Nelle Nemo, a marine biologist, has just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She has 1520 samples that she’s run through an assay machine to measure the relative abundance of 300 proteins. She needs to run these 1520 files through an imaginary program called goostats she inherited. On top of this huge task, she has to write up results by the end of the month so her paper can appear in a special issue of Aquatic Goo Letters.
The bad news is that if she has to run goostats by hand using a GUI, she’ll have to select and open a file 1520 times. If goostats takes 30 seconds to run each file, the whole process will take more than 12 hours of Nelle’s attention. With the shell, Nelle can instead assign her computer this mundane task while she focuses her attention on writing her paper.
The next few lessons will explore the ways Nelle can achieve this. More specifically, they explain how she can use a command shell to run the goostats program, using loops to automate the repetitive steps of entering file names, so that her computer can work while she writes her paper.
ls FlagsWhat does the command ls do when used
with the -l option?
What about if you use both the -l and the -h option?
By default ls lists the contents of a directory in alphabetical order by name. The command ls -t lists items by time of last change instead of alphabetically. The command ls -r lists the contents of a directory in reverse order. What happens when you combine the -t and -r flags? Hint: You may need to use the -l flag to see the last changed dates.
Starting from /Users/amanda/data, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda?
cd .cd /cd /home/amandacd ../..cd ~cd homecd ~/data/..cdcd ..If pwd displays /Users/thing,
what will ls -F ../backup display?
../backup: No such file or directory2012-12-01 2013-01-08 2013-01-272012-12-01/ 2013-01-08/ 2013-01-27/original/ pnas_final/ pnas_sub/ls Reading ComprehensionIf pwd displays /Users/backup,
and -r tells ls to display things in reverse order,
what command(s) will result in the following output:
pnas_sub/ pnas_final/ original/
ls pwd
ls -r -F
ls -r -F /Users/backup
Jamie realizes that she put the files sucrose.dat and maltose.dat into the wrong folder.
The files should have been placed in the raw folder. She runs these commands to explore the file system.
$ ls -F
analyzed/ raw/
$ ls -F analyzed
fructose.dat glucose.dat maltose.dat sucrose.dat
$ cd analyzed
Fill in the blanks to move these files to the raw/ folder to correct her mistake
$ mv sucrose.dat maltose.dat ____/____
Suppose you created a text file called statstics.txt
After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so?
cp statstics.txt statistics.txtmv statstics.txt statistics.txtmv statstics.txt .cp statstics.txt .What is the output of the closing ls command in the sequence shown below?
$ pwd
/Users/jamie/data
$ ls
proteins.dat
$ mkdir recombined
$ mv proteins.dat recombined/
$ cp recombined/proteins.dat ../proteins-saved.dat
$ ls
proteins-saved.dat recombinedrecombinedproteins.dat recombinedproteins-saved.dat(Examples from data-shell/molecules directory)
* matches zero or more characters.
*.pdb matches ethane.pdb, propane.pdb, and every file that ends with .pdb.
p*.pdb only matches pentane.pdb and propane.pdb
? matches exactly one character.
?ethane.pdb would match methane.pdb
*ethane.pdb matches both ethane.pdb, and methane.pdb.
???ane.pdb matches three characters followed by ane.pdb, giving cubane.pdb ethane.pdb octane.pdb.
In the molecules directory which ls command(s) will
produce this output?
ethane.pdb methane.pdb
ls *t*ane.pdbls *t?ne.*ls *t??ne.pdbls ethane.*Jamie is working on a project and she sees that her files aren’t very well organized:
$ ls -F
analyzed/ fructose.dat raw/ sucrose.dat
The fructose.dat and sucrose.dat files contain output from her data
analysis. How could you use wildcards with the mv command to move both files to the analyzed directory at the same time?
If we run sort on a file containing the following lines:
10
2
19
22
6
the output is:
10
19
2
22
6
If we run sort -n on the same input, we get this instead:
2
6
10
19
22
Why?
The head command prints lines from the start of a file and the tail prints lines from the end of a file instead.
If we were to run these 2 commands:
$ head -n 3 animals.txt > animals-subset.txt
$ tail -n 2 animals.txt >> animals-subset.txt
what would animals.txt contain?
animals.txt animals.txtanimals.txtanimals.txt `
##
In our current directory, we want to find the 3 files which have the least number of lines. Which command would work?
wc -l * > sort -n > head -n 3 wc -l * | sort -n | head -n 1-3 wc -l * | head -n 3 | sort -n wc -l * | sort -n | head -n 3 A file called animals.txt looks like this:
2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear
If we run this command, what lines will end up in final.txt?
$ cat animals.txt | head -n 5 | tail -n 3 | sort -r > final.txt
The general form of a loop:
for thing in list_of_things
do
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
This exercise refers to the data-shell/molecules directory. ls gives the following output:
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
What is the output of the following code?
$ for datafile in *.pdb
> do
> ls *.pdb
> done
Now, what is the output of the following code?
$ for datafile in *.pdb
> do
> ls $datafile
> done
Why do these two loops give different outputs?
What would be the output of running the following loop in the data-shell/molecules directory?
$ for filename in c*
> do
> ls $filename
> done
cubane.pdb, octane.pdb and pentane.pdb are listed.cubane.pdb is listed.How would the output differ from using this command instead?
$ for filename in *c*
> do
> ls $filename
> done
In the data-shell/molecules directory, what is the effect of this loop?
for alkanes in *.pdb
do
echo $alkanes
cat $alkanes > alkanes.pdb
done
cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.cubane.pdb, ethane.pdb, and methane.pdb, and the text from all three files would be concatenated and saved to a file called alkanes.pdb.cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.Also in the data-shell/molecules directory, what would be the output of the following loop?
for datafile in *.pdb
do
cat $datafile >> all.pdb
done
cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb would be concatenated and saved to a file called all.pdb.ethane.pdb will be saved to a file called all.pdb.cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be concatenated and saved to a file called all.pdb.cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be printed to the screen and saved to a file called all.pdb.
A loop is a way to do many things at once — or to make many mistakes at once if it does the wrong thing. One way to check what a loop would do is to echo the commands it would run instead of actually running them.
Suppose we want to preview the commands the following loop will execute without actually running those commands:
$ for datafile in *.pdb
> do
> cat $datafile >> all.pdb
> done
What is the difference between the two loops below, and which one would we want to run?
# Version 1
$ for datafile in *.pdb
> do
> echo cat $datafile >> all.pdb
> done
# Version 2
$ for datafile in *.pdb
> do
> echo "cat $datafile >> all.pdb"
> done
Suppose we want to set up a directory structure to organize some experiments measuring reaction rate constants with different compounds and different temperatures. What would be the result of the following code:
$ for species in cubane ethane methane
> do
> for temperature in 25 30 37 40
> do
> mkdir $species-$temperature
> done
> done
Leah has several hundred data files, each of which is formatted like this:
2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
2013-11-06,fox,1
2013-11-07,rabbit,18
2013-11-07,bear,1
An example of this type of file is given in data-shell/data/animal-counts/animals.txt.
We can use the command cut -d , -f 2 animals.txt | sort | uniq to produce the unique species in animals.txt. In order to avoid having to type out this series of commands every time, a scientist may choose to write a shell script instead.
Write a shell script called species.sh that takes any number of filenames as command-line arguments, and uses a variation of the above command to print a list of the unique species appearing in each of those files separately.
$ history | tail -n 5 > recent.sh
If you run the above command the last command in the file is the history command itself, i.e., the shell has added history to the command log before actually running it. In fact, the shell always adds commands to the log before running them. Why do you think it does this?
In the molecules directory, imagine you have a shell script called script.sh containing the following commands:
head -n $2 $1
tail -n $3 $1
While you are in the molecules directory, you type the following command:
bash script.sh '*.pdb' 1 1
Which of the following outputs would you expect to see?
.pdb in the molecules directory.pdb in the molecules directorymolecules directory*.pdbWrite a shell script called longest.sh that takes the name of a directory and a filename extension as its arguments, and prints out the name of the file with the most lines in that directory with that extension. When the script is run as below, it should print the name of the .pdb file in /tmp/data that has the most lines.
$ bash longest.sh /tmp/data pdb
For this question, consider the data-shell/molecules directory once again. This contains a number of .pdb files in addition to any other files you may have created. Explain what each of the following three scripts would do when run as bash script1.sh *.pdb, bash script2.sh *.pdb, and bash script3.sh *.pdb respectively.
# Script 1
echo *.*
# Script 2
for filename in $1 $2 $3
do
cat $filename
done
# Script 3
echo $@.pdb
Suppose you have saved the following script in a file called do-errors.sh in Nelle’s north-pacific-gyre/2012-07-03 directory:
# Calculate stats for data files.
for datafile in "$@"
do
echo $datfile
bash goostats $datafile stats-$datafile
done
When you run it:
$ bash do-errors.sh NENE*[AB].txt
the output is blank. To figure out why, re-run the script using the -x option:
bash -x do-errors.sh NENE*[AB].txt
What is the output showing you? Which line is responsible for the error?
grepWhich command would result in the following output:
and the presence of absence:
grep "of" haiku.txtgrep -E "of" haiku.txtgrep -w "of" haiku.txtgrep -i "of" haiku.txtLeah has several hundred data files saved in one directory, each of which is formatted like this:
2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
She wants to write a shell script that takes a species as the first command-line argument and a directory as the second argument. The script should return one file called species.txt containing a list of dates and the number of that species seen on each date. For example using the data shown above, rabbit.txt would contain:
2013-11-05,22
2013-11-06,19
Put these commands and pipes in the right order to achieve this:
cut -d : -f 2
>
|
grep -w $1 -r $2
|
$1.txt
cut -d , -f 1,3
Hint: use man grep to look for how to grep text recursively in a directory and man cut to select more than one field in a line.
An example of such a file is provided in data-shell/data/animal-counts/animals.txt
You and your friend, having just finished reading Little Women by Louisa May Alcott, are in an argument. Of the four sisters in the book, Jo, Meg, Beth, and Amy, your friend thinks that Jo was the most mentioned. You, however, are certain it was Amy. Luckily, you have a file LittleWomen.txt containing the full text of the novel (data-shell/writing/data/LittleWomen.txt). Using a for loop, how would you tabulate the number of times each of the four sisters is mentioned?
Hint: one solution might employ the commands grep and wc and a |, while another might utilize grep options. There is often more than one way to solve a programming task, so a particular solution is usually chosen based on a combination of yielding the correct result, elegance, readability, and speed.
The -v option to grep inverts pattern matching, so that only lines which do not match the pattern are printed. Given that, which of the following commands will find all files in /data whose names end in s.txt but whose names also do not contain the string net? (For example, animals.txt or amino-acids.txt but not planets.txt.) Once you have thought about your answer, you can test the commands in the data-shell directory.
find data -name "*s.txt" | grep -v netfind data -name *s.txt | grep -v netgrep -v "net" $(find data -name "*s.txt")find Pipeline Reading ComprehensionWrite a short explanatory comment for the following shell script:
wc -l $(find . -name "*.dat") | sort -n