Nelle Nemo, a marine biologist, has just returned from a six-month survey of the North Pacific Gyre, where she has been sampling gelatinous marine life in the Great Pacific Garbage Patch. She has 1520 samples that she’s run through an assay machine to measure the relative abundance of 300 proteins. She needs to run these 1520 files through an imaginary program called goostats
she inherited. On top of this huge task, she has to write up results by the end of the month so her paper can appear in a special issue of Aquatic Goo Letters.
The bad news is that if she has to run goostats
by hand using a GUI, she’ll have to select and open a file 1520 times. If goostats
takes 30 seconds to run each file, the whole process will take more than 12 hours of Nelle’s attention. With the shell, Nelle can instead assign her computer this mundane task while she focuses her attention on writing her paper.
The next few lessons will explore the ways Nelle can achieve this. More specifically, they explain how she can use a command shell to run the goostats
program, using loops to automate the repetitive steps of entering file names, so that her computer can work while she writes her paper.
ls
FlagsWhat does the command ls
do when used
with the -l
option?
What about if you use both the -l
and the -h
option?
By default ls lists the contents of a directory in alphabetical order by name. The command ls -t
lists items by time of last change instead of alphabetically. The command ls -r
lists the contents of a directory in reverse order. What happens when you combine the -t
and -r
flags? Hint: You may need to use the -l
flag to see the last changed dates.
Starting from /Users/amanda/data
, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda
?
cd .
cd /
cd /home/amanda
cd ../..
cd ~
cd home
cd ~/data/..
cd
cd ..
If pwd
displays /Users/thing
,
what will ls -F ../backup
display?
../backup: No such file or directory
2012-12-01 2013-01-08 2013-01-27
2012-12-01/ 2013-01-08/ 2013-01-27/
original/ pnas_final/ pnas_sub/
ls
Reading ComprehensionIf pwd
displays /Users/backup
,
and -r
tells ls
to display things in reverse order,
what command(s) will result in the following output:
pnas_sub/ pnas_final/ original/
ls pwd
ls -r -F
ls -r -F /Users/backup
Jamie realizes that she put the files sucrose.dat
and maltose.dat
into the wrong folder.
The files should have been placed in the raw
folder. She runs these commands to explore the file system.
$ ls -F
analyzed/ raw/
$ ls -F analyzed
fructose.dat glucose.dat maltose.dat sucrose.dat
$ cd analyzed
Fill in the blanks to move these files to the raw/
folder to correct her mistake
$ mv sucrose.dat maltose.dat ____/____
Suppose you created a text file called statstics.txt
After creating and saving this file you realize you misspelled the filename! You want to correct the mistake, which of the following commands could you use to do so?
cp statstics.txt statistics.txt
mv statstics.txt statistics.txt
mv statstics.txt .
cp statstics.txt .
What is the output of the closing ls
command in the sequence shown below?
$ pwd
/Users/jamie/data
$ ls
proteins.dat
$ mkdir recombined
$ mv proteins.dat recombined/
$ cp recombined/proteins.dat ../proteins-saved.dat
$ ls
proteins-saved.dat recombined
recombined
proteins.dat recombined
proteins-saved.dat
(Examples from data-shell/molecules
directory)
*
matches zero or more characters.
*.pdb
matches ethane.pdb
, propane.pdb
, and every file that ends with .pdb
.
p*.pdb
only matches pentane.pdb
and propane.pdb
?
matches exactly one character.
?ethane.pdb
would match methane.pdb
*ethane.pdb
matches both ethane.pdb
, and methane.pdb
.
???ane.pdb
matches three characters followed by ane.pdb
, giving cubane.pdb
ethane.pdb
octane.pdb
.
In the molecules
directory which ls
command(s) will
produce this output?
ethane.pdb methane.pdb
ls *t*ane.pdb
ls *t?ne.*
ls *t??ne.pdb
ls ethane.*
Jamie is working on a project and she sees that her files aren’t very well organized:
$ ls -F
analyzed/ fructose.dat raw/ sucrose.dat
The fructose.dat
and sucrose.dat
files contain output from her data
analysis. How could you use wildcards with the mv
command to move both files to the analyzed
directory at the same time?
If we run sort
on a file containing the following lines:
10
2
19
22
6
the output is:
10
19
2
22
6
If we run sort -n
on the same input, we get this instead:
2
6
10
19
22
Why?
The head
command prints lines from the start of a file and the tail
prints lines from the end of a file instead.
If we were to run these 2 commands:
$ head -n 3 animals.txt > animals-subset.txt
$ tail -n 2 animals.txt >> animals-subset.txt
what would animals.txt
contain?
animals.txt
animals.txt
animals.txt
animals.txt
`
##
In our current directory, we want to find the 3 files which have the least number of lines. Which command would work?
wc -l * > sort -n > head -n 3
wc -l * | sort -n | head -n 1-3
wc -l * | head -n 3 | sort -n
wc -l * | sort -n | head -n 3
A file called animals.txt looks like this:
2012-11-05,deer
2012-11-05,rabbit
2012-11-05,raccoon
2012-11-06,rabbit
2012-11-06,deer
2012-11-06,fox
2012-11-07,rabbit
2012-11-07,bear
If we run this command, what lines will end up in final.txt
?
$ cat animals.txt | head -n 5 | tail -n 3 | sort -r > final.txt
The general form of a loop:
for thing in list_of_things
do
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
This exercise refers to the data-shell/molecules
directory. ls
gives the following output:
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
What is the output of the following code?
$ for datafile in *.pdb
> do
> ls *.pdb
> done
Now, what is the output of the following code?
$ for datafile in *.pdb
> do
> ls $datafile
> done
Why do these two loops give different outputs?
What would be the output of running the following loop in the data-shell/molecules
directory?
$ for filename in c*
> do
> ls $filename
> done
cubane.pdb
, octane.pdb
and pentane.pdb
are listed.cubane.pdb
is listed.How would the output differ from using this command instead?
$ for filename in *c*
> do
> ls $filename
> done
In the data-shell/molecules
directory, what is the effect of this loop?
for alkanes in *.pdb
do
echo $alkanes
cat $alkanes > alkanes.pdb
done
cubane.pdb
, ethane.pdb
, methane.pdb
, octane.pdb
, pentane.pdb
and propane.pdb
, and the text from propane.pdb
will be saved to a file called alkanes.pdb
.cubane.pdb
, ethane.pdb
, and methane.pdb
, and the text from all three files would be concatenated and saved to a file called alkanes.pdb
.cubane.pdb
, ethane.pdb
, methane.pdb
, octane.pdb
, and pentane.pdb
, and the text from propane.pdb
will be saved to a file called alkanes.pdb
.Also in the data-shell/molecules
directory, what would be the output of the following loop?
for datafile in *.pdb
do
cat $datafile >> all.pdb
done
cubane.pdb
, ethane.pdb
, methane.pdb
, octane.pdb
, and pentane.pdb
would be concatenated and saved to a file called all.pdb
.ethane.pdb
will be saved to a file called all.pdb
.cubane.pdb
, ethane.pdb
, methane.pdb
, octane.pdb
, pentane.pdb
and propane.pdb
would be concatenated and saved to a file called all.pdb
.cubane.pdb
, ethane.pdb
, methane.pdb
, octane.pdb
, pentane.pdb
and propane.pdb
would be printed to the screen and saved to a file called all.pdb
.
A loop is a way to do many things at once — or to make many mistakes at once if it does the wrong thing. One way to check what a loop would do is to echo
the commands it would run instead of actually running them.
Suppose we want to preview the commands the following loop will execute without actually running those commands:
$ for datafile in *.pdb
> do
> cat $datafile >> all.pdb
> done
What is the difference between the two loops below, and which one would we want to run?
# Version 1
$ for datafile in *.pdb
> do
> echo cat $datafile >> all.pdb
> done
# Version 2
$ for datafile in *.pdb
> do
> echo "cat $datafile >> all.pdb"
> done
Suppose we want to set up a directory structure to organize some experiments measuring reaction rate constants with different compounds and different temperatures. What would be the result of the following code:
$ for species in cubane ethane methane
> do
> for temperature in 25 30 37 40
> do
> mkdir $species-$temperature
> done
> done
Leah has several hundred data files, each of which is formatted like this:
2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
2013-11-06,fox,1
2013-11-07,rabbit,18
2013-11-07,bear,1
An example of this type of file is given in data-shell/data/animal-counts/animals.txt
.
We can use the command cut -d , -f 2 animals.txt | sort | uniq
to produce the unique species in animals.txt
. In order to avoid having to type out this series of commands every time, a scientist may choose to write a shell script instead.
Write a shell script called species.sh
that takes any number of filenames as command-line arguments, and uses a variation of the above command to print a list of the unique species appearing in each of those files separately.
$ history | tail -n 5 > recent.sh
If you run the above command the last command in the file is the history
command itself, i.e., the shell has added history
to the command log before actually running it. In fact, the shell always adds commands to the log before running them. Why do you think it does this?
In the molecules directory, imagine you have a shell script called script.sh
containing the following commands:
head -n $2 $1
tail -n $3 $1
While you are in the molecules
directory, you type the following command:
bash script.sh '*.pdb' 1 1
Which of the following outputs would you expect to see?
.pdb
in the molecules directory.pdb
in the molecules
directorymolecules
directory*.pdb
Write a shell script called longest.sh
that takes the name of a directory and a filename extension as its arguments, and prints out the name of the file with the most lines in that directory with that extension. When the script is run as below, it should print the name of the .pdb
file in /tmp/data
that has the most lines.
$ bash longest.sh /tmp/data pdb
For this question, consider the data-shell/molecules
directory once again. This contains a number of .pdb
files in addition to any other files you may have created. Explain what each of the following three scripts would do when run as bash script1.sh *.pdb
, bash script2.sh *.pdb
, and bash script3.sh *.pdb
respectively.
# Script 1
echo *.*
# Script 2
for filename in $1 $2 $3
do
cat $filename
done
# Script 3
echo $@.pdb
Suppose you have saved the following script in a file called do-errors.sh
in Nelle’s north-pacific-gyre/2012-07-03
directory:
# Calculate stats for data files.
for datafile in "$@"
do
echo $datfile
bash goostats $datafile stats-$datafile
done
When you run it:
$ bash do-errors.sh NENE*[AB].txt
the output is blank. To figure out why, re-run the script using the -x
option:
bash -x do-errors.sh NENE*[AB].txt
What is the output showing you? Which line is responsible for the error?
grep
Which command would result in the following output:
and the presence of absence:
grep "of" haiku.txt
grep -E "of" haiku.txt
grep -w "of" haiku.txt
grep -i "of" haiku.txt
Leah has several hundred data files saved in one directory, each of which is formatted like this:
2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
She wants to write a shell script that takes a species as the first command-line argument and a directory as the second argument. The script should return one file called species.txt
containing a list of dates and the number of that species seen on each date. For example using the data shown above, rabbit.txt
would contain:
2013-11-05,22
2013-11-06,19
Put these commands and pipes in the right order to achieve this:
cut -d : -f 2
>
|
grep -w $1 -r $2
|
$1.txt
cut -d , -f 1,3
Hint: use man grep
to look for how to grep text recursively in a directory and man cut
to select more than one field in a line.
An example of such a file is provided in data-shell/data/animal-counts/animals.txt
You and your friend, having just finished reading Little Women by Louisa May Alcott, are in an argument. Of the four sisters in the book, Jo, Meg, Beth, and Amy, your friend thinks that Jo was the most mentioned. You, however, are certain it was Amy. Luckily, you have a file LittleWomen.txt
containing the full text of the novel (data-shell/writing/data/LittleWomen.txt
). Using a for
loop, how would you tabulate the number of times each of the four sisters is mentioned?
Hint: one solution might employ the commands grep
and wc
and a |
, while another might utilize grep
options. There is often more than one way to solve a programming task, so a particular solution is usually chosen based on a combination of yielding the correct result, elegance, readability, and speed.
The -v
option to grep
inverts pattern matching, so that only lines which do not match the pattern are printed. Given that, which of the following commands will find all files in /data
whose names end in s.txt
but whose names also do not contain the string net
? (For example, animals.txt
or amino-acids.txt
but not planets.txt
.) Once you have thought about your answer, you can test the commands in the data-shell
directory.
find data -name "*s.txt" | grep -v net
find data -name *s.txt | grep -v net
grep -v "net" $(find data -name "*s.txt")
find
Pipeline Reading ComprehensionWrite a short explanatory comment for the following shell script:
wc -l $(find . -name "*.dat") | sort -n