Manipulating data
Displaying the content of a file
There are several tools aiming to display the content of a file, each of them whith a different purpose.
Displaying the entire file
cat display the whole content of a file:
$ cat molecules/methane.pdb
COMPND METHANE
AUTHOR DAVE WOODCOCK 95 12 18
ATOM 1 C 1 0.257 -0.363 0.000 1.00 0.00
ATOM 2 H 1 0.257 0.727 0.000 1.00 0.00
ATOM 3 H 1 0.771 -0.727 0.890 1.00 0.00
ATOM 4 H 1 0.771 -0.727 -0.890 1.00 0.00
ATOM 5 H 1 -0.771 -0.727 0.000 1.00 0.00
TER 6 1
END
Actually, cat’s primary purpose is to concatenate files.
Meaning that a way of using it would actually be to give several files as
input in order to get the concatenation of the files:
$ cat molecules/methane.pdb molecules/ethane.pdb
COMPND METHANE
AUTHOR DAVE WOODCOCK 95 12 18
ATOM 1 C 1 0.257 -0.363 0.000 1.00 0.00
ATOM 2 H 1 0.257 0.727 0.000 1.00 0.00
ATOM 3 H 1 0.771 -0.727 0.890 1.00 0.00
ATOM 4 H 1 0.771 -0.727 -0.890 1.00 0.00
ATOM 5 H 1 -0.771 -0.727 0.000 1.00 0.00
TER 6 1
END
COMPND ETHANE
AUTHOR DAVE WOODCOCK 95 12 18
ATOM 1 C 1 -0.752 0.001 -0.141 1.00 0.00
ATOM 2 C 1 0.752 -0.001 0.141 1.00 0.00
ATOM 3 H 1 -1.158 0.991 0.070 1.00 0.00
ATOM 4 H 1 -1.240 -0.737 0.496 1.00 0.00
ATOM 5 H 1 -0.924 -0.249 -1.188 1.00 0.00
ATOM 6 H 1 1.158 -0.991 -0.070 1.00 0.00
ATOM 7 H 1 0.924 0.249 1.188 1.00 0.00
ATOM 8 H 1 1.240 0.737 -0.496 1.00 0.00
TER 9 1
END
Displaying files screen by screen
There are two utilities, more and less, that allows to display files
screen by screen, which is very usefull when reading large files.
Question: try more and less to display iris.csv.
Question: in both programs, use / to find the occurences of ‘virginica’.
Displaying only the beginning of the file
Displaying the beginning of the file, or head, is done thanks to the
head command.
Question: use head to display the first 5 lines of iris.csv.
Solution:
$ head -n5 iris.csv sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,setosa 4.9,3,1.4,0.2,setosa 4.7,3.2,1.3,0.2,setosa 4.6,3.1,1.5,0.2,setosa
Displaying only the end of the file
Displaying the beginning of the file, or tail, is done thanks to the
tail command.
Question: use tail to display the last 5 lines of iris.csv.
Solution:
$ tail -n5 iris.csv 6.7,3,5.2,2.3,virginica 6.3,2.5,5,1.9,virginica 6.5,3,5.2,2,virginica 6.2,3.4,5.4,2.3,virginica 5.9,3,5.1,1.8,virginica
Saving the output of a command to a file
Most standard unix tools do not have an option for naming an output file.
It is because the standard way to store a command output is by redirecting
it to a file.
This is done by using the greather-than sign >:
$ # Output is displayed on screen
$ head -n5 iris.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
$ # Output is redicted to tmp.csv, therefore not displayed on screen
$ head -n5 iris.csv > tmp.csv
$ # Display the content of tmp.csv
$ cat tmp.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
$ # Remove temporary file.
$ rm tmp.csv
Importantly, > creates a new file, whether or not the command output
contains something.
If the output file already exists, it is erased then created again.
Question: how to store the first 2 lines of iris.csv to output.csv?
Once done, store the last 2 lines of iris.csv to output.csv. See
by yourself that output.csv only contains the output of your last command.
Remove output.csv when done.
Solution:
$ head -2 iris.csv > output.csv $ cat output.csv sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,setosa $ tail -2 iris.csv > output.csv $ cat output.csv 6.2,3.4,5.4,2.3,virginica 5.9,3,5.1,1.8,virginica $ # The output contains only the last two lines of iris.csv, as $ # the system erased output.csv prior to writing the tail command output. $ rm output.csv # cleaning
There is a second operator, made of two greater-than signs (‘>>’), which
allows to append output to an already existing file (note: if the file does not
exist, it will be created).
Question: how to store the first 2 and last 2 lines of iris.csv
to output.csv? Once done, display it and remove output.csv
Solution:
$ head -2 iris.csv > output.csv $ tail -2 iris.csv >> output.csv $ cat output.csv sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,setosa 6.2,3.4,5.4,2.3,virginica 5.9,3,5.1,1.8,virginica $ rm output.csv # cleaning
File editors
There are a number of terminal file editors.
The winner in the category “easy hands-on” is probably nano, which has
the advantage that keyboard shortcut are display at the bottom of the screen.
Question: use nano to create a file animals.txt with 3 lines: “dog”,
“cat”, “rabbit”.
Solution: To open a new or existing file, the command whould be
nano <FILENAME>. In our casenano animals.txt.Then, just enter the 3 lines and hit Ctrl+X to exit.
nanoprompts whether or not you want to save the changes you made. So hitYto save the changes.
Other historical and still widely used editors are vi and emacs, which
are both very powerful.
I would personnaly recommand using vi as it uses keyboard shortcuts which
are common to other unix programs.
Counting the number of lines/words/characters in a file
To count lines, words or characters in a file, we use wc (word count).
$ wc creatures/basilisk.fasta
28 34 1708 creatures/basilisk.fasta
By default, wc displays three numbers, namely the number of lines, words
and characters in a file.
Question: how to display only the number of lines in a file? of words? of characters?
Solution:
$ # display only the number of lines $ wc -l creatures/basilisk.fasta 28 creatures/basilisk.fasta $ # display only the number of words $ wc -w creatures/basilisk.fasta 34 creatures/basilisk.fasta $ # display only the number of characters $ wc -c creatures/basilisk.fasta 1708 creatures/basilisk.fasta
Question: how to display the number of lines in pdb files located in molecules?
Solution:
$ wc -l molecules/*.pdb 20 molecules/cubane.pdb 12 molecules/ethane.pdb 9 molecules/methane.pdb 30 molecules/octane.pdb 21 molecules/pentane.pdb 15 molecules/propane.pdb 107 total
Searching a pattern in a file
grep prints lines matching a pattern.
For example, to get the “canine” lines from tooth.csv, just type
$ grep canine tooth.csv
2015-05-27,canine
2016-12-09,canine
2015-10-18,canine
2016-06-13,canine
2015-05-10,canine
2016-05-07,canine
2015-10-26,canine
2016-07-08,canine
2015-02-14,canine
2016-08-12,canine
2016-09-17,canine
2016-02-07,canine
2017-11-12,canine
2016-08-17,canine
2015-10-23,canine
2015-12-09,canine
2015-12-03,canine
2016-03-07,canine
2017-03-04,canine
2015-03-05,canine
grep has a lot of useful options:
-v, reverse search-i, ignore case-n, prefix each line with the line number-c, count the number of occurences of the pattern-r, recursively explore files
Question: how to get the number of non-canine line?
Solution:
$ grep -vc canine tooth.csv 80
Extracting a file columns
The command cut allows to extract columns from a file in various ways.
For example, to extract the date column from tooth.csv, several approches
can be used:
$ # Extract each line first 10 characters
$ cut -c 1-10 tooth.csv
2015-08-03
2015-06-21
2015-10-26
2017-01-30
2015-05-27
2016-07-14
2016-01-21
2015-05-14
2016-03-07
# [...]
$ # Extract the first column using "," as a separator
$ cut -d "," -f 1 tooth.csv
2015-08-03
2015-06-21
2015-10-26
2017-01-30
2015-05-27
2016-07-14
2016-01-21
2015-05-14
2016-03-07
# [...]
Question: find two ways to extract the teeth column from tooth.csv.
Solution:
$ # Extract characters 12 to end-of-line $ cut -c 12- tooth.csv molar molar molar premolar canine incisor premolar incisor incisor # [...] $ # Extract the second column using "," as column separator $ cut -d "," -f 2 tooth.csv molar molar molar premolar canine incisor premolar incisor incisor
Sorting a file
Quite intuitively, sort allows to sort a file lines:
$ sort tooth.csv
2015-01-16,incisor
2015-02-14,canine
2015-02-22,molar
2015-02-25,incisor
2015-03-05,canine
2015-03-20,premolar
2015-03-31,premolar
2015-04-16,molar
2015-04-23,premolar
2015-05-10,canine
# [...]
Importantly, by default, the sort is not a numeric sort but rather a sort by ASCII code.
So when sorting animals/animals.txt, this is what happens:
$ sort animals/animals.txt
11 cat
1 dog
23 bird
2 rabbit
3 chicken
Question: how to sort animals/animals.txt properly? How to sort it in decreasing order?
Solution:
# Numeric sort $ sort -n animals/animals.txt 1 dog 2 rabbit 3 chicken 11 cat 23 bird # Reverse numeric sort $ sort -nr animals/animals.txt 23 bird 11 cat 3 chicken 2 rabbit 1 dog
Question: how to sort tooth.csv by increasing day number?
Solution:
$ # We have to tell sort that the field separator we want to use is '-' $ # and then that we want to sort by the 3rd field. $ sort tooth.csv -t- -k3 2017-11-01,premolar 2017-12-02,incisor 2015-12-03,canine 2016-06-03,incisor 2017-08-03,incisor 2015-08-03,molar 2016-06-03,premolar 2017-03-04,canine 2015-12-04,incisor 2017-02-04,premolar # [...]
Concatenating files
As seen previously, cat allows both to
display a single file and to concatenate several file one after the other.
paste allows to concatenate file column wide:
$ # Concatenate animals/animals.txt column wide
$ paste animals/animals.txt animals/animals.txt
2 rabbit 2 rabbit
1 dog 1 dog
11 cat 11 cat
23 bird 23 bird
3 chicken 3 chicken
$ # Concatenate animals/animals.txt column wide with custom separator (" - ")
$ paste -d "|" animals/animals.txt animals/animals.txt
2 rabbit|2 rabbit
1 dog|1 dog
11 cat|11 cat
23 bird|23 bird
3 chicken|3 chicken
Comparing files
Comparing files can be done in numerous ways. These are the most common tools.
cmp outputs the position of the difference if files are different, else nothing:
$ cmp animals/animals.txt animals/animals2.txt
animals/animals.txt animals/animals2.txt differ: byte 11, line 2
diff outputs the line number of the difference, and the difference:
diff animals/animals.txt animals/animals2.txt
2d1
< 1 dog
This interprets as “there is a difference on line 2 of animals.txt which translates to line 1 of animals2.txt: the line ‘1 dog’ has been deleted”.
Question: what diff option allows to compare the files side-by-side?
Solution:
diff -y animals/animals.txt animals/animals2.txt 2 rabbit 2 rabbit 1 dog < 11 cat 11 cat 23 bird 23 bird 3 chicken 3 chicken
Modifying file content
To make simple changes, you probably want to use a classic text editor.
But sometimes, you need to make more complicated changes, e.g. changing all the occurences of a string in a file.
Replacing all occurences of a word in a file
To to this, sed is probably the most straight-forward solution:
$ # Replace all occurences of 'rabbit' with 'bunny'
$ sed 's/rabbit/bunny/g' animals/animals.txt
2 bunny
1 dog
11 cat
23 bird
3 chicken
The script part of this command tells sed:
s/stands for “substitute”rabbit/bunny/means”rabbit” with “bunny”- the trailing
gindicates that we want all occurences on the line to be substituted
So, in one sentence, “replace all occurences of ‘rabbit’ with ‘bunny’”.
Removing a line
By line number
$ # Remove lines 1 and 2
$ sed '1,2d' animals/animals.txt
11 cat
23 bird
3 chicken
By pattern
$ # Remove all lines that contains 'chicken'
$ sed '/chicken/d' animals/animals.txt
2 rabbit
1 dog
11 cat
23 bird
sed is actually a lot more poweful than only those two simple examples.
If you want to know more about sed there are numerous quality tutorials
around.