Manipulating data

Displaying the content of a file

There are several tools aiming to display the content of a file, each of them whith a different purpose.

Displaying the entire file

cat display the whole content of a file:

$ cat molecules/methane.pdb
COMPND      METHANE
AUTHOR      DAVE WOODCOCK  95 12 18
ATOM      1  C           1       0.257  -0.363   0.000  1.00  0.00
ATOM      2  H           1       0.257   0.727   0.000  1.00  0.00
ATOM      3  H           1       0.771  -0.727   0.890  1.00  0.00
ATOM      4  H           1       0.771  -0.727  -0.890  1.00  0.00
ATOM      5  H           1      -0.771  -0.727   0.000  1.00  0.00
TER       6              1
END

Actually, cat’s primary purpose is to concatenate files. Meaning that a way of using it would actually be to give several files as input in order to get the concatenation of the files:

$ cat molecules/methane.pdb molecules/ethane.pdb
COMPND      METHANE
AUTHOR      DAVE WOODCOCK  95 12 18
ATOM      1  C           1       0.257  -0.363   0.000  1.00  0.00
ATOM      2  H           1       0.257   0.727   0.000  1.00  0.00
ATOM      3  H           1       0.771  -0.727   0.890  1.00  0.00
ATOM      4  H           1       0.771  -0.727  -0.890  1.00  0.00
ATOM      5  H           1      -0.771  -0.727   0.000  1.00  0.00
TER       6              1
END
COMPND      ETHANE
AUTHOR      DAVE WOODCOCK  95 12 18
ATOM      1  C           1      -0.752   0.001  -0.141  1.00  0.00
ATOM      2  C           1       0.752  -0.001   0.141  1.00  0.00
ATOM      3  H           1      -1.158   0.991   0.070  1.00  0.00
ATOM      4  H           1      -1.240  -0.737   0.496  1.00  0.00
ATOM      5  H           1      -0.924  -0.249  -1.188  1.00  0.00
ATOM      6  H           1       1.158  -0.991  -0.070  1.00  0.00
ATOM      7  H           1       0.924   0.249   1.188  1.00  0.00
ATOM      8  H           1       1.240   0.737  -0.496  1.00  0.00
TER       9              1
END

Displaying files screen by screen

There are two utilities, more and less, that allows to display files screen by screen, which is very usefull when reading large files.

Question: try more and less to display iris.csv.

Question: in both programs, use / to find the occurences of ‘virginica’.

Displaying only the beginning of the file

Displaying the beginning of the file, or head, is done thanks to the head command.

Question: use head to display the first 5 lines of iris.csv.

Solution:

$ head -n5 iris.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa

Displaying only the end of the file

Displaying the beginning of the file, or tail, is done thanks to the tail command.

Question: use tail to display the last 5 lines of iris.csv.

Solution:

$ tail -n5 iris.csv
7,3,5.2,2.3,virginica
3,2.5,5,1.9,virginica
5,3,5.2,2,virginica
2,3.4,5.4,2.3,virginica
9,3,5.1,1.8,virginica

Saving the output of a command to a file

Most standard unix tools do not have an option for naming an output file. It is because the standard way to store a command output is by redirecting it to a file. This is done by using the greather-than sign >:

$ # Output is displayed on screen
$ head -n5 iris.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa

$ # Output is redicted to tmp.csv, therefore not displayed on screen
$ head -n5 iris.csv > tmp.csv

$ # Display the content of tmp.csv
$ cat tmp.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa

$ # Remove temporary file.
$ rm tmp.csv

Importantly, > creates a new file, whether or not the command output contains something. If the output file already exists, it is erased then created again.

Question: how to store the first 2 lines of iris.csv to output.csv? Once done, store the last 2 lines of iris.csv to output.csv. See by yourself that output.csv only contains the output of your last command. Remove output.csv when done.

Solution:

$ head -2 iris.csv > output.csv
$ cat output.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
$ tail -2 iris.csv > output.csv
$ cat output.csv
6.2,3.4,5.4,2.3,virginica
5.9,3,5.1,1.8,virginica
$ # The output contains only the last two lines of iris.csv, as
$ # the system erased output.csv prior to writing the tail command output.
$ rm output.csv  # cleaning

There is a second operator, made of two greater-than signs (‘>>’), which allows to append output to an already existing file (note: if the file does not exist, it will be created).

Question: how to store the first 2 and last 2 lines of iris.csv to output.csv? Once done, display it and remove output.csv

Solution:

$ head -2 iris.csv > output.csv
$ tail -2 iris.csv >> output.csv
$ cat output.csv
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
6.2,3.4,5.4,2.3,virginica
5.9,3,5.1,1.8,virginica
$ rm output.csv  # cleaning

File editors

There are a number of terminal file editors. The winner in the category “easy hands-on” is probably nano, which has the advantage that keyboard shortcut are display at the bottom of the screen.

Question: use nano to create a file animals.txt with 3 lines: “dog”, “cat”, “rabbit”.

Solution: To open a new or existing file, the command whould be nano <FILENAME>. In our case nano animals.txt.

Then, just enter the 3 lines and hit Ctrl+X to exit. nano prompts whether or not you want to save the changes you made. So hit Y to save the changes.

Other historical and still widely used editors are vi and emacs, which are both very powerful. I would personnaly recommand using vi as it uses keyboard shortcuts which are common to other unix programs.

Counting the number of lines/words/characters in a file

To count lines, words or characters in a file, we use wc (word count).

$ wc creatures/basilisk.fasta
  28   34 1708 creatures/basilisk.fasta

By default, wc displays three numbers, namely the number of lines, words and characters in a file.

Question: how to display only the number of lines in a file? of words? of characters?

Solution:

$ # display only the number of lines
$ wc -l creatures/basilisk.fasta
28 creatures/basilisk.fasta
$ # display only the number of words
$ wc -w creatures/basilisk.fasta
34 creatures/basilisk.fasta
$ # display only the number of characters
$ wc -c creatures/basilisk.fasta
1708 creatures/basilisk.fasta

Question: how to display the number of lines in pdb files located in molecules?

Solution:

$ wc -l molecules/*.pdb
molecules/cubane.pdb
molecules/ethane.pdb
molecules/methane.pdb
molecules/octane.pdb
molecules/pentane.pdb
molecules/propane.pdb
total

Searching a pattern in a file

grep prints lines matching a pattern.

For example, to get the “canine” lines from tooth.csv, just type

$ grep canine tooth.csv
2015-05-27,canine
2016-12-09,canine
2015-10-18,canine
2016-06-13,canine
2015-05-10,canine
2016-05-07,canine
2015-10-26,canine
2016-07-08,canine
2015-02-14,canine
2016-08-12,canine
2016-09-17,canine
2016-02-07,canine
2017-11-12,canine
2016-08-17,canine
2015-10-23,canine
2015-12-09,canine
2015-12-03,canine
2016-03-07,canine
2017-03-04,canine
2015-03-05,canine

grep has a lot of useful options:

-v, reverse search
-i, ignore case
-n, prefix each line with the line number
-c, count the number of occurences of the pattern
-r, recursively explore files

Question: how to get the number of non-canine line?

Solution:

$ grep -vc canine tooth.csv
80

Extracting a file columns

The command cut allows to extract columns from a file in various ways.

For example, to extract the date column from tooth.csv, several approches can be used:

$ # Extract each line first 10 characters
$ cut -c 1-10 tooth.csv
2015-08-03
2015-06-21
2015-10-26
2017-01-30
2015-05-27
2016-07-14
2016-01-21
2015-05-14
2016-03-07
# [...]
$ # Extract the first column using "," as a separator
$ cut -d "," -f 1 tooth.csv
2015-08-03
2015-06-21
2015-10-26
2017-01-30
2015-05-27
2016-07-14
2016-01-21
2015-05-14
2016-03-07
# [...]

Question: find two ways to extract the teeth column from tooth.csv.

Solution:

$ # Extract characters 12 to end-of-line
$ cut -c 12- tooth.csv
molar
molar
molar
premolar
canine
incisor
premolar
incisor
incisor
# [...]
$ # Extract the second column using "," as column separator
$ cut -d "," -f 2 tooth.csv
molar
molar
molar
premolar
canine
incisor
premolar
incisor
incisor

Sorting a file

Quite intuitively, sort allows to sort a file lines:

$ sort tooth.csv
2015-01-16,incisor
2015-02-14,canine
2015-02-22,molar
2015-02-25,incisor
2015-03-05,canine
2015-03-20,premolar
2015-03-31,premolar
2015-04-16,molar
2015-04-23,premolar
2015-05-10,canine
# [...]

Importantly, by default, the sort is not a numeric sort but rather a sort by ASCII code.

So when sorting animals/animals.txt, this is what happens:

$ sort animals/animals.txt
cat
dog
bird
rabbit
chicken

Question: how to sort animals/animals.txt properly? How to sort it in decreasing order?

Solution:

# Numeric sort
$ sort -n animals/animals.txt
dog
rabbit
chicken
cat
bird

# Reverse numeric sort
$ sort -nr animals/animals.txt
bird
cat
chicken
rabbit
dog

Question: how to sort tooth.csv by increasing day number?

Solution:

$ # We have to tell sort that the field separator we want to use is '-'
$ # and then that we want to sort by the 3rd field.
$ sort tooth.csv -t- -k3
2017-11-01,premolar
2017-12-02,incisor
2015-12-03,canine
2016-06-03,incisor
2017-08-03,incisor
2015-08-03,molar
2016-06-03,premolar
2017-03-04,canine
2015-12-04,incisor
2017-02-04,premolar
# [...]

Concatenating files

As seen previously, cat allows both to display a single file and to concatenate several file one after the other.

paste allows to concatenate file column wide:

$ # Concatenate animals/animals.txt column wide
$ paste animals/animals.txt animals/animals.txt
rabbit    2 rabbit
dog   1 dog
cat  11 cat
bird 23 bird
chicken   3 chicken

$ # Concatenate animals/animals.txt column wide with custom separator (" - ")
$ paste -d "|" animals/animals.txt animals/animals.txt
rabbit|2 rabbit
dog|1 dog
cat|11 cat
bird|23 bird
chicken|3 chicken

Comparing files

Comparing files can be done in numerous ways. These are the most common tools.

cmp outputs the position of the difference if files are different, else nothing:

$ cmp animals/animals.txt animals/animals2.txt
animals/animals.txt animals/animals2.txt differ: byte 11, line 2

diff outputs the line number of the difference, and the difference:

diff animals/animals.txt animals/animals2.txt
2d1
< 1 dog

This interprets as “there is a difference on line 2 of animals.txt which translates to line 1 of animals2.txt: the line ‘1 dog’ has been deleted”.

Question: what diff option allows to compare the files side-by-side?

Solution:

diff -y animals/animals.txt animals/animals2.txt
rabbit                            2 rabbit
dog                                 <
cat                              11 cat
bird                             23 bird
chicken                           3 chicken

Modifying file content

To make simple changes, you probably want to use a classic text editor.

But sometimes, you need to make more complicated changes, e.g. changing all the occurences of a string in a file.

Replacing all occurences of a word in a file

To to this, sed is probably the most straight-forward solution:

$ # Replace all occurences of 'rabbit' with 'bunny'
$ sed 's/rabbit/bunny/g' animals/animals.txt
bunny
dog
cat
bird
chicken

The script part of this command tells sed:

s/ stands for “substitute”
rabbit/bunny/ means”rabbit” with “bunny”
the trailing g indicates that we want all occurences on the line to be substituted

So, in one sentence, “replace all occurences of ‘rabbit’ with ‘bunny’”.

Removing a line

By line number

$ # Remove lines 1 and 2
$ sed '1,2d' animals/animals.txt
11 cat
23 bird
3 chicken

By pattern

$ # Remove all lines that contains 'chicken'
$ sed '/chicken/d' animals/animals.txt
2 rabbit
1 dog
11 cat
23 bird

sed is actually a lot more poweful than only those two simple examples. If you want to know more about sed there are numerous quality tutorials around.

Back to tutorial homepage