Command line programming tips and tricks for biologists

To see the CPU info (command line):
$cat /proc/cpuinfo

To get the HDD info:
$sudo hdparm -I /dev/sda
To see the list of the active processes (also to get the HDD and Memory info)
To run a program on a set of files with a specific extension in a directory:
ls *.ext | xargs -i ./program {} [list of parameters]

Now a little playing with grep!
To find those lines that neither have rg1 nor rg2 and dumping them in output.txt:
grep -v “rg1” $datafile | grep -v “rg2” > “output.txt”

To find those lines with one of the rg1 or rg2 and dumping them in separate files:
grep “rg1” $datafile > “rg1lines.txt”
Combining several regexp and if statement:
$_ = $data; if ( m/regex/ && m/secondregex/ ) {..}
To sort and remove duplicates from a tab limited file based on a specific column (e.g. column one, delimited with comma):
sort -u -t, -k1,1 file
if the file is “Tab” delimited, just simply remove the -t, part. If your file has several columns and you want to remove the duplicates based on the combination of a few columns (e.g. if the name and family name together match among several records and they are located on columns 1 and 3) :
sort -u -k1,1 -k3,3 file
To print only specific columns of a delimited file:
cat file.tsv | awk ‘{ print $1 “\t” $3}’
awk -F “|” ‘{print $3}’ filename

Unpacking archives
$ gunzip -d archive.tar.gz $ tar -xvpf archive.tar

Sorting (on the spot) the tabular output of any any program (in this example, we want to sort the blastn results based on the criterion on column 12, HSP) (Source):
$ blastall -p blastn -i seq.fa -d db.fa -m 8 | sort -g -k 12

A very useful tutorial source for Linux Shell Text Processing, including: grep, cat, awk, sort, uniq, find, xargs, … (HERE) Dealing with “jobs” in Linux:

  • jobs – list the current jobs
  • fg – resume the job that’s next in the queue
  • fg [number] – resume job [number]
  • bg – Push the next job in the queue into the background
  • bg [number] – Push the job [number] into the background
  • kill %[number] – Kill the job numbered [number]
  • kill -[signal] %[number] – Send the signal [signal] to job number [number]

The percentages in front of kill command tells it that the provided number is a job number and not a process number (source).To learn about the type of the RAM:

sudo dmidecode –type 17

To only copy a specific number of lines to a new file:
shell: # head -100 oldfile > newfile
vi: # vi oldfile :1,100 w newfile

A little more from head:
head -30 /path/to/file | tail -20 prints the last 20 of the first 30 lines ie 10-30and (to only copy the last 100 lines):tail –lines=+100 inputfile > outputfile

to count the number of lines:

wc -l

To cut only a specific (or a set of) columns from a delimited file:In the example below, it cuts all the columns after the second column (including second column). The file is space delimited. If it was column delimited (:) one should use -d: instead of -d’ ‘
cut -d’ ‘ -f2- test.txt

To grep tab delimited columns (e.g. 16[tab space]29130558)
grep -e $’16\t29130558′ vardb_NoCancer.txt
To print out an array in a tab limited format without using a loop command:
local $” = “\t”; print “@array\n”;
To get information about a directory’s size on the disc:
du -s /dir
Let’s say we have a data file called with multiple columns. The first two columns are:
10 14 12 16 14 18 If we want to catch only those lines in which their first column is greater than 10 and their second columns is smaller than 17:
cat | awk ‘$1 > 10 && $2 < 17’
This can come very handy. Now if the columns are a mix of letters and numbers such as:
and you only want to check the number after “:” (column character)
Let’s say we have to files ( and with the same content (as above example, except the first column contains “chr:” before the numbers) . Something like this:
chr4:10 14 chr5:12 16 chr6:14 18
Now you want to get a list of file names (without the extension parts) along with those lines in which the numbers after “chr:” are smaller than 14 and greater than 10
grep chr *.data | sed ‘s/chr.*:/\t/g’ | sed ‘s/.data//g’ | sed ‘s/://’ | awk ‘$2 > 10 && $2 < 14’
Here is the explanation:
first with grep command, you get a list of files and lines starting with chr.
Next you search and replace the chr[number][dot] with tab character using sed command, now the numbers after chr..: are located in the second column.
Next, you search and replace the file extension with nothing using sed command.
Next you get read of “:” character using sed replace.
Next you check the condition again using sed command for the second column for smaller than 14 and greater than 10.
Here is the output:
grexample2 12 16 grexample 12 16
PS: here is an actual example of what I had run while ago finding variations in two exomes of a gene (SETD2 Exomes 1 and 2)
grep chr ../../Hamid/ControlDBAnnotation/vcfs/*/*.combined_variants.excel.txt | sed ‘s/chr/chr\t/g’ | sed ‘s/:/\t/’ | sed ‘s/\.\.\/\.\.\/Hamid\/ControlDBAnnotation\/vcfs\///’ | sed ‘s/.combined_variants.excel.txt//g’ | sed ‘s/:/\t/’ | sed ‘s/=chr\t/=chr/’ | awk ‘$3 == 3 && ($4 < 47166038 && $4 > 47161671) || ($4 < 47168153 && $4 > 47168137)’ > SETD2_EX1-2.combined_variants.excel.txt
PSS: Don’t try this at home! I mean, this will not work on your computer since you are not using my computer and things are at different places (well, probably!) This was just to show an actual example of how these command lines might come in handy! They save lives!
Here is a very useful sed command! You will need it if you are trying to keep a pattern in your string and replace the rest of it with something else. In the example below, we want to keep the variation reference amino acid and its position and to replace the alternative amino acid (N) to something else (XXX in this example):
echo ‘p.D433N’ | sed ‘/p.[A-Z][0-9]*/ s/\(p.[A-Z][0-9]*\).*/\1XXX/’
This can come handy when you only care about changing of an amino acid at a certain position to whatever else! As long as it changes!
Renaming (moving) multiple files, oneliner (more on this, using three different commands):
for file in *.prev ; do mv “$file” “${file/.ignore/.new}” ; done
A good one liner for concatenating multiplexed samples (fastq files) from Illumina HiSeq 2000 (and some other platforms):
ls *R1*.fastq | sed ‘s/.*N501\.//’ | sed ‘s/_R1.fastq//’ | sort -u | awk ‘{print “ls *”$1″*_R1.fastq”}’ | bash
(This is for read one files, same for read two files, only replace R1 with R2). Not to forget decompressing before concatenating, and compressing after concatenating. Sed command should be altered accordinglty.
Reading a tabular file into columns in bash (an example given below):
while read file columns1 column2 column3
echo $cmd $column1 $column3 $colum2 $file
done < “$input”
How to only list file (not folders):
find . -maxdepth 1 -not -type d
How to search and replace within a file:
sed -i ‘s/original_text/replacement/g’ input.txt
Thus for example this is how you can remove all the numbers from a file:
sed -i ‘s/[^0-9]*//g’ input.txt
Something which is very useful and necessary to know but for some reason I cannot seem to be able to remember this and every time I have to look it up! It’s about searching within computer for a specific file.
The command is “Find” command. Bellow is the general idea of how to use this command:
find where1 where2 where3 -name what
name can include wild cards
to remove an entire column in R:
dataframe$columnName <- NULL

and to transpose a table in R, simply:
datafram.T <- t(dataframe)

And if you got into trouble with DOC end of line characters (Ctrl-M) and want to convert them to Unix format so you can parse them in Unix:
awk ‘{sub(/\r$/,””);print}’

Very simple way to condition clauses in oneliner awks (below is just an example):

.. | awk ‘{if ($1 > 0.02) print $1}’

Some very useful “search and replace” commands (directly from Vim Wiki):

:s/foo/bar/g Change each ‘foo’ to ‘bar’ in the current line.
:%s/foo/bar/g Change each ‘foo’ to ‘bar’ in all the lines.
:5,12s/foo/bar/g Change each ‘foo’ to ‘bar’ for all lines from line 5 to line 12 (inclusive).
:’a,’bs/foo/bar/g Change each ‘foo’ to ‘bar’ for all lines from mark a to mark b inclusive (see Note below).
:.,$s/foo/bar/g Change each ‘foo’ to ‘bar’ for all lines from the current line (.) to the last line ($) inclusive.
:.,+2s/foo/bar/g Change each ‘foo’ to ‘bar’ for the current line (.) and the two next lines (+2).
:g/^baz/s/foo/bar/g Change each ‘foo’ to ‘bar’ in each line starting with ‘baz’.

A handy command when you are scratching your head with sed!

Imagine you have a list of files, you want to make a config file over a list of files you have stored somewhere. Let’s say, the first column of this config file will be the location of the file, the second one a constant argument (like -1 or +5) and the third one is some info extracted from the file name. But bear in mind that you want to make this config file only over a certain set of files, like those non input files. Here is what you would do:

for file in /location/*.tar.gz; do if [[ ! $file =~ input ]]; then echo -e $file”\t-1\t”${file/non_intended_prefixes/} | awk ‘{gsub (“non-intended_suffixes”,””,$3)}1’; fi; done

Here is an example of how the input directory and output config file will look like:

for file in /location/*.tar.gz; do if [[ ! $file =~ input ]]; then echo -e $file”\t-1\t”${file/ABC_/} | awk ‘{gsub (“XY.*”,””,$3)}1’; fi; done

Input directory (~/location/) contains these files:


and here is how the config file will look like

~/locaiton/ABC_Case1_XYZ.tar.gz -1 Case1
~/locaiton/ABC_Case2_XYZ2.tar.gz -1 Case2

To get the size of each chromosome from UCSC server using mysql:

mysql –user=genome – -A -e “select chrom, size from hg19.chromInfo” > hg19.genome

This page contains 25 useful commands for deleting a line from a file with a matched pattern. For example I found the command bellow very useful. It’s good for when you want to delete a line right before the line containing the pattern of your interest:

sed -n ‘/Linux/{x;d;};1h;1!{x;p;};${x;p;}’ file
for example if you pipe this:

to the command

sed -n ‘/BB/{x;d;};1h;1!{x;p;};${x;p;}’

above, you will get:

This is good for when you want to remove the last exon from a bed file containing the set of coding exons (or introns.)
Adding a line to the beginning of a file:

sed i ‘1s/^/<added text> /’ file

For example if you want to add the name of each file in a directory to the beginning of a files:
for file in *.counts;do echo “sed -i ‘1s/^/$(echo $file)\t\n/’ $file” | bash;done
To rename all the files in the directory by removing the first four characters from the file names:

rename -n -v ‘s/^(.{4})//’ *

-n avoids action and -v shows what will change, so the final command would be something like this:

rename ‘s/^(.{4})//’ *

To print the name and number of the columns in a tab limited (r any other character limited) file:

csvcut -r -n file.tsv

To get a list of genes with the length of their longest transcript:

First download the refseq transcript database from NCBI refseq database:


Then sort and remove the shorter length transcripts by:

awk ‘{print $5″\t”$4-$3}’ hgTables.tsv | sort -nrk2,2 | awk ‘!seen[$1]++’ | sort -k1,1 > sorted_uniq_gene_length.tsv

There are times you want to remove a number of characters from the beginning of the name of a number of files, let’s say 5 characters and there is no similar pattern to use; here is how to do it:

for f in *; do mv “$f” “${f:5}”; done
If you are eager to know what type of machine and OS you are using on a remote server, simply type:

cat /etc/*-release

(head -n 1 4019/ && tail -n +2 4019/ | sort -k6,6)

If you you want to sort a file without moving the header (let’s say you want to sort it based on column 6 which is a number):

(head -n 1 <file> && tail -n +2<file> | sort -nk6,6)

How to delete the first n lines from a file using sed:

sed -e ‘1,3d’


Leave a Reply

Your email address will not be published. Required fields are marked *