Extras

Some programming tips and tricks for biologists:

(mostly shell commands)

These are (some of the) useful commands I found that are accumulated since 2000 when I had my first encounter with programming in Linux. So, if some of them look very trivial and basic, that’s the reason! There is no chronological order to the context!

To see the CPU info (command line):
$cat /proc/cpuinfo
To get the HDD info:
$sudo hdparm -I /dev/sda
To see the list of the active processes (also to get the HDD and Memory info)
$top
To run a program on a set of files with a specific extension in a directory:
ls *.ext | xargs -i ./program {} [list of parameters]

Now a little playing with grep!
To find those lines that neither have rg1 nor rg2 and dumping them in output.txt:
grep -v “rg1” $datafile | grep -v “rg2” > “output.txt”
To find those lines with one of the rg1 or rg2 and dumping them in separate files:
grep “rg1” $datafile > “rg1lines.txt”
Combining several regexp and if statement:
$_ = $data; if ( m/regex/ && m/secondregex/ ) {..}
To sort and remove duplicates from a tab limited file based on a specific column (e.g. column one, delimited with comma):
sort -u -t, -k1,1 file
if the file is “Tab” delimited, just simply remove the -t, part. If your file has several columns and you want to remove the duplicates based on the combination of a few columns (e.g. if the name and family name together match among several records and they are located on columns 1 and 3) :
sort -u -k1,1 -k3,3 file
To print only specific columns of a delimited file:
cat file.tsv | awk ‘{ print $1 “\t” $3}’
or
awk -F “|” ‘{print $3}’ filename

Unpacking archives
$ gunzip -d archive.tar.gz $ tar -xvpf archive.tar
Sorting (on the spot) the tabular output of any any program (in this example, we want to sort the blastn results based on the criterion on column 12, HSP) (Source):
$ blastall -p blastn -i seq.fa -d db.fa -m 8 | sort -g -k 12

A very useful tutorial source for Linux Shell Text Processing, including: grep, cat, awk, sort, uniq, find, xargs, … (HERE) Dealing with “jobs” in Linux:

jobs – list the current jobs
fg – resume the job that’s next in the queue
fg [number] – resume job [number]
bg – Push the next job in the queue into the background
bg [number] – Push the job [number] into the background
kill %[number] – Kill the job numbered [number]
kill -[signal] %[number] – Send the signal [signal] to job number [number]

The percentages in front of kill command tells it that the provided number is a job number and not a process number (source).To learn about the type of the RAM:

sudo dmidecode –type 17

To only copy a specific number of lines to a new file:
shell: # head -100 oldfile > newfile
vi: # vi oldfile :1,100 w newfile

A little more from head:
head -30 /path/to/file | tail -20 prints the last 20 of the first 30 lines ie 10-30and (to only copy the last 100 lines):tail –lines=+100 inputfile > outputfile

to count the number of lines:

wc -l file.name

To cut only a specific (or a set of) columns from a delimited file:In the example below, it cuts all the columns after the second column (including second column). The file is space delimited. If it was column delimited (:) one should use -d: instead of -d’ ‘
cut -d’ ‘ -f2- test.txt

To grep tab delimited columns (e.g. 16[tab space]29130558)

grep -e $’16\t29130558′ vardb_NoCancer.txt

To print out an array in a tab limited format without using a loop command:

local $” = “\t”; print “@array\n”;

To get information about a directory’s size on the disc:

du -s -h /dir (-h makes it human readable)

Let’s say you want to ssh to a remote server as part of a shell script, for example, you want to rsync several files to a remote host; one way would be to enter the password every time a file transfer starts which obviously is painful if you have more than 2 files to transfer! The other way would be connecting to the remote server with ssh without the need to enter password. Wouldn’t that be more convenient? To do this, you need to add your host machine (let’s call it A) to the list of trusted (authorized) machines on machine B (the remote server).

To do this, you need to create a public rsa key on your host machine that can be copied anywhere you want to have access to without password. Then copy that in the list of authorized machines on the host server. There is a file for this list under: ~/.ssh/authorized_keys

So the commands would be something like this:

1- Create the public key on the A machine (local)

ssh-keygen -t rsa

2- Copy the public key onto the B machine (remote)

cat ~/.ssh/id_rsa.pub | ssh username@B-machine-ip ‘cat >> ~/.ssh/authorized_keys’

Voila!

(from this source: http://www.linuxproblem.org/art_9.html)

Now, let’s say we have a data file called grexample.data with multiple columns. The first two columns are:

10 14 12 16 14 18 If we want to catch only those lines in which their first column is greater than 10 and their second columns is smaller than 17:

cat grexample.data | awk ‘$1 > 10 && $2 < 17’

This can come very handy. Now if the columns are a mix of letters and numbers such as:

chr20:10

and you only want to check the number after “:” (column character)

Let’s say we have to files (grexample.data and grexample2.data) with the same content (as above example, except the first column contains “chr:” before the numbers) . Something like this:

chr4:10 14 chr5:12 16 chr6:14 18

Now you want to get a list of file names (without the extension parts) along with those lines in which the numbers after “chr:” are smaller than 14 and greater than 10

grep chr *.data | sed ‘s/chr.*:/\t/g’ | sed ‘s/.data//g’ | sed ‘s/://’ | awk ‘$2 > 10 && $2 < 14’

Here is the explanation:

first with grep command, you get a list of files and lines starting with chr.

Next you search and replace the chr[number][dot] with tab character using sed command, now the numbers after chr..: are located in the second column.

Next, you search and replace the file extension with nothing using sed command.

Next you get read of “:” character using sed replace.

Next you check the condition again using sed command for the second column for smaller than 14 and greater than 10.

Here is the output:

grexample2 12 16 grexample 12 16

Awk oneliner papge

PS: here is an actual example of what I had run while ago finding variations in two exomes of a gene (SETD2 Exomes 1 and 2)

grep chr ../../Hamid/ControlDBAnnotation/vcfs/*/*.combined_variants.excel.txt | sed ‘s/chr/chr\t/g’ | sed ‘s/:/\t/’ | sed ‘s/\.\.\/\.\.\/Hamid\/ControlDBAnnotation\/vcfs\///’ | sed ‘s/.combined_variants.excel.txt//g’ | sed ‘s/:/\t/’ | sed ‘s/=chr\t/=chr/’ | awk ‘$3 == 3 && ($4 < 47166038 && $4 > 47161671) || ($4 < 47168153 && $4 > 47168137)’ > SETD2_EX1-2.combined_variants.excel.txt

PSS: Don’t try this at home! I mean, this will not work on your computer since you are not using my computer and things are at different places (well, probably!) This was just to show an actual example of how these command lines might come in handy! They save lives!

Here is a very useful sed command! You will need it if you are trying to keep a pattern in your string and replace the rest of it with something else. In the example below, we want to keep the variation reference amino acid and its position and to replace the alternative amino acid (N) to something else (XXX in this example):

echo ‘p.D433N’ | sed ‘/p.[A-Z][0-9]*/ s/$p.[A-Z][0-9]*$.*/\1XXX/’
result:

p.D433XXX

This can come handy when you only care about changing of an amino acid at a certain position to whatever else! As long as it changes!

Renaming (moving) multiple files, oneliner (more on this, using three different commands):

for file in *.prev ; do mv “$file” “${file/.ignore/.new}” ; done

A good one liner for concatenating multiplexed samples (fastq files) from Illumina HiSeq 2000 (and some other platforms):

ls *R1*.fastq | sed ‘s/.*N501\.//’ | sed ‘s/_R1.fastq//’ | sort -u | awk ‘{print “ls *”$1″*_R1.fastq”}’ | bash

(This is for read one files, same for read two files, only replace R1 with R2). Not to forget decompressing before concatenating, and compressing after concatenating. Sed command should be altered accordinglty.

Reading a tabular file into columns in bash (an example given below):
while read file columns1 column2 column3
do
echo $cmd $column1 $column3 $colum2 $file
done < “$input”
How to only list file (not folders):
find . -maxdepth 1 -not -type d
How to search and replace within a file:
sed -i ‘s/original_text/replacement/g’ input.txt

Thus for example this is how you can remove all the numbers from a file:
sed -i ‘s/[^0-9]*//g’ input.txt

Something which is very useful and necessary to know but for some reason I cannot seem to be able to remember this and every time I have to look it up! It’s about searching within computer for a specific file.

The command is “Find” command. Bellow is the general idea of how to use this command:

find where1 where2 where3 -name what

name can include wild cards

to remove an entire column in R:

dataframe$columnName <- NULL

and to transpose a table in R, simply:

datafram.T <- t(dataframe)

And if you got into trouble with DOC end of line characters (Ctrl-M) and want to convert them to Unix format so you can parse them in Unix:
awk ‘{sub(/\r$/,””);print}’

Very simple way to condition clauses in oneliner awks (below is just an example):

.. | awk ‘{if ($1 > 0.02) print $1}’

Some very useful “search and replace” commands (directly from Vim Wiki):

:s/foo/bar/g Change each ‘foo’ to ‘bar’ in the current line.
:%s/foo/bar/g Change each ‘foo’ to ‘bar’ in all the lines.
:5,12s/foo/bar/g Change each ‘foo’ to ‘bar’ for all lines from line 5 to line 12 (inclusive).
:’a,’bs/foo/bar/g Change each ‘foo’ to ‘bar’ for all lines from mark a to mark b inclusive (see Note below).
:.,$s/foo/bar/g Change each ‘foo’ to ‘bar’ for all lines from the current line (.) to the last line ($) inclusive.
:.,+2s/foo/bar/g Change each ‘foo’ to ‘bar’ for the current line (.) and the two next lines (+2).
:g/^baz/s/foo/bar/g Change each ‘foo’ to ‘bar’ in each line starting with ‘baz’.

A handy command when you are scratching your head with sed!

Imagine you have a list of files, you want to make a config file over a list of files you have stored somewhere. Let’s say, the first column of this config file will be the location of the file, the second one a constant argument (like -1 or +5) and the third one is some info extracted from the file name. But bear in mind that you want to make this config file only over a certain set of files, like those non input files. Here is what you would do:

for file in /location/*.tar.gz; do if [[ ! $file =~ input ]]; then echo -e $file”\t-1\t”${file/non_intended_prefixes/} | awk ‘{gsub (“non-intended_suffixes”,””,$3)}1’; fi; done

Here is an example of how the input directory and output config file will look like:

for file in /location/*.tar.gz; do if [[ ! $file =~ input ]]; then echo -e $file”\t-1\t”${file/ABC_/} | awk ‘{gsub (“XY.*”,””,$3)}1’; fi; done

Input directory (~/location/) contains these files:

ABC_Case1_XYZ1.tar.gz
ABC_Case1_XYZ1.zip
ABC_Case2_XYZ2.tar.gz
ABC_Case2_XYZ2.zip
ABC_input1_XYZ1.tar.gz
ABC_input1_XYZ1.zip
ABC_input2_XYZ2.tar.gz
ABC_input2_XYZ2.zip

and here is how the config file will look like

~/locaiton/ABC_Case1_XYZ.tar.gz -1 Case1
~/locaiton/ABC_Case2_XYZ2.tar.gz -1 Case2

To get the size of each chromosome from UCSC server using mysql:

mysql –user=genome –host=genome-mysql.cse.ucsc.edu -A -e “select chrom, size from hg19.chromInfo” > hg19.genome

This page contains 25 useful commands for deleting a line from a file with a matched pattern. For example I found the command bellow very useful. It’s good for when you want to delete a line right before the line containing the pattern of your interest:

sed -n ‘/Linux/{x;d;};1h;1!{x;p;};${x;p;}’ file

for example if you pipe this:

to the command

sed -n ‘/BB/{x;d;};1h;1!{x;p;};${x;p;}’

above, you will get:

This is good for when you want to remove the last exon from a bed file containing the set of coding exons (or introns.)

Adding a line to the beginning of a file:

sed –i ‘1s/^/<added text> /’ file

For example if you want to add the name of each file in a directory to the beginning of a files:
for file in *.counts;do echo “sed -i ‘1s/^/$(echo $file)\t\n/’ $file” | bash;done
To rename all the files in the directory by removing the first four characters from the file names:

rename -n -v ‘s/^(.{4})//’ *

-n avoids action and -v shows what will change, so the final command would be something like this:

rename ‘s/^(.{4})//’ *

To print the name and number of the columns in a tab limited (r any other character limited) file:

csvcut -r -n file.tsv

To get a list of genes with the length of their longest transcript:

First download the refseq transcript database from NCBI refseq database:

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/gene_RefSeqGene

Then sort and remove the shorter length transcripts by:

awk ‘{print $5″\t”$4-$3}’ hgTables.tsv | sort -nrk2,2 | awk ‘!seen[$1]++’ | sort -k1,1 > sorted_uniq_gene_length.tsv

There are times you want to remove a number of characters from the beginning of the name of a number of files, let’s say 5 characters and there is no similar pattern to use; here is how to do it:

for f in *; do mv “$f” “${f:5}”; done
If you are eager to know what type of machine and OS you are using on a remote server, simply type:

cat /etc/*-release

(head -n 1 4019/4019-tumor_Cancer_Gene_Census.events.excel.tsv && tail -n +2 4019/4019-tumor_Cancer_Gene_Census.events.excel.tsv | sort -k6,6)

If you you want to sort a file without moving the header (let’s say you want to sort it based on column 6 which is a number):

(head -n 1 <file> && tail -n +2<file> | sort -nk6,6)

How to delete the first n lines from a file using sed:

sed -e ‘1,3d’

Docker:

How to install docker on ubuntu (14.04 TLS) (taken from here)

sudo apt-get update
sudo apt-get -y install docker.io
ln -sf /usr/bin/docker.io /usr/local/bin/docker
sed -i ‘$acomplete -F _docker docker’ /etc/bash_completion.d/docker.io
update-rc.d docker.io defaults

How to remove docker on ubuntu (14.04 TLS) (taken from here)

to identify what installed package you have
dpkg -l | grep -i docker
So you need to change package name in commands from https://stackoverflow.com/a/31313851/2340159 to match package name. For example, for docker.io it would be:
sudo apt-get purge -y docker.io
sudo apt-get autoremove -y –purge docker.io
sudo apt-get autoclean
The above commands will not remove images, containers, volumes, or user created configuration files on your host. If you wish to delete all images, containers, and volumes run the following command:
sudo rm -rf /var/lib/docker
Remove docker from apparmor.d:
sudo rm /etc/apparmor.d/docker
Remove docker group:
sudo groupdel docker

Let’s say you want to find the maximum value of a column using awk. Your input file (e.g. input.txt) has two columns, first one line number, second one the value you want to find the maximum of:

awk -v max=0 ‘{if ($2>max){max=$2;row=$1}}END{print “row number:”row” has the maximum value which is:”max}’ input.txt

Do you want to know for how long your job (which is running in background or foreground) been running, simply use top command or type the command below if you only want to see the running time of a specific process (for whatever reason such as including it in a script to stop a process after a certain time):

ps -o etime= -p “$$”

In this command, instead of $$ use the process id (pid) that you can get using command top. If your job includes multiple processes (e.g. if you are running something and piping its results to another program), the latter will give you the largest time-span of all the processes included in the submitted job. In other words, it gives you the total “job” run-time not just a single process.

To remove duplicates from a flat file based on only one column (3 in this example):
awk -F’,’ ‘!seen[$3]++’ filename

Let’s say you are on a server node and don’t know how many cores you have access to (for example, if you want to parallelize a command you need to know the number of cores you can use):

First, let’s see how many cores your node has:

cores=$(grep -c -e ‘^processor’ /proc/cpuinfo)

For using multiple threaded gzip, you can use pigzip (parallel implementation of gzip). The -p defines the number of threads (cores) you want to assign, -k keeps the original file.

Let’s say you want to run a program that opens a gazillion files at the same time (something like Picard’s SortSam). Usually in linux systems there’s a limit to the number of open files (usually it’s 1024). On your own machine, you can easily change this limit by setting ulimit -n to the new limit you want (make sure you do it for both hard and soft limit). But, what if you are running that program through a docker image (for example Picard’s docker image)? Luckily, there is a solution for this in the newer versions of docker which makes it unnecessary to change the docker image itself and doesn’t need rebooting your machine (which in the case for docker image makes it almost impossible). So, what’s the solution? You can modify the Kernel values right from the docker run command line! Right? I was happily surprised too! Here is an example that sets the limits (both soft and hard) of the open files to One Million!
docker run –ulimit nofile=1000000:1000000 -it broadinstitute/picard:latest
More info here:
https://stackoverflow.com/questions/24318543/how-to-set-ulimit-file-descriptor-on-docker-container-the-image-tag-is-phusion

A little bit more fun with Awk. Let’s say you want to remove the very last column from a text file (assuming it is a delimited file):

awk ‘NF{NF-=1};1’ <in >out

awk ‘NF{NF–};1’ <in >out

awk ‘NF{–NF};1’ <in >out

Credit: https://unix.stackexchange.com/questions/234432/how-to-delete-the-last-column-of-a-file-in-linux

If you want to change the last occurence of a pattern using sed: (for example changing the last dash into a tab in a line)

sed ‘s/$.*$-/\1\t/’

Credit: https://unix.stackexchange.com/a/187894/85668

If you want to edit the PATH in your docker image, add this line to your dockerfile:

ENV PATH=”/newpath:${PATH}”

How to rename fasta file sequence names with ascending numbers (from https://www.biostars.org/p/53212/):

awk ‘/^>/{print “>” ++i; next}{print}’ < test.fa

How to define and use an abs function in awk (one-liner) (link to the original stack overflow thread)

awk -F'\t' 'function abs(x){return ((x < 0.0) ? -x : x)} {if (abs($9) < 500) print $0}'

Let’s say you want to connect from your local computer (L) to a remote server (R) and your username is TomL and TomR on the local and remote machines respectively! Now you want to connect to the remote server without entering the password every time, or yet more convincingly, you want to rsync a bunch of files to the remote server or vice versa without entering the password for each file! Here is a list of what you need to do:

First on your local machine, do this:

L:~TomL$ ssh-keygen -t rsa

L:~TomL$ ssh TomR@R mkdir -p .ssh

L:~TomL$ cat .ssh/id_rsa.pub | ssh TomR@R ‘cat >> .ssh/authorized_keys’

Next, on your remote machine, do this:

R:~TomR$ chmod 700 ~/.ssh

R:~TomR$ chmod 640 ~/.ssh/authorized_keys

machine If these didn’t work, try this:

L:~TomL$ cat .ssh/id_rsa.pub | ssh TomR@R ‘cat >> .ssh/authorized_keys2’

R:~TomR$ chmod 640 ~/.ssh/authorized_keys

Nikleotide!

Hamid Nikbakht, Ph.D. Computational Biology, Cancer (epi)genomics

Some programming tips and tricks for biologists:

(mostly shell commands)

Leave a Reply Cancel reply