Thursday 29 November 2012

Quick and dirty coding: count DNA base composition in bash

Another installation in my (completely non-ordered, non-structured) quick and dirty code series; here's a way to easily measure the base composition of all of the reads in a fastq file (if you want to check your GC content say) in bash.

I was going to say a quick way, but objectively, it's probably not. Bash is really not an appropriate language to be doing this kind of thing in, but if you're working a lot in the terminal for other things it's sometimes convenient to just have a stock of bash-only scripts for common tasks.

 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "A" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "C" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "G" | wc -l  
 echo sed '1~4d;3~4d;4~4d' FILENAME.FASTQ | sed s/./\&~/g | tr "~" "\n" | grep "T" | wc -l  

As I'm hoping for these to possibly provide some assistance to people who know less about bioinformatics and programming than me (perhaps the smaller end of the wedge, but still significant), I'm going to try and provide some comments and context, so that people can learn from these as well as just use them.

Of note in this example is the first sed term used here (between the echo and the pipe ("|")), which is one that I find myself using very frequently for preliminary analysis of NGS data. All it does it look through a text file and send every 2nd line of four (N-4d just ignores the Nth line of every four lines) and send that to standard out (so either to the terminal or the next process in the pipe)

In a fastq file this setup outputs the sequence line, but obviously it's very easily adapted to output any of the lines you're interested in, from any repetitively structured text file.

In case people might want a marginally more detailed readout (with a bit more of a time penalty), I threw together the following bash script, which I saved as BaseComposition.sh, which I've commented to give a little flavour of some of the things one can do in bash (however inappropriately). Note that bash by default can only do integer arithmetic; you need to use bc or awk or something else to use floats.

 # Jamie Heather, UCL, November 2012  
   
 # Short bash script to give a quick read-out of the DNA base composition of a fastq file  
 # Run with command:  
 # $ sh BaseComposition.sh FILENAME.fastq  
   
 # Takes an input b, then runs through following pipe, whereby:  
 # every second line of 4 line repeat (i.e. sequence line) taken | sed adds tildes between each character | tr replaces tildes for new lines | grep counts each character | wc counts each hit  
 # Then feeds each base into it's own variable  
 ACOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "A" | wc -l)  
 CCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "C" | wc -l)  
 GCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "G" | wc -l)  
 TCOUNT=$(sed '1~4d;3~4d;4~4d' $1 | sed s/./\&~/g | tr "~" "\n" | grep "T" | wc -l)  
   
 # Adds up all the individual bases   
 TOTALBASES=$(($ACOUNT + $CCOUNT + $GCOUNT + $TCOUNT))  
   
 # Prints out both the number of hits for each base, and the percentage of that each base's contribution to the whole  
 echo "\n"  
 echo A bases: "\n"$ACOUNT "\n"Percentage A:  
 echo "scale=2;$ACOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo C bases: "\n"$CCOUNT "\n"Percentage C:   
 echo "scale=2;$CCOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo G bases: "\n"$GCOUNT "\n"Percentage G:   
 echo "scale=2;$GCOUNT/$TOTALBASES"*100 | bc  
 echo "\n"  
 echo T bases: "\n"$TCOUNT "\n"Percentage T:   
 echo "scale=2;$TCOUNT/$TOTALBASES"*100 | bc  

No comments:

Post a Comment