Monday, 30 September 2013

5 minute bash tutorial - how to win the lottery!

Apologies for the slightly misleading title. To expand...

In my office, a number of us are involved in analysis of large datasets, often sequence or text files. Each of us has our own little habits and idiosyncrasies; given the same task, we each find different ways to go about it.

A running joke with one of my friends is that I'll persevere with getting a long, bafflingly piped bash command long after she'll have resorted to scripting something in a higher level language.

(In all fairness we both do both, but I do find something deeply satisfying about chaining everything together in one line, no matter how human-unreadable it may be.)

Last Friday, this topic came up during a conversation about the upcoming Euromillions lottery. As a bit of a joke, I whipped up a quick line of bash to generate compatible random numbers (NB - I actually wrote it wrong in the tweet!).

Today I realised that if I explained the steps in the process, it might help someone new to bash understand how best to get some more functionality of out the command line.

So, here it is (with the error removed):
 for i in {1..50}; do echo $i >> win1; done; shuf win1 | head -5; for i in {1..11}; do echo $i >> win2; done; shuf win2 | head -2; rm win*  

This should produce output that fits the Euromillions format; five numbers between one and fifty, with two 'star' numbers between one and eleven, e.g.:

Let's break this down.

 for i in {1..50};  

This sets up a for loop, which will iterate through every integer in the range 1 to 50. For loops in bash take the format:


Here, the VARIABLE is the thing that changes each iteration, and THING is the range or list of items. SOMETHING and SOMETHING_ELSE are then commands you want to run within the loop, which have to be sandwiched between 'do' and 'done'. More bash loop examples/explanation here.

However, all those carriage-returns take up far too much time (and make it far too comprehensible) - I tend to substitute those for semi-colons, which allow you to concatenate multiple commands into one-liners.

So, our one-liner first loops from 1 to 50, then what?

 do echo $i >> win1; done;  

Here we have the contents of that first loop, sandwiched between 'do' and 'done' as per requirements, so that bash knows what to do at each turn of that loop.

Echo is the unix equivalent of 'print'; it just outputs something you tell it to the screen (or somewhere else).

In this case, we echo $i, where i denotes the variable we set up in the loop; we just need the dollar sign to let bash know we're asking for the contents of a variable. So, in the first cycle of the loop $i = 1, then in the next $i = 2, all the way up to $i = 50, which is when the loop stops.

Now if we just had 'do echo $i' in our loop, we'd just get the numbers 1 to 50 printed out in the terminal. However, we've used '>>' to redirect the numbers away from the terminal, instead putting them in a new file called 'win1'.

By using '>>' instead of '>', each cycle of the loop appends the new number to the end of the file win1 - if we used just one > each iteration of the loop would overwrite the whole file, leaving win1 at the end containing just the number 50.

So now we have the file win1 containing the numbers 1-50, each on its own line.

 shuf win1 | head -5;  

This is the bit of the code that picks the numbers.

Shuf simply shuffles all of the lines in a file, into a random(ish) order. It's an incredible useful command, which I kicked myself when I found out about it, for having taken so long. In this instance, it mixes up all the lines in win1, so they will no longer be in numerical order.

Here's where we first encounter pipes - that long character between the shuf and the next command is said to 'pipe' (or carry) the results of the first command into the second. 

We take the randomly shuffled win1 file, and pipe that into head (one of the commands you'll probably end up using most!) to take off the top five numbers, giving you your first five lottery numbers.

The next bit just repeats the same methodology for the lucky star balls (two numbers between one and eleven), outputting them after the previous five:

 for i in {1..11}; do echo $i >> win2; done; shuf win2 | head -2;

Finally, a bit of tidying up:

 rm win*  

Rm removes files - here I've used the '*' wildcard character to make it remove all files which begin 'win', thus removing the win1 and win2 files created earlier. This is also vital if you're going to run the script more than once, as otherwise you might end up drawing the same number twice (as re-running the loop would add yet more numbers to the files).

Note that rm can be dangerous - by default, it won't double check if you're sure you want to delete something, and there's basically no going back, so be careful! For instance, by not explicitly stating the files I wanted to remove, this line would also remove any other files which fit the pattern - e.g. if I had a file called 'winning_lottery_numbers.txt' in the same directory, it would have been deleted too!

This can be overcome by being explicit in the files you're removing, like so:

 rm win1 win2   

However if you have more than a few files, you're probably going to want to use the wildcards sensibly.

There you have it - a silly bit of code, but hopefully a useful one to a bash novice!

In case you're wondering, I did a small bit of rigourous, carefully controlled science with my script - I bought a ticket using numbers generated with either my code, or with the lottery's own 'lucky dip'. My code won nothing, but I got £3 from the lucky dip! Pretty conclusive we can all agree.

Tuesday, 10 September 2013

Python fastq analysis for beginners

It's been a little while since my last post, as I did that rare and wondrous thing and took a holiday. Now I'm back, and galvanised into action by a particularly nice reference to my blog, I though I'd continue my vague hopscotch bioinformatics for beginners series, Here's a pretty basic one I've been meaning to put up for a while: fastq analysis.

Fastq is of course the format for storage next generation DNA sequencing data. Each sequence gets four lines in the file: one for sequence identifier, nucleotide sequence, a description line (often just a '+' character), and finally the per-base quality scores (thus lines two and four must be equal length).

If you're on this page, chances are good you're a wet-labber who's just starting to get into working with sequencing data. You can do basic manipulation using a multitude of different programs (I particularly like the FASTX toolkit for a nice spread of basic but useful tricks), but at some point, you're probably going to have to get your hands dirty and code something.

Now I know this isn't best practise, but often when I want to code something I can't immediately think of a solution for, I just get combing tutorials, manuals and forums finding fragments of script that do similar things to what I want, and weaving them together.

Often, the tricky place is knowing where to start. With a view to helping out people like me, here's the basic script I use for the majority of the fastq analysis that I write in Python.

It makes use of the SeqIO interface in Biopython. Basically, all this script does is scroll through a fastq file one record at a time, allowing you to interrogate or manipulate your individual sequences as needed.

from Bio import SeqIO # module needed for sequence input

countseqs = 0

fastqfile = open("YOUR_FILENAME_HERE.fastq", "rU")  
for record in SeqIO.parse(fastqfile,"fastq"):  
  countseqs += 1  
print "Reading from", fastqfile  
print countseqs, "records analysed."  

I usually make use of a little variant on this that allows command line input of the fastq file name; this means I can apply the same script to a large number of files without having to edit the filename in the script each time (and instead can make use of the tab-complete function of my shell):

from Bio import SeqIO # module needed for sequence input   
import sys   # module needed for command line argument to get fastq name   
if (len(sys.argv) > 1):   
  fastqfile = open(sys.argv[1], "rU")   
elif (len(sys.argv) == 1):   
  print "Please supply a fastq filename (i.e. python file.fastq)"   
for record in SeqIO.parse(fastqfile,"fastq"):   
  countseqs += 1   
print "Reading from", fastqfile   
print countseqs, "records analysed."   
NB - this script is easily converted into a fasta file, just by doing a find and replace of 'fastq' for 'fasta' - the parser handles it just fine.

This is of course the easy bit - filling in the "whatever you want to happen" bit is always going to be tricky.

To give you an example, here are the bones of a script I wrote based on this template, that I alluded to in a previous post.

from Bio import SeqIO # module needed for sequence input  
import sys      # module needed for command line argument to get fastq name  
import regex        # module for regex  
if (len(sys.argv) > 1):  
  fastqfile1 = open(sys.argv[1], "rU")  
elif (len(sys.argv) == 1):  
  print "Please supply a fastq filename (i.e. python file.fastq)"  
SEQ1_outfilename = ("SEQ_1_" + sys.argv[1][:15] + ".fastq")  
SEQ1_outfile = open(SEQ1_outfilename, "w")  
SEQ2_outfilename = ("SEQ_2_" + sys.argv[1][:15] + ".fastq")  
SEQ2_outfile = open(SEQ2_outfilename, "w")  
countseqs = 0  
countSEQ_1 = 0  
countSEQ_2 = 0  
for record in SeqIO.parse(fastqfile1,"fastq"): # scrolls through fastq file, using SeqIO package  
  countseqs += 1 # counts number of sequences measured  
  seq1 ='(ATATATATAT){e<=1}', str(record.seq))  
  if seq1:  
    countSEQ_1 +=1  
    output_handle = open(str(SEQ1_outfilename), "a")  
    SeqIO.write(record, output_handle, "fastq-sanger")  
    seq2 ='(CGCGCGCGCG){e<=1}', str(record.seq))  
    if seq2:   
      countSEQ_2 +=1  
      output_handle = open(str(SEQ2_outfilename), "a")  
      SeqIO.write(record, output_handle, "fastq-sanger")  
print "Reading from", fastqfile1  
print countseqs, "records analysed."  
print countSEQ_1, "Sequence 1s detected."  
print countSEQ_2, "Sequence 2s sequences detected."  

This uses the framework of the scripts above, looking for two sequences (in this case two primers, the presence of which would indicate which amplicons the sequences belonged to); depending on which subsequence it finds (if any), it bins the record in one of two output files.

To give a bit of leeway for sequencing error, I've made use of 'regex', an alternative regular expression module (as opposed to the standard/older 're'), which has an option for fuzzy string matching - i.e. you can allow mismatches or indels.

Taking one of the pretend sequences I've used here (ATATATATAT), this means that any of the following sequences would also be categorised as a hit using this search: ATATATGTAT, ATATATATT, ATATAATATAT.

There we have it; a crude, yet effective script to take a fastq file, and output two new fastqs containing (mutually exclusive) reads based on the presence of a given subsequence.

Note that this probably isn't the most efficient solution to this problem - even for something written in Python - but it was knocked together quickly, building up on simple principles from a basic fastq analysis template script, and most importantly, it got the job done. I only needed to run this code three or four times, and was able to do the analysis I needed, which is, let's face it, the most important thing.


Update: it's been a few years since I wrote this article, and it's worth pointing out that I've found a much nicer way to parse fasta/fastq files: the readfq function from Heng Li. It's available in a variety of languages, is simple, customisable, pared down, and crucially it's fast. Particularly in applications where you don't need all the extra legwork that SeqIO is doing - like converting your quality scores into their integer equivalents.