Thursday 24 October 2013

Biopython SeqIO tips

The Biopython libraries are of tremendous use in analysing biological data. I use them in the vast majority of my bespoke fastq analysis, as I'm sure many do.
However, there's a couple of tasks I regularly find myself wanting to do, but that I could not easily find solutions for. In case anyone finds themselves in similar need, here are the solutions I found, maybe save you a bit of time.
Output quality scores
This is the main one; it's always annoyed me that there doesn't seem to be an easy way to output the quality score in SeqIO. Sure, I know with the letter_annotations option I can output the actual score in numbers, but sometimes I want to output the actual ASCII characters (such as if I want to take a subsection of a fastq record, both of nucleotide and quality strings).
Here's how I get around it; turn the whole record into a string, split that on its new lines, and just take the fourth:
 str(record.format("fastq")).split("\n")[3]  
Generate new fastq records in situ
Sometimes it's necessary to generate complete fastq records on the fly, as opposed to reading them in from an existing file.
I've found a couple of ways of doing this. The first comes buried in amongst some unrelated Biopython features, and looks like this:
 from Bio.Alphabet import generic_dna  
 from Bio.Seq import Seq  
 from Bio.SeqRecord import SeqRecord  
 new_record = SeqRecord(Seq("AAAAAA", generic_dna), id="New Record", description="")  
 new_record.letter_annotations["phred_quality"] = [40,40,40,40,40,40]   
The second way came from an answer on Biostars, and works like this:
 from Bio import SeqIO  
 from StringIO import StringIO  
 fastq_string = "@%s\n%s\n+\n%s\n" % ("New Record", "AAAAAA", "IIIIII")  
 new_record = SeqIO.read(StringIO(fastq_string), "fastq-illumina")  
They both work, with a couple of differences.
According to a quick test, the latter is faster, and obviously requires loading fewer modules.
Which to use might also depend on what format you have your qualities in; if you only have integers, then the first might be more tempting, whereas if you're making your new fastq from pieces of other existing records then the second is probably the way to go.

No comments:

Post a Comment