Sunday, 24 November 2013

Fred Sanger, sequencing legend, passed away

As you're all no doubt already aware, Fred Sanger passed away last Tuesday.

The inventor of methods to sequence both proteins and nucleic acids (and netting himself Nobel Prizes for each), if anyone can be said to be the father of sequencing, it is him.

I'm not planning to write too much about his work here (not least because I've just written a piece on it for BioNews, which should go up tomorrow available here). I do however have to share a wonderful quote of his that I found nestled in an autobiographical piece he wrote after his second Nobel (that I think has instantly become a feature of my future presentations).

Although at the time it seemed to be a major change from proteins to nucleic acids, the concern with the basic problem of "sequencing" remained the same. And indeed this theme has been at the centre of all my research since 1943, both because of its intrinsic fascination and my conviction that a knowledge of sequences could contribute much to our understanding of living matter.

He's not wrong there.


Thursday, 24 October 2013

Biopython SeqIO tips

The Biopython libraries are of tremendous use in analysing biological data. I use them in the vast majority of my bespoke fastq analysis, as I'm sure many do.
However, there's a couple of tasks I regularly find myself wanting to do, but that I could not easily find solutions for. In case anyone finds themselves in similar need, here are the solutions I found, maybe save you a bit of time.
Output quality scores
This is the main one; it's always annoyed me that there doesn't seem to be an easy way to output the quality score in SeqIO. Sure, I know with the letter_annotations option I can output the actual score in numbers, but sometimes I want to output the actual ASCII characters (such as if I want to take a subsection of a fastq record, both of nucleotide and quality strings).
Here's how I get around it; turn the whole record into a string, split that on its new lines, and just take the fourth:
Generate new fastq records in situ
Sometimes it's necessary to generate complete fastq records on the fly, as opposed to reading them in from an existing file.
I've found a couple of ways of doing this. The first comes buried in amongst some unrelated Biopython features, and looks like this:
 from Bio.Alphabet import generic_dna  
 from Bio.Seq import Seq  
 from Bio.SeqRecord import SeqRecord  
 new_record = SeqRecord(Seq("AAAAAA", generic_dna), id="New Record", description="")  
 new_record.letter_annotations["phred_quality"] = [40,40,40,40,40,40]   
The second way came from an answer on Biostars, and works like this:
 from Bio import SeqIO  
 from StringIO import StringIO  
 fastq_string = "@%s\n%s\n+\n%s\n" % ("New Record", "AAAAAA", "IIIIII")  
 new_record =, "fastq-illumina")  
They both work, with a couple of differences.
According to a quick test, the latter is faster, and obviously requires loading fewer modules.
Which to use might also depend on what format you have your qualities in; if you only have integers, then the first might be more tempting, whereas if you're making your new fastq from pieces of other existing records then the second is probably the way to go.

Sunday, 20 October 2013

Installing python dependencies for Decombinator

I've written previously about Decombinator, the TCR repertoire analysis program developed in our laboratory.

Having just formatted my computer and reinstalled all my packages, I've been reminded of what a faff this can be.

This time around I did things a lot more efficiently than I had before. Seeing as how my colleague responsible for updating the current readme is busy writing up his thesis, here's my quick guide to installing all the python modules required by Decombinator in Linux - at least in Ubuntu.

 sudo apt-get install python-numpy python-biopython python-matplotlib python-levenshtein   
 sudo apt-get install python-pip  
 sudo pip install acora  

Most of the modules are available in the repositories, which has the added bonus of filling in all their required dependencies. Acora (the module that enacts the Aho-Corasick finite-state machine) however is not, but is easily installed from the command line using pip.

Easy. Time to get decombining.

August 2016 update:
Decombinator has gone through a number of improvements since writing this post. The current version (and installation details) can be found on the Innate2Adaptive lab's Github repo.

Thursday, 17 October 2013

Good pipetting rules of thumb

Accurate micro-pipetting is probably the single most important physical skill that a molecular or cell biologist need learn. It is the foundation upon which the vast majority of our experiments are built.

Pipettes are to us as brushes are to painters. They are the tools by which we achieve our aims; any mistake in pipetting at any stage stands to affect the outcome of our experiment.

Understandably, there are many resources available to help inform best pipetting practice, which I recommend budding biologists make use of. This is a nice one, and there are plenty of (cheesy) videos around too. People should also know a bit (if not a lot!) about how they work, how they can go wrong, and the different ways in which they can be used – there’s an unwieldy if thorough guide covering these here.

However, all this advice tends to focus on what’s best for the experiment – which is important – but overlooks another concern; your body.

Experiments can often involve highly repetitive movements, for extended periods of time. This isn’t that great on your mechanisms, particularly those of your thumb and wrist (as well I know!), and can cause repetitive strain injuries. There are a number of ways you can cut down on these risks, but here’s a quick rundown of the ones I think are worth bearing in mind:

         Don’t rush. Be fast, by all means – if you’re able – but hurrying usually means you do things the wrong way (both in terms of your data’s and your own safety)

         Set up your bench. My bench might look a mess when I’m working, but it’s organised so that I can access everything I need to without stretching or moving lots of stuff around. Which leads onto…

         Minimise. The injuries you’re at risk of are cumulative in nature, so you want to get away with using the minimum force and number of repetitions you can get away with. Maximise use of master mixes, reverse-pipette when needed, use multi-channel/step pipettes where possible, and for the love of all that is good, don’t bang your pipette tips on as hard as you can! (This is a bit of a bugbear of mine; the more you push it on just adds to the force required to get it off! It’s also damaging to the pipette in the long run. Gentle but firm pressure with a slight twist should be more than enough to form a strong enough seal for aspiration)

         Mix it up. If you’re able, swap hands every now and then to split the load. Admittedly, most people’s weaker-hand is probably less reliable than their stronger-hand, but accuracy’s sometimes less important – for instance, a lot of your larger-volume wash steps can probably afford to take the hit

         Take breaks. The most effective, but least practical option (which I am guilty of ignoring most of the time!). Try to break up your experiments, either with pauses or activities that don’t put so much pressure on your wrists and hands (so you can still be working, fear not PIs!).

The best part is that most of these tips help improve both the accuracy and speed of your work too.

Remember, you only get one pair of hands (if you’re lucky), so don’t put them at risk for your work, no matter how important you think it is at the time. Speaking from experience, it can also impact directly on your research, as my current bout of tendonitis tenosynovitis (damn you qPCR!) put me out of commission lab-wise for a fortnight*. It’s in everyone’s best interest for you to take the time and effort to pipette safely, and that includes your data.

OK now I’m anthropomorphising results. Time to end blog post.

* I know it seems rich me writing this article after admitting I have the injuries I claim knowledge of preventing, but it's a bit of a 'do as I do and not as I say' situation. I happen to have always had dodgy wrists, no doubt from a childhood spent on computer games, an adolescence spent typing, and an adulthood spent pipetting. I think the addition of thumb-swipey smart phone ownership was the last straw.

Monday, 30 September 2013

5 minute bash tutorial - how to win the lottery!

Apologies for the slightly misleading title. To expand...

In my office, a number of us are involved in analysis of large datasets, often sequence or text files. Each of us has our own little habits and idiosyncrasies; given the same task, we each find different ways to go about it.

A running joke with one of my friends is that I'll persevere with getting a long, bafflingly piped bash command long after she'll have resorted to scripting something in a higher level language.

(In all fairness we both do both, but I do find something deeply satisfying about chaining everything together in one line, no matter how human-unreadable it may be.)

Last Friday, this topic came up during a conversation about the upcoming Euromillions lottery. As a bit of a joke, I whipped up a quick line of bash to generate compatible random numbers (NB - I actually wrote it wrong in the tweet!).

Today I realised that if I explained the steps in the process, it might help someone new to bash understand how best to get some more functionality of out the command line.

So, here it is (with the error removed):
 for i in {1..50}; do echo $i >> win1; done; shuf win1 | head -5; for i in {1..11}; do echo $i >> win2; done; shuf win2 | head -2; rm win*  

This should produce output that fits the Euromillions format; five numbers between one and fifty, with two 'star' numbers between one and eleven, e.g.:

Let's break this down.

 for i in {1..50};  

This sets up a for loop, which will iterate through every integer in the range 1 to 50. For loops in bash take the format:


Here, the VARIABLE is the thing that changes each iteration, and THING is the range or list of items. SOMETHING and SOMETHING_ELSE are then commands you want to run within the loop, which have to be sandwiched between 'do' and 'done'. More bash loop examples/explanation here.

However, all those carriage-returns take up far too much time (and make it far too comprehensible) - I tend to substitute those for semi-colons, which allow you to concatenate multiple commands into one-liners.

So, our one-liner first loops from 1 to 50, then what?

 do echo $i >> win1; done;  

Here we have the contents of that first loop, sandwiched between 'do' and 'done' as per requirements, so that bash knows what to do at each turn of that loop.

Echo is the unix equivalent of 'print'; it just outputs something you tell it to the screen (or somewhere else).

In this case, we echo $i, where i denotes the variable we set up in the loop; we just need the dollar sign to let bash know we're asking for the contents of a variable. So, in the first cycle of the loop $i = 1, then in the next $i = 2, all the way up to $i = 50, which is when the loop stops.

Now if we just had 'do echo $i' in our loop, we'd just get the numbers 1 to 50 printed out in the terminal. However, we've used '>>' to redirect the numbers away from the terminal, instead putting them in a new file called 'win1'.

By using '>>' instead of '>', each cycle of the loop appends the new number to the end of the file win1 - if we used just one > each iteration of the loop would overwrite the whole file, leaving win1 at the end containing just the number 50.

So now we have the file win1 containing the numbers 1-50, each on its own line.

 shuf win1 | head -5;  

This is the bit of the code that picks the numbers.

Shuf simply shuffles all of the lines in a file, into a random(ish) order. It's an incredible useful command, which I kicked myself when I found out about it, for having taken so long. In this instance, it mixes up all the lines in win1, so they will no longer be in numerical order.

Here's where we first encounter pipes - that long character between the shuf and the next command is said to 'pipe' (or carry) the results of the first command into the second. 

We take the randomly shuffled win1 file, and pipe that into head (one of the commands you'll probably end up using most!) to take off the top five numbers, giving you your first five lottery numbers.

The next bit just repeats the same methodology for the lucky star balls (two numbers between one and eleven), outputting them after the previous five:

 for i in {1..11}; do echo $i >> win2; done; shuf win2 | head -2;

Finally, a bit of tidying up:

 rm win*  

Rm removes files - here I've used the '*' wildcard character to make it remove all files which begin 'win', thus removing the win1 and win2 files created earlier. This is also vital if you're going to run the script more than once, as otherwise you might end up drawing the same number twice (as re-running the loop would add yet more numbers to the files).

Note that rm can be dangerous - by default, it won't double check if you're sure you want to delete something, and there's basically no going back, so be careful! For instance, by not explicitly stating the files I wanted to remove, this line would also remove any other files which fit the pattern - e.g. if I had a file called 'winning_lottery_numbers.txt' in the same directory, it would have been deleted too!

This can be overcome by being explicit in the files you're removing, like so:

 rm win1 win2   

However if you have more than a few files, you're probably going to want to use the wildcards sensibly.

There you have it - a silly bit of code, but hopefully a useful one to a bash novice!

In case you're wondering, I did a small bit of rigourous, carefully controlled science with my script - I bought a ticket using numbers generated with either my code, or with the lottery's own 'lucky dip'. My code won nothing, but I got £3 from the lucky dip! Pretty conclusive we can all agree.

Tuesday, 10 September 2013

Python fastq analysis for beginners

It's been a little while since my last post, as I did that rare and wondrous thing and took a holiday. Now I'm back, and galvanised into action by a particularly nice reference to my blog, I though I'd continue my vague hopscotch bioinformatics for beginners series, Here's a pretty basic one I've been meaning to put up for a while: fastq analysis.

Fastq is of course the format for storage next generation DNA sequencing data. Each sequence gets four lines in the file: one for sequence identifier, nucleotide sequence, a description line (often just a '+' character), and finally the per-base quality scores (thus lines two and four must be equal length).

If you're on this page, chances are good you're a wet-labber who's just starting to get into working with sequencing data. You can do basic manipulation using a multitude of different programs (I particularly like the FASTX toolkit for a nice spread of basic but useful tricks), but at some point, you're probably going to have to get your hands dirty and code something.

Now I know this isn't best practise, but often when I want to code something I can't immediately think of a solution for, I just get combing tutorials, manuals and forums finding fragments of script that do similar things to what I want, and weaving them together.

Often, the tricky place is knowing where to start. With a view to helping out people like me, here's the basic script I use for the majority of the fastq analysis that I write in Python.

It makes use of the SeqIO interface in Biopython. Basically, all this script does is scroll through a fastq file one record at a time, allowing you to interrogate or manipulate your individual sequences as needed.

from Bio import SeqIO # module needed for sequence input

countseqs = 0

fastqfile = open("YOUR_FILENAME_HERE.fastq", "rU")  
for record in SeqIO.parse(fastqfile,"fastq"):  
  countseqs += 1  
print "Reading from", fastqfile  
print countseqs, "records analysed."  

I usually make use of a little variant on this that allows command line input of the fastq file name; this means I can apply the same script to a large number of files without having to edit the filename in the script each time (and instead can make use of the tab-complete function of my shell):

from Bio import SeqIO # module needed for sequence input   
import sys   # module needed for command line argument to get fastq name   
if (len(sys.argv) > 1):   
  fastqfile = open(sys.argv[1], "rU")   
elif (len(sys.argv) == 1):   
  print "Please supply a fastq filename (i.e. python file.fastq)"   
for record in SeqIO.parse(fastqfile,"fastq"):   
  countseqs += 1   
print "Reading from", fastqfile   
print countseqs, "records analysed."   
NB - this script is easily converted into a fasta file, just by doing a find and replace of 'fastq' for 'fasta' - the parser handles it just fine.

This is of course the easy bit - filling in the "whatever you want to happen" bit is always going to be tricky.

To give you an example, here are the bones of a script I wrote based on this template, that I alluded to in a previous post.

from Bio import SeqIO # module needed for sequence input  
import sys      # module needed for command line argument to get fastq name  
import regex        # module for regex  
if (len(sys.argv) > 1):  
  fastqfile1 = open(sys.argv[1], "rU")  
elif (len(sys.argv) == 1):  
  print "Please supply a fastq filename (i.e. python file.fastq)"  
SEQ1_outfilename = ("SEQ_1_" + sys.argv[1][:15] + ".fastq")  
SEQ1_outfile = open(SEQ1_outfilename, "w")  
SEQ2_outfilename = ("SEQ_2_" + sys.argv[1][:15] + ".fastq")  
SEQ2_outfile = open(SEQ2_outfilename, "w")  
countseqs = 0  
countSEQ_1 = 0  
countSEQ_2 = 0  
for record in SeqIO.parse(fastqfile1,"fastq"): # scrolls through fastq file, using SeqIO package  
  countseqs += 1 # counts number of sequences measured  
  seq1 ='(ATATATATAT){e<=1}', str(record.seq))  
  if seq1:  
    countSEQ_1 +=1  
    output_handle = open(str(SEQ1_outfilename), "a")  
    SeqIO.write(record, output_handle, "fastq-sanger")  
    seq2 ='(CGCGCGCGCG){e<=1}', str(record.seq))  
    if seq2:   
      countSEQ_2 +=1  
      output_handle = open(str(SEQ2_outfilename), "a")  
      SeqIO.write(record, output_handle, "fastq-sanger")  
print "Reading from", fastqfile1  
print countseqs, "records analysed."  
print countSEQ_1, "Sequence 1s detected."  
print countSEQ_2, "Sequence 2s sequences detected."  

This uses the framework of the scripts above, looking for two sequences (in this case two primers, the presence of which would indicate which amplicons the sequences belonged to); depending on which subsequence it finds (if any), it bins the record in one of two output files.

To give a bit of leeway for sequencing error, I've made use of 'regex', an alternative regular expression module (as opposed to the standard/older 're'), which has an option for fuzzy string matching - i.e. you can allow mismatches or indels.

Taking one of the pretend sequences I've used here (ATATATATAT), this means that any of the following sequences would also be categorised as a hit using this search: ATATATGTAT, ATATATATT, ATATAATATAT.

There we have it; a crude, yet effective script to take a fastq file, and output two new fastqs containing (mutually exclusive) reads based on the presence of a given subsequence.

Note that this probably isn't the most efficient solution to this problem - even for something written in Python - but it was knocked together quickly, building up on simple principles from a basic fastq analysis template script, and most importantly, it got the job done. I only needed to run this code three or four times, and was able to do the analysis I needed, which is, let's face it, the most important thing.


Update: it's been a few years since I wrote this article, and it's worth pointing out that I've found a much nicer way to parse fasta/fastq files: the readfq function from Heng Li. It's available in a variety of languages, is simple, customisable, pared down, and crucially it's fast. Particularly in applications where you don't need all the extra legwork that SeqIO is doing - like converting your quality scores into their integer equivalents.

Thursday, 15 August 2013

5 minute PyMOL tutorial

I'm very lucky to have a pretty great PhD project, deep-sequencing T cell receptor repertoires. It's an interesting, exciting field, that I have only one major complaint about - I don't get to make any pretty pictures.

It sounds like a minor complaint, but it makes a difference when it comes to presentation time. My main output is sequence data; at best I produce some nice graphs, which rather lack the the impact of say, some beautiful confocal microscopy.

In order to get nice pictures to open a presentation on, I like to just fire up some molecular visualisation software and get some nice screenshots of whatever molecules are relevant to the talk. In case others wanted to do similarly - make nice pictures, without needing to become an expert in molecular visualisation or modelling - but didn't know how, here's a mini-tutorial as to how I do it.

First off, you need some visualisation software. If you're a student or teacher, I think you can get the excellent PyMOL for free, which I'll use in this blog, although there are others. I personally cut my visualisation teeth on a combination of Jmol, and (the now decidedly retro) RasMol - what you learn in one is broadly applicable to the others.

Let's get a structure to work on - head on over to the Protein Data Bank and look up something you're interested in. You can search by molecule name, ID (if you have it, you can find these in structural papers) or author name.

We need to download the PDB file for a molecule of interest, which is basically just a text file containing a set of x, y, z coordinates for all the atoms in that molecule (as described by solving the crystal structure, for example).

I've downloaded the PDB for 1MI5, a crystal structure for the LC13 TCR bound to HLAB8 complexed with a peptide from the EBNA 3A molecule of Epstein Barr Virus.

Fire up PyMOL - there's a two windows, a viewer and an interface - and open your PDB (note that if you already know the ID of your molecule, you can use this to open the file without downloading the PDB with the fetch command, e.g. 'fetch 1zgl').

At first it'll probably look like a big mess.

Orient yourself. It's pretty easy to move about with mouse; it's just hard to remember which button does what. Handily, they designers provided a nice little reminder in the bottom ride hand side.

Basically, left mouse button rotates, middle button translocates, and right mouse button zooms. Have a play, get used to it, it probably doesn't matter much anyway - if you're anything like me you'll forget it every time and have to rediscover it every time.

Now let's make it look pretty.

By default, all molecules are selected. We can use the five buttons in the top right to perform various tasks on whatever's currently selected.

The left column represents the different selections. The buttons on the right are the tasks that can be applied to those selections. Left to right: Action; Show; Hide; Label; Color
Unless I'm looking at specific interactions, I like cartoon display, as I think this makes it easiest to see generally what goes where in a structure, particularly if there's more than one polypeptide present. Click 'S' for show, then choose cartoon.

It looks somewhat fuzzy now, as it's showing the cartoon display overlain on everything else that was there. To get rid of all the fluff, go the 'H' (hide) menu, and hide the lines (default layout) and the waters (sporadic red dots).

Much better.

Now we want to be able to tell our different molecules apart. This is where the PDB web entry for our molecule really comes in handy, as it tells us what letter is used to refer to each separate molecule as. This allows us to make different selections, which we can then use to apply actions to particular molecules or subsets of molecules.

So, looking at our entry, I can enter the following code into the command line area of the interface window to generate our selections.

 select MHC, (chain a)  
 select B2m, (chain b)  
 select peptide, (chain c)  
 select alpha, (chain d)  
 select beta, (chain e)  

TCR-people, note that also for the three or four other TCR-pMHC complexes I've tried this on, they've all used the same letters for each chain, so this might be the convention.
You can then use the 'C' color button on each selection to colour them separately. I've arranged the molecule so you see down the groove of the MHC molecule, as this shows the interaction site of the whole complex well.

However, I always think it's a bit hard to see the peptide in amongst all the other molecules, so I've gone back and changed it to show as spheres.

After that, we just need to save a high-resolution image, which we can get through ray tracing:

 ray 2400, 2400  

Just save the image after ray tracing. I then cropped this down using Gimp.
All done!

When using crystal structures that you didn't produce, remember to give proper attribution; either cite the original paper, or provide the PDB ID.

For a more in-depth understanding of PyMOL, why not read the manual or check out the PyMOL wiki.

Saturday, 10 August 2013

Decombining the Recombined: High-Throughput TCR Repertoire Analysis

August 2016 update:
Decombinator has gone through a number of improvements since writing this post. The current version can be found on the Innate2Adaptive lab's Github repo.

The background

The background      

As I've mentioned in the past, my PhD project is involved with using high-throughput sequencing (HTS) to investigate T-cell receptor (TCR) repertoires.
Next-generation sequencing (NGS) technologies allow us to investigate the workings and dynamics of the adaptive immune system with greater breadth and resolution than ever before. It’s possible to extract or amplify the coding sequence for thousands to millions of different clones, and throw them all into one run of a sequencer. The trick is extracting the TCR information from the torrent of data that pours out.
Our lab’s paper for the high-throughput analysis of such repertoire data has been out for a while now, so here’s my attempt to put in to a little wider context than a paper allows, while hopefully translating it for the less computationally inclined.
In order to understand best how the brains in our lab tackled this problem, it's probably worth looking at how others before us have.

The history

There's a beaten path to follow. The germline V, D and J gene sequences are easily downloaded from IMGT GENE-DB; then you take your pick of a short read aligner to match up the V and J sequences used (we’ll come to the Ds later).
Most of the popular alignment algorithms get covered by some group or other: SWA (Ndifon et al., 2012; Wang et al., 2010), BLAT (Klarenbeek et al., 2010) and BLAST (Mamedov et al., 2011) all feature. IMGT’s HighV-QUEST software uses DNAPLOT, a bespoke aligner written for their previous low-throughput version (Alamyar et al, 2012; Giudicelli et al, 2004; Lefranc et al, 1999).
Sadly, some of the big hitters in the field don’t see fit to mention what they use to look for TCR genes (or I’ve just missed it). Robert Holt’s group produced the first NGS TCR sequencing paper I’m aware of (Freeman, Warren, Webb, Nelson, & Holt, 2009), but don’t mention how they assign their genes (admittedly they’re more focused on explaining how iSSAKE, their short-read assembler designed for TCR data, works).
The most prolific author in the TCR ‘RepSeq’ field is Harlen Robins, who has contributed to a wealth of wonderful repertoire papers (Emerson et al., 2013; Robins et al., 2009, 2010, 2012; Sherwood et al., 2013; Srivastava & Robins, 2012; Wu et al., 2012), yet all remain equally vague on TCR assignation methods (probably related to the fact that he and several other early colleagues set up a company, Adaptive Biotech, that offers TCR repertoire sequencing and analysis).
So we see a strong tendency towards alignment (and a disappointing prevalence of ‘in house scripts’). I can understand why: you've sequenced your TCRs, you've got folder of fastqs burning a hole in your hard drive, you’re itching to get analysing. What else does your average common house or garden biologist do when confronted with sequences, but align?
However, this doesn't really exploit the nature of the problem.

The problem

When trying to fill out a genome, alignment and assembly make sense; you take a read, and you either see where it matches to a reference, or see where it overlaps with the other reads to make a contig.
For TCR analysis however, we're not trying to map reads to or make a genome; arguably we’re dealing with some of the few sequences that might not be covered under the regular remit of 'the genome'. Nor are we trying to identify known amplicons from a pool, where they should all the same, give or take the odd SNP.
TCR amplicons instead should have one of a limited number of sequences at each end (corresponding to the V and J/C gene present, depending on your TCR amplification/capture strategy), with a potentially (and indeed probably) completely novel sequence in between.
In this scenario, pairwise alignment isn’t necessarily the best option. Why waste time trying to match every bit of each read to a reference (in this case, a list of germ-line V(D)J gene segments), when only the bits you’re least interested in – the germline sequences – stand a hope of aligning anywhere?

The solution?

Enter our approach: Decombinator (Thomas, Heather, Ndifon, Shawe-Taylor, & Chain, 2013).
Decombinator rapidly scans through sequence files looking for rearranged TCRs, classifying any it finds by these criteria; which V and J genes were used, how many nucleotides have been deleted from each, and the string of nucleotides between the end of the V and the start of the J. This simple five-part index condenses all of the information contained in any given detected rearrangement into a compact format, convenient for downstream processing.
All TCR analysis probably does something similar; the clever thing about Decombinator is how it gets there. At its core lies a finite state automaton that passes through each read looking for a panel short sequence ‘tags’, each one uniquely identifying the presence of a germline V or J gene. It if finds tags for both a V and a J it can populate the five fields of the identifier, thus calling the re-arrangement.
Example of the comma-delimited output from Decombinator. From left to right: (optional sixth-field) clonal frequency; V gene used; J gene used; number of V deletions; number of J deletions, and insert string
The algorithm used was developed by Aho and Corasick in the ‘70s, for bibliographic text searching, and the computer science people tell me that it’s really never been topped – when it comes to searching for multiple strings at once, the Aho-Corasick (AC) algorithm is the one (Aho & Corasick, 1975).
Its strength lies in its speed – it’s simply the most effective way to search one target string for multiple substrings. By using the AC algorithm Decombinator runs orders of magnitude faster than alignment based techniques. It does this by generating a special trie of the substrings, which it uses to search the target string exceedingly efficiently.
Essentially, the algorithm uses the trie to look for every substring or tag simultaneously, in just one pass through the sequence to be searched. It passes through the string, using the character at the current position to decide how to navigate; by knowing where it is on the trie at any one given point, it’s able to use the longest matched suffix to find the longest available matched prefix.

Figshare seems to have scotched up the resolution somewhat, but you get the idea.

Decombinator uses a slight modification of the AC approach, in order to cope with mismatches between the tag sequences and the germline gene sequenced (perhaps from SNPs, PCR/sequencing error, or use of non-prototypic alleles). If no complete tags are found, the code breaks each of the tags into two halves, making new half-tags which are then used to make new tries and search the sequence.
If a half-tag is found, Decombinator then compares the sequence in the read to all of the whole-tags that contain that half-tag*; if there’s only a one nucleotide mismatch (a Hamming distance of one) then that germline region is assigned to the read. In simulated data, we find this to work pretty well, correctly assigning ~99% of artificial reads with a 0.1% error (i.e. one random mismatch every 1,000 nucleotides, on average), dropping to ~88% for 1% error **.
It’s simple, fast, effective and open source; if you do have high-throughput human or mouse alpha-beta or gamma-delta TCR data set to analyse, it’s probably worth giving it a try. The only real available other option is HighV-QUEST, which (due to submission and speed constraints) might not be too appealing an option if you really have a serious amount of data.
(CORRECTION – in the course of writing this entry (which typically has gone on about five times longer than I intended, in both time taken and words written), some new rather exciting looking software has just been published (Bolotin et al., 2013). MiTCR makes a number of bold claims which if true, would make it a very tempting bit of software. However, given that I haven’t been able to get it working, I think I’ll save any discussion of this for a later blog entry.)

The (odds and) end(s)

D segments
If you’ve made it this far through the post, chances are good you’re either thinking about doing some TCR sequencing or have done already, in which case you’re probably wondering, ‘but what about the Ds – when do they get assigned’?
In the beta locus (of both mice and men), there are but two D regions, which are very short, and very similar. As they can get exonucleolytically nibbled from both ends, the vast majority of the time you simply can’t tell which has been used (one early paper managed in around a quarter of reads (Freeman et al., 2009)). Maybe you could do some kind of probabilistic inference for the rest, based on the fact that there is a correlation between which Ds pair with which Js, likely due to the chromatin conformation at this relative small part of the locus (Murugan, Mora, Walczak, & Callan, 2012; Ndifon et al., 2012), but that’s a lot of work for very little reward.
Hence Decombinator does not assign TRBDs; they just get included in the fifth, string based component of the five-part identifier (which explains the longer inserts you see for beta compared to alpha). If you want to go TRBD mining you’re very welcome, just take the relevant column of the output and get aligning. However, for our purposes (and I suspect many others’), knowing the exact TRBD isn’t that important, where indeed even possible at all.
There’s also the question of how to deal with errors, which can accrue during amplification or sequencing of samples. While Decombinator does mitigate error somewhat through use of the half-tags and omission of reads containing ambiguous N calls, it doesn’t have any other specific error-filtration. As with any pipeline, garbage in begets garbage out; there’s plenty of software to trim or filter HTS data, so we don’t really need to reinvent the wheel and put some in here.
Similarly, Decombinator doesn’t currently offer sequence clustering, whereby very similar sequences get amalgamated into one, as some published pipelines do. Personally, I have reservations about applying standard clustering techniques to variable immunoreceptor sequence data.
Sequences can legitimately differ by only one nucleotide, and production of very similar clonotypes is inbuilt into the recombination machinery (Greenaway et al., 2013; Venturi et al., 2011); it is very easy to imagine bona fide but low frequency clones being absorbed into their more prevalent cousins, which could obscure some genuine biology. The counter argument is of course that by not clustering, one allows through a greater proportion of errors, thus artificially inflating diversity. Again, if desired other tools for sequence clustering exist.
My contribution to Decombinator was relatively minor - the real brainwork was done before I'd even joined the lab, by my labmate, mathematician Niclas Thomas, our shared immunologist supervisor Benny Chain, and Nic's mathematical supervisor, John Shawe-Taylor. They came up with the idea and implemented the first few versions. I came in later, designing tags for the human alpha chain, testing and debugging, and bringing another biologist’s view to the table for feature development***. The other author, Wilfred Ndifon, at the time was a postdoc in a group with close collaborative ties, who I believe gave additional advice on development, and provided pair-wise alignment scripts against which to test Decombinator.

* Due to the short length of the half-tags and the high levels of homology between germline regions, not all half tags are unique
** In our Illumina data, by comparing the V sequence upstream of the rearrangement – which should be invariant – to the reference, we typically get error rates below 0.5%, some of which could be explained by allelic variation or SNPs relative to the reference
*** At first we had a system that Nic would buy a drink for every bug found. For a long time I was the only person really using Decombinator (and probably remain the person who’s used it most), often tweaking it for whatever I happened to need to make it do that day, putting me in prime place to be Bug-Finder Extraordinaire. I think I let him off from the offer after the first dozen or so bugs found.

The papers

Aho, A. V., & Corasick, M. J. (1975). Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6), 333–340. doi:10.1145/360825.360855
Alamyar, A., Giudicelli, V., Li, S., Duroux, P., & Lefranc, M. (2012). IMGT/HIGHV-QUEST: THE IMGT® WEB PORTAL FOR IMMUNOGLOBULIN (IG) OR ANTIBODY AND T CELL RECEPTOR (TR) ANALYSIS FROM NGS HIGH THROUGHPUT AND DEEP SEQUENCING. Immunome Research, 8(1), 2. doi:10.1038/nmat3328
Bolotin, D. a, Shugay, M., Mamedov, I. Z., Putintseva, E. V, Turchaninova, M. a, Zvyagin, I. V, Britanova, O. V, et al. (2013). MiTCR: software for T-cell receptor sequencing data analysis. Nature methods, (8). doi:10.1038/nmeth.2555
Emerson, R., Sherwood, A., Desmarais, C., Malhotra, S., Phippard, D., & Robins, H. (2013). Estimating the ratio of CD4+ to CD8+ T cells using high-throughput sequence data. Journal of immunological methods. doi:10.1016/j.jim.2013.02.002
Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H., & Holt, R. a. (2009). Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome research, 19(10), 1817–24. doi:10.1101/gr.092924.109
Giudicelli, V., Chaume, D., & Lefranc, M.-P. (2004). IMGT/V-QUEST, an integrated software program for immunoglobulin and T cell receptor V-J and V-D-J rearrangement analysis. Nucleic acids research, 32(Web Server issue), W435–40. doi:10.1093/nar/gkh412
Greenaway, H. Y., Ng, B., Price, D. a, Douek, D. C., Davenport, M. P., & Venturi, V. (2013). NKT and MAIT invariant TCR╬▒ sequences can be produced efficiently by VJ gene recombination. Immunobiology, 218(2), 213–24. doi:10.1016/j.imbio.2012.04.003
Klarenbeek, P. L., Tak, P. P., Van Schaik, B. D. C., Zwinderman, A. H., Jakobs, M. E., Zhang, Z., Van Kampen, A. H. C., et al. (2010). Human T-cell memory consists mainly of unexpanded clones. Immunology letters, 133(1), 42–8. doi:10.1016/j.imlet.2010.06.011
Lefranc, M.-P. (1999). IMGT, the international ImMunoGeneTics database. Nucleic acids research, 27(1), 209–212. Retrieved from
Mamedov, I. Z., Britanova, O. V, Bolotin, D., Chkalina, A. V, Staroverov, D. B., Zvyagin, I. V, Kotlobay, A. A., et al. (2011). Quantitative tracking of T cell clones after hematopoietic stem cell transplantation. EMBO molecular medicine, 1–8.
Murugan, a., Mora, T., Walczak, a. M., & Callan, C. G. (2012). Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.1212755109
Ndifon, W., Gal, H., Shifrut, E., Aharoni, R., Yissachar, N., Waysbort, N., Reich-Zeliger, S., et al. (2012). Chromatin conformation governs T-cell receptor J gene segment usage. Proceedings of the National Academy of Sciences, 1–6. doi:10.1073/pnas.1203916109
Robins, H. S., Campregher, P. V, Srivastava, S. K., Wacher, A., Turtle, C. J., Kahsai, O., Riddell, S. R., et al. (2009). Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood, 114(19), 4099–107. doi:10.1182/blood-2009-04-217604
Robins, H. S., Desmarais, C., Matthis, J., Livingston, R., Andriesen, J., Reijonen, H., Carlson, C., et al. (2012). Ultra-sensitive detection of rare T cell clones. Journal of Immunological Methods, 375(1-2), 14–19. doi:10.1016/j.jim.2011.09.001
Robins, H. S., Srivastava, S. K., Campregher, P. V, Turtle, C. J., Andriesen, J., Riddell, S. R., Carlson, C. S., et al. (2010). Overlap and effective size of the human CD8+ T cell receptor repertoire. Science translational medicine, 2(47), 47ra64. doi:10.1126/scitranslmed.3001442
Sherwood, A. M., Emerson, R. O., Scherer, D., Habermann, N., Buck, K., Staffa, J., Desmarais, C., et al. (2013). Tumor-infiltrating lymphocytes in colorectal tumors display a diversity of T cell receptor sequences that differ from the T cells in adjacent mucosal tissue. Cancer immunology, immunotherapy: CII. doi:10.1007/s00262-013-1446-2
Srivastava, S. K., & Robins, H. S. (2012). Palindromic nucleotide analysis in human T cell receptor rearrangements. PloS one, 7(12), e52250. doi:10.1371/journal.pone.0052250
Thomas, N., Heather, J., Ndifon, W., Shawe-Taylor, J., & Chain, B. (2013). Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine. Bioinformatics, 29(5), 542–50. doi:10.1093/bioinformatics/btt004
Venturi, V., Quigley, M. F., Greenaway, H. Y., Ng, P. C., Ende, Z. S., McIntosh, T., Asher, T. E., et al. (2011). A Mechanism for TCR Sharing between T Cell Subsets and Individuals Revealed by Pyrosequencing. Journal of immunology. doi:10.4049/jimmunol.1003898
Wang, C., Sanders, C. M., Yang, Q., Schroeder, H. W., Wang, E., Babrzadeh, F., Gharizadeh, B., et al. (2010). High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets. Proceedings of the National Academy of Sciences of the United States of America, 107(4), 1518–23. doi:10.1073/pnas.0913939107
Wu, D., Sherwood, A., Fromm, J. R., Winter, S. S., Dunsmore, K. P., Loh, M. L., Greisman, H. a., et al. (2012). High-Throughput Sequencing Detects Minimal Residual Disease in Acute T Lymphoblastic Leukemia. Science Translational Medicine, 4(134), 134ra63–134ra63. doi:10.1126/scitranslmed.3003656