Saturday, 26 July 2014

See an error in a database? Let someone know!

Anyone who does any T cell receptor analysis will know IMGT (the ImmunoGenetics DataBase), the repository for all things TCR and Ig. You either use it, or you're one of those annoying people that makes me have to drag up all the tables of outdated nomenclatures.
Much like any resource, IMGT has it good points (simple and highly useful features like GENE-DB and LocusView in particular) and its bad (the less said about LIGM-DB the better).
However, again like any resource, it's only as good as the data stored in it. The data in it, as far as I can tell, is pretty damn good (and I use it a lot). I guess that's why they got to be in charge of all the data in the first place.
As such, when I recently found an error in a sequence*, I made sure to let them know: I certainly get a lot of mileage out of their data, it's only fair that I pay them back (and pay it forward to others) by ensuring the data that is there is good.
It's always a little nerve-inducing, being a PhD student emailing senior doctors and professors to let them know of a mistake you've discovered, but as hoped the information was very warmly received, and I'm told that the error will be corrected.
Science has to be self-correcting to stop errors lingering and spreading; firing a quick email off to correct an annotation might not seem like much, but if it stops one person going through the same short time of confusion that you went through unravelling the mistake then you've done a net service to the world.
* For the people that found their way here suffering from this particular error, here's what I found. I was looking at the TCR leader regions(the mono-spliced section of the transcript between the start of translation and the beginning of the V region which encodes the localisation signal peptide), when I noticed that one gene never seemed to produce functional transcripts. It turned out that while some of the entries for the human alpha gene TRAV29,DV5 were correct, if you downloaded the L-PART2 region alone the sequence produced actually contains a section of the start of the V gene. So, instead of reading 'GGGTAAAC', it reads 'GGGTAAACAGTCAACAGAAGAATGAT'. I just checked and it still gives the old sequence, but I assume there's a lag time for databases to update.