Wednesday 10 August 2016

Extracting sample names from Illumina FASTQ filenames

Another post in the vein of "minor analysis task and my solution": how to extract the sample name or identifier (i.e. what you called a given file in the samplesheet) from a list of Illumina FASTQ file names.

So you might have something that looks like this (which I've cobbled together from a couple of runs from different machines that I pulled off Basespace):
 
322_S18_R2_001.fastq.gz 
1c8_P6_act_S11_R2_001.fastq.gz
5c4_P8_res_S1_R1_001.fastq.gz


In this exercise, I wanted the first part of the IDs, which are what we labelled the actual samples with. In an ideal world this should be very easy: we could just use sed or cut to delete everything after the first instance of a character used in the text that Illumina appends to the sample name. However different people (or the same people at different times) use different naming conventions, which can often include underscores, the character that separates the string fields that get added.

Therefore what we need to do is not delete from say the first pattern, but from the nth pattern from the end, which turned out to be a slightly trickier problem (at least to one such as myself that doesn't find sed to be the most intuitive program in the world). Here's what I came up with:
sed 's/\(.*\)_\(.*_\)\(.*\)_/\1|\2/;s/|.*//' FilenameList.txt

What this does is find the third underscore to last (just before S[XX]_R[X].fastq.gz string) and replace it with a pipe character ('|'), which should never appear in a filename, before finding this pipe character and deleting everything after it. This produces the following output:
 
322
1c8_P6_act
5c4_P8_res

This feels a bit hacky to me, as really I feel like I should be able to do this in one substitution, but the first part of the command actually selects the latter half (so if anyone knows how to tidy this up I'd love to hear it!).

It's also worth pointing out that the output of different machines or demultiplexing settings will differ in the number of underscores, and so the command will need to change slightly. For instance, the above example was run on NextSeq data demultiplexed using bcl2fastq --no-lane-splitting. For an example of a file containing lane information ('L001' etc), here's the equivalent on some MiSeq filenames I pulled up.
# This 
324bAB_S8_L001_R1_001.fastq.gz 
076-PBMCs-alpha_S5_L001_R1_001.fastq.gz 

# After this
sed 's/\(.*\)_\(.*_\)\(.*\)_\(.*\)_/\1|\2/;s/|.*//' FilenameList.txt   

# Becomes this 
324bAB
076-PBMCs-alpha
The reason I was actually doing this was because I had a number of different NextSeq runs, some of which contained samples sequenced from the same original libraries (to increase the depth), so I needed to get the names of all samples sequenced in each run. I had each run in separate directories, and then ran this in the folder above:
for i in */
do echo $i
ls $i | grep R1 | grep fastq.gz | sed 's/\(.*\)_\(.*_\)\(.*\)_/\1|\
2/;s/|.*//' > ${i%?}.fl
done
This goes through all the directories, finds the R1 FASTQ files (as there are other files in there, and this gives me only one entry per sample) and strips their sample names out as above, and writes these to a textfile with the suffix '.fl'.

No comments:

Post a Comment