Long Reads Blog

2014-10-30   Say you just sorted a whole bunch of bam files and now want to create an index for each.

The obvious (but wrong) solution would be samtools index *.sorted.bam. Wrong, because samtools index only takes a single input bam file path.

The next obvious (correct) solution might be:

for i in *.sorted.bam
do
	samtools index $i
done

But who wants to type all that? And anyways, this is a post about running things in parallel, which this is not.

Enter the solution using GNU parallel: ls *.sorted.bam | parallel -j4 samtools index {}. Simple, runs 4 processes in parallel until they all finish.

Installing parallel is as easy as brew install parallel (on OS X with homebrew installed), or sudo apt-get install parallel (on Ubuntu et al; may need an extra command depending on your version).

A quick intro to parallel:

  • inputs can be passed to parallel by piping them in, as above
  • alternately, you can specify them using this format: parallel echo {} ::: A B C, which echos A, B, and C (each separately, in a random order)
  • The {} will be replaced by the inputs as given. You can also lop off the file name extension using {.}, for example: ls *.bam | parallel echo samtools sort {} {.}.sorted, which issues commands of the variety samtools sort mysample.bam mysample.sorted.
  • the --dryrun option will echo the commands back to you without running them. I find it easier to just type the word “echo” before my command
  • by default the output from each command will only be printed to stdout after the command has completed; to intersperse output from all commands in real time, use the -u (“ungroup”) argument to parallel
  • by default, the number of processes is the number of cores available on the computer. Use -j n to specify n jobs in parallel
  • to run more complex commands using parallel, you may need to use quotation marks around the command

A note of caution: monitor your CPU usage when running jobs in parallel. If your processes aren’t getting up to 100% CPU usage each, this might be indicative of an file read/write bottleneck, and reducing -j might actually help your overall runtime. For example, running samtools sort in parallel on lots of files exacerbates the thrashing problem described previously because there even more files being read simultaneously.

2014-10-26   It turns out that samtools sort command line options really do matter. When sorting a large bam file, samtools creates lots of intermediate files. The merge it runs to create the final sorted bam tries to read from all of them nearly simultaneously, thrashing even a pretty decent RAID drive (the read head on the drives are spending more time moving between files than actually reading any single file).

A partial solution lies in increasing the amount of memory used by specifying the -m option, for example -m 5G to give each samtools thread 5GB of memory. The larger the amount of memory, the fewer intermediate files created, the less thrashing involved in the merge step.

In my case, time to sort a 30x whole-genome BAM file went from days down to hours simply by specifying -m 10G rather than the default of 768M.

A more complete solution might be to use the DNAnexus fork of samtools, but it didn’t compile out of the box on OS X so I haven’t actually tried it (though the accompanying blog post was helpful).

Also, remember that newer versions of samtools sort can be parallelized to some extent using the -@ option, eg -@4 to specify 4 threads. Note that the memory allocation above is per thread, so in our example, we’d need about 4 x 5 = 20 GB of RAM.