Fixing the XS tag in TopHat output [Bug Fixing]

The TopHat2 paper was just published. Although there are over a dozen of spliced mapper out there, TopHat remains one of the most frequently used mapper, which I think, could be attributed to its tightly linked sister differential expression software, Cufflinks, in the “Tuxedo” suite. Many thanks to the developers at CBCB, TopHat has been improved a lot since its first release. However, there are some bugs in the output of the final BAM file really annoyed me.

First is the XS tag when –GTF option is turned on. That is, if a read is mapped on a feature specified in the GTF,  the XS tag would always be on the same strand of the feature not matter what strand the read is mapped to. This would create wrong counts is there are unannotated antisense transcripts. To fix that, the following samtools|awk|samtools pipe will do the trick,

samtools view -h $tophatOutDir/accepted_hits.bam
| awk '{if(\$2%256==16) {sub(/XS\:A\:\+/,"XS:A:-",\$0);print \$0}; if(\$2%256==0) {sub(/XS\:A\:\-/,"XS:A:+",\$0);print \$0};}'
| samtools view -h -S -b - >$tophatOutDir/accepted_hits_xs.bam

It goes like this:

  1. samtools stream feed the lines of the accepted_hits.bam to awk;
  2. awk checks the bit flag in the second column (i.e.$2) whether 16 is present, (i.e. (\$2%256==16)), if yes then change + to -; and check for the bit flag 0 for the other strand;
  3. samtools takes the stream feed from awk and output the BAM;

Leave a comment