How to distinguish perfectly mapped reads from a SAM/BAM file
Today I was wondering, given a set of reads and a genome sequence, how I could filter out all of the reads that align to the genome perfectly. I'll explore this idea a bit here. I'm going to use nullgraph to simulate a random genome, bwa to compute the alignments, and samtools for post-processing the alignments.
Simulated data¶
I'll start by simulating a random 2.5 megabase genome and "sequencing" to a depth of 20x coverage.
make-random-genome.py -h
# Generating the bogus chromosome in 10 smaller pieces is quicker than in one big piece.
echo '>bogusgenome' > bogus-genome.fa
time for i in {1..10}
do
make-random-genome.py --length 250000 --seed $((42 * $i)) | grep -v '^>' >> bogus-genome.fa
done
make-reads.py -h
time make-reads.py --error-rate 0.005 --read-length 125 --coverage 20 --seed 54321 bogus-genome.fa > bogus-reads.fa
Ok, so we have 400k reads, 125 bp each and 186,317 of these reads have one or more "mutations" (representing single-nucleotide sequencing errors).
Take Un¶
My first thought was to take the "minimum seed length" parameter to the extreme. Many alignment algorithms, bwa mem included, use a "seed-and-extend" strategy. The seed is a subsequence of the read that must match the reference perfectly for the rest of the mismatch-tolerant alignment to be computed. If my reads are 125 bp in length and I set the minimum seed length to 125, only alignments for reads that match perfectly should be reported. The remainder of the reads should be reported as unaligned.
Let's see!
# First, gotta index the genome so bwa can do its thang.
bwa index bogus-genome.fa
I'll pipe the output of bwa
directly to samtools
.
The -f 4
flag will pull out all reads that are marked as unaligned, and the -c
flag will count the records instead of printing them.
bwa mem -k 125 bogus-genome.fa bogus-reads.fa | samtools view -f 4 -c
Great, it worked! We have 186,317 reads marked as unaligned, which matches the number of reads containing errors from our simulation.
Take Deux¶
That was pretty quick and easy, but unfortunately this solution has limited utility. If any kind of trimming, error correction, or other quality control is performed on the reads prior to alignment, then it's almost certain that the reads will not be of uniform length. If I set the minimum seed length to high, some reads will not make the cut simply because they aren't long enough, but if I set it too low, reads with errors will be aligned and I'll have to resort to some other way of distinguishing perfect matches anyway.
My next thought was to adjust the alignment scoring scheme and penalize mismatches, gap opens, and gap extensions so much that a read with any mismatch would not be aligned.
bwa mem -B 100 -O [100,100] -E [100,100] bogus-genome.fa bogus-reads.fa | samtools view -f 4 -c
bwa mem -B 1000 -O [1000,1000] -E [1000,1000] bogus-genome.fa bogus-reads.fa | samtools view -f 4 -c
Not sure what's going on here, but clearly it didn't work as intended.
Take Trois¶
My next thought was to process the alignments "normally" (with default parameters) and examine each alignment's CIGAR string to distinguish perfect matches from imperfect alignments.
Ah, if only life were that easy.
The CIGAR format uses different symbols to annotated insertions, deletions, clipped alignments, and so on.
But someone somwhere a long time ago decided that it would be a good idea to use the same symbol to annotate matches and mismatches in an alignment.
So if an alignment has a CIGAR string of 108M
it's impossible to know if that's a perfect match of 108 nucleotides, or 90 matches and then a single mismatch and then 17 more matches, or something else.
One would ether have to compare the read sequence to the sequence of the genome to which it was aligned, or compare the length of the read to the number of matches in the MD tag or something like that.
This is getting into the realm of BAM processing libraries and scripts, which I'm trying to avoid.
If at all possible I'd love to do this as a simple shell one-liner.
My friend Srihari suggested I look at the optional tags in more detail, particularly the edit distance (NM) tag. I felt silly for not thinking of this earlier myself (I'll chalk it up to my limited practical experience with SAM/BAM data). Any perfect alignment is going to have an edit distance of 0, while any imperfect alignment is going to have a positive edit distance, so that should be a simple way to distinguish and ignore perfect alignments.
# The grep command discards read alignments with an edit distance of 0.
bwa mem bogus-genome.fa bogus-reads.fa | grep -v 'NM:i:0' | wc -l
Huh, that's close, but not exactly our target. We should be seeing 186,317 reads. What's going on? Let's do a bit of detective work.
# Grab read IDs from the "seed length" method that gave us the correct answer earlier.
bwa mem -k 125 bogus-genome.fa bogus-reads.fa | samtools view -f 4 > test-one.sam
cut -f 1 test-one.sam | sort | uniq > test-one.txt
# Grab read IDs from the "edit distance" method that isn't quite right.
bwa mem bogus-genome.fa bogus-reads.fa | samtools view | grep -v 'NM:i:0' > test-two.sam
cut -f 1 test-two.sam | sort | uniq > test-two.txt
# Randomly grab a few reads that are appear in the first set but not the second.
# I'm using `gshuf` on Mac OS X. Replace it with `shuf` if you're using a Linux OS.
comm -23 test-one.txt test-two.txt | gshuf | head -n 5
Ok let's take a closer look at a couple of these reads.
grep -A 1 -e read43849r -e read5193r bogus-reads.fa
Interesting. Both have a couple of sequencing errors near the end of the read (shown as lowercase symbols). Now let's take a look at the alignments without any filtering.
# Hide the progress reports by sending them to /dev/null
bwa mem bogus-genome.fa bogus-reads.fa 2> /dev/null | grep -e read43849r -e read5193r
Ahh, ok. As the CIGAR strings show, both of these alignments are soft-clipped at the ends. That means the offending nucleotides are not considered part of the alignment, e.g. when computing the edit distance. Ignoring the soft-clipped portion, the remainder of the read is a perfect match and is thus being (erroneously) discarded.
Take Quatre¶
Ok, so it looks like we need to check for soft clipping before we apply the edit distance filter. It doesn't look like soft clipping can be disabled with bwa mem (I would LOVE to be corrected!), so it'll have to be handled by post-processing.
Our algorithm boils down to this: keep the record if the CIGAR string does not match the regular expression \d+M
(check for soft clipping) or if the edit distance is 0.
I can think of two ways of doing this.
The first approach is lazier, but is also more concise and less cryptic to those who have never slung Perl.
bwa mem bogus-genome.fa bogus-reads.fa | samtools view | perl -ne 'print if !/\t\d+M\t/ || !/NM:i:0/' | wc -l
The second approach is more explicit, avoiding any unintended off-target matches for the \d+M
regex.
This comes at the cost of introducing more complexity and verbosity in the Perl code, and honestly I don't know if an off-target match for that regex is even possible for well-formatted SAM alignments.
bwa mem bogus-genome.fa bogus-reads.fa | samtools view | perl -ne '@v = split; print if $v[5] =~ /S/ || !/NM:i:0/' | wc -l
In any case, both approaches see to work correctly and give us just the reads that do not align perfectly to the reference.
Fin¶
One thing I haven't considered much here is performance. This whole exercise would have been pretty silly if a short read aligner exists that make it easy to require only exact alignments to be reported. But I'm not familiar with any, and this use case has such limited utility with the data we work with on a daily basis that I doubt there is one in wide use. Again, I'd be happy to be proven wrong!