Today I was wondering, given a set of reads and a genome sequence, how I could filter out all of the reads that align to the genome perfectly. I'll explore this idea a bit here. I'm going to use nullgraph to simulate a random genome, bwa to compute the alignments, and samtools for post-processing the alignments.

Simulated data

I'll start by simulating a random 2.5 megabase genome and "sequencing" to a depth of 20x coverage.

In [1]:
make-random-genome.py -h
usage: make-random-genome.py [-h] [-l LENGTH] [-s SEED] [--name NAME]

optional arguments:
  -h, --help            show this help message and exit
  -l LENGTH, --length LENGTH
                        Simulated genome length
  -s SEED, --seed SEED  Random number seed
  --name NAME           sequence name
In [2]:
# Generating the bogus chromosome in 10 smaller pieces is quicker than in one big piece.
echo '>bogusgenome' > bogus-genome.fa
time for i in {1..10}
do
    make-random-genome.py --length 250000 --seed $((42 * $i)) | grep -v '^>' >> bogus-genome.fa
done
Using random seed: 42
Using random seed: 84
Using random seed: 126
Using random seed: 168
Using random seed: 210
Using random seed: 252
Using random seed: 294
Using random seed: 336
Using random seed: 378
Using random seed: 420

real	0m3.067s
user	0m2.876s
sys	0m0.143s
In [3]:
make-reads.py -h
usage: make-reads.py [-h] [-e ERROR_RATE] [-r READ_LENGTH] [-C COVERAGE]
                     [-S SEED] [--even] [--mutation-details MUTATION_DETAILS]
                     genome

positional arguments:
  genome

optional arguments:
  -h, --help            show this help message and exit
  -e ERROR_RATE, --error-rate ERROR_RATE
  -r READ_LENGTH, --read-length READ_LENGTH
                        Length of reads to generate
  -C COVERAGE, --coverage COVERAGE
                        Targeted coverage level
  -S SEED, --seed SEED  Random seed
  --even
  --mutation-details MUTATION_DETAILS
                        Write detailed log of mutations here
In [4]:
time make-reads.py --error-rate 0.005 --read-length 125 --coverage 20 --seed 54321 bogus-genome.fa > bogus-reads.fa
genome size: 2500000
coverage: 20.0
readlen: 125
error rate: 0.005
Read in template genome bogusgenome of length 2500000 from bogus-genome.fa
Generating 400000 reads of length 125 for a target coverage of 20.0 with a target error rate of 0.005
186317 of 400000 reads mutated; 251371 total mutations

real	1m39.693s
user	1m39.133s
sys	0m0.283s

Ok, so we have 400k reads, 125 bp each and 186,317 of these reads have one or more "mutations" (representing single-nucleotide sequencing errors).

Take Un

My first thought was to take the "minimum seed length" parameter to the extreme. Many alignment algorithms, bwa mem included, use a "seed-and-extend" strategy. The seed is a subsequence of the read that must match the reference perfectly for the rest of the mismatch-tolerant alignment to be computed. If my reads are 125 bp in length and I set the minimum seed length to 125, only alignments for reads that match perfectly should be reported. The remainder of the reads should be reported as unaligned.

Let's see!

In [5]:
# First, gotta index the genome so bwa can do its thang.
bwa index bogus-genome.fa
[bwa_index] Pack FASTA... 0.03 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.83 seconds elapse.
[bwa_index] Update BWT... 0.03 sec
[bwa_index] Pack forward-only FASTA... 0.02 sec
[bwa_index] Construct SA from BWT and Occ... 0.20 sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa index bogus-genome.fa
[main] Real time: 1.122 sec; CPU: 1.113 sec

I'll pipe the output of bwa directly to samtools. The -f 4 flag will pull out all reads that are marked as unaligned, and the -c flag will count the records instead of printing them.

In [6]:
bwa mem -k 125 bogus-genome.fa bogus-reads.fa | samtools view -f 4 -c
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 80000 sequences (10000000 bp)...
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 2.159 CPU sec, 2.105 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 1.929 CPU sec, 1.812 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 1.971 CPU sec, 1.857 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 2.274 CPU sec, 2.168 real sec
[M::mem_process_seqs] Processed 80000 reads in 1.830 CPU sec, 1.759 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -k 125 bogus-genome.fa bogus-reads.fa
[main] Real time: 9.840 sec; CPU: 10.281 sec
186317

Great, it worked! We have 186,317 reads marked as unaligned, which matches the number of reads containing errors from our simulation.

Take Deux

That was pretty quick and easy, but unfortunately this solution has limited utility. If any kind of trimming, error correction, or other quality control is performed on the reads prior to alignment, then it's almost certain that the reads will not be of uniform length. If I set the minimum seed length to high, some reads will not make the cut simply because they aren't long enough, but if I set it too low, reads with errors will be aligned and I'll have to resort to some other way of distinguishing perfect matches anyway.

My next thought was to adjust the alignment scoring scheme and penalize mismatches, gap opens, and gap extensions so much that a read with any mismatch would not be aligned.

In [7]:
bwa mem -B 100 -O [100,100] -E [100,100] bogus-genome.fa bogus-reads.fa | samtools view -f 4 -c
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 80000 sequences (10000000 bp)...
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 3.261 CPU sec, 3.207 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 7.165 CPU sec, 7.206 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 5.006 CPU sec, 5.368 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 4.185 CPU sec, 4.146 real sec
[M::mem_process_seqs] Processed 80000 reads in 6.265 CPU sec, 6.539 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -B 100 -O [100,100] -E [100,100] bogus-genome.fa bogus-reads.fa
[main] Real time: 26.631 sec; CPU: 26.008 sec
7
In [8]:
bwa mem -B 1000 -O [1000,1000] -E [1000,1000] bogus-genome.fa bogus-reads.fa | samtools view -f 4 -c
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 80000 sequences (10000000 bp)...
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 4.372 CPU sec, 4.334 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 4.562 CPU sec, 4.719 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 6.073 CPU sec, 6.102 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 3.534 CPU sec, 3.410 real sec
[M::mem_process_seqs] Processed 80000 reads in 3.393 CPU sec, 3.320 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -B 1000 -O [1000,1000] -E [1000,1000] bogus-genome.fa bogus-reads.fa
[main] Real time: 22.047 sec; CPU: 22.072 sec
0

Not sure what's going on here, but clearly it didn't work as intended.

Take Trois

My next thought was to process the alignments "normally" (with default parameters) and examine each alignment's CIGAR string to distinguish perfect matches from imperfect alignments. Ah, if only life were that easy. The CIGAR format uses different symbols to annotated insertions, deletions, clipped alignments, and so on. But someone somwhere a long time ago decided that it would be a good idea to use the same symbol to annotate matches and mismatches in an alignment. So if an alignment has a CIGAR string of 108M it's impossible to know if that's a perfect match of 108 nucleotides, or 90 matches and then a single mismatch and then 17 more matches, or something else. One would ether have to compare the read sequence to the sequence of the genome to which it was aligned, or compare the length of the read to the number of matches in the MD tag or something like that. This is getting into the realm of BAM processing libraries and scripts, which I'm trying to avoid. If at all possible I'd love to do this as a simple shell one-liner.

My friend Srihari suggested I look at the optional tags in more detail, particularly the edit distance (NM) tag. I felt silly for not thinking of this earlier myself (I'll chalk it up to my limited practical experience with SAM/BAM data). Any perfect alignment is going to have an edit distance of 0, while any imperfect alignment is going to have a positive edit distance, so that should be a simple way to distinguish and ignore perfect alignments.

In [9]:
# The grep command discards read alignments with an edit distance of 0.
bwa mem bogus-genome.fa bogus-reads.fa | grep -v 'NM:i:0' | wc -l
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 80000 sequences (10000000 bp)...
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 3.601 CPU sec, 3.550 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 3.519 CPU sec, 3.405 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 5.629 CPU sec, 5.714 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 5.022 CPU sec, 5.151 real sec
[M::mem_process_seqs] Processed 80000 reads in 3.737 CPU sec, 3.669 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem bogus-genome.fa bogus-reads.fa
[main] Real time: 21.635 sec; CPU: 21.627 sec
  186194

Huh, that's close, but not exactly our target. We should be seeing 186,317 reads. What's going on? Let's do a bit of detective work.

In [10]:
# Grab read IDs from the "seed length" method that gave us the correct answer earlier.
bwa mem -k 125 bogus-genome.fa bogus-reads.fa | samtools view -f 4 > test-one.sam
cut -f 1 test-one.sam | sort | uniq > test-one.txt

# Grab read IDs from the "edit distance" method that isn't quite right.
bwa mem bogus-genome.fa bogus-reads.fa | samtools view | grep -v 'NM:i:0' > test-two.sam
cut -f 1 test-two.sam | sort | uniq > test-two.txt

# Randomly grab a few reads that are appear in the first set but not the second.
# I'm using `gshuf` on Mac OS X. Replace it with `shuf` if you're using a Linux OS.
comm -23 test-one.txt test-two.txt | gshuf | head -n 5
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 80000 sequences (10000000 bp)...
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 1.816 CPU sec, 1.763 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 2.448 CPU sec, 2.477 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 2.712 CPU sec, 2.592 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 1.967 CPU sec, 1.850 real sec
[M::mem_process_seqs] Processed 80000 reads in 2.266 CPU sec, 2.218 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -k 125 bogus-genome.fa bogus-reads.fa
[main] Real time: 11.074 sec; CPU: 11.341 sec
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 80000 sequences (10000000 bp)...
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 3.708 CPU sec, 3.658 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 4.044 CPU sec, 3.941 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 4.456 CPU sec, 4.382 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 4.368 CPU sec, 4.279 real sec
[M::mem_process_seqs] Processed 80000 reads in 4.595 CPU sec, 4.713 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem bogus-genome.fa bogus-reads.fa
[main] Real time: 21.216 sec; CPU: 21.325 sec
read108697r
read74495f
read264906r
read294922f
read306338r

Ok let's take a closer look at a couple of these reads.

In [11]:
grep -A 1 -e read43849r -e read5193r bogus-reads.fa
>read5193r start=1764079,mutations=2
GAAGCGATAGATCCACGCGATACTCAGTGTATGTAGCCAAATCAAAAAACTACACGATGCCGAACAGAGTGTTTTATTTAACCTCCTGTTTTTGTACAACGTTTTGAGAACATATATCATgGaTT
--
>read43849r start=2013628,mutations=2
aaTTCGCGATTCTAAGGCGCGGGGTCACCATTTAATAACCTCGCACTCGGTACTGAAGTAGCGGCCGTTCGTTTGTGGGCACTAGGTCATTCAGTCCCGTGGCGCGTTTTATAGACGAAACTTTA

Interesting. Both have a couple of sequencing errors near the end of the read (shown as lowercase symbols). Now let's take a look at the alignments without any filtering.

In [12]:
# Hide the progress reports by sending them to /dev/null
bwa mem bogus-genome.fa bogus-reads.fa 2> /dev/null | grep -e read43849r -e read5193r
read5193r	16	bogusgenome	1764085	60	5S120M	*	0	0	AATCCATGATATATGTTCTCAAAACGTTGTACAAAAACAGGAGGTTAAATAAAACACTCTGTTCGGCATCGTGTAGTTTTTTGATTTGGCTACATACACTGAGTATCGCGTGGATCTATCGCTTC	*	NM:i:0	MD:Z:120	AS:i:120	XS:i:0
read43849r	16	bogusgenome	2013629	60	123M2S	*	0	0	TAAAGTTTCGTCTATAAAACGCGCCACGGGACTGAATGACCTAGTGCCCACAAACGAACGGCCGCTACTTCAGTACCGAGTGCGAGGTTATTAAATGGTGACCCCGCGCCTTAGAATCGCGAATT	*	NM:i:0	MD:Z:123	AS:i:123	XS:i:0

Ahh, ok. As the CIGAR strings show, both of these alignments are soft-clipped at the ends. That means the offending nucleotides are not considered part of the alignment, e.g. when computing the edit distance. Ignoring the soft-clipped portion, the remainder of the read is a perfect match and is thus being (erroneously) discarded.

Take Quatre

Ok, so it looks like we need to check for soft clipping before we apply the edit distance filter. It doesn't look like soft clipping can be disabled with bwa mem (I would LOVE to be corrected!), so it'll have to be handled by post-processing.

Our algorithm boils down to this: keep the record if the CIGAR string does not match the regular expression \d+M (check for soft clipping) or if the edit distance is 0. I can think of two ways of doing this. The first approach is lazier, but is also more concise and less cryptic to those who have never slung Perl.

In [13]:
bwa mem bogus-genome.fa bogus-reads.fa | samtools view | perl -ne 'print if !/\t\d+M\t/ || !/NM:i:0/' | wc -l
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 80000 sequences (10000000 bp)...
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 5.212 CPU sec, 5.262 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 6.921 CPU sec, 6.943 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 6.100 CPU sec, 6.010 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 3.542 CPU sec, 3.420 real sec
[M::mem_process_seqs] Processed 80000 reads in 3.608 CPU sec, 3.531 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem bogus-genome.fa bogus-reads.fa
[main] Real time: 25.384 sec; CPU: 25.531 sec
  186317

The second approach is more explicit, avoiding any unintended off-target matches for the \d+M regex. This comes at the cost of introducing more complexity and verbosity in the Perl code, and honestly I don't know if an off-target match for that regex is even possible for well-formatted SAM alignments.

In [14]:
bwa mem bogus-genome.fa bogus-reads.fa | samtools view | perl -ne '@v = split; print if $v[5] =~ /S/ || !/NM:i:0/' | wc -l
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 80000 sequences (10000000 bp)...
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 3.712 CPU sec, 3.667 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 5.299 CPU sec, 5.271 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 3.721 CPU sec, 3.590 real sec
[M::process] read 80000 sequences (10000000 bp)...
[M::mem_process_seqs] Processed 80000 reads in 4.195 CPU sec, 4.070 real sec
[M::mem_process_seqs] Processed 80000 reads in 4.451 CPU sec, 4.383 real sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem bogus-genome.fa bogus-reads.fa
[main] Real time: 21.338 sec; CPU: 21.505 sec
  186317

In any case, both approaches see to work correctly and give us just the reads that do not align perfectly to the reference.

Fin

One thing I haven't considered much here is performance. This whole exercise would have been pretty silly if a short read aligner exists that make it easy to require only exact alignments to be reported. But I'm not familiar with any, and this use case has such limited utility with the data we work with on a daily basis that I doubt there is one in wide use. Again, I'd be happy to be proven wrong!

Comments