tl;dr? It's fine, just ignore secondary/supplementary alignments and don't disable reporting of unaligned reads.


I recently started my first real foray into human genomics, and if I thought I was overwhelmed with data before, I'm much more so now. (Thank goodness I'm not stupid ambitious enough to work on marine or soil metagenomics!) One challenge I've encountered already is that this field seems to be moving toward BAM files1 for distributing not only read alignments against a reference genome but also the original read sequences themselves. I have a passing familiarity with SAM/BAM files, but (barring a few exceptions) I've always treated them as a black box before, using "trusted" tools to create and process them. Now that I'm faced with the prospect of writing software to take them as input, I've really had to bite the bullet and dig into the nitty gritty.

BAM is the new Fastq

The Fastq format has long been the standard for storing nucleotide sequence reads from genome-scale sequencing platforms such as Illumina. Although it is riddled with the same issues as essentially every other bioinformatics data format—a redundant and incomplete spec (What *is* the purpose of the third line? Can the sequence be split across multiple lines?) and inconsistent usage (Which quality encoding was used? Are pairs interleaved or split into different files?)—the Fastq format's stripped-down simplicity is no doubt a big factor in its wide adoption. On the other hand, the SAM format was designed for storing read alignments against a reference, with BAM as its compressed binary counterpart. When it comes to analyzing the raw reads produced by a sequencer, Fastq is everybody's natural first choice.

The scale of human genomics data collection is proving challenging, however. When a single modest-coverage sample requires on the order of 100 GB for storage, and you're sequencing thousands of samples, you don't want to store duplicates of your data. And since the majority of scientists in the field are only interested in the alignments anyway, it makes much more sense to publish and archive only BAM files rather than only Fastq files.

But what if I just want the reads?

So the big question then is: if I'm not interested in the alignments and only want the original read sequences, can I get these from the BAM file? Those more familiar with SAM/BAM could probably have given an immediate answer, but as a relative n00b I had several questions.

  • Is the sequence reported in the BAM file the read sequence, the reference sequence, or an alignment artifact (like a consensus)?
  • Do BAM files include unmapped reads?
  • Are multi-mapped reads reported multiple times?

In short, is it possible to extract all input read sequences non-redundantly from a BAM file?

(I should note at this point that I wasn't completely unaware that BAM-to-Fastq converters exist2, so I probably should have approached this question with a bit more confidence.)

Some probing

To satisfy my curiousity, I did a bit of detective work on the BAM files I'm analyzing in my current project. Since the files are huge (>100GB per) and stored on a busy cluster, I selected only the first 5 million-ish reads from the file for testing (processing the whole file would have required submitting a job request to the queue).

samtools view -h mysample.bam \
    | head -n 5000000 \
    | samtools view -b -o mysample.1st5mil.bam

Checking whether the file contains unmapped reads is quite simple.

samtools view -f 4 -c mysample.1st5mil.bam

Checking whether multimapped reads are reported multiple times took a bit more command-line fu. The following command cuts out the read ID from each alignment, sorts, reports the counts of each uniq read ID, sorts again by counts, and then reports the top 10 reads by count.

samtools view mysample.1st5mil.bam \
    | cut -f 1 \
    | sort \
    | uniq -c \
    | sort -rn \
    | head -n 10

All of the top 10 read IDs occurred four times in the input. I suspected that two times would simply mean both the read and its pair were found, but four times means that both the read and its pair were each reported twice. To confirm, I searched the file for several of the duplicated read IDs.

samtools view mysample.1st5mil.bam | grep ReadIdHere

Sure enough, in each case, both of the read pairs were reported once in full, and a second time as a hard-clipped secondary alignment (256 or 0x100 in the bitwise flag). The SAM specification is a bit sparse on when hard and soft clipping are intended to be used, but the point is moot since these are marked as secondary alignments.

I then used the following commands to ensure that all hard-clipped alignments are marked as secondary (and that all secondary alignments are hard-clipped).

samtools view -f 256 -c mysample.1st5mil.bam
samtools view -f 256 mysample.1st5mil.bam | cut -f 6 | grep -c H
samtools view -F 256 mysample.1st5mil.bam | cut -f 6 | grep -c H

So far so good. The only point remaining is to verify that the sequences in the BAM file actually correspond to the read and not the reference or some kind of alignment consensus. Without access to the original reads, I could only refer back to the SAM specification, which describes the 10th column of an alignment record as the "segment sequence". Elsewhere in the SAM spec, it is said a read may be composed of multiple segments, so for primary alignments I think we're safe in saying the reported sequence is the entire original read sequence.

A little test

Based on this detective work, it looks like BAM files do indeed contain the information we need to pull out the original read sequence.3 On one hand, it's not too surprising: this kind of thing has likely come up as authors of the SAM spec and mapping software refined their tools over time. On the other hand, I couldn't shake the feeling that, as a newcomer, there might be some problematic detail that I've missed. I concluded that the only way to ease my unsettled mind was to do a test where I knew the (Fastq) input, and could verify that the (BAM) output had exactly the information I needed.

Here is the basic outline of the test.

  • grab a couple of chromosomes from the human reference genome (commands not shown)
  • grab a small sample (≈1 million reads) from a Fastq file I have access to (commands not shown)
  • index the reference sequence for mapping
  • map the reads to the reference, produce BAM file
  • extract read sequences from the BAM file, compare to contents of Fastq file

I decided to test both BWA and Bowtie2 with default settings, figuring that >95% of human genome stuff is going to involve data mapped by of these two tools.

bwa index hg.fasta
bwa mem -t 4 hg.fasta reads-subset.fastq | samtools view b -o test.bwa.bam

bowtie2-build hg.fasta hg.fasta.bt2
bowtie2 -p 4 -x hg.fasta.bt2 -U reads-subset.fastq | samtools view -b -o test.bt2.bam

Now let's extract all of the sequences corresponding to primary alignments—not marked as secondary (256 or 0x100) or supplementary (2048 or 0x800)—from each BAM file. If we sort the sequences, they should be identical right?

# 2048 + 256 = 2304
samtools -F 2304 test.bwa.bam \
    | cut -f 10 \
    | sort \
    > test.bwa.seq.txt
samtools -F 2304 test.bt2.bam \
    | cut -f 10 \
    | sort \
    > test.bt2.seq.txt
diff -q test.bwa.seq.txt test.bt2.seq.txt

Ruh, roh! It looks like they're not the same! What's going on? Both files contain exactly 1,000,000 lines, as expected.

wc -l test.bwa.seq.txt test.bt2.seq.txt

How many are shared between them?

comm -12 test.bwa.seq.txt test.bt2.seq.txt | wc -l

It looks like only about 84% are shared in this case. After a bit more investigation, I discovered that in some cases the mapper will report the reverse complement of the read sequence instead of the original read sequence. This of course is not at all problematic for BAM paring in general, but it complicated my rudimentary evaluation using the diff command. To this end, I wrote a simple C++ program that would print out each sequence it was given as well as its reverse complement.

#include <algorithm>
#include <iostream>
#include <string>

char comp(char c)
{
    return (c) == 'A' ? 'T' : (c) == 'T' ? 'A' : (c) == 'C' ? 'G' : (c) == 'G' ? 'C' : 'N';
}

int main()
{
    std::string seq, revcomp;
    while (getline(std::cin, seq)) {
        revcomp.resize(seq.length());
        std::transform(seq.rbegin(), seq.rend(), revcomp.begin(), comp);
        if (seq < revcomp) {
            std::cout << seq << '\t' << revcomp << '\n';
        } else {
            std::cout << revcomp << '\t' << seq << '\n';
        }
    }
    return 0;
}

We can now use this to confirm that the ≈16% of sequences that don't match between BWA and Bowtie are simply reverse complements of each other. If we compile this program and re-run the pipeline with it included, we should get identical output.

g++ -Wall -O3 --std=c++11 -o rc rc.cpp
samtools -F 2304 test.bwa.bam \
    | cut -f 10 \
    | ./rc \
    | sort \
    > test.bwa.seq.txt
samtools -F 2304 test.bt2.bam \
    | cut -f 10 \
    | ./rc \
    | sort \
    > test.bt2.seq.txt
diff -q test.bwa.seq.txt test.bt2.seq.txt

Woohoo, success! Of course, this only checks that the mappers give identical results. What we're really interested in is whether they both match the contents of the original Fastq file.

paste - - - - < reads-subset.fastq \
    | cut -f 2 \
    | ./rc \
    | sort \
    > test.fastq.seq.txt
diff -q test.fastq.seq.txt test.bwa.seq.txt
diff -q test.fastq.seq.txt test.bt2.seq.txt

It is indeed a match, and I think I can confidently mark this case closed.

Epilogue

I did warn you in the title that this was an idiot's guide, right? A lot of this is probably very elementary stuff to guys and gals that work with mappers and BAM file internals on a daily basis. Even so, I hope I've demonstrated clearly and comprehensively that even if you're interested only in the original read sequence, BAM files are a suitable data format. Of course...there are some caveats:

Caveats

  • Some mappers give you the option to not report unmapped reads. Enabling these settings is problematic if the Fastq files are not published alongside the BAM files.
  • Some mappers may behave differently than the two I evaluated, which (although unlikely) might render some points of my assessment invalid.
  • Performing quality control prior to mapping is convenient for those using the alignments, but prevents users from ever going back and applying a different (improved?) quality control procedure4.

Other than that, it looks like you're just fine grabbing the segment sequence from each alignment except those marked as secondary or supplementary.

But let me open this up: is there anything I've missed, any potential pitfalls that I've overlooked?


A big thanks to Carrie Ganote, Ali Berens, Titus Brown, and Fereydoun Hormozdiari for discussions on the topic and feedback on a draft of this blog post.


1Or, in the longer term, CRAM files, which are different but supposedly backwards compatible with BAM.

2This thread on SEQanswers discusses a variety of BAM-to-Fastq converters that exist or have existed.

3I haven't discussed pairing information since it's not important for the current analysis I'm doing, but it should also be simple to extract by examining the read ID and bitwise flag(s) related to pairing. Of course, things are rarely as simple in practice as they are conceptually...

4This is actually a very important discussion, and where the publishing-BAM-only approach becomes tricky. When do we handle adapter trimming? Removal of technical duplicates? Quality trimming (if any)? Wouldn't we want to apply some quality control before mapping the reads? But then again, quality control on reads is not a settled science, and publishing BAM files of cleaned-up reads prevents us from applying improved QC in the future. I'm not sure I have a good answer for this, but it's an important conversation.

Comments