Streaming data from the SRA with fastq-dump
NCBI's Sequence Read Archive is the go-to repository for published genome-scale sequence data sets.
Although there are a variety of ways to download sequence data from SRA, the
fastq-dump command from the SRA Toolkit is the most convenient in my opinion.
In fact, with a few settings tweaks
fastq-dump can stream data directly from the SRA into an analysis pipeline.
- For a true streaming approach, you'll want to disable local file caching with vdb-config. Especially on clusters with tight quotas on home directory storage, the default settings can be very problematic.
- If you have paired reads, use the
--split-filesflag for proper printing of pairs and the
-Zfor short) so that the data is printed in interleaved Fastq format, rather than in two paired files (as is the default).
- By default, the read IDs returned by
fastq-dumpdon't include any pairing information, which some programs rely on for processing paired-end data. Include the options
--defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+'to append a
/2to the end of each read ID for pairing information, and to throw away all of the superfluous and redundant info in the 3rd line of each Fastq record.
The following example pipes the SRA data set with the accession ERR612477 into a processing pipeline.
fastq-dump --split-files --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' -Z ERR612477 \ | trim-low-abund.py --ksize 25 --max-memory-usage 2G --variable-coverage - \ | my-favorite-mapper-or-assembler > out.dat