Psst! I've posted some details about my notebook setup here.

Information content versus data volume and k-mer counting accuracy

Keeping track of k-mers for simple operations has become a fundamental component of many bioinformatics techniques. Two common operations on k-mers include set membership queries ("is k-mer X present in data set Y?") and abundance queries ("how many times does k-mer X occur in data set Y"). Several probabilistic data structures have been developed to support ...

more…

How to distinguish perfectly mapped reads from a SAM/BAM file

In which I explore read alignments in SAM format and discuss the pros and cons of various approaches to distinguishing perfect matches from imperfect matches.

more…

Super simple reverse-complement aware DNA sequence search with rcgrep

There are many wonderfully elegant and efficient tools for performing all sorts of exact and inexact searches on large collections of DNA sequences. Experience has shown, however, that these tools are usually very rigid with respect to their assumptions about input data. If input files are compressed in a certain way, or stored in a non-standard format, there is usually ...

more…

Streaming data from the SRA with fastq-dump

NCBI's Sequence Read Archive is the go-to repository for published genome-scale sequence data sets. Although there are a variety of ways to download sequence data from SRA, the fastq-dump command from the SRA Toolkit is the most convenient in my opinion. In fact, with a few settings tweaks fastq-dump can stream data directly from the SRA into an analysis ...

more…

Composing generator functions in Python

In which I briefly motivate the utility of generator functions and demonstrate that they can be nested to create a data processing stream.

more…

Thoughts on research software from the PSRN workshop

In which I ramble on about experimental science, research software, and cyberinfrastructure engineering.

more…

My thoughts on the PSRN workshop on cyberinfrastructure and training

I am on my way home from a workshop on cyberinfrastructure and postgraduate training hosted by the Plant Science Research Network. It was a grueling three-day sprint, but it brought together a phenomenal diversity of experience and perspectives on the relevant issues. I wanted to summarize my thoughts while they are still fresh in my head. Beware: what follows is ...

more…

An idiot's guide to loading reads from a BAM file

tl;dr? It's fine, just ignore secondary/supplementary alignments and don't disable reporting of unaligned reads.

more…

Reproducible variant calling is possible with randomized algorithms

This morning I read On genomic repeats and reproducibility by Can Firtina and Can Alkan. The paper discusses two notable observations regarding calling genomic variants.

  1. Some sequence read aligners are not deterministic, and shuffling the order of the reads can result in different alignments.
  2. Some variant callers are not deterministic, and will report a different set of variants if an ...
more…

That darn cache! Configuring the SRA Toolkit

Permalink: 2016-05-18 by Daniel S. Standage in blog tags: sra ngs

Last night I started a batch job on our group's cluster to download and process 9 Illumina libraries from the NCBI SRA. In the past, I have almost always downloaded such data via direct links to .sra files on the SRA FTP site, and then converted these files to Fastq format using the fastq-dump command from the SRA Toolkit ...

more…