Psst! I've posted some details about my notebook setup here.

Thoughts on research software from the PSRN workshop

In which I ramble on about experimental science, research software, and cyberinfrastructure engineering.

more…

My thoughts on the PSRN workshop on cyberinfrastructure and training

I am on my way home from a workshop on cyberinfrastructure and postgraduate training hosted by the Plant Science Research Network. It was a grueling three-day sprint, but it brought together a phenomenal diversity of experience and perspectives on the relevant issues. I wanted to summarize my thoughts while they are still fresh in my head. Beware: what follows is …

more…

An idiot's guide to loading reads from a BAM file

tl;dr? It's fine, just ignore secondary/supplementary alignments and don't disable reporting of unaligned reads.

more…

Reproducible variant calling is possible with randomized algorithms

This morning I read On genomic repeats and reproducibility by Can Firtina and Can Alkan. The paper discusses two notable observations regarding calling genomic variants.

  1. Some sequence read aligners are not deterministic, and shuffling the order of the reads can result in different alignments.
  2. Some variant callers are not deterministic, and will report a different set of variants if an …
more…

That darn cache! Configuring the SRA Toolkit

Permalink: 2016-05-18 by Daniel S. Standage in blog tags: sra ngs

Last night I started a batch job on our group's cluster to download and process 9 Illumina libraries from the NCBI SRA. In the past, I have almost always downloaded such data via direct links to .sra files on the SRA FTP site, and then converted these files to Fastq format using the fastq-dump command from the SRA Toolkit. However …

more…

The eduroam network and 802.1X profiles in Mac OS X

My affiliation recently changed from Indiana University to UC Davis, and accordingly my IU credentials no longer give me access to the eduroam wifi network. Over the last couple of days I've been struggling to connect my laptop to eduroam using my new UC Davis credentials. At first I thought the it was an issue with my account, but it …

more…

Citing "manuscripts in progress" on your CV

A couple of weeks ago, I saw a couple of Twitter threads explode on the topic of citing "manuscripts in progress" on one's CV.

more…

Searching for TSA master records at NCBI

The NCBI Transcript Shotgun Assembly database is the go-to place for submitting transcript assemblies for long-term archival and public access. However, TSA is not one of the database options provided when doing keyword searches at NCBI. TSA sequences are available through the nuccore nucleotide database, along with all other DNA and RNA sequences.

If you want to search NCBI exclusively …

more…

On genomic interval notation

Intervals are one of the most common data abstractions used in genome informatics, along with strings and graphs. DNA has an intricate dynamic three-dimensional structure, but for many bioinformatics applications we can get away with ignoring this level of detail and representing the molecule instead as a static linear sequence of symbols. Genomic features—such as genes or transposable elements …

more…

My tutorial on git banches

I've been trying subtly (and in a few cases not-so-subtly) for years now to convert my colleagues to the gospel of git and GitHub. The git version control system has its quirks no doubt, but there is—in my opinion—no more powerful system for open collaboration on software and science than git and GitHub. A quote I attribute (hopefully …

more…