Psst! I've posted some details about my notebook setup here.

Myths and Fallacies: GUIs versus CLIs in Bioinformatics

Bioinformatics Twitter is having...a moment. What began as one man's exasperated rant against poorly documented and distributed code has bloomed into a protracted debate about what constitutes "good" bioinformatics software and which kind(s) of interfaces developers should be expected to provide if they "really" want people to use their software. For those that have been around a while …

more…

Developer, pull request thyself

I contend that you should be treating your solo software projects like a collaborative project and using pull requests, an issue tracker, and other social development tools.

more…

A snake in the pipes!

A few comments on the new "pipe" output flag in Snakemake.

more…

The Joy and Art of Automated Testing

A primer on software testing for scientists and researchers.

more…

A brief review of HULK and histosketch

About a month ago, I was intrigued to see a bit of Twitter activity around a new bioRxiv preprint. The manuscript describes HULK, a new bioinformatics tool that implements some useful comparison metrics and operations for analyzing (meta)genomes. HULK is based on a new algorithm called histogram sketching (HistoSketch for short), following the trend of related sketching algorithms (HyperLogLog …

more…

Improvements from applying filters at k-mer counting time in kevlar

Permalink: 2018-07-16 by Daniel S. Standage in blog tags: kevlar

One of the fundamental insights of the kevlar de novo variant caller is the framing of the variant discovery problem as a search for novel k-mers. In this case, "novel" means abundant in the focal sample and effectively absent from all control samples. In the early stages of creating kevlar, it quickly became clear that many k-mers satisfying these simple …

more…

Loading paired reads from position-sorted BAM files

Permalink: 2018-06-12 by Daniel S. Standage in blog tags: ngs bam

BAM files with sequence alignments sorted by genomic position seem to be the new currency of exchange for large-scale human genome sequencing projects. This is convenient and practical in many ways for many people. But in my current research I work a lot with tools that only want/need the sequence information and, for whatever reasons, support only FASTA or …

more…

Information content versus data volume and k-mer counting accuracy

Keeping track of k-mers for simple operations has become a fundamental component of many bioinformatics techniques. Two common operations on k-mers include set membership queries ("is k-mer X present in data set Y?") and abundance queries ("how many times does k-mer X occur in data set Y"). Several probabilistic data structures have been developed to support …

more…

How to distinguish perfectly mapped reads from a SAM/BAM file

In which I explore read alignments in SAM format and discuss the pros and cons of various approaches to distinguishing perfect matches from imperfect matches.

more…

Super simple reverse-complement aware DNA sequence search with rcgrep

There are many wonderfully elegant and efficient tools for performing all sorts of exact and inexact searches on large collections of DNA sequences. Experience has shown, however, that these tools are usually very rigid with respect to their assumptions about input data. If input files are compressed in a certain way, or stored in a non-standard format, there is usually …

more…