Daniel S. Standage

Permalink: 2017-03-02 by Daniel S. Standage in blog tags: software rcgrep frustration driven development

There are many wonderfully elegant and efficient tools for performing all sorts of exact and inexact searches on large collections of DNA sequences. Experience has shown, however, that these tools are usually very rigid with respect to their assumptions about input data. If input files are compressed in a certain way, or stored in a non-standard format, there is usually some non-trivial work and storage involved in converting the data to a format the software will accept. There are circumstances in which this extra work is justified, but many times you just need a flexible tool to do a quick-n-dirty search.

Consider the following, totally not-real never-happened-to-me hypothetical situation.

ME: I just need to search for a sequence in this text file real quick. I'll use grep.

ALSO ME: Oh, the file is gzip-compressed. Use gzgrep.

ME AGAIN: Oops, I forgot to search for the sequence's reverse complement as well. Try again.

ME, 5 MINUTES LATER: This time I want to search a different file (bzip2-compressed) for 3 sequences and their reverse complements. Invoke bzgrep with multiple -e flags.

ME, HEAD ON DESK: Uggghh, bzgrep on Linux doesn't support multiple -e flags. Pipe output of bzcat to grep.

Searching for DNA sequences in text files is a very simple task that too often is unnecessarily complicated by uninteresting and frustrating technical details.

Enter rcgrep.

rcgrep is a lightweight wrapper for the UNIX grep command and is intended to make these irrelevant details disappear as much as possible. It supports multiple queries, is reverse complement aware, handles mixtures of compressed and uncompressed files automagically, supports streaming, and can tap into grep's extended features such as case-insensitive search (grep -i) and before- and after-context reporting (grep -B and grep -A).

This tool, like a few others I've artisanally hand-crafted over the past few years, is an example of frustration-driven development. It's such a simple and obvious idea that I'm sure it's been implemented and re-implemented in dozens of ways by hundreds of researchers over the years. I hope that the small amount of work I've put into publishing this tool and making it easy to install and use will save the field a few hundred computational-biologist-hours: time much better spent on doing some awesome science.

To install, invoke pip install rcgrep or download directly from the GitHub repository at https://github.com/dib-lab/rcgrep.

Daniel S. Standage

Super simple reverse-complement aware DNA sequence search with rcgrep

Comments