Daniel S. Standage

Permalink: 2015-12-15 by Daniel S. Standage in blog tags: shell python

The UNIX shell is an indispensible tool for project organization and data management in bioinformatics. I spend a lot of time in the shell, and having picked up on a lot of time-saving techniques over the years it might just be my favorite computing environment.

The shell has its limitations, however. Piping, the very feature that gives the shell its amazing power and flexibility, can also lead to some pretty gruesome syntax. Debugging shell code is tough, the error handling is rudimentary, and good luck finding a good framework for automated tests. In short, the shell is great for interactive computing and automating simple tasks, but when it comes to workflows requiring more fine-grained control, a language like Python is often a better choice.

Here I provide a Python translation of several shell commands.

Simplest case

You don't typically get much bioinformatics work done with a single command without arguments. Anything substantial will involve data files, parameters, and so on, that are typically specified using arguments on the command line (you don't have these hard coded in a script, do you?!?!). But just for sake of completeness, it's very straightforward to execute shell commands this way in Python.

ls

The Python equivalent is as follows.

subprocess.check_call('ls')

There is even the simpler call function...

subprocess.call('ls')

...but for most situations I find the check_call more useful, since it will halt the Python code immediately if the subprocess returns a non-zero status.

Command with arguments

ls -lhp

Commands with arguments cannot simply be dropped in to the check_call command as-is. The following code will fail.

subprocess.check_call('ls -lhp')

There are two ways you can fix this: the convenient (and wrong and insecure) way...

subprocess.check_call('ls -lhp', shell=True)

...and The Right Way.

subprocess.check_call(['ls', '-lhp'])

The convenience of the first method comes at the cost of security: the shell=True setting introduces vulnerability to shell injections. This isn't the type of thing you expect to encounter much in the research setting, but it's an important consideration nonetheless, and exceptions should be made with caution.

This example is pretty silly, since you'll probably never need to call the ls command from Python. Let's do a different example you're much more likely to encounter in bioinformatics.

blastx -db /opt/ncbi/nr -query tsa.fasta \
       -evalue 1e-4 -num_threads $numthreads \
       -out tsa-vs-nr.blastx

The Python equivalent is as follows.

command = ['blastx', '-db', '/opt/ncbi/nr', '-query', 'tsa.fasta',
           '-evalue', '1e-4', '-num_threads', numthreads,
           '-out', 'tsa-vs-nr.blastx']
subprocess.check_call(command)

Redirect stdin and stdout

Many programs and commands allow you to specify input and output files as arguments, as in the blastx command above. However, sometimes your only options are stdin and/or stdout.

sed s/^scaffold_/PcanScaf/ < pcan-in.gff3 > pcan-out.gff3

Here is the Python equivalent.

with open('pcan-in.gff3', 'r') as instream, open('pcan-out.gff3', 'w') as outstream:
    subprocess.check_call(['sed', 's/^scaffold_/PcanScaf/'], stdin=instream, stdout=outstream)

The main event: pipelines

Handling input and output for single commands is great and all, but the real power of the shell is piping commands together like so.

grep -v $'\tintron\t' loci.gff3 \
    | pmrna --locus --accession --map=map.txt \
    | canon-gff3 --outfile=locus.mrnas.gff3

Unless we want to introduce security vulnerabilities, we cannot simply run these commands with a single call to the check_call function. For this use case, we want to use the Popen constructor and the communicate method.

grepproc = subprocess.Popen(['grep', '-v', '\tintron\t', 'loci.gff3'],
                             stdout=subprocess.PIPE)
pmrnaproc = subprocess.Popen(['pmrna', '--locus', '--accession', '--map=map.txt'],
                             stdin=grepproc.stdout,
                             stdout=subprocess.PIPE)
canonproc = subprocess.Popen(['canon-gff3', '--outfile=locus.mrnas.gff3'],
                             stdin=pmrnaproc.stdout)
canonproc.communicate()

If needed, it is trivial to capture the terminal output like so.

grepproc = subprocess.Popen(['grep', '-v', '\tintron\t', 'loci.gff3'],
                             stdout=subprocess.PIPE)
pmrnaproc = subprocess.Popen(['pmrna', '--locus', '--accession', '--map=map.txt'],
                             stdin=grepproc.stdout,
                             stdout=subprocess.PIPE)
canonproc = subprocess.Popen(['canon-gff3', '--outfile=locus.mrnas.gff3'],
                             stdin=pmrnaproc.stdout,
                             stdout=subprocess.PIPE,
                             stderr=subprocess.PIPE)
stdout, stderr = canonproc.communicate()
for line in stderr.split('\n'):
    # process the terminal warnings

Coda

That's it!

Python's subprocess module is pretty powerful, and allows even slicker interactions with the shell, such as printing text to a pipeline of shell commands. However, writing to and reading from a pipeline simultaneously can get tricky and is prone to deadlocks. I will not cover this here, but the Internet is full of blog posts and StackOverflow threads discussing the intricacies of the subprocess for these more complicated use cases.