Generators have long been one of my favorite features of the Python programming language. As stated in the Python documentation:

When you call a [regular] function, it gets a private namespace where its local variables are created. When the function reaches a return statement, the local variables are destroyed and the value is returned to the caller. A later call to the same function creates a new private namespace and a fresh set of local variables. But, what if the local variables weren’t thrown away on exiting a function? What if you could later resume the function where it left off? This is what generators provide; they can be thought of as resumable functions.

So simple and yet so concise and immensely powerful. It's difficult to express how useful generators are, and despite my enthusiasm for and constant reliance on them in my daily work even I haven't until recently fully appreciated their utility.

Much of the software I wrote as a graduate student relies heavily on the GtNodeStream class from the GenomeTools C library. The GtNodeStream is similar conceptually to Python generators in that it implements lazy (on-demand) evaluation and retains state between calls (it also requires quite a lot of code to implement and use). The real power of node streams, as explained very well in the GenomeTools paper, is that they can be composed. Any bit of data processing can be implemented as a custom node stream, and then larger data processing tasks can be addressed by chaining node streams together. I found this pattern to be a very valuable way to decompose large analysis tasks into smaller more manageable chunks.

It wasn't until the last couple of days that I made the connection between GtNodeStreams and Python generators. In all my excitement, my burning question was this: can generators easily be composed?

As the simple toy example below shows, yes. Yes they can.

In [1]:
def source():
    for i in range(5):
        data = [i+1]
        yield data

This first generator function is very simple: each time it is called, it returns a list with a single value in it. After 5 calls, the generator is fully depleted.

In [2]:
def transform1(instream):
    for data in instream:
        data.append('t1')
        yield data
In [3]:
def transform2(instream):
    for data in instream:
        data.append('t2')
        yield data
In [4]:
def transform3(instream):
    for data in instream:
        data.append('t3')
        yield data

These next three generator functions are trivial. For each list object the generator pulls from its input stream, it simply appends an additional value. Now, consider the behavior we observe when these generators are all composed.

In [5]:
for data in transform3(transform2(transform1(source()))):
    print(data)
[1, 't1', 't2', 't3']
[2, 't1', 't2', 't3']
[3, 't1', 't2', 't3']
[4, 't1', 't2', 't3']
[5, 't1', 't2', 't3']

Each object returned by the source function is passed through the chain of generators, modified by each one as it passes throgh, until finally it is passed to the data variable in the for loop of the cell directly above. Of course this is a trivial example, but if we trade out the list object for (for example) a DNA sequence read, and trade out these silly transform functions for code that processes DNA sequences in a useful way, then all of a sudden we have an efficient and modular framework for DNA sequence analysis.

What are your thoughts? Awesome sauce or old hat? Am I the late one to this party?

Comments