python and generators
TRANSCRIPT
-
8/14/2019 Python and Generators
1/6
Generators, comments and the data fetish
Author: Marcin Swiatek, Visimatik Inc., 2008
I worked on a tool for navigating data hierarchies stored in HDF5 files. As it usually happens, at some
point I wanted to convince myself that the program would be adequate for the intended purpose 1. To
this end, I needed to exercise it against a reasonably wide set of exemplary data. While I already had
examples downloaded from theHDF5 web site and the datasets I had generated for my project, I
needed more still. To this end, I have contrived a tool to help me quickly populate tables in an HDF5
file of arbitrary structure with random data2.
This is where things get awkward. Given the sheer thrill of writing test code and the comforting
obviousness of several solutions to the problem, how do I convince at least one person to stay with me
past this paragraph? I promise to spend little time on trivial matters, like generating random numbers
or the syntax of loop statements. The article will focus on generators and reflection and suggest a
practical and entertaining use of these fairly obscure, yet very useful, aspects of Python.
This document and the associated source code can be downloaded from www.visimatik.com.
The set-up
Frequently the problem statement suggests possible solutions. And while prudent programmers should
treat solutions that 'invite themselves' with distrust, in this case the description is best taken at its face
value. The tandem routinesPopulateTablesInFile andPopulateTable (Listing 1) implement astraightforward strategy: find all tables in the file and against each table execute the prescribed
number of insertions. Insertion consists of an assignment (each field gets assigned a new value),
followed by a call to the appendmethod on an object representing the table's row.The only aspect that calls for comment is how assigned values are produced. This task is delegated to
callable objects I will refer to (somehow awkwardly) as 'data makers3'. It is rather obvious what a
'maker' needs to do: it will, when called, return a new value of the expected type. Values for all
column types, excepting strings, are ultimately generated by calls to functions imported form the
standard library module random.It seems reasonable to associate these 'makers' with data types, or, more exactly, with types of types:
kinds4. However, I decided to have the front-line 'maker holder' indexed by column descriptions,
rather then just types. Column descriptors carry information about column type and for the purpose of
this exercise can be reduced to kinds. In the future, however, they will let me introduce another level
of indirection between the dataMakerparameter ofPopulateTable and theKindMakers dictionary. I
plan to use it to populate columns with variates of different distributions, depending on datasemantics.
PopulateTablesInFile constitutes the highest-level interface to the 'table stuffer' module
1 This process is usually called 'testing'.
2 There is a very long argument to be made about using synthetic data for verification of algorithms and data analysis
methods. In several fields I am familiar with, synthetic data is used often and to good results. However, in this text I
intend to focus on some features of the programming language: Python. Building models for generators of synthetic
data is out of scope of this article.
3 I am trying hard to avoid calling these objects 'generators'. The name 'generators' might seem natural to describe a
routine that generates something. But since this article is about generators in a different meaning, it is better kept
this way. 'Makers' for objects serving data when called, 'generators' for...
4 For practical purposes it the details (like representation length) of columns' types won't really matter. Makers returnvalues in Python types, which are converted upon assignment. To keep things simple I have decided to ignore the
distinction between fixed-length string columns in pytables and Python strings the only practical situation where
exact type is indeed important.
mailto:[email protected]?subject=Article:%20Pytables%20and%20meta-classeshttp://hdf.ncsa.uiuc.edu/products/hdf5/hdf5-files.htmlhttp://hdf.ncsa.uiuc.edu/products/hdf5/hdf5-files.htmlhttp://www.visimatik.com/strictlytechnicalhttp://www.visimatik.com/strictlytechnicalhttp://hdf.ncsa.uiuc.edu/products/hdf5/hdf5-files.htmlmailto:[email protected]?subject=Article:%20Pytables%20and%20meta-classeshttp://www.visimatik.com/strictlytechnical -
8/14/2019 Python and Generators
2/6
(TableStuffer.py) and is used by the actual command script (TestMaker.py). There is nothing
remarkable about the command line script and we will spend no time analyzing it.
Structures of HDF files the script can create have been predefined using the mechanism discussed
previously. These definitions can be found in Schemas/Canned.py. You will needpytables and HDF5libraries to run the example.
First, there was Word or on the benefits of wordy comments
When it comes to populating text fields, the simplest solution is to generate random sequences of
characters. However, this approach has considerable drawbacks. While the practice may be adequate
for load testing, it will not do if data is to be ever evaluated by a human being. Chance doesn't look
life-like and a person evaluating test results based on random symbols will have considerable
difficulties finding and recalling any point of reference. The usual way out of this difficulty is to take
words from a file containing natural language text (literature classicsalways work best). However,
this is a truly light-weight project and hauling a large text file around with a tiny test script just seems
to be out of proportion.
How could I do without, then? The program itself is text, although admittedly with a limited and
peculiar vocabulary. Yet a good part of any source file is written for human readers: comments and
doc-strings. This will be my source of test data.
Given Python's nature, it is relatively easy to access this information in the runtime. The standard
library module inspect offers several useful tools to get the task done. In wordManufacture routine
(Listing 2), I traverse the live graph of runtime artifacts. It is worth underscoring that this is not the
graph of relations of program's data (objects), which may refer to each other in very complex ways.
Here, we will remain on the meta level, where relations between entities (such as module, type, classormethod) are defined by the lexical structure of the program5. One could expect a graph with edges
defined by relations of inheritance and containment to be free of cycles. Unfortunately, this is not the
case:
>>> mmth = inspect.getmembers(__main__)>>> mnm = [t[1] for t in mmth if t[0] == '__main__']>>> mnm[]>>> mnm[0] == __main__True>>>
The practical consequence of this observation is that the code cannot be treated as a tree. Normally,
one would strive to devise an algorithm avoiding infinite loops. However, in this particular situation it
made more sense for me to embrace infinity. After all, the program is supposed to generate words
until the end of time.
The source code associated with a programming artifact may be inspected using appropriate routines
from the inspectmodule. In particular, functionsgetcomments andgetdoc extract comments and
documentation strings, information that will suit best the purpose. Each obtained string will likely
contain several words: the smallest pieces of text that can be easily noticed, memorized and
referenced. The object producing text data will thus return individual words. Notice that the algorithm
gathering words will have to operate on an nested structure, a graph of objects containing lists of
words. A nested iteration is easy to program, but there is an additional challenge: the words need to be
returned one-by-one, in subsequent calls.
One could gather all text up front and return words from a storage. This, however, requires the graph
traversal problem to be addressed properly. The alternative is to encapsulate the process of data
5 This is a major simplification. In reality, Python's dynamic character makes the lexical structure of the program
more malleable then one may expect.
http://www.visimatik.com/Pytablesversions.pdfhttp://www.pytables.org/moinhttp://hdf.ncsa.uiuc.edu/products/hdf5/index.htmlhttp://www.gutenberg.org/wiki/Main_Pagehttp://www.gutenberg.org/wiki/Main_Pagehttp://docs.python.org/tut/node6.html#SECTION006760000000000000000http://docs.python.org/lib/module-inspect.htmlhttp://www.visimatik.com/Pytablesversions.pdfhttp://www.pytables.org/moinhttp://hdf.ncsa.uiuc.edu/products/hdf5/index.htmlhttp://www.gutenberg.org/wiki/Main_Pagehttp://docs.python.org/tut/node6.html#SECTION006760000000000000000http://docs.python.org/lib/module-inspect.html -
8/14/2019 Python and Generators
3/6
extraction in a class, which would progress iteration 'on demand', when more data is needed. There is
nothing unusual about this proposition. For instance, classes reducing a complex data structure or an
algorithm to an iteration are the favorite vehicle of database access libraries.
Devising a class for the task would not difficult. The only aspect that might call for special care is the
question of representing the state of nested iterations in object's variables.
Interestingly, in Python the task of finding the suitable representation can be delegated to the language
itself.
The state of a computation
Suppose you invoke wordManufacture() as presented inListing 2. What the call will return? Well, itis easy if you try...
>>>>>> p = wordManufacture()>>> p>>> p.next()'__main__'
>>>
Instead of returning a string, as one might have expected, wordManufacture() returns an object a
generator.According to Python documentation, it is enough to place a yield statement in the
function's body to make the interpreter create a wholly different code execution structure and in place
of a normal function produce agenerator function.
I find it convenient to look at generators in a
similar way as at iterators6. One could say an
iterator represents an iteration. By the same
token, generator can be thought of as an object
representing, and permitting some control over aflow of a computation. In this context control
amounts to something very much akin stepping
through an iteration. However, the routine,
which is to be controlled through a generator,
needs to be written in a specific way, with
explicit definition of junction points, where the
generator function will communicate with the
outside. In Python, these junction points are
defined using theyieldkeyword.Upon invocation of the method next() of a
generator object, the related generator routinewill execute up until the nextyieldin its code. Acall to next() returns whatever the generator
routine yields.
In earlier versions of Python (pre PEP-342), the
interface of a generator was exactly that of an
iterator and the construct lacked the two-way
communication, enabled by thesendmethod of
the interface. Thus, generator functions were just a way to code some iterations more conveniently.
Most examples given in literature reinforce this association.
6 When discussing the concept, I much prefer to focus on the design pattern, rather than on its interpretation in aspecific language. Oddly enough, I could not find on-line an explanation that I liked. Wikipedia does have an article,
but it is poorly written, in my opinion.This one seems good, but the link just looks like it is not going to last. The
seminalGoF book 'Design Patterns' brings a good discussion, but it is not available as an on-line reference.
Generators in Python are now something more
than just enhanced iterators. In the scenario
described here, one-way communication implies
that one routine plays-back another, as if
having it perform a certain task. However, it is
possible to communicate in both directions,
thorough thesendmethod of thegeneratorinterface and the return value of theyieldinstruction. This enables compositions, where
two or more routines collaborate on some task
(the term for that is, I believe, collaborativemultitasking). In other words, generatorfunctions can be coroutines, with all associated
benefits. For instance, it is a natural way of
expressing several interesting algorithms.Programming coroutines is an interesting topic
on its own right, but a fairly broad one, too. It
will not be discussed here; instead refer all
readers interested in writing coroutines in Python
to the already invoked PEP-342 and other
resources on the web.
http://www.python.org/doc/current/ref/yield.htmlhttp://www.python.org/doc/current/ref/yield.htmlhttp://docs.python.org/lib/typeiter.htmlhttp://www.python.org/dev/peps/pep-0342/http://en.wikipedia.org/wiki/Iterator_patternhttp://www.patterndepot.com/put/8/iterator.pdfhttp://www.patterndepot.com/put/8/iterator.pdfhttp://www.amazon.com/Design-Patterns-Object-Oriented-Addison-Wesley-Professional/dp/0201633612http://www.amazon.com/Design-Patterns-Object-Oriented-Addison-Wesley-Professional/dp/0201633612http://www.python.org/doc/current/ref/yield.htmlhttp://en.wikipedia.org/wiki/Iterator_patternhttp://www.patterndepot.com/put/8/iterator.pdfhttp://www.amazon.com/Design-Patterns-Object-Oriented-Addison-Wesley-Professional/dp/0201633612http://docs.python.org/lib/typeiter.htmlhttp://www.python.org/dev/peps/pep-0342/ -
8/14/2019 Python and Generators
4/6
The design presented in this article also follows that established usage pattern. However, please keep
in mind that while similarities between generators and iterators are really difficult to overlook, one
should not reduce one to another.
Generators and iterators are most often employed in the context of aforloop, which in Python is
always about iterating through something (an iterable) using an iterator object. The loop's semantics
completely occludes the use of iterators, usually to programmer's great benefit. Owing it to that, thebehavior of iterators is usually of concern only in the context of writing container classes.
However, this example requires placing a generator outside the usual context. Adapting an iterator or
a generator to a callable interface is not complicated, but requires some care. In my example, the
generator is adapted to a callable interface through a thin wrapper object. Its __call__ function
invokes explicitly the generator's next() function. Notice the try-except clause, surrounding this call.
According to iteratorandgeneratorcontracts, StopIteration exception is used to signal that there isno more elements in the collection (or no more computations to perform). Hence, this is something
we must expect7; the semantics offorconstruct in Python includes exception handling, but here it is
up to the programmer. The wordSmith class inListing 2 implements the wrapper.
Remaining business
In order to ensure stop of the wordManufacture routine, I have introduced a primitive counter into thealgorithm. While it came useful in debugging, in real life I would use a different way of extracting
finite sequences from an infinite-loop generator. The standard itertools module brings a variety of
tools extending the standard iterators mechanism8. One of them, islice, first perfectly the purpose.
Listings
Listing 1: The essential interface
KindMakers = {'string' : wordSmith(),'int' : lambda : random.randint(-1000, 1000),'uint' : lambda : random.randint(0, 1000),'bool' : lambda : random.random() > 0.5,'float': lambda : random.expovariate(.1),'complex' : lambda : random.gauss(0, 100.0) + 1j * random.expovariate(.1),'time' : lambda : time.time() + random.gauss(0, 10000.0)}
class MakerFromColumn(object):def __init__(self, kinddct = KindMakers, stoplst = []):
self._src = kinddctself._stoplist = stoplst[:]
def __getitem__(self, key):if key._v_name in self._stoplist:
raise KeyError(key)
return self._src[key.kind]TypedMakers = MakerFromColumn(KindMakers, ['vID'])
def PopulateTable(table_obj, count = 1, dataMakers = TypedMakers):"""
The function will insert into the data table 'count' rows that will be generatedby the supplied dataMakers.dataMakers object is expected to implement simple indexing operator, with
column descriptions (tables.Col) as parameter. See the default value of
7 A generator function will raise StopIteration upon termination.
8 The documentation of the itertools module also brings several examples of use of the yield statement.
http://docs.python.org/lib/module-itertools.htmlhttp://docs.python.org/lib/module-itertools.html -
8/14/2019 Python and Generators
5/6
dataMakers(TypedMakers) for an imlementation.
"""boundflds = []for col_path in table_obj.colpathnames:
try:data_gen = dataMakers[table_obj.coldescrs[col_path]]if data_gen:
boundflds.append((col_path, data_gen))except KeyError:
pass
accs = table_obj.rowfor i in xrange(count):
for (key, gen) in boundflds:accs[key] = gen()
accs.append()table_obj.flush()
def PopulateTablesInFile(hdf5File, rowcounts = {}, defcount = 100):for tb in hdf5File.walkNodes("/", "Table"):
PopulateTable(tb, rowcounts.get(tb.name, defcount))
Listing 2: The data maker for the 'string' kind
def wordManufacture(max_iter_count = -1):import __main__import inspect
item_queue = []filterette = lambda itm : inspect.isclass(itm) or inspect.isroutine(itm) or
inspect.ismodule(itm)
while max_iter_count != 0:try:
this_object = item_queue.pop(0)nextpass = [tp[1] for tp in inspect.getmembers(this_object, filterette)]if this_object in nextpass:
nextpass.remove(this_object)nextpass.append(this_object)
item_queue += nextpass
txtlst = []doc_string = inspect.getdoc(this_object)comment_string = inspect.getcomments(this_object)if doc_string is not None:
txtlst += doc_string.split()if comment_string is not None:
txtlst += comment_string.split()
for wrd in txtlst:yield wrd
except IndexError:
assert not item_queueitem_queue.append(__main__) #da capo...
if max_iter_count > 0:
max_iter_count -= 1
class wordSmith(object):def __init__(self):
self.__gen = wordManufacture()
def __call__(self):try:return self.__gen.next()
except StopIteration:self.__gen = wordManufacture()
-
8/14/2019 Python and Generators
6/6
return self.__gen.next()