python and generators

Upload: motek

Post on 31-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Python and Generators

    1/6

    Generators, comments and the data fetish

    Author: Marcin Swiatek, Visimatik Inc., 2008

    I worked on a tool for navigating data hierarchies stored in HDF5 files. As it usually happens, at some

    point I wanted to convince myself that the program would be adequate for the intended purpose 1. To

    this end, I needed to exercise it against a reasonably wide set of exemplary data. While I already had

    examples downloaded from theHDF5 web site and the datasets I had generated for my project, I

    needed more still. To this end, I have contrived a tool to help me quickly populate tables in an HDF5

    file of arbitrary structure with random data2.

    This is where things get awkward. Given the sheer thrill of writing test code and the comforting

    obviousness of several solutions to the problem, how do I convince at least one person to stay with me

    past this paragraph? I promise to spend little time on trivial matters, like generating random numbers

    or the syntax of loop statements. The article will focus on generators and reflection and suggest a

    practical and entertaining use of these fairly obscure, yet very useful, aspects of Python.

    This document and the associated source code can be downloaded from www.visimatik.com.

    The set-up

    Frequently the problem statement suggests possible solutions. And while prudent programmers should

    treat solutions that 'invite themselves' with distrust, in this case the description is best taken at its face

    value. The tandem routinesPopulateTablesInFile andPopulateTable (Listing 1) implement astraightforward strategy: find all tables in the file and against each table execute the prescribed

    number of insertions. Insertion consists of an assignment (each field gets assigned a new value),

    followed by a call to the appendmethod on an object representing the table's row.The only aspect that calls for comment is how assigned values are produced. This task is delegated to

    callable objects I will refer to (somehow awkwardly) as 'data makers3'. It is rather obvious what a

    'maker' needs to do: it will, when called, return a new value of the expected type. Values for all

    column types, excepting strings, are ultimately generated by calls to functions imported form the

    standard library module random.It seems reasonable to associate these 'makers' with data types, or, more exactly, with types of types:

    kinds4. However, I decided to have the front-line 'maker holder' indexed by column descriptions,

    rather then just types. Column descriptors carry information about column type and for the purpose of

    this exercise can be reduced to kinds. In the future, however, they will let me introduce another level

    of indirection between the dataMakerparameter ofPopulateTable and theKindMakers dictionary. I

    plan to use it to populate columns with variates of different distributions, depending on datasemantics.

    PopulateTablesInFile constitutes the highest-level interface to the 'table stuffer' module

    1 This process is usually called 'testing'.

    2 There is a very long argument to be made about using synthetic data for verification of algorithms and data analysis

    methods. In several fields I am familiar with, synthetic data is used often and to good results. However, in this text I

    intend to focus on some features of the programming language: Python. Building models for generators of synthetic

    data is out of scope of this article.

    3 I am trying hard to avoid calling these objects 'generators'. The name 'generators' might seem natural to describe a

    routine that generates something. But since this article is about generators in a different meaning, it is better kept

    this way. 'Makers' for objects serving data when called, 'generators' for...

    4 For practical purposes it the details (like representation length) of columns' types won't really matter. Makers returnvalues in Python types, which are converted upon assignment. To keep things simple I have decided to ignore the

    distinction between fixed-length string columns in pytables and Python strings the only practical situation where

    exact type is indeed important.

    mailto:[email protected]?subject=Article:%20Pytables%20and%20meta-classeshttp://hdf.ncsa.uiuc.edu/products/hdf5/hdf5-files.htmlhttp://hdf.ncsa.uiuc.edu/products/hdf5/hdf5-files.htmlhttp://www.visimatik.com/strictlytechnicalhttp://www.visimatik.com/strictlytechnicalhttp://hdf.ncsa.uiuc.edu/products/hdf5/hdf5-files.htmlmailto:[email protected]?subject=Article:%20Pytables%20and%20meta-classeshttp://www.visimatik.com/strictlytechnical
  • 8/14/2019 Python and Generators

    2/6

    (TableStuffer.py) and is used by the actual command script (TestMaker.py). There is nothing

    remarkable about the command line script and we will spend no time analyzing it.

    Structures of HDF files the script can create have been predefined using the mechanism discussed

    previously. These definitions can be found in Schemas/Canned.py. You will needpytables and HDF5libraries to run the example.

    First, there was Word or on the benefits of wordy comments

    When it comes to populating text fields, the simplest solution is to generate random sequences of

    characters. However, this approach has considerable drawbacks. While the practice may be adequate

    for load testing, it will not do if data is to be ever evaluated by a human being. Chance doesn't look

    life-like and a person evaluating test results based on random symbols will have considerable

    difficulties finding and recalling any point of reference. The usual way out of this difficulty is to take

    words from a file containing natural language text (literature classicsalways work best). However,

    this is a truly light-weight project and hauling a large text file around with a tiny test script just seems

    to be out of proportion.

    How could I do without, then? The program itself is text, although admittedly with a limited and

    peculiar vocabulary. Yet a good part of any source file is written for human readers: comments and

    doc-strings. This will be my source of test data.

    Given Python's nature, it is relatively easy to access this information in the runtime. The standard

    library module inspect offers several useful tools to get the task done. In wordManufacture routine

    (Listing 2), I traverse the live graph of runtime artifacts. It is worth underscoring that this is not the

    graph of relations of program's data (objects), which may refer to each other in very complex ways.

    Here, we will remain on the meta level, where relations between entities (such as module, type, classormethod) are defined by the lexical structure of the program5. One could expect a graph with edges

    defined by relations of inheritance and containment to be free of cycles. Unfortunately, this is not the

    case:

    >>> mmth = inspect.getmembers(__main__)>>> mnm = [t[1] for t in mmth if t[0] == '__main__']>>> mnm[]>>> mnm[0] == __main__True>>>

    The practical consequence of this observation is that the code cannot be treated as a tree. Normally,

    one would strive to devise an algorithm avoiding infinite loops. However, in this particular situation it

    made more sense for me to embrace infinity. After all, the program is supposed to generate words

    until the end of time.

    The source code associated with a programming artifact may be inspected using appropriate routines

    from the inspectmodule. In particular, functionsgetcomments andgetdoc extract comments and

    documentation strings, information that will suit best the purpose. Each obtained string will likely

    contain several words: the smallest pieces of text that can be easily noticed, memorized and

    referenced. The object producing text data will thus return individual words. Notice that the algorithm

    gathering words will have to operate on an nested structure, a graph of objects containing lists of

    words. A nested iteration is easy to program, but there is an additional challenge: the words need to be

    returned one-by-one, in subsequent calls.

    One could gather all text up front and return words from a storage. This, however, requires the graph

    traversal problem to be addressed properly. The alternative is to encapsulate the process of data

    5 This is a major simplification. In reality, Python's dynamic character makes the lexical structure of the program

    more malleable then one may expect.

    http://www.visimatik.com/Pytablesversions.pdfhttp://www.pytables.org/moinhttp://hdf.ncsa.uiuc.edu/products/hdf5/index.htmlhttp://www.gutenberg.org/wiki/Main_Pagehttp://www.gutenberg.org/wiki/Main_Pagehttp://docs.python.org/tut/node6.html#SECTION006760000000000000000http://docs.python.org/lib/module-inspect.htmlhttp://www.visimatik.com/Pytablesversions.pdfhttp://www.pytables.org/moinhttp://hdf.ncsa.uiuc.edu/products/hdf5/index.htmlhttp://www.gutenberg.org/wiki/Main_Pagehttp://docs.python.org/tut/node6.html#SECTION006760000000000000000http://docs.python.org/lib/module-inspect.html
  • 8/14/2019 Python and Generators

    3/6

    extraction in a class, which would progress iteration 'on demand', when more data is needed. There is

    nothing unusual about this proposition. For instance, classes reducing a complex data structure or an

    algorithm to an iteration are the favorite vehicle of database access libraries.

    Devising a class for the task would not difficult. The only aspect that might call for special care is the

    question of representing the state of nested iterations in object's variables.

    Interestingly, in Python the task of finding the suitable representation can be delegated to the language

    itself.

    The state of a computation

    Suppose you invoke wordManufacture() as presented inListing 2. What the call will return? Well, itis easy if you try...

    >>>>>> p = wordManufacture()>>> p>>> p.next()'__main__'

    >>>

    Instead of returning a string, as one might have expected, wordManufacture() returns an object a

    generator.According to Python documentation, it is enough to place a yield statement in the

    function's body to make the interpreter create a wholly different code execution structure and in place

    of a normal function produce agenerator function.

    I find it convenient to look at generators in a

    similar way as at iterators6. One could say an

    iterator represents an iteration. By the same

    token, generator can be thought of as an object

    representing, and permitting some control over aflow of a computation. In this context control

    amounts to something very much akin stepping

    through an iteration. However, the routine,

    which is to be controlled through a generator,

    needs to be written in a specific way, with

    explicit definition of junction points, where the

    generator function will communicate with the

    outside. In Python, these junction points are

    defined using theyieldkeyword.Upon invocation of the method next() of a

    generator object, the related generator routinewill execute up until the nextyieldin its code. Acall to next() returns whatever the generator

    routine yields.

    In earlier versions of Python (pre PEP-342), the

    interface of a generator was exactly that of an

    iterator and the construct lacked the two-way

    communication, enabled by thesendmethod of

    the interface. Thus, generator functions were just a way to code some iterations more conveniently.

    Most examples given in literature reinforce this association.

    6 When discussing the concept, I much prefer to focus on the design pattern, rather than on its interpretation in aspecific language. Oddly enough, I could not find on-line an explanation that I liked. Wikipedia does have an article,

    but it is poorly written, in my opinion.This one seems good, but the link just looks like it is not going to last. The

    seminalGoF book 'Design Patterns' brings a good discussion, but it is not available as an on-line reference.

    Generators in Python are now something more

    than just enhanced iterators. In the scenario

    described here, one-way communication implies

    that one routine plays-back another, as if

    having it perform a certain task. However, it is

    possible to communicate in both directions,

    thorough thesendmethod of thegeneratorinterface and the return value of theyieldinstruction. This enables compositions, where

    two or more routines collaborate on some task

    (the term for that is, I believe, collaborativemultitasking). In other words, generatorfunctions can be coroutines, with all associated

    benefits. For instance, it is a natural way of

    expressing several interesting algorithms.Programming coroutines is an interesting topic

    on its own right, but a fairly broad one, too. It

    will not be discussed here; instead refer all

    readers interested in writing coroutines in Python

    to the already invoked PEP-342 and other

    resources on the web.

    http://www.python.org/doc/current/ref/yield.htmlhttp://www.python.org/doc/current/ref/yield.htmlhttp://docs.python.org/lib/typeiter.htmlhttp://www.python.org/dev/peps/pep-0342/http://en.wikipedia.org/wiki/Iterator_patternhttp://www.patterndepot.com/put/8/iterator.pdfhttp://www.patterndepot.com/put/8/iterator.pdfhttp://www.amazon.com/Design-Patterns-Object-Oriented-Addison-Wesley-Professional/dp/0201633612http://www.amazon.com/Design-Patterns-Object-Oriented-Addison-Wesley-Professional/dp/0201633612http://www.python.org/doc/current/ref/yield.htmlhttp://en.wikipedia.org/wiki/Iterator_patternhttp://www.patterndepot.com/put/8/iterator.pdfhttp://www.amazon.com/Design-Patterns-Object-Oriented-Addison-Wesley-Professional/dp/0201633612http://docs.python.org/lib/typeiter.htmlhttp://www.python.org/dev/peps/pep-0342/
  • 8/14/2019 Python and Generators

    4/6

    The design presented in this article also follows that established usage pattern. However, please keep

    in mind that while similarities between generators and iterators are really difficult to overlook, one

    should not reduce one to another.

    Generators and iterators are most often employed in the context of aforloop, which in Python is

    always about iterating through something (an iterable) using an iterator object. The loop's semantics

    completely occludes the use of iterators, usually to programmer's great benefit. Owing it to that, thebehavior of iterators is usually of concern only in the context of writing container classes.

    However, this example requires placing a generator outside the usual context. Adapting an iterator or

    a generator to a callable interface is not complicated, but requires some care. In my example, the

    generator is adapted to a callable interface through a thin wrapper object. Its __call__ function

    invokes explicitly the generator's next() function. Notice the try-except clause, surrounding this call.

    According to iteratorandgeneratorcontracts, StopIteration exception is used to signal that there isno more elements in the collection (or no more computations to perform). Hence, this is something

    we must expect7; the semantics offorconstruct in Python includes exception handling, but here it is

    up to the programmer. The wordSmith class inListing 2 implements the wrapper.

    Remaining business

    In order to ensure stop of the wordManufacture routine, I have introduced a primitive counter into thealgorithm. While it came useful in debugging, in real life I would use a different way of extracting

    finite sequences from an infinite-loop generator. The standard itertools module brings a variety of

    tools extending the standard iterators mechanism8. One of them, islice, first perfectly the purpose.

    Listings

    Listing 1: The essential interface

    KindMakers = {'string' : wordSmith(),'int' : lambda : random.randint(-1000, 1000),'uint' : lambda : random.randint(0, 1000),'bool' : lambda : random.random() > 0.5,'float': lambda : random.expovariate(.1),'complex' : lambda : random.gauss(0, 100.0) + 1j * random.expovariate(.1),'time' : lambda : time.time() + random.gauss(0, 10000.0)}

    class MakerFromColumn(object):def __init__(self, kinddct = KindMakers, stoplst = []):

    self._src = kinddctself._stoplist = stoplst[:]

    def __getitem__(self, key):if key._v_name in self._stoplist:

    raise KeyError(key)

    return self._src[key.kind]TypedMakers = MakerFromColumn(KindMakers, ['vID'])

    def PopulateTable(table_obj, count = 1, dataMakers = TypedMakers):"""

    The function will insert into the data table 'count' rows that will be generatedby the supplied dataMakers.dataMakers object is expected to implement simple indexing operator, with

    column descriptions (tables.Col) as parameter. See the default value of

    7 A generator function will raise StopIteration upon termination.

    8 The documentation of the itertools module also brings several examples of use of the yield statement.

    http://docs.python.org/lib/module-itertools.htmlhttp://docs.python.org/lib/module-itertools.html
  • 8/14/2019 Python and Generators

    5/6

    dataMakers(TypedMakers) for an imlementation.

    """boundflds = []for col_path in table_obj.colpathnames:

    try:data_gen = dataMakers[table_obj.coldescrs[col_path]]if data_gen:

    boundflds.append((col_path, data_gen))except KeyError:

    pass

    accs = table_obj.rowfor i in xrange(count):

    for (key, gen) in boundflds:accs[key] = gen()

    accs.append()table_obj.flush()

    def PopulateTablesInFile(hdf5File, rowcounts = {}, defcount = 100):for tb in hdf5File.walkNodes("/", "Table"):

    PopulateTable(tb, rowcounts.get(tb.name, defcount))

    Listing 2: The data maker for the 'string' kind

    def wordManufacture(max_iter_count = -1):import __main__import inspect

    item_queue = []filterette = lambda itm : inspect.isclass(itm) or inspect.isroutine(itm) or

    inspect.ismodule(itm)

    while max_iter_count != 0:try:

    this_object = item_queue.pop(0)nextpass = [tp[1] for tp in inspect.getmembers(this_object, filterette)]if this_object in nextpass:

    nextpass.remove(this_object)nextpass.append(this_object)

    item_queue += nextpass

    txtlst = []doc_string = inspect.getdoc(this_object)comment_string = inspect.getcomments(this_object)if doc_string is not None:

    txtlst += doc_string.split()if comment_string is not None:

    txtlst += comment_string.split()

    for wrd in txtlst:yield wrd

    except IndexError:

    assert not item_queueitem_queue.append(__main__) #da capo...

    if max_iter_count > 0:

    max_iter_count -= 1

    class wordSmith(object):def __init__(self):

    self.__gen = wordManufacture()

    def __call__(self):try:return self.__gen.next()

    except StopIteration:self.__gen = wordManufacture()

  • 8/14/2019 Python and Generators

    6/6

    return self.__gen.next()