mnh csv python

Post on 07-Jul-2015

65 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

PyGotham 09:45 AM - 10:45 AM on August 17, 2014. If you're new to Python, you might find that you're using Python as if it were C. This talk will demonstrate how to take advantage of Python's special data structures to build tools for analyzing and creating nicely-formatted reports from CSV data.

TRANSCRIPT

Building flexible tools to store sums and report on CSV data

Presented by

Margery Harrison

Audience level: Novice09:45 AM - 10:45 AM

August 17, 2014Room 704

Python Flexibility

● Basic, Fortran, C, Pascal, Javascript,...● At some point, there's a tendency to think

the same way, and just translate it● You can write Python as if it were C● Or you can take advantage of Python's

special data structures.● The second option is a lot more fun.

Using Python data structures to report on CSV data

● Lists● Sets● Tuples● Dictionaries● CSV Reader

● DictReader● Counter

Also,

● Using tuples as dictionary keys● Using enumerate() to count how many

times you've looped– See “Loop like a Native”

http://nedbatchelder.com/text/iter.html

Code Development Method

● Start with simplest possible version● Test and validate● Iterative improvements

– Make it prettier

– Make it do more

– Make it more general

This is a CSV file

color,size,shape,number

red,big,square,3

blue,big,triangle,5

green,small,square,2

blue,small,triangle,1

red,big,square,7

blue,small,triangle,3

https://c1.staticflickr.com/3/2201/2469586703_cfdaf88195.jpg

http://i239.photobucket.com/albums/ff263/peacelovebones/two-pandas-rolling-1.jpg

CSV DictReader

>>> import csv

>>> import os

>>> with open("simpleCSV.txt") as f:

... r=csv.DictReader(f)

... for row in r:

... print row

...

Running DictReader

DictReader is sequential

Tabulate All Possible Values

How many of each?

● It's nice to have a listing that shows the variety of objects that can appear in each column.

● Next, we'd like to count how many of each● And guess what? Python has a special data

structure for that.

collections.Counter

Playing with Counters

Index into Counters

Counter + DictReader

Let's use counters to tell us how many of each value was in each column.

Print number of each value

Output

colorblue : 3green : 1red : 2

shapesquare : 3triangle: 3

number1 : 13 : 22 : 15 : 17 : 1

sizesmall : 3big : 3

You might ask, why not this?

for row in r: for head in r.fieldnames: field_value = row[head] possible_values[head].add(field_value) #count_of_values.update(row[head]) count_of_values.update(field_value) print count_of_values

Because

Counter({'e': 13, 'l': 12, 'a': 9, 'r': 9, 'g': 7, 'b': 6, 'i': 6, 's': 6, 'u': 6, 'n': 4, 'm': 3, 'q': 3, 't': 3, 'd': 2, '3': 2, '1': 1, '2': 1, '7': 1, '5': 1})

color

blue : 0

green : 0

red : 0

shapesquare : 0triangle: 0

number1 : 13 : 22 : 15 : 17 : 1

sizesmall : 0big : 0

Output

colorblue : 3green : 1red : 2

shapesquare : 3triangle: 3

number1 : 13 : 22 : 15 : 17 : 1

sizesmall : 3big : 3

How many red squares?

● We can use tuples as an index into the counter

– (red,square)

– (big,red,square)

– (small,blue,triangle)

– (small,square)

Let's use a simpler CSV

color,size,shape

red,big,square

blue,big,triangle

green,small,square

blue,small,triangle

red,big,square

blue,small,triangle

Counting Tuplestrying to use magic update()

>>> c=collections.Counter([('a,b'),('c,d,e')])>>> cCounter({'a,b': 1, 'c,d,e': 1})>>> c.update(('a','b'))>>> cCounter({'a': 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})>>> c.update((('a','b'),))>>> cCounter({'a': 1, ('a', 'b'): 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})

Oh well>>> c.update([(('a','b'),)])>>> cCounter({'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1, ('a', 'b'): 1})>>> c[('a','b')]1>>> c[('a','b')]+=5>>> cCounter({('a', 'b'): 6, 'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1})

Combo Count Part 1: Initialize

Combo Count 2: Counting

Combo Count 3: Printing

Combo Count Outputcolorblue : 33 blue in 1 combinations:('blue', 'big'): 1('blue', 'small'): 23 blue in 2 combinations:('blue', 'big', 'triangle'): 1('blue', 'small', 'triangle'): 2green : 11 green in 1 combinations:('green', 'small'): 11 green in 2 combinations:('green', 'small', 'square'): 1red : 22 red in 1 combinations:('red', 'big'): 22 red in 2 combinations:('red', 'big', 'square'): 2

shapesquare : 33 square in 1 combinations:3 square in 2 combinations:('red', 'big', 'square'): 2('green', 'small', 'square'): 1triangle: 33 triangle in 1 combinations:3 triangle in 2 combinations:('blue', 'big', 'triangle'): 1('blue', 'small', 'triangle'): 2sizesmall : 33 small in 1 combinations:('blue', 'small'): 2('green', 'small'): 13 small in 2 combinations:('green', 'small', 'square'): 1('blue', 'small', 'triangle'): 2

big : 33 big in 1 combinations:('blue', 'big'): 1('red', 'big'): 23 big in 2 combinations:('red', 'big', 'square'): 2('blue', 'big', 'triangle'): 1

Well, that's ugly

● We need to make it prettier● We need to write out to a file● We need to break things up into Classes

Printing Combination Levels

Number of Squares

Number of Red Squares

Number of Blue Squares

Number of Triangles

Number of Red Triangles

Number of Blue Triangles

Total Red

Total Blue

Indentation per level

● If we're indexing by tuple, then the indentation level could correspond to the number of items in the tuple.

● Let's have general methods to format the indentation level, given the number of items in the tuple, or input 'level' integer

A class write_indent() methodIf part of class with counter and msgs dict,

just pass in the tuple:

def write_indent(self, tup_index):''' :param tup_index: tuple index into counter''' indent = ' ' * len(tup_index) msg = self.msgs[tup_index] sum = self.counts[tup_index] indented_msg = ('{0:s}{1:s}'.format( indent, msg, sum) return indented_msg

class-less indent_message()

def indent_message(level, msg, sum,\

space_per_indent=2, space=' '):

num_spaces = self.space_per_indent * level

indent = space * num_spaces

# We'll want to tune the formatting..

indented_msg = ('{0:s}{1:s}:{2:d}'.format(

indent, msg, sum)

return indented_msg

Adjustable field widths

Depending on data, we'll want different field widths

red squares 5

Blue squares 21

Large Red Squares in the Bronx 987654321

Using format to format a format string

>>> f='{{0:{0:d}s}}'.format(3)

>>> f

'{0:3s}'

>>> f='{{0:{0:d}s}}{{1:{1:d}d}}'.format(3,5)

>>> f

'{0:3s}{1:5d}'

>>> f='{{0:s}}{{1:{0:d}s}}{{2:{1:d}d}}'.format(3,5)

>>> f

'{0:s}{1:3s}{2:5d}'

Format 3 values

● Our formatting string will print 3 values:– String of space chars: {0:s}

– Message: {1:[msg_width]s}

– Sum: Right justified {2:-[sum_width]d}

Class For Flexible Indentation

Flexible Indent Class Variables

Flexible Indent Method

Testing IndentMessages class

SimpleCSVReporter

● Open a CSV File● Create

– Set of possible values

– Set of possible tuples

– Counter indexed by each value & tuple

● Use IndentMessages to format output lines

SimpleCSVReporter class vars

readCSV() beginsinitialize sets..

readCSV() continued: Loop to collect & sum

Write to Report File

Using recursion for limitless indentation

Recursive print sub-levels

Word transform stubs

General method to test

Test with simpler CSV

Output for simpler CSV

A bigger CSV file

"CCN","REPORTDATETIME","SHIFT","OFFENSE","METHOD","BLOCKSITEADDRESS","WARD","ANC","DISTRICT","PSA","NEIGHBORHOODCLUSTER","BUSINESSIMPROVEMENTDISTRICT","VOTING_PRECINCT","START_DATE","END_DATE"

4104147,"4/16/2013 12:00:00 AM","MIDNIGHT","HOMICIDE","KNIFE","1500 - 1599 BLOCK OF 1ST STREET SW",6,"6D","FIRST",105,9,,"Precinct 127","7/27/2004 8:30:00 PM","7/27/2004 8:30:00 PM"

5047867,"6/5/2013 12:00:00 AM","MIDNIGHT","SEX ABUSE","KNIFE","6500 - 6599 BLOCK OF PINEY BRANCH ROAD NW",4,"4B","FOURTH",402,17,,"Precinct 59","4/15/2005 12:30:00 PM",

● From http://data.octo.dc.gov/

Deleted all but 4 columns

"SHIFT","OFFENSE","METHOD","DISTRICT"

"MIDNIGHT","HOMICIDE","KNIFE","FIRST"

"MIDNIGHT","SEX ABUSE","KNIFE","FOURTH"

...

"DAY","THEFT/OTHER","OTHERS","SECOND"

"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"

"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"

"EVENING","BURGLARY","OTHERS","FIFTH"

...

Method to run crime report

Output - top

Output - bottom

Improvements

● Allow user-specified order for values, e.g. FIRST, SECOND, THIRD

● Other means of tabulating● Keeping track of blank values● Summing counts in columns● ...

https://c1.staticflickr.com/3/2201/2469586703_cfdaf88195.jpg

LinksThis talk: http://www.slideshare.net/pargery/mnh-csv-python

● https://github.com/pargery/csv_utils2

● Also some notes in http://margerytech.blogspot.com/

Info on Data Structures

● http://rhodesmill.org/brandon/slides/2014-04-pycon/data-structures/

● http://nedbatchelder.com/text/iter.html

DC crime stats

● http://data.octo.dc.gov/“The data made available here has been modified for use from its original source, which is the Government of the District of Columbia. Neither the District of Columbia Government nor the Office of the Chief Technology Officer (OCTO) makes any claims as to the completeness, accuracy or content of any data contained in this application; makes any representation of any kind, including, but not limited to, warranty of the accuracy or fitness for a particular use; nor are any such warranties to be implied or inferred with respect to the information or data furnished herein. The data is subject to change as modifications and updates are complete. It is understood that the information contained in the web feed is being used at one's own risk."

top related