a look inside pandas design and development

A look inside pandas design and development

Wes McKinneyLambda Foundry, Inc.

@wesmckinn

NYC Python Meetup, 1/10/2012

a.k.a. “Pragmatic Python for high performance

data analysis”

a.k.a. “Rise of the pandas”

More like...

SPEED!!!

Or maybe... (j/k)

• Mathematician at heart

• 3 years in the quant finance industry

• Last 2: statistics + freelance + open source

• My new company: Lambda Foundry

• Building analytics and tools for finance and other domains

• Blog: http://blog.wesmckinney.com

• GitHub: http://github.com/wesm

• Twitter: @wesmckinn

• Working on “Python for Data Analysis” for O’Reilly Media

• Giving PyCon tutorial on pandas (!)

pandas?

• http://pandas.sf.net

• Swiss-army knife of (in-memory) data manipulation in Python

• Like R’s data.frame on steroids

• Excellent performance

• Easy-to-use, highly consistent API

• A foundation for data analysis in Python

pandas

• In heavy production use in the financial industry

• Generally much better performance than other open source alternatives (e.g. R)

• Hope: basis for the “next generation” data analytical environment in Python

Simplifying data wrangling

• Data munging / preparation / cleaning / integration is slow, error prone, and time consuming

• Everyone already <3’s Python for data wrangling: pandas takes it to the next level

Explosive pandas growth

• Last 6 months: 240 files changed 49428 insertions(+), 15358 deletions(-)

Cython-generated C removed

Rigorous unit testing

• Need to be able to trust your $1e3/e6/e9s to pandas

• > 98% line coverage as measured by coverage.py

• v0.3.0 (2/19/2011): 533 test functions

• v0.7.0 (1/09/2012): 1272 test functions

Some development asides

• I get a lot of questions about my dev env

• Emacs + IPython FTW

• Indispensible development tools

• pdb (and IPython-enhanced pdb)

• pylint / pyflakes (integrated with Emacs)

• nose

• coverage.py

• grin, for searching code. >> ack/grep IMHO

IPython

• Matthew Goodman: “If you are not using this tool, you are doing it wrong!”

• Tab completion, introspection, interactive debugger, command history

• Designed to enhance your productivity in every way. I can’t live without it

• IPython HTML notebook is a game changer

Profiling and optimization

• %time, %timeit in IPython

• %prun, to profile a statement with cProfile

• %run -p to profile whole programs

• line_profiler module, for line-by-line timing

• Optimization: find right algorithm first. Cython-ize the bottlenecks (if need be)

Other things that matter

• Follow PEP8 religiously

• Naming conventions, other code style

• 80 character per line hard limit

• Test more than you think you need to, aim for 100% line coverage

• Avoid long functions (> 50 lines), refactor aggressively

I’m serious about function length

http://gist.github.com/1580880

Don’t make a mess

YouTube: “What killed Smalltalk could kill s/Ruby/Python, too”

Uncle Bob

Other stuff

• Good keyboard

Other stuff• Big monitors

Other stuff

• Ergonomic chair (good hacking posture)

pandas DataFrame• Jack-of-trades tabular data structure

In [10]: tips[:10]Out[10]: total_bill tip sex smoker day time size1 16.99 1.01 Female No Sun Dinner 2 2 10.34 1.66 Male No Sun Dinner 3 3 21.01 3.50 Male No Sun Dinner 3 4 23.68 3.31 Male No Sun Dinner 2 5 24.59 3.61 Female No Sun Dinner 4 6 25.29 4.71 Male No Sun Dinner 4 7 8.770 2.00 Male No Sun Dinner 2 8 26.88 3.12 Male No Sun Dinner 4 9 15.04 1.96 Male No Sun Dinner 2 10 14.78 3.23 Male No Sun Dinner 2

DataFrame

• Heterogeneous columns

• Data alignment and axis indexing

• No-copy data selection (!)

• Agile reshaping

• Fast joining, merging, concatenation

DataFrame

• Axis indexing enable rich data alignment, joins / merges, reshaping, selection, etc.

day Fri Sat Sun Thur sex smoker Female No 3.125 2.725 3.329 2.460 Yes 2.683 2.869 3.500 2.990Male No 2.500 3.257 3.115 2.942 Yes 2.741 2.879 3.521 3.058

Let’s have a little fun

To the IPython Notebook, Batman

http://ashleyw.co.uk/project/food-nutrient-database

Axis indexing, the special pandas-flavored sauce

• Enables “alignment-free” programming

• Prevents major source of data munging frustration and errors

• Fast (O(1) or O(log n)) selecting data

• Powerful way of describing reshape / join / merge / pivot-table operations

Data alignment, join ops

• The brains live in the axis index

• Indexes know how to do set logic

• Join/align ops: produce “indexers”

• Mapping between source/output

• Indexer passed to fast “take” function

Index join example

left right

joined

-11203

lidx ridx

left_values.take(lidx, axis) reindexed data

Implementing index joins

• Completely irregular case: use hash tables

• Monotonic / increasing values

• Faster specialized left/right/inner/outer join routines, especially for native types (int32/64, datetime64)

• Lookup hash table is persisted inside the Index object!

Um, hash table?

joined

0123{ }d

-11203

indexer

Hash tables

• Form the core of many critical pandas algorithms

• unique (for set intersection / union)

• “factor”ize

• groupby

• join / merge / align

GroupBy, a brief algorithmic exploration• Simple problem: compute group sums for a

vector given group identifications

bbaabaa

-13232

labels values

unique labels

group sums

unique_labels = np.unique(labels)results = np.empty(len(unique_labels))

for i, label in enumerate(unique_labels): results[i] = values[labels == label].sum()

For all these examples, assume N data points and K unique groups

GroupBy: Algo #1

GroupBy: Algo #1, don’t do this

unique_labels = np.unique(labels)results = np.empty(len(unique_labels))

for i, label in enumerate(unique_labels): results[i] = values[labels == label].sum()

Some obvious problems• O(N * K) comparisons. Slow for large K• K passes through values• numpy.unique is pretty slow (more on this later)

GroupBy: Algo #2

g_inds = {label : [i where labels[i] == label]}

Pros: one pass through values. ~O(N) for N >> KCons: g_inds can be built in O(N), but too many list/dict API calls, even using Cython

Make this dict in O(N) (pseudocode)

Nowfor i, label in enumerate(unique_labels): indices = g_inds[label] label_values = values.take(indices) result[i] = label_values.sum()

GroupBy: Algo #3, much faster

• “Factorize” labels

• Produce vector of integers from 0, ..., K-1 corresponding to the unique observed values (use a hash table)

result = np.zeros(k)for i, j in enumerate(factorized_labels): result[j] += values[i]

Pros: avoid expensive dict-of-lists creation. Avoid numpy.unique and have option to not to sort the unique labels, skipping O(K lg K) work

Speed comparisons

• Test case: 100,000 data points, 5,000 groups

• Algo 3, don’t sort groups: 5.46 ms

• Algo 3, sort groups: 10.6 ms

• Algo 2: 155 ms (14.6x slower)

• Algo 1: 10.49 seconds (990x slower)

• Algos 2/3 implemented in Cython

GroupBy

• Situation is significantly more complicated in the multi-key case.

• More on this later

Algo 3, profiledIn [32]: %prun for _ in xrange(100) algo3_nosort()

cumtime filename:lineno(function) 0.592 <string>:1(<module>) 0.584 groupby_ex.py:37(algo3_nosort) 0.535 {method 'factorize' of DictFactorizer' objects} 0.047 {pandas._tseries.group_add} 0.002 numeric.py:65(zeros_like) 0.001 {method 'fill' of 'numpy.ndarray' objects} 0.000 {numpy.core.multiarray.empty_like} 0.000 {numpy.core.multiarray.empty}

Curious

Slaves to algorithms

• Turns out that numpy.unique works by sorting, not a hash table. Thus O(N log N) versus O(N)

• Takes > 70% of the runtime of Algo #2

• Factorize is the new bottleneck, possible to go faster?!

Unique-ing fasterBasic algorithm using a dict, do this in Cython

table = {}uniques = []for value in values: if value not in table: table[value] = None # dummy uniques.append(value)if sort: uniques.sort()

Performance may depend on the number of unique groups (due to dict resizing)

Unique-ing faster

No Sort: at best ~70x faster, worst 6.5x faster Sort: at best ~70x faster, worst 1.7x faster

Remember

Can we go faster?

• Python dict is renowned as one of the best hash table implementations anywhere

• But:

• No ability to preallocate, subject to arbitrary resizings

• We don’t care about reference counting, throw away table once done

• Hm, what to do, what to do?

Enter klib

• http://github.com/attractivechaos/klib

• Small, portable C data structures and algorithms

• khash: fast, memory-efficient hash table

• Hack a Cython interface (pxd file) and we’re in business

khash Cython interfacecdef extern from "khash.h": ctypedef struct kh_pymap_t: khint_t n_buckets, size, n_occupied, upper_bound uint32_t *flags PyObject **keys Py_ssize_t *vals

inline kh_pymap_t* kh_init_pymap() inline void kh_destroy_pymap(kh_pymap_t*) inline khint_t kh_get_pymap(kh_pymap_t*, PyObject*) inline khint_t kh_put_pymap(kh_pymap_t*, PyObject*, int*) inline void kh_clear_pymap(kh_pymap_t*) inline void kh_resize_pymap(kh_pymap_t*, khint_t) inline void kh_del_pymap(kh_pymap_t*, khint_t) bint kh_exist_pymap(kh_pymap_t*, khiter_t)

PyDict vs. khash unique

Conclusions: dict resizing makes a big impact48

Use strcmp in C

Gloves come off with int64

PyObject* boxing / PyRichCompare obvious culprit

Some NumPy-fu• Think about the sorted factorize algorithm

• Want to compute sorted unique labels

• Also compute integer ids relative to the unique values, without making 2 passes through a hash table!

sorter = uniques.argsort() reverse_indexer = np.empty(len(sorter)) reverse_indexer.put(sorter, np.arange(len(sorter)))

labels = reverse_indexer.take(labels)

Aside, for the R community

• R’s factor function is suboptimal

• Makes two hash table passes

• unique uniquify and sort

• match ids relative to unique labels

• This is highly fixable

• R’s integer unique is about 40% slower than my khash_int64 unique

Multi-key GroupBy

• Significantly more complicated because the number of possible key combinations may be very large

• Example, group by two sets of labels

• 1000 unique values in each

• “Key space”: 1,000,000, even though observed key pairs may be small

Multi-key GroupBySimplified Algorithm

id1, count1 = factorize(label1)id2, count2 = factorize(label2)group_id = id1 * count2 + id2nobs = count1 * count2

if nobs > LARGE_NUMBER: group_id, nobs = factorize(group_id)

result = group_add(data, group_id, nobs)

Multi-GroupBy

• Pathological, but realistic example

• 50,000 values, 1e4 unique keys x 2, key space 1e8

• Compress key space: 9.2 ms

• Don’t compress: 1.2s (!)

• I actually discovered this problem while writing this talk (!!)

Speaking of performance

• Testing the correctness of code is easy: write unit tests

• How to systematically test performance?

• Need to catch performance regressions

• Being mildly performance obsessed, I got very tired of playing performance whack-a-mole with pandas

vbench project

• http://github.com/wesm/vbench

• Run benchmarks for each version of your codebase

• vbench checks out each revision of your codebase, builds it, and runs all the benchmarks you define

• Results stored in a SQLite database

• Only works with git right now

vbenchjoin_dataframe_index_single_key_bigger = \ Benchmark("df.join(df_key2, on='key2')", setup, name='join_dataframe_index_single_key_bigger')

vbenchstmt3 = "df.groupby(['key1', 'key2']).sum()"groupby_multi_cython = Benchmark(stmt3, setup, name="groupby_multi_cython", start_date=datetime(2011, 7, 1))

Fast database joins

• Problem: SQL-compatible left, right, inner, outer joins

• Row duplication

• Join on index and / or join on columns

• Sorting vs. not sorting

• Algorithmically closely related to groupby etc.

Row duplicationleft right

key keyouter join

lvalue rvalue

foofoobarbaz

foofoobarqux

key lvalue rvalue

foofoofoofoobarbazqux

112234

Join indexersleft right

key keyouter join

lvalue rvalue

foofoobarbaz

foofoobarqux

key lidx ridx

001123

Join indexersleft right

key keyouter join

lvalue rvalue

foofoobarbaz

foofoobarqux

key lidx ridx

001123

-13Problem: factorized keys

need to be sorted!

An algorithmic observation

• If N values are known to be from the range 0 through K - 1, can be sorted in O(N)

• Variant of counting sort

• For our purposes, only compute the sorting indexer (argsort)

Winning join algorithmO(K log K) or O(N)

O(N_output)

don’t sort keyssort keys

(counting sort)

(refactorize)

(this step is actually fairly nontrivial)

Factorize keys columns

Compute / compress group indexes

"Sort" by group indexes

Compute left / right join indexers for join method

Remap indexers relative to original row ordering

Move data efficiently into output DataFrame

“You’re like CLR, I’m like CLRS”- “Kill Dash Nine”, by Monzy

Join test case

• Left: 80k rows, 2 key columns, 8k unique key pairs

• Right: 8k rows, 2 key columns, 8k unique key pairs

• 6k matching key pairs between the tables, many-to-one join

• One column of numerical values in each

Join test case

• Many-to-many case: stack right DataFrame on top of itself to yield 16k rows, 2 rows for each key pair

• Aside: sorting the unique keys dominates the runtime (that pesky O(K log K)), not included in these benchmarks

Quick, algebra!

• Left join: 80k rows

• Right join: 62k rows

• Inner join: 60k rows

• Outer join: 82k rows

• Left join: 140k rows

• Right join: 124k rows

• Inner join: 120k rows

• Outer join: 144k rows

Many-to-manyMany-to-one

Results vs. some R packages

* relative timings70

Results vs SQLite3

Note: In SQLite3 doing something like

Absolute timings

* outer is LEFT OUTER in SQLite3

DataFrame sort by columns

• Applied same ideas / tools to “sort by multiple columns op” yesterday

The bottom line

• Just a flavor: pretty much all of pandas has seen the same level of design effort and performance scrutiny

• Make sure whoever implemented your data structures and algorithms care about performance. A lot.

• Python has amazingly powerful and productive tools for implementation work

Thanks!

• Follow me on Twitter: @wesmckinn

• Blog: http://blog.wesmckinney.com

• Exciting Python things ahead in 2012

a look inside pandas design and development

data selection

rs data

rich data alignment

n data points

data powerful way

high performance data

data analysis fororeilly

unique group b

Technology

look inside crps

look inside! irreplaceable executive

look inside mechanisms

look insidelook inside

a look inside electricity

look what’s inside

desdemona - look inside

anthologion look inside

look inside pain 2012

twizzle yacht - inside look

pandas, plants, and...

look inside fundamentals

look inside nashville

strategic planning- look ahead, look inside

look inside older adults

look inside sleep & pain

look inside memphis

jobpiper - an inside look

look inside current topics

lab 7 pandas ii: plotting with pandas - byu acme · pandas...