practical medium data analytics with python (10 things i hate about pandas, pydata nyc 2013)

Post on 27-Jan-2015

114 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

by Wes McKinney (@wesmckinn) at PyData NYC 2013

TRANSCRIPT

PyData NYC 2013

Practical Medium Data Analytics with Python

PyData NYC 2013

Practical Medium Data Analytics with Python

10 Things I Hate About pandas

www.datapad.io

Wes McKinney

3

• Former quant and MIT math dude

• Creator of Pandas project for Python

• Author of Python for Data Analysis — O’Reilly

• Founder and CEO of DataPad

@wesmckinn

www.datapad.io

• > 20k copies since Oct 2012• Bringing many new people

to Python and data analysis with code

4

www.datapad.io

•http://datapad.io

•Founded in 2013, located in SF

• In private beta, join us!

•Hiring for engineering

Why hate on pandas?

pandas rocks!

www.datapad.io

• Easy-to-use, fast in-memory data wrangling and analytics library

• Enabled loads of complex data work to be done by mere mortals in Python

• Might have kept R from taking over the world (hehe)

10

So, pandas

www.datapad.io

•170 distinct contributors

•Over 5400 issues and pull requests on GitHub

•Upcoming 0.13 release

12

pandas, the project

www.datapad.io

•pandas’s broad applicability also a liability

•Only game in town for lot of things

•pandas being used in some unplanned ways

13

But.

www.datapad.io

• No more structured dtype drudgery!

• Easy IO!

• Data alignment!

• Hierarchical indexing!

• Time series analytics!

14

Some things to love

www.datapad.io

•Table reshaping

•Missing data handling

•pandas.merge, pandas.concat

•Expressive groupby machinery

15

More things to love

www.datapad.io

•General data wrangling

•ETL jobs

•Business analytics (incl. BI uses)

•Time series analysis, statistical modeling

16

Some pandas use cases

pandas does many things that are tedious, slow, or

difficult to do correctly without it

Unfortunately, pandas is not a database

www.datapad.io

•DataFrame’s internal structure intended to make row-oriented ops fast on numerical data

•Python objects can be used as data, indices (a feature, not a bug)

19

#1 Slightly too far from the metal

www.datapad.io

• Many analytics ops require a small portion of the data

• Many ways to “materialize” the full data set in memory by accident

• Axis indexes wouldn’t necessarily make sense on out of core data sets

20

#2 No support (yet) for memory maps

www.datapad.io

•N.B. HDF5/PyTables support is a partial solution

21

#2 No support (yet) for memory maps

www.datapad.io

•Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system

• Inadequacy of pandas/NumPy data type systems

22

#3 No tight database integration

www.datapad.io

• Jobs with heavy SQL-reading are slow and use tons of memory

•TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays

23

#3 No tight database integration

www.datapad.io

• Inconsistent representation of missing data

•No Boolean or Integer NA values

•NA needs to be a first class citizen in analytics operations

24

#4 Best-efforts NA representation

www.datapad.io

• Difficult to understand footprint of pandas object

• Ample data copying throughout library

• Would benefit from being able to compress data in-memory or shuttle data temporarily to disk

25

#5 RAM management

www.datapad.io

•Makes pandas not quite a fully-fledged R replacement

•GroupBy and Joins slower than they could be

26

#6 Weak support for categorical data

www.datapad.io

•Must write custom functions to pass to .apply(..)

•Easy to run up against DRY problems and general Python syntax limitations

27

#7 Complex GroupBy operations get messy

www.datapad.io

•DataFrame not intended as a database table

•Makes streaming data use a challenge

•B+ tree tables interesting?

28

#8 Appending data slow and tedious

www.datapad.io

•Currencies, units

•Time zones

•Geographic data

•Composite data types

29

#9 Limited type system, column metadata

www.datapad.io

•Filter

•Group

• Join

•Aggregate

•Limit/TopK

•Sorting

30

#10 No true query processing layer

WHERE, HAVINGGROUP BYJOINSUM, MEAN, ...LIMITORDER BY

www.datapad.io

•Hampered by use of Python data structures / GIL interactions

•Object internals not designed for concurrent use

31

#11 “Slow”: no multicore / distributed algos

Oh no what do we do

Stop believing in the “one tool to rule them all”

“Real Artists Ship”- Steve Jobs

www.datapad.io

• I am heavily biased by focus on business analytics/BI use cases

•Need production-ready software to ship in relatively short time frame

36

Focus on results

www.datapad.io

• In internal development at DataPad

•Code named “badger”

•pandas-ish syntax: designed for data processing and analytical queries

37

A new project

www.datapad.io

•Consistent data type system

•Compressed columnar binary storage

•High perf analytical query processor

•Data preparation/cleaning tools

38

Badger in a nutshell

www.datapad.io

•Time series analytics

• Immutable array data, little copying

•Analytics kernels: written C with no dependencies

•Caching of useful intermediates

39

Badger in a nutshell

www.datapad.io

•Data set: 2012 Election data (FEC)

•5.3 mm records 7 columns

•Tools

•pandas

•badger

•R: data.table

•SQL: PostgreSQL, SQLite

40

Some benchmarks

www.datapad.io

•Total contributions by candidate

41

Query 1

SELECT  cand_nm,                sum(contb_receipt_amt)  AS  totalFROM  fecGROUP  BY  cand_nm

www.datapad.io

•Total contributions by candidate

42

Query 1

badger  (in-­‐memory)  :      19ms  (1x)badger  (from-­‐disk)  :    131ms  (6.9x)pandas  (in-­‐memory)  :    273ms  (14.3x)R  data.table  1.8.10:    382ms  (20x)PostgreSQL                  :      4.7s  (247x)SQLite                          :        72s  (3800x)

www.datapad.io

•Total contributions by candidate and state

43

Query 2

SELECT  cand_nm,  contbr_st,              sum(contb_receipt_amt)  AS  totalFROM  fecGROUP  BY  cand_nm,  contbr_st

www.datapad.io44

Query 2

badger  (in-­‐memory)  :    269ms  (1x)badger  (from-­‐disk)  :    391ms  (1.5x)R  data.table  1.8.10:    500ms  (1.8x)pandas  (in-­‐memory)  :    770ms  (2.9x)PostgreSQL                  :    5.96s  (23x)

•Total contributions by candidate and state

www.datapad.io

•Total contributions by candidate and state with 2 filter predicates

45

Query 3

SELECT  cand_nm,              sum(contb_receipt_amt)  as  totalFROM  fecWHERE  contb_receipt_dt  BETWEEN                '2012-­‐05-­‐01'  and  '2012-­‐11-­‐05'    AND  contb_receipt_amt  BETWEEN                  0  and  2500GROUP  BY  cand_nm

www.datapad.io

•Total contributions by candidate and state with 2 filter predicates

46

Query 3

badger  (in-­‐memory)  :      96ms  (1x)badger  (from-­‐disk)  :    275ms  (2.9x)pandas  (in-­‐memory)  :    946ms  (9.8x)PostgreSQL                  :      6.2s  (65x)

www.datapad.io

•Distributed in-memory analytics

•Multicore algorithms

•ETL job-building tools

•Open source in some form someday

•Looking for algorithms hackers to help

47

Badger, the future

www.datapad.io

Thank you!

48

top related