practical medium data analytics with python (10 things i hate about pandas, pydata nyc 2013)

PyData NYC 2013

Practical Medium Data Analytics with Python

PyData NYC 2013

Practical Medium Data Analytics with Python

10 Things I Hate About pandas

www.datapad.io

Wes McKinney

• Former quant and MIT math dude

• Creator of Pandas project for Python

• Author of Python for Data Analysis — O’Reilly

• Founder and CEO of DataPad

@wesmckinn

www.datapad.io

• > 20k copies since Oct 2012• Bringing many new people

to Python and data analysis with code

www.datapad.io

•http://datapad.io

•Founded in 2013, located in SF

• In private beta, join us!

•Hiring for engineering

Why hate on pandas?

www.datapad.io7

pandas rocks!

www.datapad.io

• Easy-to-use, fast in-memory data wrangling and analytics library

• Enabled loads of complex data work to be done by mere mortals in Python

• Might have kept R from taking over the world (hehe)

So, pandas

www.datapad.io11

www.datapad.io

•170 distinct contributors

•Over 5400 issues and pull requests on GitHub

•Upcoming 0.13 release

pandas, the project

www.datapad.io

•pandas’s broad applicability also a liability

•Only game in town for lot of things

•pandas being used in some unplanned ways

www.datapad.io

• No more structured dtype drudgery!

• Easy IO!

• Data alignment!

• Hierarchical indexing!

• Time series analytics!

Some things to love

www.datapad.io

•Table reshaping

•Missing data handling

•pandas.merge, pandas.concat

•Expressive groupby machinery

More things to love

www.datapad.io

•General data wrangling

•ETL jobs

•Business analytics (incl. BI uses)

•Time series analysis, statistical modeling

Some pandas use cases

pandas does many things that are tedious, slow, or

difficult to do correctly without it

Unfortunately, pandas is not a database

www.datapad.io

•DataFrame’s internal structure intended to make row-oriented ops fast on numerical data

•Python objects can be used as data, indices (a feature, not a bug)

#1 Slightly too far from the metal

www.datapad.io

• Many analytics ops require a small portion of the data

• Many ways to “materialize” the full data set in memory by accident

• Axis indexes wouldn’t necessarily make sense on out of core data sets

#2 No support (yet) for memory maps

www.datapad.io

•N.B. HDF5/PyTables support is a partial solution

#2 No support (yet) for memory maps

www.datapad.io

•Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system

• Inadequacy of pandas/NumPy data type systems

#3 No tight database integration

www.datapad.io

• Jobs with heavy SQL-reading are slow and use tons of memory

•TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays

#3 No tight database integration

www.datapad.io

• Inconsistent representation of missing data

•No Boolean or Integer NA values

•NA needs to be a first class citizen in analytics operations

#4 Best-efforts NA representation

www.datapad.io

• Difficult to understand footprint of pandas object

• Ample data copying throughout library

• Would benefit from being able to compress data in-memory or shuttle data temporarily to disk

#5 RAM management

www.datapad.io

•Makes pandas not quite a fully-fledged R replacement

•GroupBy and Joins slower than they could be

#6 Weak support for categorical data

www.datapad.io

•Must write custom functions to pass to .apply(..)

•Easy to run up against DRY problems and general Python syntax limitations

#7 Complex GroupBy operations get messy

www.datapad.io

•DataFrame not intended as a database table

•Makes streaming data use a challenge

•B+ tree tables interesting?

#8 Appending data slow and tedious

www.datapad.io

•Currencies, units

•Time zones

•Geographic data

•Composite data types

#9 Limited type system, column metadata

www.datapad.io

•Filter

•Group

• Join

•Aggregate

•Limit/TopK

•Sorting

#10 No true query processing layer

WHERE, HAVINGGROUP BYJOINSUM, MEAN, ...LIMITORDER BY

www.datapad.io

•Hampered by use of Python data structures / GIL interactions

•Object internals not designed for concurrent use

#11 “Slow”: no multicore / distributed algos

Oh no what do we do

Stop believing in the “one tool to rule them all”

“Real Artists Ship”- Steve Jobs

www.datapad.io

• I am heavily biased by focus on business analytics/BI use cases

•Need production-ready software to ship in relatively short time frame

Focus on results

www.datapad.io

• In internal development at DataPad

•Code named “badger”

•pandas-ish syntax: designed for data processing and analytical queries

A new project

www.datapad.io

•Consistent data type system

•Compressed columnar binary storage

•High perf analytical query processor

•Data preparation/cleaning tools

Badger in a nutshell

www.datapad.io

•Time series analytics

• Immutable array data, little copying

•Analytics kernels: written C with no dependencies

•Caching of useful intermediates

Badger in a nutshell

www.datapad.io

•Data set: 2012 Election data (FEC)

•5.3 mm records 7 columns

•Tools

•pandas

•badger

•R: data.table

•SQL: PostgreSQL, SQLite

Some benchmarks

www.datapad.io

•Total contributions by candidate

Query 1

SELECT cand_nm, sum(contb_receipt_amt) AS totalFROM fecGROUP BY cand_nm

www.datapad.io

•Total contributions by candidate

Query 1

badger (in-‐memory) : 19ms (1x)badger (from-‐disk) : 131ms (6.9x)pandas (in-‐memory) : 273ms (14.3x)R data.table 1.8.10: 382ms (20x)PostgreSQL : 4.7s (247x)SQLite : 72s (3800x)

www.datapad.io

•Total contributions by candidate and state

Query 2

SELECT cand_nm, contbr_st, sum(contb_receipt_amt) AS totalFROM fecGROUP BY cand_nm, contbr_st

www.datapad.io44

Query 2

badger (in-‐memory) : 269ms (1x)badger (from-‐disk) : 391ms (1.5x)R data.table 1.8.10: 500ms (1.8x)pandas (in-‐memory) : 770ms (2.9x)PostgreSQL : 5.96s (23x)

•Total contributions by candidate and state

www.datapad.io

•Total contributions by candidate and state with 2 filter predicates

Query 3

SELECT cand_nm, sum(contb_receipt_amt) as totalFROM fecWHERE contb_receipt_dt BETWEEN '2012-‐05-‐01' and '2012-‐11-‐05' AND contb_receipt_amt BETWEEN 0 and 2500GROUP BY cand_nm

www.datapad.io

•Total contributions by candidate and state with 2 filter predicates

Query 3

badger (in-‐memory) : 96ms (1x)badger (from-‐disk) : 275ms (2.9x)pandas (in-‐memory) : 946ms (9.8x)PostgreSQL : 6.2s (65x)

www.datapad.io

•Distributed in-memory analytics

•Multicore algorithms

•ETL job-building tools

•Open source in some form someday

•Looking for algorithms hackers to help

Badger, the future

www.datapad.io

Thank you!

practical medium data analytics with python (10 things i hate about pandas, pydata nyc 2013)

sql data

data set

numerical data

categorical data

shuttle data

core data

data alignment

missing data handling

Technology

new capabilities in the pydata ecosystem

lab 7 pandas ii: plotting with pandas - byu acme · pandas...

2018 - pydata · 2018. what is pydata? pydata provides a...

release 0.14.1 pydata development team

authorship attribution pydata london

shogun 2.0 @ pydata nyc 2012

pydata: past, present future (pydata sv 2014 keynote)

chloe s questions about pandas · cyberhunt!about!pandas!...

2020 sponsor prospectus - pydata · 2020 sponsor...

pydata boston 2013

illinois pandas/pans advisory...

pydata london news 2nd september 2014

pydata london january 2017

wide io presentation pydata london

memex - pydata seattle

extracting knowledge from pydata london 2015

pydata nyc by akira shibata

our data ourselves, pydata 2015

python business intelligence (pydata 2012 talk)

andreas schreiber pydata berlin ... pydata quantified...