pandas/data analysis at baypiggies
DESCRIPTION
Presented at BayPiggies by Chang She and Andy Hayden. pandas is used by many people to make their lives easier when analyzing data. This talk is centered around how the overarching goal of user productivity has driven the balance of API development and performance optimization. We will cover some pandas basics. We'll talk about pandas performance. And we'll discuss data structures and algorithms. Along the way, we'll cover best practices and tools useful for developing open source projects. Chang She is the CTO/co-founder of DataPad. A pythonista and recovering financial quant, Chang was a core contributor to pandas prior to co-founding DataPad. Chang is passionate about creating better data tools to make knowledge workers more productive. Andy is a core contributor to pandas and holds the dubious accolade of having answered the most pandas-related questions on Stack Overflow. Andy is an analyst and software engineer from the UK, turned Data Scientist in CA, and is enthusiastic about making data tools easy. ipython notebooks available here: https://www.wakari.io/sharing/bundle/hayd/baypiggies https://www.wakari.io/sharing/bundle/hayd/vbench https://www.wakari.io/sharing/bundle/hayd/pandorableTRANSCRIPT
![Page 1: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/1.jpg)
Python PandasLessons Learned in Performance and
Design
![Page 2: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/2.jpg)
Who we are
Chang She - CTO/Cofounder @ DataPad, core pandas contributor, recovering financial quant. Follow me on twitter: @changhiskhan
Andy Hayden - core pandas contributor, analyst and software engineer from the UK turned Data Scientist in CA, avid data tool maker
![Page 3: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/3.jpg)
What are we talking about
- Why pandas?- What’s cool about pandas?- How do we improve and track performance- A few data structures and algorithms- Bad idioms and how to fix
![Page 4: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/4.jpg)
![Page 5: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/5.jpg)
What is it?
- Python library for analyzing real world data- Created by Wes McKinney, now led by Jeff Reback- Supported on all platforms- Supports Python 3.4 as of latest version- Big and active community
![Page 6: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/6.jpg)
Pandas Highlights- Labelled data and automatic alignment- Easy data integration- Flexible slicing and dicing of data- Analytics made to fit your brain, not vice versa (I’m looking at you SQL)
USER PRODUCTIVITY
![Page 7: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/7.jpg)
Productivity via better workflow
- Single tool to minimize cognitive dissonance
- Iterative and not linear workflow
- Performant enough for interactive work
![Page 8: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/8.jpg)
Pandas basics
(notebook)
![Page 9: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/9.jpg)
Priorities
- Build the right abstractions
- Get the API right
- Then optimize for performance
![Page 10: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/10.jpg)
Open source APIs
- Sometimes you can’t be all things to all people
- You can only add to an API, rarely change, and never get rid of APIs
- Documentation Documentation Documentation
![Page 11: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/11.jpg)
An example
- DataFrame started life as essentially a dict of Series- There was also DataMatrix- Unified under DataFrame via combining homogeneous blocks. Performant and single API
![Page 12: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/12.jpg)
Optimization
- Push slow code paths into cython or directly into C
- Try to be smart about minimizing cache misses and not creating unnecessary copies
- Careful with NAs
![Page 13: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/13.jpg)
Tracking Performance (vbench)
![Page 14: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/14.jpg)
what to track?
use vbench to track everything we care about (read: users have complained its slow ?)
unofficial vbenches repos for numpy and scikit
(look)
![Page 15: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/15.jpg)
why
Once users are using your API, they’ll notice performance changes “it feels slower”.
Then timeit and have legitimate grievance… want to automate this process (before user-upset).
![Page 16: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/16.jpg)
how
(notebook)
![Page 17: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/17.jpg)
Pandorable pandas
(notebook)
![Page 18: Pandas/Data Analysis at Baypiggies](https://reader034.vdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/18.jpg)
The End