python data ecosystem: thoughts on building for the future
TRANSCRIPT
1 © Cloudera, Inc. All rights reserved.
Python Data Ecosystem: Thoughts on Building for the Future Wes McKinney @wesmckinn PyData Berlin 2016-‐05-‐21
2 © Cloudera, Inc. All rights reserved.
Me
• Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects
• Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubaWng)}
• Mostly work in Python and Cython/C/C++
3 © Cloudera, Inc. All rights reserved.
In process: Python for Data Analysis: 2nd Edi4on Coming early 2017
4 © Cloudera, Inc. All rights reserved.
Building open source communiWes
5 © Cloudera, Inc. All rights reserved.
Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals.
Wikipedia
6 © Cloudera, Inc. All rights reserved.
Step 1 Be open and transparent
7 © Cloudera, Inc. All rights reserved.
Step 2 Reach out to others
8 © Cloudera, Inc. All rights reserved.
Step 3 Strive for consensus
9 © Cloudera, Inc. All rights reserved.
Step 4 Value contribuWons extending beyond lines of code
10 © Cloudera, Inc. All rights reserved.
Step 5 Make things harder for bad actors
11 © Cloudera, Inc. All rights reserved.
12 © Cloudera, Inc. All rights reserved.
Handling problems carefully
13 © Cloudera, Inc. All rights reserved.
http://numfocus.org
http://apache.org
14 © Cloudera, Inc. All rights reserved.
Python packaging
15 © Cloudera, Inc. All rights reserved.
Packaging is hard
• Reproducible infrastructure • Reproducible toolchains • Reproducible build scripts • IntegraWon tesWng • MulWple library version builds • MulWple Python versions • Dependency resoluWon • HosWng and distribuWon • MulWple environment management
16 © Cloudera, Inc. All rights reserved.
ReflecWng on the past
17 © Cloudera, Inc. All rights reserved.
18 © Cloudera, Inc. All rights reserved.
conda-‐forge
• Community-‐curated conda package channel (on anaconda.org) • Reproducible build infrastructure (Docker + Circle CI + Travis CI + Appveyor) • Automated GitHub helper tools
conda config --add channels conda-forge
19 © Cloudera, Inc. All rights reserved.
What’s important to me right now?
20 © Cloudera, Inc. All rights reserved.
Important things
• Building bridges with other data science communiWes (R, Julia, Scala, etc.) • Enabling Python to more efficiently talk to other systems (e.g. Hadoop things) • Building Python tools for new and changing varieWes of data
21 © Cloudera, Inc. All rights reserved.
RAM as the new disk?
• SSD – DRAM performance convergence
• NVM developments
(3D Xpoint) Memory working set
Consumer Consumer Consumer
22 © Cloudera, Inc. All rights reserved.
Problems
• Memory (data structure) representaWons
• Metadata representaWons
• Memory ownership, life-‐cycle
23 © Cloudera, Inc. All rights reserved.
NumPy solved this problem for Python scienWsts
• Common memory representaWon • ndarray strided, homogeneous buffer
• Common metadata • NumPy dtypes
• No well-‐defined memory sharing / messaging model: case by case basis
24 © Cloudera, Inc. All rights reserved.
Problems NumPy doesn’t solve as well
• Nested data types (think JSON)
• Missing / NULL data
• Strings and category types
• Columnar memory representaWon for tables (think: analyWc SQL databases)
25 © Cloudera, Inc. All rights reserved.
Apache Arrow
http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
26 © Cloudera, Inc. All rights reserved.
Arrow in a Slide • New Top-‐level Apache Sonware FoundaWon project • Focused on Columnar In-‐Memory AnalyWcs
1. 10-‐100x speedup on many workloads 2. Common data layer enables companies to choose best of
breed systems 3. Designed to work with any programming language 4. Support for both relaWonal and complex data as-‐is
• Developers from 13+ major open source projects involved
• A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
27 © Cloudera, Inc. All rights reserved.
Focus on CPU Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional Memory Buffer
Arrow Memory Buffer
• Cache Locality • Super-‐scalar & vectorized operaWon • Minimal Structure Overhead • Constant value access
• With minimal structure overhead
• Operate directly on columnar compressed data
28 © Cloudera, Inc. All rights reserved.
High Performance Sharing & Interchange Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-to-Arrow reader)
29 © Cloudera, Inc. All rights reserved.
Arrow in acWon: Feather File Format for Python and R
• Problem: fast, language-‐agnosWc binary data frame file format
• By Wes McKinney (Python) and Hadley Wickham (R)
• Read speeds close to disk IO performance
30 © Cloudera, Inc. All rights reserved.
Real World Example: Feather File Format for Python and R
library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path)
import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)
R Python
31 © Cloudera, Inc. All rights reserved.
More on Feather
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
32 © Cloudera, Inc. All rights reserved.
Feather: the good and not-‐so-‐good
• Good • Language-‐agnosWc memory representaWon • Extremely fast • New storage features can be added without much difficulty
• Not-‐so-‐good
• Data must be convert to/from storage representaWon (Arrow) and in-‐memory “proprietary” data structures (R / Python data frames)
33 © Cloudera, Inc. All rights reserved.
Apache Parquet: Python support is coming
• Collaborating with Uwe Korn from Blue Yonder
pandas
Arrow (C++ / Python)
Parquet (C++)
34 © Cloudera, Inc. All rights reserved.
Shared needs for Python, R, Julia, ...
• If PLs can establish a common data frame C/C++-‐level memory representaWon, we can share algorithms and libraries much more easily
• Example: dplyr’s in-‐memory backend
• Other requirements • Permissive licensing (Python / Julia require MIT/Apache-‐like) • Common build/test/packaging for shared C/C++ library components
35 © Cloudera, Inc. All rights reserved.
Real World Example: Python With Spark, Drill, Impala
in partition 0
…
in partition n - 1
SQL Engine
Python function
input
Python function
input
User-supplied Python code
output
output
out partition 0
…
out partition n - 1
SQL Engine
36 © Cloudera, Inc. All rights reserved.
Get Involved in Arrow • Join the community
• [email protected] • Slack: hups://apachearrowslackin.herokuapp.com/ • hup://arrow.apache.org • @ApacheArrow
37 © Cloudera, Inc. All rights reserved.
Thank you Wes McKinney @wesmckinn Views are my own