python data ecosystem: thoughts on building for the future

1 © Cloudera, Inc. All rights reserved.

Python Data Ecosystem: Thoughts on Building for the Future Wes McKinney @wesmckinn PyData Berlin 2016-‐05-‐21


Me

• Data Science Tools at Cloudera, formerly DataPad CEO/founder •  Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects

• Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubaWng)}

• Mostly work in Python and Cython/C/C++


In process: Python for Data Analysis: 2nd Edi4on Coming early 2017


Building open source communiWes


Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals.

Wikipedia


Step 1 Be open and transparent


Step 2 Reach out to others


Step 3 Strive for consensus


Step 4 Value contribuWons extending beyond lines of code


Step 5 Make things harder for bad actors


Handling problems carefully


http://numfocus.org

http://apache.org


Python packaging


Packaging is hard

•  Reproducible infrastructure •  Reproducible toolchains •  Reproducible build scripts •  IntegraWon tesWng •  MulWple library version builds •  MulWple Python versions •  Dependency resoluWon •  HosWng and distribuWon •  MulWple environment management


ReflecWng on the past


conda-‐forge

•  Community-‐curated conda package channel (on anaconda.org) •  Reproducible build infrastructure (Docker + Circle CI + Travis CI + Appveyor) •  Automated GitHub helper tools

conda config --add channels conda-forge


What’s important to me right now?


Important things

•  Building bridges with other data science communiWes (R, Julia, Scala, etc.) •  Enabling Python to more efficiently talk to other systems (e.g. Hadoop things) •  Building Python tools for new and changing varieWes of data


RAM as the new disk?

•  SSD – DRAM performance convergence

•  NVM developments

(3D Xpoint) Memory working set

Consumer Consumer Consumer


Problems

•  Memory (data structure) representaWons

•  Metadata representaWons

•  Memory ownership, life-‐cycle


NumPy solved this problem for Python scienWsts

•  Common memory representaWon •  ndarray strided, homogeneous buffer

•  Common metadata •  NumPy dtypes

•  No well-‐defined memory sharing / messaging model: case by case basis


Problems NumPy doesn’t solve as well

•  Nested data types (think JSON)

•  Missing / NULL data

•  Strings and category types

•  Columnar memory representaWon for tables (think: analyWc SQL databases)


Apache Arrow

http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau


Arrow in a Slide • New Top-‐level Apache Sonware FoundaWon project •  Focused on Columnar In-‐Memory AnalyWcs

1.  10-‐100x speedup on many workloads 2.  Common data layer enables companies to choose best of

breed systems 3.  Designed to work with any programming language 4.  Support for both relaWonal and complex data as-‐is

•  Developers from 13+ major open source projects involved

•  A significant % of the world’s data will be processed through Arrow!

Calcite

Cassandra

Deeplearning4j

Drill

Hadoop

HBase

Ibis

Impala

Kudu

Pandas

Parquet

Phoenix

Spark

Storm

R


Focus on CPU Efficiency

1331246660

1331246351

1331244570

1331261196

3/8/2012 2:44PM

3/8/2012 2:38PM

3/8/2012 2:09PM

3/8/2012 6:46PM

99.155.155.225

65.87.165.114

71.10.106.181

76.102.156.138

Row 1

Row 2

Row 3

Row 4

1331246660

1331246351

1331244570

1331261196

3/8/2012 2:44PM

3/8/2012 2:38PM

3/8/2012 2:09PM

3/8/2012 6:46PM

99.155.155.225

65.87.165.114

71.10.106.181

76.102.156.138

session_id

timestamp

source_ip

Traditional Memory Buffer

Arrow Memory Buffer

• Cache Locality • Super-‐scalar & vectorized operaWon • Minimal Structure Overhead • Constant value access

• With minimal structure overhead

• Operate directly on columnar compressed data


High Performance Sharing & Interchange Today With Arrow

•  Each system has its own internal memory format

•  70-80% CPU wasted on serialization and deserialization

•  Similar functionality implemented in multiple projects

•  All systems utilize the same memory format

•  No overhead for cross-system communication

•  Projects can share functionality (eg, Parquet-to-Arrow reader)


Arrow in acWon: Feather File Format for Python and R

• Problem: fast, language-‐agnosWc binary data frame file format

• By Wes McKinney (Python) and Hadley Wickham (R)

• Read speeds close to disk IO performance


Real World Example: Feather File Format for Python and R

library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path)

import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)

R Python


More on Feather

array 0

array 1

array 2

...

array n - 1

METADATA

Feather File

libfeather C++ library

Rcpp

Cython

R data.frame

pandas DataFrame


Feather: the good and not-‐so-‐good

•  Good •  Language-‐agnosWc memory representaWon •  Extremely fast •  New storage features can be added without much difficulty

•  Not-‐so-‐good

•  Data must be convert to/from storage representaWon (Arrow) and in-‐memory “proprietary” data structures (R / Python data frames)


Apache Parquet: Python support is coming

•  Collaborating with Uwe Korn from Blue Yonder

pandas

Arrow (C++ / Python)

Parquet (C++)


Shared needs for Python, R, Julia, ...

•  If PLs can establish a common data frame C/C++-‐level memory representaWon, we can share algorithms and libraries much more easily

•  Example: dplyr’s in-‐memory backend

•  Other requirements •  Permissive licensing (Python / Julia require MIT/Apache-‐like) •  Common build/test/packaging for shared C/C++ library components


Real World Example: Python With Spark, Drill, Impala

in partition 0

…

in partition n - 1

SQL Engine

Python function

input

Python function

input

User-supplied Python code

output

output

out partition 0

…

out partition n - 1

SQL Engine


Get Involved in Arrow •  Join the community

• [email protected] • Slack: hups://apachearrowslackin.herokuapp.com/ • hup://arrow.apache.org • @ApacheArrow


Thank you Wes McKinney @wesmckinn Views are my own

python data ecosystem: thoughts on building for the future

Technology