python data ecosystem: thoughts on building for the future

37
Python Data Ecosystem: Thoughts on Building for the Future Wes McKinney @wesmckinn PyData Berlin 20160521

Upload: wes-mckinney

Post on 15-Jan-2017

3.701 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Python Data Ecosystem: Thoughts on Building for the Future

1  ©  Cloudera,  Inc.  All  rights  reserved.  

Python  Data  Ecosystem:  Thoughts  on  Building  for  the  Future  Wes  McKinney  @wesmckinn  PyData  Berlin  2016-­‐05-­‐21  

Page 2: Python Data Ecosystem: Thoughts on Building for the Future

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Me  

• Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder  •  Serial  creator  of  structured  data  tools  /  user  interfaces  • Wrote  bestseller  Python  for  Data  Analysis  2012  • Open  source  projects  

• Python  {pandas,  Ibis,  statsmodels}  • Apache  {Arrow,  Parquet,  Kudu  (incubaWng)}  

• Mostly  work  in  Python  and  Cython/C/C++    

Page 3: Python Data Ecosystem: Thoughts on Building for the Future

3  ©  Cloudera,  Inc.  All  rights  reserved.  

In  process:  Python  for  Data  Analysis:  2nd  Edi4on  Coming  early  2017  

Page 4: Python Data Ecosystem: Thoughts on Building for the Future

4  ©  Cloudera,  Inc.  All  rights  reserved.  

Building  open  source  communiWes  

Page 5: Python Data Ecosystem: Thoughts on Building for the Future

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals.

Wikipedia

Page 6: Python Data Ecosystem: Thoughts on Building for the Future

6  ©  Cloudera,  Inc.  All  rights  reserved.  

Step  1    Be  open  and  transparent  

Page 7: Python Data Ecosystem: Thoughts on Building for the Future

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Step  2    Reach  out  to  others  

Page 8: Python Data Ecosystem: Thoughts on Building for the Future

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Step  3    Strive  for  consensus  

Page 9: Python Data Ecosystem: Thoughts on Building for the Future

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Step  4  Value  contribuWons  extending  beyond  lines  of  code  

Page 10: Python Data Ecosystem: Thoughts on Building for the Future

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Step  5    Make  things  harder  for  bad  actors  

Page 11: Python Data Ecosystem: Thoughts on Building for the Future

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Page 12: Python Data Ecosystem: Thoughts on Building for the Future

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Handling problems carefully

Page 13: Python Data Ecosystem: Thoughts on Building for the Future

13  ©  Cloudera,  Inc.  All  rights  reserved.  

http://numfocus.org

http://apache.org

Page 14: Python Data Ecosystem: Thoughts on Building for the Future

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Python  packaging  

Page 15: Python Data Ecosystem: Thoughts on Building for the Future

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Packaging  is  hard  

•   Reproducible  infrastructure    •   Reproducible  toolchains    •   Reproducible  build  scripts  •   IntegraWon  tesWng  •   MulWple  library  version  builds  •   MulWple  Python  versions  •   Dependency  resoluWon  •   HosWng  and  distribuWon  •   MulWple  environment  management  

Page 16: Python Data Ecosystem: Thoughts on Building for the Future

16  ©  Cloudera,  Inc.  All  rights  reserved.  

ReflecWng  on  the  past  

Page 17: Python Data Ecosystem: Thoughts on Building for the Future

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Page 18: Python Data Ecosystem: Thoughts on Building for the Future

18  ©  Cloudera,  Inc.  All  rights  reserved.  

conda-­‐forge  

•   Community-­‐curated  conda  package  channel  (on  anaconda.org)  •   Reproducible  build  infrastructure  (Docker  +  Circle  CI  +  Travis  CI  +  Appveyor)  •   Automated  GitHub  helper  tools  

conda config --add channels conda-forge

Page 19: Python Data Ecosystem: Thoughts on Building for the Future

19  ©  Cloudera,  Inc.  All  rights  reserved.  

What’s  important  to  me  right  now?  

Page 20: Python Data Ecosystem: Thoughts on Building for the Future

20  ©  Cloudera,  Inc.  All  rights  reserved.  

Important  things  

•   Building  bridges  with  other  data  science  communiWes  (R,  Julia,  Scala,  etc.)  •   Enabling  Python  to  more  efficiently  talk  to  other  systems  (e.g.  Hadoop  things)  •   Building  Python  tools  for  new  and  changing  varieWes  of  data  

Page 21: Python Data Ecosystem: Thoughts on Building for the Future

21  ©  Cloudera,  Inc.  All  rights  reserved.  

RAM  as  the  new  disk?  

•  SSD – DRAM performance convergence

•  NVM developments

(3D Xpoint) Memory working set

Consumer Consumer Consumer

Page 22: Python Data Ecosystem: Thoughts on Building for the Future

22  ©  Cloudera,  Inc.  All  rights  reserved.  

Problems  

•   Memory  (data  structure)  representaWons  

•   Metadata  representaWons  

•   Memory  ownership,  life-­‐cycle  

Page 23: Python Data Ecosystem: Thoughts on Building for the Future

23  ©  Cloudera,  Inc.  All  rights  reserved.  

NumPy  solved  this  problem  for  Python  scienWsts  

•   Common  memory  representaWon  •   ndarray  strided,  homogeneous  buffer  

•   Common  metadata  •   NumPy  dtypes  

•   No  well-­‐defined  memory  sharing  /  messaging  model:  case  by  case  basis  

Page 24: Python Data Ecosystem: Thoughts on Building for the Future

24  ©  Cloudera,  Inc.  All  rights  reserved.  

Problems  NumPy  doesn’t  solve  as  well  

•   Nested  data  types  (think  JSON)  

•   Missing  /  NULL  data  

•   Strings  and  category  types  

•   Columnar  memory  representaWon  for  tables  (think:  analyWc  SQL  databases)  

Page 25: Python Data Ecosystem: Thoughts on Building for the Future

25  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Arrow  

http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau

Page 26: Python Data Ecosystem: Thoughts on Building for the Future

26  ©  Cloudera,  Inc.  All  rights  reserved.  

Arrow  in  a  Slide  • New  Top-­‐level  Apache  Sonware  FoundaWon  project    •  Focused  on  Columnar  In-­‐Memory  AnalyWcs  

1.  10-­‐100x  speedup  on  many  workloads  2.  Common  data  layer  enables  companies  to  choose  best  of  

breed  systems    3.  Designed  to  work  with  any  programming  language  4.  Support  for  both  relaWonal  and  complex  data  as-­‐is  

 •  Developers  from  13+  major  open  source  projects  involved  

•  A  significant  %  of  the  world’s  data  will  be  processed  through  Arrow!  

 

Calcite

Cassandra

Deeplearning4j

Drill

Hadoop

HBase

Ibis

Impala

Kudu

Pandas

Parquet

Phoenix

Spark

Storm

R

Page 27: Python Data Ecosystem: Thoughts on Building for the Future

27  ©  Cloudera,  Inc.  All  rights  reserved.  

Focus  on  CPU  Efficiency  

1331246660

1331246351

1331244570

1331261196

3/8/2012 2:44PM

3/8/2012 2:38PM

3/8/2012 2:09PM

3/8/2012 6:46PM

99.155.155.225

65.87.165.114

71.10.106.181

76.102.156.138

Row 1

Row 2

Row 3

Row 4

1331246660

1331246351

1331244570

1331261196

3/8/2012 2:44PM

3/8/2012 2:38PM

3/8/2012 2:09PM

3/8/2012 6:46PM

99.155.155.225

65.87.165.114

71.10.106.181

76.102.156.138

session_id

timestamp

source_ip

Traditional Memory Buffer

Arrow Memory Buffer

• Cache  Locality  • Super-­‐scalar  &  vectorized  operaWon  • Minimal  Structure  Overhead  • Constant  value  access    

• With  minimal  structure  overhead  

• Operate  directly  on  columnar  compressed  data  

Page 28: Python Data Ecosystem: Thoughts on Building for the Future

28  ©  Cloudera,  Inc.  All  rights  reserved.  

High  Performance  Sharing  &  Interchange  Today With Arrow

•  Each system has its own internal memory format

•  70-80% CPU wasted on serialization and deserialization

•  Similar functionality implemented in multiple projects

•  All systems utilize the same memory format

•  No overhead for cross-system communication

•  Projects can share functionality (eg, Parquet-to-Arrow reader)

Page 29: Python Data Ecosystem: Thoughts on Building for the Future

29  ©  Cloudera,  Inc.  All  rights  reserved.  

Arrow  in  acWon:  Feather  File  Format  for  Python  and  R  

• Problem:  fast,  language-­‐agnosWc  binary  data  frame  file  format  

• By  Wes  McKinney  (Python)  and  Hadley  Wickham  (R)  

• Read  speeds  close  to  disk  IO  performance  

Page 30: Python Data Ecosystem: Thoughts on Building for the Future

30  ©  Cloudera,  Inc.  All  rights  reserved.  

Real  World  Example:  Feather  File  Format  for  Python  and  R  

library(feather)      path  <-­‐  "my_data.feather"  write_feather(df,  path)      df  <-­‐  read_feather(path)  

import  feather      path  =  'my_data.feather'      feather.write_dataframe(df,  path)  df  =  feather.read_dataframe(path)  

R   Python  

Page 31: Python Data Ecosystem: Thoughts on Building for the Future

31  ©  Cloudera,  Inc.  All  rights  reserved.  

More  on  Feather  

array 0

array 1

array 2

...

array n - 1

METADATA

Feather File

libfeather C++ library

Rcpp

Cython

R data.frame

pandas DataFrame

Page 32: Python Data Ecosystem: Thoughts on Building for the Future

32  ©  Cloudera,  Inc.  All  rights  reserved.  

Feather:  the  good  and  not-­‐so-­‐good  

•  Good  •  Language-­‐agnosWc  memory  representaWon  •  Extremely  fast  •  New  storage  features  can  be  added  without  much  difficulty  

 •  Not-­‐so-­‐good  

•  Data  must  be  convert  to/from  storage  representaWon  (Arrow)  and  in-­‐memory  “proprietary”  data  structures  (R  /  Python  data  frames)  

Page 33: Python Data Ecosystem: Thoughts on Building for the Future

33  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Parquet:  Python  support  is  coming  

•  Collaborating with Uwe Korn from Blue Yonder

pandas

Arrow (C++ / Python)

Parquet (C++)

Page 34: Python Data Ecosystem: Thoughts on Building for the Future

34  ©  Cloudera,  Inc.  All  rights  reserved.  

Shared  needs  for  Python,  R,  Julia,  ...  

•  If  PLs  can  establish  a  common  data  frame  C/C++-­‐level  memory  representaWon,  we  can  share  algorithms  and  libraries  much  more  easily  

•  Example:  dplyr’s  in-­‐memory  backend    

•  Other  requirements  •  Permissive  licensing  (Python  /  Julia  require  MIT/Apache-­‐like)  •  Common  build/test/packaging  for  shared  C/C++  library  components  

Page 35: Python Data Ecosystem: Thoughts on Building for the Future

35  ©  Cloudera,  Inc.  All  rights  reserved.  

Real  World  Example:  Python  With  Spark,  Drill,  Impala  

in partition 0

in partition n - 1

SQL Engine

Python function

input

Python function

input

User-supplied Python code

output

output

out partition 0

out partition n - 1

SQL Engine

Page 36: Python Data Ecosystem: Thoughts on Building for the Future

36  ©  Cloudera,  Inc.  All  rights  reserved.  

Get  Involved  in  Arrow  •  Join  the  community  

• [email protected]  • Slack:  hups://apachearrowslackin.herokuapp.com/  • hup://arrow.apache.org  • @ApacheArrow  

Page 37: Python Data Ecosystem: Thoughts on Building for the Future

37  ©  Cloudera,  Inc.  All  rights  reserved.  

Thank  you  Wes  McKinney  @wesmckinn  Views  are  my  own