does python stand a chance in today's world of data science?
TRANSCRIPT
![Page 1: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/1.jpg)
Does Python stand a chance in today’s world of data science?
Radim Řehůřek
![Page 2: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/2.jpg)
YES
![Page 3: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/3.jpg)
![Page 4: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/4.jpg)
![Page 5: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/5.jpg)
![Page 6: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/6.jpg)
RaRe Technologies Ltd.
![Page 7: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/7.jpg)
Python vs. rest
● performance?● deployment?● logging, debugging?● workflow, integration?
![Page 8: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/8.jpg)
SVD
![Page 9: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/9.jpg)
English Wikipedia
● ~3.5M docs● ~2G words● with 100K vocab, ~0.5G matrix non-zeros
○ very sparse● small-ish, but known & accessible and out -
of-core
![Page 10: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/10.jpg)
![Page 11: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/11.jpg)
![Page 12: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/12.jpg)
![Page 13: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/13.jpg)
Spark mllib
● top level Apache project, Scala● RDDs, Resilient Distributed Datasets● ~RAM caching + execution engine● latest Spark 1.3.0 + mllib● AWS EMR cluster (4x m3.xlarge)
![Page 14: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/14.jpg)
SVD @ mllib
![Page 15: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/15.jpg)
![Page 16: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/16.jpg)
Mahout SSVD
● the “scikit-learn” of Hadoop, Java● originally on MapReduce● now Mahout Samsara @ Spark, Scala● newest Mahout 0.10.0
![Page 17: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/17.jpg)
+ “local mode” eats up all disk, then fails
![Page 18: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/18.jpg)
Google’s word2vec
unsupervised ML● Berlin is to Germany as Paris is to …?● king - man + woman = queen● which word doesn’t fit? “dinner cereal
breakfast lunch”
http://radimrehurek.com/2014/02/word2vec-tutorial/#app
![Page 19: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/19.jpg)
Word2vec @ Wikipedia
![Page 20: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/20.jpg)
C vs. NumPy vs. optimized
(+pure Python: 120x slower than baseline)
![Page 21: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/21.jpg)
Single machine parallelization
C (1/2/4 workers): 1.0x / 1.9x / 3.2xgensim: 1.0x / 1.75x / 2.85x
![Page 22: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/22.jpg)
streaming (Python generator) for input
+ amazing Python ecosystem on either end!
![Page 23: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/23.jpg)
![Page 24: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/24.jpg)
![Page 25: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/25.jpg)
word2vec @ mllib
![Page 26: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/26.jpg)
![Page 27: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/27.jpg)
scaling down for Spark
=> if scaling linearly, Spark needs a cluster of ~12 machines to break even (vs. pySpark)
![Page 28: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/28.jpg)
Deeplearning4j
David Przybilla, Idio Ltd.https://github.com/idio/wiki2vec
![Page 29: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/29.jpg)
ANN libs @ Wikipedia
![Page 30: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/30.jpg)
“Do one thing and do it well.”Doug McIlroy
"Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can."
Zawinski’s law of software development
Tools
![Page 31: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/31.jpg)
APIs & Abstractions
![Page 32: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/32.jpg)
Configuration, setup, deployment
Python 2 vs Python 3
… the real work!
![Page 33: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/33.jpg)
Java
![Page 34: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/34.jpg)
Complex pipelines● Python: Luigi (~Spotify), Pinball (~Pinterest)● Java: Apache Oozie, Azkaban...
![Page 35: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/35.jpg)
Logging
● UI, job trackers● tracebacks, continuous● configurable● human readable
![Page 36: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/36.jpg)
Navigating the tool landscape
Let it go; if it’s meant to be, it will come back.
![Page 37: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/37.jpg)
Take “progress” easy
![Page 38: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/38.jpg)
Summary
Python’s greatest differentiating factors:● +experienced full stack engineers● +pragmatic, mature tools● +HPC & scientific “baggage”● -meh deployment, orchestration, packaging● -not as much enterprise “baggage”
![Page 39: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/39.jpg)
![Page 40: Does Python stand a chance in today's world of data science?](https://reader031.vdocuments.us/reader031/viewer/2022030320/586b5d081a28ab432d8bb703/html5/thumbnails/40.jpg)
Radim Řehůřekhttp://rare-technologies.com(formerly radimrehurek.com)
@radimrehurek
Blog (mostly tech): http://radimrehurek.com/blog/