python and h2o with cliff click at pydata dallas 2015
TRANSCRIPT
H2O.aiMachine Intelligence
Fast, Scalable In-Memory Machine and Deep LearningFor Smarter Applications
Python with H2O
Cliff Click
H2O.aiMachine Intelligence
Who Am I?
Cliff ClickCTO, Co-Founder [email protected]
40 yrs coding35 yrs building compiler30 yrs distributed computation20 yrs OS, device drivers, HPC, HotSpot 10 yrs Low-latency GC, custom java hardware,
NonBlockingHashMap20 patents, dozens of papers100s of public talks
PhD Computer Science1995 Rice UniversityHotSpot JVM Server Compiler“showed the world JITing is possible”
H2O.aiMachine Intelligence
H2O Open Source In-MemoryMachine Learning for Big Data
Distributed In-Memory Math PlatformGLM, GBM, RF, K-Means, PCA, Deep Learning
Easy to use SDK & APIJava, R/CRAN, Scala, Spark, Python, JSON, Browser GUI
Use ALL your dataModeling without samplingHDFS, S3, NFS, NoSql
Big Data & Better AlgorithmsBetter Predictions!
H2O.aiMachine Intelligence
TBD. Customer Support
TBDHead of Sales
Distributed
Systems
Engineers
Making
ML Scale!
H2O.aiMachine Intelligence
Practical Machine Learning
Value RequirementsFast & Interactive In-Memory
Big Data (No Sampling) Distributed
Ownership Open Source
Extensibility API/SDK
Portability Java, REST/JSON
Infrastructure Cloud or On-Premise Hadoop or Private Cluster
H2O.aiMachine Intelligence
H2O Architecture
Prediction Engine
R & Exec Engine Web Interface
Spark Scala REPL
Nano-FastScoring Engine
Distributed In-Memory K/V Store
Column Compress DataMap/Reduce
Memory Manager
Algorithms! GBM, Random Forest, GLM, PCA, K-Means,
Deep Learning
HDFS S3 NFS
H2O.aiMachine Intelligence
H2O Architecture
Prediction Engine
R & Exec Engine Web Interface
Spark Scala REPL
Nano-FastScoring Engine
Distributed In-Memory K/V Store
Column Compress DataMap/Reduce
Memory Manager
Algorithms! GBM, Random Forest, GLM, PCA, K-Means,
Deep Learning
HDFS S3 NFS
H2O.aiMachine Intelligence
Demo!
Python Demo
● CitiBike of NYC● Predict bikes-per-hour-per-station
– From per-trip logs● 10M rows of data● Group-By, date/time feature-munging
H2O.aiMachine Intelligence
H2O: A Platform for Big Math
● Most Any Java on Big 2-D Tables– Write like its single-thread POJO code– Runs distributed & parallel by default
● Fast: billion row logistic regression takes 4 sec● Worlds first parallel & distributed GBM
– Plus GBM, Deep Learn / Neural Nets, RF, PCA, GLM...
● R integration: use terabyte datasets from R● Sparkling Water: Direct Spark integration
H2O.aiMachine Intelligence
H2O: A Platform for Big Math
● Easy launch: “java -jar h2o.jar”
– No GC tuning: -Xmx as big as you like
● Production ready:– Private on-premise cluster OR– In the Cloud– Hadoop, Yarn, EC2, or standalone cluster – HDFS, S3, NFS, URI & other datasources– Open Source, Apache v2