introduction to fast scalable machine learning with h2o - plano
TRANSCRIPT
![Page 1: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/1.jpg)
H2O.aiMachine Intelligence
What is H2O? Open source in-memory prediction engineMath Platform
• Parallelized and distributed algorithms making the most use out of multithreaded systems
• GLM, Random Forest, GBM, PCA, etc.
Easy to use and adoptAPI• Written in Java – perfect for Java Programmers• REST API (JSON) – drives H2O from R, Python, Excel, Tableau
More data? Or better models? BOTHBig Data• Use all of your data – model without down sampling• Run a simple GLM or a more complex GBM to find the best fit for the data• More Data + Better Models = Better Predictions
![Page 2: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/2.jpg)
H2O.aiMachine Intelligence
H2O Production Analytics Workflow
HDFS
S3
NFS
DistributedIn-Memory
Load Data
Loss-lessCompression
H2O Compute Engine
Production Scoring Environment
Exploratory &Descriptive
Analysis
Feature Engineering &
Selection
Supervised &Unsupervised
Modeling
ModelEvaluation &
Selection
Predict
Data & ModelStorage
Model Export:Plain Old Java Object
YourImagination
Data Prep Export:Plain Old Java Object
Local
![Page 3: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/3.jpg)
H2O.aiMachine Intelligence
Ensembles
Deep Neural Networks
Algorithms on H2O
• Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes • Distributed Random Forest:
Classification or regression models• Gradient Boosting Machine: Produces
an ensemble of decision trees with increasing refined approximations
• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations
Supervised Learning
Statistical Analysis
![Page 4: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/4.jpg)
H2O.aiMachine Intelligence
Dimensionality Reduction
Anomaly Detection
Algorithms on H2O
• K-means: Partitions observations into k clusters/groups of the same spatial size
• Principal Component Analysis: Linearly transforms correlated variables to independent components
• Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data
• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
Unsupervised Learning
Clustering
![Page 5: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/5.jpg)
JavaScript R Python Excel/Tableau
Network
Rapids Expression Evaluation Engine Scala Customer
Algorithm
Customer AlgorithmParse
GLMGBMRF
Deep LearningK-Means
PCA
In-H2O Prediction Engine
H2O Software Stack
Fluid Vector Frame
Distributed K/V Store
Non-blocking Hash Map
Job
MRTask
Fork/Join
Flow
Customer Algorithm
Spark Hadoop Standalone H2O
![Page 6: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/6.jpg)
H2O.aiMachine Intelligence
cientific Advisory CouncilStephen BoydProfessor of EE EngineeringStanford University
Rob TibshiraniProfessor of Health Researchand Policy, and StatisticsStanford University
Trevor HastieProfessor of StatisticsStanford University
![Page 7: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/7.jpg)
H2O and R
![Page 8: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/8.jpg)
Reading Data from HDFS into H2O with R
STEP 1
R user
h2o_df = h2o.importFile(“hdfs://path/to/data.csv”)
![Page 9: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/9.jpg)
Reading Data from HDFS into H2O with R
H2OH2O
H2O
data.csv
HTTP REST API request to H2Ohas HDFS path
H2O ClusterInitiate distributed
ingest
HDFSRequest
data from HDFS
STEP 22.2
2.3
2.4
R
h2o.importFile()
2.1R function
call
![Page 10: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/10.jpg)
Reading Data from HDFS into H2O with R
H2OH2O
H2O
R
HDFS
STEP 3
Cluster IPCluster Port
Pointer to Data
Return pointer to data in
REST API JSON Response
HDFS provides
data
3.3
3.43.1h2o_df object
created in R
data.csv
h2o_dfH2O
Frame
3.2Distributed
H2OFrame in DKV
H2O Cluster
![Page 11: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/11.jpg)
R Script Starting H2O GLM
HTTP
REST/JSON
.h2o.startModelJob()POST /3/ModelBuilders/glm
h2o.glm()
R script
Standard R process
TCP/IP
HTTP
REST/JSON
/3/ModelBuilders/glm endpoint
Job
GLM algorithm
GLM tasks
Fork/Join framework
K/V store framework
H2O process
Network layer
REST layer
H2O - algos
H2O - core
User process
H2O process
Legend
![Page 12: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/12.jpg)
R Script Retrieving H2O GLM Result
HTTP
REST/JSON
h2o.getModel()GET /3/Models/glm_model_id
h2o.glm()
R script
Standard R process
TCP/IP
HTTP
REST/JSON
/3/Models endpoint
Fork/Join framework
K/V store framework
H2O process
Network layer
REST layer
H2O - algos
H2O - core
User process
H2O process
Legend
![Page 13: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/13.jpg)
H2O in Big Data Environments
![Page 14: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/14.jpg)
Hadoop (and YARN)• You can launch H2O directly on Hadoop:
$ hadoop jar h2odriver.jar … -nodes 3 –mapperXmx 50g
• H2O uses Hadoop MapReduce to get CPU and Memory on the cluster, not to manage work– H2O mappers stay at 0% progress forever
• Until you shut down the H2O job yourself– All mappers (3 in this case) must be running at the same time– The mappers communicate with each other
• Form an H2O cluster on-the-spot within your Hadoop environment– No Hadoop reducers(!)
• Special YARN memory settings for large mappers– yarn.nodemanager.resource.memory-mb– yarn.scheduler.maximum-allocation-mb
• CPU resources controlled via –nthreads h2o command line argument
![Page 15: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/15.jpg)
H2O on YARN DeploymentHadoop Gateway Node
hadoop jar h2odriver.jar …
YARN RM
HDFSHDFSData Nodes
Hadoop Cluster
H2OH2O Mappers
(YARN Containers)
YARN Worker Nodes
NM NM NM NM NM NMYARN NodeManagers (NM)
H2O H2O
![Page 16: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/16.jpg)
Now You Have an H2O ClusterHadoop Gateway Node
hadoop jar h2odriver.jar …
HDFSHDFSData Nodes
Hadoop Cluster
H2OH2O Mappers
(YARN Containers)
YARN Worker Nodes
NM NM NMYARN NodeManagers (NM)
H2O H2O
YARN RM
![Page 17: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/17.jpg)
Read Data from HDFS OnceHadoop Gateway Node
hadoop jar h2odriver.jar …
HDFSHDFSData Nodes
Hadoop Cluster
H2OH2O Mappers
(YARN Containers)
YARN Worker Nodes
NM NM NMYARN NodeManagers (NM)
H2O H2O
YARN RM
![Page 18: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/18.jpg)
Build Models in-MemoryHadoop Gateway Node
hadoop jar h2odriver.jar …Hadoop Cluster
H2OH2O Mappers
(YARN Containers)
YARN Worker Nodes
NM NM NMYARN NodeManagers (NM)
H2O H2O
YARN RM
![Page 19: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/19.jpg)
Sparkling Water
![Page 20: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/20.jpg)
Sparkling Water Application Life Cycle
Sparkling App
jar file
SparkMaster
JVM
spark-submit
SparkWorker
JVM
SparkWorker
JVM
SparkWorker
JVM
(1)
(2)
(3)
Sparkling Water Cluster
Spark Executor JVM
H2O(4)
Spark Executor JVM
H2O
Spark Executor JVM
H2O
![Page 21: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/21.jpg)
Sparkling Water Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVMData
Source(e.g.
HDFS) (1)
(2)
(3)
H2ORDD
Spark Executor JVM
Spark Executor JVM
SparkRDD
![Page 22: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/22.jpg)
How H2O Processes Data
![Page 23: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/23.jpg)
Distributed Data TaxonomyVector
H2O.aiMachine Intelligence
![Page 24: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/24.jpg)
Distributed Data TaxonomyVector The vector may be very large
(billions of rows)
- Stored as a compressed column (often 4x)
- Access as Java primitives with on-the-fly decompression
- Support fast Random access
- Modifiable with Java memory semanticsH2O.aiMachine Intelligence
![Page 25: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/25.jpg)
Distributed Data TaxonomyVector Large vectors must be distributed
over multiple JVMs
- Vector is split into chunks- Chunk is a unit of parallel access- Each chunk ~ 1000 elements- Per-chunk compression- Homed to a single node- Can be spilled to disk- GC very cheap
H2O.aiMachine Intelligence
![Page 26: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/26.jpg)
Distributed Data Taxonomy age gender zip_code ID
A row of data is always stored in a single JVM
Distributed data frame
- Similar to R data frame
- Adding and removingcolumns is cheap
H2O.aiMachine Intelligence
![Page 27: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/27.jpg)
Distributed Fork/Join
JVMtask
JVMtask
JVMtask
JVMtask
JVMtask
Task is distributed in a tree pattern
- Results are reduced at each inner node
- Returns with a singleresult when all subtasks done
H2O.aiMachine Intelligence
![Page 28: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/28.jpg)
Distributed Fork/Join
JVMtask
JVM
task
task task
tasktaskchunk
chunk chunk
On each node the task is parallelized using Fork/Join
H2O.aiMachine Intelligence
![Page 29: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/29.jpg)
Deploying H2OPOJO Models
![Page 30: Introduction to Fast Scalable Machine Learning with H2O - Plano](https://reader035.vdocuments.us/reader035/viewer/2022062503/586f78ad1a28ab10258b6ccd/html5/thumbnails/30.jpg)
H2O on Storm
(Spout)
(Bolt)
H2OScoringPOJO
Real-time data
Predictions
DataSource
(e.g. HDFS)
H2O
Emit POJO Model
Real-time StreamModeling workflow