intro to machine learning with h2o and python - denver
TRANSCRIPT
H2O.ai Machine Intelligence
Robust, Tested and Supported Platform for Predictive Analytics
• Founded:2011:Version3releasedin2015• Product:H2Oopensourcein-memorypredic=onengine• Team:50+Coredevelopersanddatascien=sts• HQ:MountainView,CA.Sales:U.S.,U.K.&Canada
H2O.ai Overview
H2O.ai Machine Intelligence
�
What is H2O?Opensourcein-memorypredic=onenginePlaMorm
• Parallelizedanddistributedalgorithmsmakingthemostuseoutofmul=threadedsystemsandgrids
• GLM,RandomForest,GBM,DeepLearning(ANN),GLRM,PCA,K-meansetc.
Data,PlaMormandClientAgnos=cAccess• Runsanywhere• RESTAPI–drivesH2OfromR,Python,WebUI,Excel,Tableau• Scorecodeforallmodels
Samecode.DifferentEnvironmentsScale• Useallofyourdatawithoutsubsebng• Nocodechangestogofromdevelopmenttoproduc=on
Single source of truth for R and Python users
Ensembles
Deep Neural Networks
Algorithms on H2O
• Generalized Linear Models with L1 and L2 Penalties: Binomial, Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
• Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of
decision trees with increasing refined approximations
• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations
Supervised Learning
Statistical Analysis
Dimensionality Reduction
Anomaly Detection
Algorithms on H2O
• K-means: Partitions observations into k clusters/groups of the same spatial size
• Principal Component Analysis: Linearly transforms correlated variables to independent components
• Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data
• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
Unsupervised Learning
Clustering
DataandClientAgnos/c
HDFS
S3
SQL
NoSQL
Classifica=onRegression
FeatureEngineering
In-Memory
MapReduce/ForkJoin
ColumnarCompression
DeepLearning
PCA,GLM,Cox
RandomForest/GBMEnsembles
H2OComputeEngine
Streaming
JavaScoreCode
MatrixFactoriza=on Clustering
Munging
H2O and R
Reading Data from Disk into H2O with R
STEP 1
R user
h2o_df = h2o.importFile(“Local/path/to/data.csv”)
Reading Data from Disk into H2O with R
Request data from disk
STEP 2
HTTP REST API request to
H2Ohas local file
path
2.2Initiate parallel
ingest
2.3
Disk 2.4
h2o.importFile()
2.1R function
call
Local H2O Instance
H2O
data.csv
Reading Data from Disk into H2O with R
STEP 3
h2o_df object created in R
Disk
Local HostPointer to Data
Return pointer to data in REST API
JSON Response
Disk provides
data
3.3
3.4 3.1
data.csv
h2o_df
3.2Parallelized
H2OFrame in DKV
Local H2O Instance
H2O Frame
H2O