machine learning with h2o, spark, and python at strata 2015

36
H 2 O.ai Machine Intelligence Fast, Scalable In-Memory Machine and Deep Learning For Smarter Applications Python & Sparkling Water with H 2 O Cliff Click Michal Malohlava

Upload: sri-ambati

Post on 14-Jul-2015

3.530 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

Fast, Scalable In-Memory Machine and Deep Learning For Smarter Applications

Python & Sparkling Water with H2O

Cliff Click Michal Malohlava

Page 2: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

Who Am I?

Cliff Click CTO, Co-Founder H2O.ai [email protected]

40 yrs coding 35 yrs building compilers 30 yrs distributed computation 20 yrs OS, device drivers, HPC, HotSpot 10 yrs Low-latency GC, custom java hardware

NonBlockingHashMap 20 patents, dozens of papers 100s of public talks

PhD Computer Science 1995 Rice University HotSpot JVM Server Compiler “showed the world JITing is possible”

Page 3: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

H2O Open Source In-Memory Machine Learning for Big Data

Distributed In-Memory Math Platform

GLM, GBM, RF, K-Means, PCA, Deep Learning

Easy to use SDK & API

Java, R (CRAN), Scala, Spark, Python, JSON, Browser GUI Use ALL your data

Modeling without sampling HDFS, S3, NFS, NoSql

Big Data & Better Algorithms Better Predictions!

Page 4: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

TBD. Customer Support

TBD Head of Sales

Distributed Systems Engineers Making ML Scale!

Page 5: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

Practical Machine Learning

Value Requirements Fast & Interactive In-Memory

Big Data (No Sampling) Distributed

Ownership Open Source

Extensibility API/SDK

Portability Java, REST/JSON

Infrastructure Cloud or On-Premise Hadoop or Private Cluster

Page 6: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

H2O Architecture

Prediction Engine

R & Exec Engine Web Interface

Spark Scala REPL

Nano-Fast Scoring Engine

Distributed In-Memory K/V Store

Column Compress Data Map/Reduce

Memory Manager

Algorithms! GBM, Random Forest, GLM, PCA, K-Means,

Deep Learning

HDFS S3 NFS

Real Tim

e D

ata Flow

Page 7: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

H2O Architecture

Prediction Engine

R & Exec Engine Web Interface

Spark Scala REPL

Nano-Fast Scoring Engine

Distributed In-Memory K/V Store

Column Compress Data Map/Reduce

Memory Manager

Algorithms! GBM, Random Forest, GLM, PCA, K-Means,

Deep Learning

HDFS S3 NFS

Real Tim

e D

ata Flow

Page 8: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

Python & Sparkling Water

●  CitiBike of NYC ●  Predict bikes-per-hour-per-station

–  From per-trip logs ●  10M rows of data ●  Group-By, date/time feature-munging

Demo!

Page 9: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

H2O: A Platform for Big Math

●  Most Any Java on Big 2-D Tables –  Write like its single-thread POJO code –  Runs distributed & parallel by default

●  Fast: billion row logistic regression takes 4 sec ●  Worlds first parallel & distributed GBM

–  Plus Deep Learn / Neural Nets, RF, PCA, K-means...

●  R integration: use terabyte datasets from R ●  Sparkling Water: Direct Spark integration

Page 10: Machine Learning with H2O, Spark, and Python at Strata 2015

H2O.ai Machine Intelligence

H2O: A Platform for Big Math

●  Easy launch: “java -jar h2o.jar” –  No GC tuning: -Xmx as big as you like

●  Production ready: –  Private on-premise cluster OR

In the Cloud –  Hadoop, Yarn, EC2, or standalone cluster –  HDFS, S3, NFS, URI & other datasources –  Open Source, Apache v2

Page 11: Machine Learning with H2O, Spark, and Python at Strata 2015

Can I call H2O’s algorithms from

my Spark workflow?

Page 12: Machine Learning with H2O, Spark, and Python at Strata 2015

YES, You can!

Page 13: Machine Learning with H2O, Spark, and Python at Strata 2015

Sparkling Water

Page 14: Machine Learning with H2O, Spark, and Python at Strata 2015

Sparkling WaterProvides

Transparent integration into Spark ecosystem

Pure H2ORDD encapsulating H2O DataFrame

Transparent use of H2O data structures and algorithms with Spark API

Excels in Spark workflows requiring advanced Machine Learning algorithms

Page 15: Machine Learning with H2O, Spark, and Python at Strata 2015

Sparkling Water Design

spark-submitSpark Master JVM

Spark Worker

JVM

Spark Worker

JVM

Spark Worker

JVM

Sparkling Water Cluster

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Sparkling App

implements

?

Page 16: Machine Learning with H2O, Spark, and Python at Strata 2015

Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source (e.g. HDFS)

H2O RDD

Spark Executor JVM

Spark Executor JVM

Spark RDD

RDDs and DataFramesshare same memory

space

Page 17: Machine Learning with H2O, Spark, and Python at Strata 2015

Demo time!

Page 18: Machine Learning with H2O, Spark, and Python at Strata 2015

SPARKLING WATER DEMOH2O.AI

Created by / H2O.ai @h2oai

Page 19: Machine Learning with H2O, Spark, and Python at Strata 2015

LAUNCH SPARKLING SHELL> export SPARK_HOME="/path/to/spark/installation"

> bin/sparkling-shell

Page 20: Machine Learning with H2O, Spark, and Python at Strata 2015

PREPARE AN ENVIRONMENTval DIR_PREFIX = "/Users/michal/Devel/projects/h2o/repos/h2o2/bigdata/laptop/citibike-nyc/"

// Common importsimport org.apache.spark.h2o._import org.apache.spark.examples.h2o._import org.apache.spark.examples.h2o.DemoUtils._import org.apache.spark.sql.SQLContextimport water.fvec._import hex.tree.gbm.GBMimport hex.tree.gbm.GBMModel.GBMParameters

// Initialize Spark SQLContextimplicit val sqlContext = new SQLContext(sc)import sqlContext._

Page 21: Machine Learning with H2O, Spark, and Python at Strata 2015

LAUNCH H2O SERVICESimplicit val h2oContext = new H2OContext(sc).start()

import h2oContext._

Page 22: Machine Learning with H2O, Spark, and Python at Strata 2015

LOAD CITIBIKE DATAUSING H2O API

val dataFiles = Array[String]( "2013-07.csv", "2013-08.csv", "2013-09.csv", "2013-10.csv", "2013-11.csv", "2013-12.csv").map(f => new java.io.File(DIR_PREFIX, f))

// Load and parse dataval bikesDF = new DataFrame(dataFiles:_*)

// Rename columns and remove all spaces in headerval colNames = bikesDF.names().map( n => n.replace(' ', '_'))bikesDF._names = colNamesbikesDF.update(null)

Page 23: Machine Learning with H2O, Spark, and Python at Strata 2015

USER-DEFINED COLUMN TRANSFORMATION// Select column 'startime'val startTimeF = bikesDF('starttime)

// Invoke column transformation and append the created columnbikesDF.add(new TimeSplit().doIt(startTimeF))// Do not forget to update frame in K/V storebikesDF.update(null)

Page 24: Machine Learning with H2O, Spark, and Python at Strata 2015

OPEN H2O FLOW UIopenFlow

AND EXPLORE DATA...> getFrames...

Page 25: Machine Learning with H2O, Spark, and Python at Strata 2015

FROM H2O'S DATAFRAME TO RDDval bikesRdd = asSchemaRDD(bikesDF)

Page 26: Machine Learning with H2O, Spark, and Python at Strata 2015

USE SPARK SQL// Register table and SQL tablesqlContext.registerRDDAsTable(bikesRdd, "bikesRdd")

// Perform SQL group operationval bikesPerDayRdd = sql( """SELECT Days, start_station_id, count(*) bikes |FROM bikesRdd |GROUP BY Days, start_station_id """.stripMargin)

Page 27: Machine Learning with H2O, Spark, and Python at Strata 2015

FROM RDD TO H2O'S DATAFRAMEval bikesPerDayDF:DataFrame = bikesPerDayRdd

AND PERFORM ADDITIONAL COLUMN TRANSFORMATION// Select "Days" columnval daysVec = bikesPerDayDF('Days)// Refine column into "Month" and "DayOfWeek"val finalBikeDF = bikesPerDayDF.add(new TimeTransform().doIt(daysVec))

Page 28: Machine Learning with H2O, Spark, and Python at Strata 2015

TIME TO BUILD A MODEL!

Page 29: Machine Learning with H2O, Spark, and Python at Strata 2015

GBM MODEL BUILDERdef buildModel(df: DataFrame, trees: Int = 200, depth: Int = 6):R2 = { // Split into train and test parts val frs = splitFrame(df, Seq("train.hex", "test.hex", "hold.hex"), Seq(0.6, 0.3, 0.1)) val (train, test, hold) = (frs(0), frs(1), frs(2)) // Configure GBM parameters val gbmParams = new GBMParameters() gbmParams._train = train gbmParams._valid = test gbmParams._response_column = 'bikes gbmParams._ntrees = trees gbmParams._max_depth = depth // Build a model val gbmModel = new GBM(gbmParams).trainModel.get // Score datasets Seq(train,test,hold).foreach(gbmModel.score(_).delete) // Collect R2 metrics val result = R2("Model #1", r2(gbmModel, train), r2(gbmModel, test), r2(gbmModel, hold)) // Perform clean-up Seq(train, test, hold).foreach(_.delete()) result}

Page 30: Machine Learning with H2O, Spark, and Python at Strata 2015

BUILD A GBM MODELval result1 = buildModel(finalBikeDF)

Page 31: Machine Learning with H2O, Spark, and Python at Strata 2015

CAN WE IMPROVE MODELBY USING INFORMATION

ABOUT WEATHER?

Page 32: Machine Learning with H2O, Spark, and Python at Strata 2015

LOAD WEATHER DATAUSING SPARK API

// Load weather data in NY 2013val weatherData = sc.textFile(DIR_PREFIX + "31081_New_York_City__Hourly_2013.csv")// Parse data and filter themval weatherRdd = weatherData.map(_.split(",")). map(row => NYWeatherParse(row)). filter(!_.isWrongRow()). filter(_.HourLocal == Some(12)).setName("weather").cache()

Page 33: Machine Learning with H2O, Spark, and Python at Strata 2015

CREATE A JOINED TABLEUSING H2O'S DATAFRAME AND SPARK'S RDD

// Join with bike tablesqlContext.registerRDDAsTable(weatherRdd, "weatherRdd")sqlContext.registerRDDAsTable(asSchemaRDD(finalBikeDF), "bikesRdd")

val bikesWeatherRdd = sql( """SELECT b.Days, b.start_station_id, b.bikes, |b.Month, b.DayOfWeek, |w.DewPoint, w.HumidityFraction, w.Prcp1Hour, |w.Temperature, w.WeatherCode1 | FROM bikesRdd b | JOIN weatherRdd w | ON b.Days = w.Days """.stripMargin)

Page 34: Machine Learning with H2O, Spark, and Python at Strata 2015

BUILD A NEW MODELUSING SPARK'S RDD IN H2O'S API

val result2 = buildModel(bikesWeatherRdd)

Page 35: Machine Learning with H2O, Spark, and Python at Strata 2015

Checkout H2O.ai Training Books

http://learn.h2o.ai/

Checkout H2O.ai Blog

http://h2o.ai/blog/

Checkout H2O.ai Youtube Channel

https://www.youtube.com/user/0xdata

Checkout GitHub

https://github.com/h2oai

More info

Page 36: Machine Learning with H2O, Spark, and Python at Strata 2015

Learn more about H2O at h2o.ai

Thank you!

Follow us at @h2oai