2014 09 30_sparkling_water_hands_on
DESCRIPTION
How Sparkling Water brings Fast Scalable Machine learning via H2O to Apache Spark. By Michal Malohlava and H2O.ai Our 100th Meetup at 0xdata, September 30, 2014 Open Source meets Out Door.TRANSCRIPT
![Page 1: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/1.jpg)
Sparkling Water“Killer App for Spark”
@hexadata & @mmalohlava presents
![Page 2: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/2.jpg)
Spark and H2OSeveral months ago…
![Page 3: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/3.jpg)
Sparkling WaterBefore
Tachyon based
Unnecessary data duplication
Now
Pure H2ORDD
Transparent use of H2O data and algorithms with Spark API
![Page 4: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/4.jpg)
Sparkling Water
����� ���
��
����� ���
��
+RDD
immutable"world
DataFrame mutable"world
![Page 5: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/5.jpg)
Sparkling Water
����� ���
��
����� ���
�� RDD DataFrame
![Page 6: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/6.jpg)
Sparkling Water Design
Sparkling App
jar file
Spark Master JVM
spark-submit
Spark Worker
JVM
Spark Worker
JVM
Spark Worker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
![Page 7: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/7.jpg)
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVMData
Source (e.g. HDFS)
H2O RDD
Spark Executor JVM
Spark Executor JVM
Spark RDD
![Page 8: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/8.jpg)
Hands-on Time
![Page 9: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/9.jpg)
Example
Load&Parse CSV data
Use Spark API, do SQL query
Create Deep Learning model
Use model for prediction
![Page 10: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/10.jpg)
Requirements
Linux or Mac OS X
Oracle Java 1.7
Virtual image is provided
for Windows users
![Page 11: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/11.jpg)
Downloadhttp://0xdata.com/download/
![Page 12: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/12.jpg)
Install and Launch
Unpack zip fileorOpen provided virtual image in VirtualBox
and Launch h2o-examples/sparkling-shell
![Page 13: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/13.jpg)
What is Sparkling Shell?
Standard spark-shell
Launch H2O extension
export MASTER=“local-cluster[3,2,1024]” !spark-shell \ —jars shaded.jar \ —conf spark.extensions=org.apache.spark.executor.H2OPlatformExtension
JAR containing H2O code
Name of H2O extension provided by JAR
Spark Master address
![Page 14: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/14.jpg)
…more on launching…
‣ By default single JVM, multi-threaded (export MASTER=local[*]) or
‣ export MASTER=“local-cluster[3,2,1024]” to launch an embedded Spark cluster or
‣ Launch standalone Spark cluster via sbin/launch-spark-cloud.sh and export MASTER=“spark://localhost:7077”
![Page 15: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/15.jpg)
Lets play with Sparking shell…
![Page 16: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/16.jpg)
Create H2O Client
import water.{H2O,H2OClientApp} H2OClientApp.start() H2O.waitForCloudSize(3, 10000)
![Page 17: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/17.jpg)
Is Spark Running?http://localhost:4040
![Page 18: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/18.jpg)
Is H2O running?http://localhost:54321/steam/index.html
![Page 19: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/19.jpg)
DataLoad some data and parse them
import java.io.Fileimport org.apache.spark.examples.h2o._import org.apache.spark.h2o._val dataFile = “../h2o-examples/smalldata/allyears2k_headers.csv.gz" !// Create DataFrame - involves parse of dataval airlinesData = new DataFrame(new File(dataFile))
![Page 20: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/20.jpg)
Where are data?Go to http://localhost:54321/steam/index.html
![Page 21: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/21.jpg)
Use Spark API// H2O Context provide useful implicits for conversions val h2oContext = new H2OContext(sc)import h2oContext._
// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines] = toRDD[Airlines](airlinesData)airlinesTable.count
// And use Spark RDD API directlyval flightsOnlyToSF = airlinesTable.filter( f => f.Dest==Some("SFO") || f.Dest==Some("SJC") || f.Dest==Some("OAK") ) flightsOnlyToSF.count
![Page 22: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/22.jpg)
Use Spark SQLimport org.apache.spark.sql.SQLContext // We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._ airlinesTable.registerTempTable("airlinesTable")
val query = “SELECT * FROM airlinesTable WHERE Dest LIKE 'SFO' OR Dest LIKE 'SJC' OR Dest LIKE 'OAK'“ // Invoke query val result = sql(query) // Using a registered context and tablesresult.count
assert(result.count == flightsOnlyToSF.count)
![Page 23: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/23.jpg)
Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel.DeepLearningParameters // Setup deep learning parameters val dlParams = new DeepLearningParameters()dlParams._training_frame = result( 'Year, 'Month, 'DayofMonth, 'DayOfWeek, 'CRSDepTime, 'CRSArrTime,'UniqueCarrier, 'FlightNum, 'TailNum, 'CRSElapsedTime, 'Origin, 'Dest,'Distance, ‘IsDepDelayed) dlParams.response_column = 'IsDepDelayed.name
// Create a new model builder val dl = new DeepLearning(dlParams)
val dlModel = dl.train.get
![Page 24: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/24.jpg)
Make a prediction
// Use model to score data val prediction = dlModel.score(result)(‘predict) !// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )
![Page 25: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/25.jpg)
What is under the hood?
![Page 26: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/26.jpg)
Spark App Extension/** Notion of Spark application platform extension. */trait PlatformExtension extends Serializable { /** Method to start extension */ def start(conf: SparkConf):Unit /** Method to stop extension */ def stop (conf: SparkConf):Unit /* Point in Spark infrastructure which will be intercepted by this extension. */ def intercept: InterceptionPoints = InterceptionPoints.EXECUTOR_LC /* User-friendly description of extension */ def desc:String override def toString = s"$desc@$intercept" } /** Supported interception points. * * Currently only Executor life cycle is supported. */object InterceptionPoints extends Enumeration { type InterceptionPoints = Value val EXECUTOR_LC /* Inject into executor lifecycle */ = Value}
![Page 27: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/27.jpg)
Using App Extensions
val conf = new SparkConf() .setAppName(“Sparkling H2O Example”) // Setup expected size of H2O cloudconf.set(“spark.h2o.cluster.size”,h2oWorkers) !// Add H2O extensionconf.addExtension[H2OPlatformExtension] !// Create Spark Context val sc = new SparkContext(sc)
![Page 28: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/28.jpg)
Spark Changes
We keep them small (~30 lines of code)
JIRA SPARK-3270 - Platform App Extensions
https://issues.apache.org/jira/browse/SPARK-3270
![Page 29: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/29.jpg)
You can participate!Epic PUBDEV-21aka Sparkling Water
PUBDEV-23 Test HDFS reader
PUBDEV-26 Implement toSchemaRDD
PUBDEV-27 Boolean transfers
PUBDEV-31 Support toRDD[ X <: Numeric]
PUBDEV-32/33 Mesos/YARN support
![Page 30: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/30.jpg)
More infoCheckout 0xdata Blog for tutorials
http://0xdata.com/blog/
Checkout 0xdata Youtube Channel
https://www.youtube.com/user/0xdata
Checkout github
https://github.com/0xdata/h2o-dev
https://github.com/0xdata/perrier
![Page 31: 2014 09 30_sparkling_water_hands_on](https://reader033.vdocuments.us/reader033/viewer/2022042813/547e42bfb4af9fb4158b55ca/html5/thumbnails/31.jpg)
Learn more about H2O at 0xdata.com
or
Thank you!
Follow us at @hexadata
neo> for r in h2o-dev perrier; do !git clone “[email protected]:0xdata/$r.git”!done