spark ml pipeline serving
TRANSCRIPT
![Page 1: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/1.jpg)
Spark Serving
by Stepan Pushkarev CTO of Hydrosphere.io
![Page 2: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/2.jpg)
Spark Users here?
![Page 3: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/3.jpg)
Data Scientists and Spark Users here?
![Page 4: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/4.jpg)
![Page 5: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/5.jpg)
Why do companies hire data scientists?
![Page 6: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/6.jpg)
Why do companies hire data scientists?
To make products smarter.
![Page 7: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/7.jpg)
What is a deliverable of data scientist and data engineer?
![Page 8: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/8.jpg)
What is a deliverable of data scientist?
Academic
paper?
ML Model? R/Python
script?
Jupiter
Notebook?
BI
Dashboard?
![Page 9: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/9.jpg)
cluster
datamodel
data scientist
? web app
![Page 10: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/10.jpg)
val wordCounts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
executor
executorexecutor
executor executor
![Page 11: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/11.jpg)
Machine Learning: training + serving
![Page 12: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/12.jpg)
pipeline
Training (Estimation) pipeline
trainpreprocess preprocess
![Page 13: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/13.jpg)
tokenizer
apache spark 1
hadoop mapreduce 0
spark machine learning 1
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1
![Page 14: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/14.jpg)
hashing tf
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1
![Page 15: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/15.jpg)
logistic regression
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1
0 72 -2.7138781446090308
0 94 0.9042505436914775
0 105 3.0835670890496645
0 495 3.2071722417080766
0 722 0.9042505436914775
![Page 16: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/16.jpg)
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.001)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)model.write.save("/tmp/spark-model")
![Page 17: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/17.jpg)
pipeline
Prediction Pipeline
preprocess preprocess
![Page 18: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/18.jpg)
val test = spark.createDataFrame(Seq(("spark hadoop"),("hadoop learning")
)).toDF("text")
val model = PipelineModel.load("/tmp/spark-model")
model.transform(test).collect()
![Page 19: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/19.jpg)
./bin/spark-submit …
![Page 20: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/20.jpg)
cluster
datamodel
data scientist
? web app
![Page 21: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/21.jpg)
Pipeline Serving - NOT Model Serving
Model level API leads to code duplication & inconsistency
at pre-processing stages!
Web App
Ruby/PHP:
preprocess
Check current user
User LogsML Pipeline: preprocess, train
Save
Score/serve model
Fraud Detection Model
![Page 22: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/22.jpg)
https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-13944
![Page 23: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/23.jpg)
![Page 24: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/24.jpg)
cluster
datamodel
data scientist
web app
PMMLPFA
MLEAP
- Yet another Format Lock
- Code & state duplication
- Limited extensibility
- Inconsistency
- Extra moving parts
![Page 25: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/25.jpg)
cluster
datamodel
data scientist
web app
docker
model
libs
deps
- Fat All inclusive Docker - bad
practice
- Every model requires new
docker to be rebuilt
![Page 26: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/26.jpg)
cluster
data
model
data scientist
web appA
PI
API
- Needs Spark Running
- High latency, low throughput
![Page 27: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/27.jpg)
cluster
data
model
data scientist
web appA
PI
serv
ing
AP
I
+ Serving skips Spark
+ But re-uses ML algorithms
+ No new formats and APIs
+ Low Latency but not super tuned
+ Scalable
+ Stateless
![Page 28: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/28.jpg)
Low level API Challenge
MS Azure
![Page 29: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/29.jpg)
A deliverable for ML model
Single row Serving / Scoring layer
xml, json, parquet, pojo, other
Monitoring, testing
integration
Large Scale, Batch
processing engine
![Page 30: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/30.jpg)
Zooming out
Unified Serving/Scoring API
Repository
MLLib model TensorFlow model Other model
![Page 31: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/31.jpg)
Real-time Prediction PIpelines
![Page 32: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/32.jpg)
Starting from scratch - System ML
Multiple execution modes, including Spark MLContext
API, Spark Batch, Hadoop Batch, Standalone, and JMLC.
![Page 33: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/33.jpg)
Demo Time
![Page 34: Spark ML Pipeline serving](https://reader034.vdocuments.us/reader034/viewer/2022051504/5a656bfb7f8b9af3678b4cc5/html5/thumbnails/34.jpg)
Thank you
Looking for
- Feedback
- Advisors, mentors & partners
- Pilots and early adopters
Stay in touch
- @hydrospheredata
- https://github.com/Hydrospheredata
- http://hydrosphere.io/