best practices for productionizing apache spark mllib models
TRANSCRIPT
Best practices for productionizing Apache Spark MLlib models
Joseph Bradley March 7, 2018 Strata San Jose
About me
Joseph Bradley • Software engineer at Databricks • Apache Spark committer & PMC member • Ph.D. in Machine Learning from Carnegie Mellon
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT Unified Analytics Platform
MISSION Making Big Data Simple
Try for free today. databricks.com
Apache Spark Engine
…
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries
MLlib’s success
• Apache Spark integration simplifies • Deployment • ETL • Integration into complete analytics pipelines with SQL & streaming
• Scalability & speed • Pipelines for featurization, modeling & tuning
MLlib’s success
• 1000s of commits • 100s of contributors • 10,000s of users
(on Databricks alone) • Many production use cases
End goal: Data-driven applications
Challenge:
Smooth deployment to production
Productionizing Machine Learning
Data Science / ML End Users
Prediction Servers
models results
Deployment Options for MLlib Latencyrequirement
10ms 100ms 1s 1 day 1 hour 1 min
Low-latencyReal-5me
Spark-less highly-available prediction server
Streaming
Spark Structured Streaming
Batch
Spark batch processing
Challenges of Productionizing
Data Science / ML End Users
Prediction Servers
models results
Serialize
Deserialize
Makepredic5ons
Challenge: working across teams
Data Science / ML End Users
Prediction Servers
models results
Serialize
Deserialize
Makepredic5ons
Challenge: featurization logic
Data Science / ML End Users
Prediction Servers
models results
Serialize
Deserialize
Makepredic5ons
FeatureLogic↓
FeatureLogic↓
FeatureLogic↓
Model
ML Pipelines Feature
extraction Original dataset
13
Predictive model
Text Label I bought the game... 4
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3
ML Pipelines: featurization
14
Feature extraction
Original dataset
Predictive model
Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
ML Pipelines: model
15
Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Feature extraction
Original dataset
Predictive model
Challenge: various environments
Data Science / ML
Prediction Servers
modelsresults
Makepredic5ons
Makepredic5ons
results
Makepredic5ons
results
Summary of challenges
Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future
model & pipeline persistence/export
dev, staging, prod
including featurization
versioning & compatibility
Production architectures for ML
Architecture A: batch Pre-compute predictions using Spark and serve from database
TrainALSModel SendEmailOfferstoCustomers
SaveOfferstoNoSQL
RankedOffers
DisplayRankedOffersinWeb/
Mobile
RecurringBatch
E.g.: content recommendation
Architecture B: streaming Score in Spark Streaming + use an API with cached predictions
WebAc5vityLogs
KillUser’sLoginSessionComputeFeatures RunPredic5on
APICheck
Streaming
CachedPredic5ons
E.g.: monitoring web sessions
Predic5onServerorApplica5on
Architecture C: sub-second Train with Spark and score outside of Spark
TrainModelinSpark
SaveModeltoS3/HDFS
NewData
CopyModeltoProduc5on
Predic5ons
E.g.: card swipe fraud detection
Solving challenges with Apache Spark MLlib
MLlib solutions by architecture
A: Batch scoring B: Streaming C: Sub-second
• ML Pipelines • Spark SQL &
custom logic
• ML Pipelines (Spark 2.3)
• Spark SQL & custom logic
• 3rd-party solutions
A: Batch scoring in Spark
ML Pipelines cover most featurization and modeling • Save models and Pipelines via ML persistence: pipeline.save(path)
Simple to add custom logic • Spark SQL
– Save workflows via notebooks, JARs, and Jobs • Custom ML Pipeline components
– Save via ML persistence
B: Streaming scoring in Spark
As of Apache Spark 2.3, same as batch: • ML Pipelines cover most featurization + modeling • Simple to add custom logic • Spark SQL • Custom ML Pipeline components
But be aware of critical updates in Spark 2.3!
Scoring with Structured Streaming in Spark 2.3
Some existing Pipelines will need fixes: • OneHotEncoder à OneHotEncoderEstimator • VectorAssembler • Sometimes need VectorSizeHint
RFormula has been updated & works out of the box.
(demo of streaming)
The nitty gritty
One-hot encoding • (Spark 2.2) Transformer: Stateless transform of DataFrame. • (Spark 2.3) Estimator: Record categories during fitting. Use same
categories during scoring. • Important fix for both batch & streaming!
Feature vector assembly (including in RFormula) • (Spark 2.2) Vector size sometimes inferred from data • (Spark 2.3) Add size hint to Pipeline when needed
C: Sub-second scoring For REST APIs and embedded applications Requirements: • Lightweight deployment (no Spark dependency) • Milliseconds for prediction (no SparkSession or Spark jobs) Several 3rd-party solutions exist: • Databricks Model Export • MLeap • PMML and PFA • H2O
Lessons from Databricks Model Export
Most engineering work is in testing • Identical behavior in MLlib and in exported models
– Including in complex Pipelines • Automated testing to catch changes in MLlib • Backwards compatibility tests
Backwards compatibility & stability guarantees are critical
• Added explicit guarantees to MLlib docs: https://spark.apache.org/docs/latest/ml-pipeline.html#backwards-compatibility-for-ml-persistence
Summary A: Batch scoring B: Streaming C: Sub-second
• ML Pipelines • Spark SQL &
custom logic
• ML Pipelines (Spark 2.3)
• Spark SQL & custom logic
• 3rd-party solutions
Additional challenges outside the scope of this talk • Feature and model management • Monitoring • A/B testing
Resources Overview of productionizing Apache Spark ML models
Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize-your-machine-learning-models
Batch scoring
Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and-loading-pipelines
Streaming scoring
Guide and example notebook: https://tinyurl.com/y7bk5plu
Sub-second scoring
Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing-apache-spark-mllib-models-for-real-time-prediction-serving
Aside: new in Apache Spark 2.3
https://databricks.com/blog/2018/02/28 • Fixes for ML scoring in Structured Streaming (this talk) • ImageSchema and image utilities to enable Deep Learning use
cases on Spark • Python API improvements for developing custom algorithms • And much more! Available now in Databricks Runtime 4.0!
hWp://dbricks.co/2sK35XT
hWp://shop.oreilly.com/product/0636920034957.do
Blog post
Available from O’Reilly
https://databricks.com/careers
Thank You! Questions?