best practices for productionizing apache spark mllib models

Best practices for productionizing Apache Spark MLlib models

Joseph Bradley March 7, 2018 Strata San Jose

About me

Joseph Bradley • Software engineer at Databricks • Apache Spark committer & PMC member • Ph.D. in Machine Learning from Carnegie Mellon

TEAM

About Databricks

Started Spark project (now Apache Spark) at UC Berkeley in 2009

PRODUCT Unified Analytics Platform

MISSION Making Big Data Simple

Try for free today. databricks.com

Apache Spark Engine

…

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries

MLlib’s success

• Apache Spark integration simplifies •  Deployment •  ETL •  Integration into complete analytics pipelines with SQL & streaming

• Scalability & speed • Pipelines for featurization, modeling & tuning

MLlib’s success

•  1000s of commits •  100s of contributors •  10,000s of users

(on Databricks alone) • Many production use cases

End goal: Data-driven applications

Challenge:

Smooth deployment to production

Productionizing Machine Learning

Data Science / ML End Users

Prediction Servers

models results

Deployment Options for MLlib Latencyrequirement

10ms 100ms 1s 1 day 1 hour 1 min

Low-latencyReal-5me

Spark-less highly-available prediction server

Streaming

Spark Structured Streaming

Batch

Spark batch processing

Challenges of Productionizing


Prediction Servers

models results

Serialize

Deserialize

Makepredic5ons

Challenge: working across teams


Prediction Servers

models results

Serialize

Deserialize

Makepredic5ons

Challenge: featurization logic


Prediction Servers

models results

Serialize

Deserialize

Makepredic5ons

FeatureLogic↓

FeatureLogic↓

FeatureLogic↓

Model

ML Pipelines Feature

extraction Original dataset

13

Predictive model

Text Label I bought the game... 4

Do NOT bother try... 1

this shirt is aweso... 5

never got it. Seller... 1

I ordered this to... 3

ML Pipelines: featurization

14

Feature extraction

Original dataset

Predictive model

Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]

Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]

this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]

never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]

I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]

ML Pipelines: model

15

Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8

Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6

this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9

never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7

I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7

Feature extraction

Original dataset

Predictive model

Challenge: various environments

Data Science / ML

Prediction Servers

modelsresults

Makepredic5ons

Makepredic5ons

results

Makepredic5ons

results

Summary of challenges

Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future

model & pipeline persistence/export

dev, staging, prod

including featurization

versioning & compatibility

Production architectures for ML

Architecture A: batch Pre-compute predictions using Spark and serve from database

TrainALSModel SendEmailOfferstoCustomers

SaveOfferstoNoSQL

RankedOffers

DisplayRankedOffersinWeb/

Mobile

RecurringBatch

E.g.: content recommendation

Architecture B: streaming Score in Spark Streaming + use an API with cached predictions

WebAc5vityLogs

KillUser’sLoginSessionComputeFeatures RunPredic5on

APICheck

Streaming

CachedPredic5ons

E.g.: monitoring web sessions

Predic5onServerorApplica5on

Architecture C: sub-second Train with Spark and score outside of Spark

TrainModelinSpark

SaveModeltoS3/HDFS

NewData

CopyModeltoProduc5on

Predic5ons

E.g.: card swipe fraud detection

Solving challenges with Apache Spark MLlib

MLlib solutions by architecture

A: Batch scoring B: Streaming C: Sub-second

•  ML Pipelines •  Spark SQL &

custom logic

•  ML Pipelines (Spark 2.3)

•  Spark SQL & custom logic

•  3rd-party solutions

A: Batch scoring in Spark

ML Pipelines cover most featurization and modeling •  Save models and Pipelines via ML persistence: pipeline.save(path)

Simple to add custom logic •  Spark SQL

–  Save workflows via notebooks, JARs, and Jobs •  Custom ML Pipeline components

–  Save via ML persistence

B: Streaming scoring in Spark

As of Apache Spark 2.3, same as batch: • ML Pipelines cover most featurization + modeling • Simple to add custom logic •  Spark SQL •  Custom ML Pipeline components

But be aware of critical updates in Spark 2.3!

Scoring with Structured Streaming in Spark 2.3

Some existing Pipelines will need fixes: • OneHotEncoder à OneHotEncoderEstimator • VectorAssembler •  Sometimes need VectorSizeHint

RFormula has been updated & works out of the box.

(demo of streaming)

The nitty gritty

One-hot encoding •  (Spark 2.2) Transformer: Stateless transform of DataFrame. •  (Spark 2.3) Estimator: Record categories during fitting. Use same

categories during scoring. •  Important fix for both batch & streaming!

Feature vector assembly (including in RFormula) •  (Spark 2.2) Vector size sometimes inferred from data •  (Spark 2.3) Add size hint to Pipeline when needed

C: Sub-second scoring For REST APIs and embedded applications Requirements: •  Lightweight deployment (no Spark dependency) •  Milliseconds for prediction (no SparkSession or Spark jobs) Several 3rd-party solutions exist: •  Databricks Model Export •  MLeap •  PMML and PFA •  H2O

Lessons from Databricks Model Export

Most engineering work is in testing •  Identical behavior in MLlib and in exported models

–  Including in complex Pipelines •  Automated testing to catch changes in MLlib •  Backwards compatibility tests

Backwards compatibility & stability guarantees are critical

•  Added explicit guarantees to MLlib docs: https://spark.apache.org/docs/latest/ml-pipeline.html#backwards-compatibility-for-ml-persistence

Summary A: Batch scoring B: Streaming C: Sub-second

•  ML Pipelines •  Spark SQL &

custom logic

•  ML Pipelines (Spark 2.3)

•  Spark SQL & custom logic

•  3rd-party solutions

Additional challenges outside the scope of this talk •  Feature and model management •  Monitoring •  A/B testing

Resources Overview of productionizing Apache Spark ML models

Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize-your-machine-learning-models

Batch scoring

Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and-loading-pipelines

Streaming scoring

Guide and example notebook: https://tinyurl.com/y7bk5plu

Sub-second scoring

Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing-apache-spark-mllib-models-for-real-time-prediction-serving

Aside: new in Apache Spark 2.3

https://databricks.com/blog/2018/02/28 • Fixes for ML scoring in Structured Streaming (this talk) •  ImageSchema and image utilities to enable Deep Learning use

cases on Spark • Python API improvements for developing custom algorithms • And much more! Available now in Databricks Runtime 4.0!

hWp://dbricks.co/2sK35XT

hWp://shop.oreilly.com/product/0636920034957.do

Blog post

Available from O’Reilly

https://databricks.com/careers

Thank You! Questions?

best practices for productionizing apache spark mllib models

Documents