ameet talwalkar, assistant professor of computer science, ucla at mlconf sf

Collaborators:*Evan*Sparks,*Michael*Franklin,*Michael*I.*Jordan,*Tim*Kraska*

UC Berkeley

Ameet*Talwalkar

Towards*an*OpBmizer*for*MLbase

Problem:"Scalable"implementa.ons"difficult"for"ML"Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Key Features

-

-

® ®

The Language of Technical Computing

MATLAB® is a high-level language and interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and built-in math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java™.

You can use MATLAB for a range of appli-cations, including signal processing and communications, image and video process-ing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing.

MATLAB Overview 2:04

Analyzing and visualizing data using the MATLAB desktop. The MATLAB environment also lets you write programs and develop algorithms and applications.

Key Features

-

-

® ®

The Language of Technical Computing

MATLAB® is a high-level language and interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and built-in math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java™.

You can use MATLAB for a range of appli-cations, including signal processing and communications, image and video process-ing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing.

MATLAB Overview 2:04

Analyzing and visualizing data using the MATLAB desktop. The MATLAB environment also lets you write programs and develop algorithms and applications.


ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves


ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

CHALLENGE:*Can*we*simplify*distributed*ML*development?

Too*many*ways*to*preprocess…

Too*many*knobs…

Problem:"ML"is"difficultfor"End"Users…

Difficult*to*debug…

Doesn’t*scale…

Too*many*algorithms…

Too*many*ways*to*preprocess…

Too*many*knobs…

Problem:"ML"is"difficultfor"End"Users…

Difficult*to*debug…

Doesn’t*scale…

CHALLENGE:*Can*we*automate*ML*pipeline*

construcBon?

Too*many*algorithms…

MLbase

4

MLlib

MLI

MLOpt

Apache*Spark

Spark:*Cluster*compuBng*system*designed*for*iteraBve*computaBon

MLlib:*Spark’s*core*ML*library

MLI:*API*to*simplify*ML*development

MLOpt:*DeclaraBve*layer*to*automate*hyperparameter*tuning

MLbase'aims'to'simplify'development'and'deployment'of'scalable'ML'pipelines

MLbase

4

MLlib

MLI

MLOpt

Apache*Spark


MLlib:*Spark’s*core*ML*library

MLI:*API*to*simplify*ML*development

MLOpt:*DeclaraBve*layer*to*automate*hyperparameter*tuning

MLbase'aims'to'simplify'development'and'deployment'of'scalable'ML'pipelinesMLOpt and MLI are

experimental testbeds

5

MLlib+**Scalable*and*fast**+**Simple*development*environment*+**Part*of*Spark’s*robust*ecosystem

SparkSQL

Apache Spark

Spark Streaming

MLlib (machine learning)

GraphX (graph)

AcBve*DevelopmentIni.al"Release• Developed*by*MLbase*team*in*AMPLab*(11*contributors)

• Scala,*Java

• Shipped*with*Spark*v0.8*(Sep*2013)

15"months"later…• 60+*contributors*from*various*organizaBons

• Scala,*Java,*Python

• Many*more*algorithms*and*ML*uBliBes

• Improved*documentaBon*/*code*examples,*API*stability

• Latest*release*part*of*Spark*v1.1*(Sept*2014)

MLlib,*MLI*and*Roadmap• MLI:*Shield*ML*Developers*from*low]details*

• Provide*familiar*mathemaBcal*operators*in*distributed*se^ng*

(tables,*matrices,*opBmizaBon*primiBves)*

• Standard*APIs*defining*ML*algorithms*and*feature*extractors

• MLlib*v1.2*will*include*ML*API*inspired*by*MLI

• Longer*term*for*MLlib*• Scalable*implementaBons*of*standard*ML*algorithms*and*

underlying*opBmizaBon*primiBves*

• Further*support*for*ML*pipeline*development*(including*hyper*

parameter*tuning*using*ideas*from*MLOpt)

MLlib,*MLI*and*Roadmap• MLI:*Shield*ML*Developers*from*low]details*

• Provide*familiar*mathemaBcal*operators*in*distributed*se^ng*

(tables,*matrices,*opBmizaBon*primiBves)*

• Standard*APIs*defining*ML*algorithms*and*feature*extractors

• MLlib*v1.2*will*include*ML*API*inspired*by*MLI

• Longer*term*for*MLlib*• Scalable*implementaBons*of*standard*ML*algorithms*and*

underlying*opBmizaBon*primiBves*

• Further*support*for*ML*pipeline*development*(including*hyper*

parameter*tuning*using*ideas*from*MLOpt)

Feedback'and'Contribu9ons'Encouraged!

Vision"MLOpt

✦ User*declaraBvely*specifies*task

✦ PAQ*=*PredicBve*AnalyBc*Query

✦ Search*through*MLlib*to*find*the*best*model/pipeline

SQL Result

TuPAQ: An Efficient Planner for Large-scale Predictive

Analytic Queries

ABSTRACTThe proliferation of massive datasets combined with the develop-ment of sophisticated analytical techniques have enabled a widevariety of novel applications such as improved product recommen-dations, automatic image tagging, and improved speech driven in-terfaces. These and many other applications can be supported byPredictive Analytic Queries (PAQs). The major obstacle to support-ing these queries is the challenging and expensive process of PAQplanning, which involves identifying and training an appropriatepredictive model. Recent efforts aiming to automate this processhave focused on single node implementations and have assumedthat model training itself is a black box, thus limiting the effective-ness of such approaches on large-scale problems. In this work, webuild upon these recent efforts and propose an integrated PAQ plan-ning architecture that combines advanced model search techniques,bandit resource allocation via runtime algorithm introspection, andphysical optimization via batching. The resulting system, TUPAQ,solves the PAQ planning problem with comparable accuracy to ex-haustive strategies but an order of magnitude faster, and can scale tomodels trained on terabytes of data across hundreds of machines.

1. INTRODUCTIONOver the past four decades, a great deal of database systems re-

search has focused on finding efficient execution strategies for anumber of different workloads – single node and distributed trans-actional, analytical, and stream processing workloads have all beenactive areas of study. Rapidly growing data volumes coupled withthe maturity of sophisticated statistical techniques have led to de-mand for support of a new type of workload: predictive analyticsover large scale, distributed datasets. Indeed, the support of pre-dictive analytic queries in database systems is an increasingly wellstudied area. Systems like MLbase [37], MADLib [33], COLUM-BUS [38], MauveDB [27], BayesStore [52] and DimmWitted [56]are all efforts to integrate statistical query processing with a datamanagement system.

Concretely, users would like to issue queries that involve reason-ing about predicted attributes, where predicted attributes are de-rived from a set of observed input attributes along with a labeled

SELECT e.sender, e.subject, e.message

FROM Emails e

WHERE e.user = ’Bob’

AND PREDICT(e.spam, e.message) = false GIVENLabeledData

(a) Spam labeling.SELECT um.title

FROM UserMovies um

WHERE um.user = ’Bob’ AND um.viewed = falseORDER BY PREDICT(um.rating) GIVEN Ratings

DESC LIMIT 50

(b) Movie recommendation.SELECT p.image

FROM Pictures p

WHERE PREDICT(p.tag, p.photo) = ’Plant’ GIVENLabeledPhotos

AND p.likes > 500

(c) Photo classification.Figure 1: Three examples of PAQs, with the predictive clauseshighlighted in green. (1a) leverages predicted values in the spamattribute to return Bob’s non-spam emails. (1b) leverages Bob’spredicted movie ratings to return Bob’s list of the top 50 predictedmovie titles. (1c) finds popular pictures of photos based on an im-age classification model – even if the images are not labeled. Eachof these use cases may require considerable training data.

training dataset. We refer to such queries as Predictive AnalyticQueries, or PAQs. The predictions returned by the system shouldbe highly accurate both on training data and new data as it comesinto the system. PAQs consist of traditional database queries alongwith new predictive clauses, and these predictive clauses are thefocus of this work. Examples of PAQs are illustrated in Figure 1,with the predictive clauses highlighted in green.

Given recent advances in statistical methodology, supervised ma-chine learning (ML) techniques are a natural way to support thepredictive clauses in PAQs. In the supervised learning setting, a sta-tistical model leverages training data to relate the input attributes tothe desired output attribute. Furthermore, ML methods learn bettermodels as the size of the training data increases, and recent ad-vances in distributed ML algorithm development, e.g., [13, 44, 40]enable large-scale model training in the distributed setting. Unfor-tunately, the application of supervised learning techniques to a newinput dataset is computationally demanding and technically chal-lenging. For a non-expert, the process of carefully preprocessingthe input attributes, selecting the appropriate ML model, and tun-

1

PAQ Model

ML

!Data

Feature*

ExtracBon

Model*

Training

Final**

Model

A*Standard*ML*Pipeline

✦ In*pracBce,*model*building*is*an*iteraBve*process*of*conBnuous*

refinement

✦ Our*grand*vision*is*to*automate*the*construcBon*of*these*

pipelines

Training*A*Model

✦ For*each*point*in*dataset*✦ compute*gradient*✦ update*model*✦ repeat*unBl*converged

✦ Requires*mul$ple'passes✦ Common*access*padern*

✦ Naive*Bayes,*Trees,*etc.

✦ Minutes*to*train*an*SVM*on*200GB*of*data*on*a*16]node*cluster

The*Tricky*Part

✦ Algorithms*✦ LogisBc*Regression,*SVM,*Tree]

based,*etc.*

✦ Algorithm*hyper]parameters*✦ Learning*Rate,*RegularizaBon,*etc.

Algorithms

Hyper*Parameters

FeaturizaBon

✦ FeaturizaBon*✦ Text:*n]grams,*TF]IDF*✦ Images:*Gabor*filters,*random*

convoluBons*✦ Random*projecBon?*Scaling?

A*Standard*ML*Pipeline

✦ In*pracBce,*model*building*is*an*iteraBve*process*of*conBnuous*

refinement

✦ Our*grand*vision*is*to*automate*the*construcBon*of*these*

pipelines

✦ Start*with*one*aspect*of*the*pipeline*]*model*selecBon

!Data

Feature*

ExtracBon

Model*

Training

Final**

Model

Automated"Model"Selec.on

One*Approach

Learning*Rate

RegularizaBon

Best"answer

✦ Try*it*all!*✦ Search*over*all*

hyperparameters,*algorithms,*features,*etc.

✦ Drawbacks*✦ Expensive*to*compute*models*✦ Hyperparameter*space*is*large

✦ Some*version*of*this*sBll*olen*done*in*pracBce!

A*Beder*Approach

✦ Beder*resource*uBlizaBon*✦ through*batching*

✦ Algorithmic*Speedups*✦ via*early*stopping*/*bandits*

✦ Improved*Search*✦ e.g.,*via*randomizaBon

Learning*Rate

RegularizaBon

Best"answer

A*Tale*Of*3*OpBmizaBons

BeLer"Resource"U.liza.on"

Algorithmic"Speedups"

Improved"Search

✦ Typical*model*update*requires*2]4*flops/double

✦ But*modern*memory*much*slower*than*processors*

✦ We*can*do*25*flops*/*double*read!*

✦ This*equates*to*6]8*model*updates*per*double*we*read,*assuming*models*fit*in*cache

✦ Train"mul.ple"models"simultaneously

Beder*Resource*UBlizaBon

What*Do*We*See*In*Spark?

✦ 2x*and*5x*increase*in*models*trained/sec*with*batching*

✦ Overhead*from*virtualizaBon,*network,*etc.

What*Do*We*See*In*Spark?

✦ These*numbers*are*with*vector]matrix*mulBplies

✦ Can*do*beder*when*rewriBng*in*terms*of*matrix]matrix*mulBplies




Improved"Search

Learning*Rate

RegularizaBon

Best"answer

Bandit*Search

✦ Each*point*is*a*trained*model

✦ Some*models*look*bad*early

✦ So*we*give*up*early!

✦ Can*frame*this*as*a*mul.Qarmed"bandit*problem✦ Each*model*is*an*arm

Bandit*Search

✦ Each*point*is*a*trained*model

✦ Some*models*look*bad*early

✦ So*we*give*up*early!

✦ Can*frame*this*as*a*mul.Qarmed"bandit*problem✦ Each*model*is*an*arm



Improved"Search


What*Search*Method?*GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

australianbreast

diabetesfourclass

splice

16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls

Data

set a

nd V

alid

atio

n Er

ror

Maximum Calls1681256625

Comparison of Search Methods Across Learning Problems

✦ Various*derivaBve]free*opBmizaBon*techniques*✦ Simple*ones*(Grid,*Random)*✦ Classic*DerivaBve]Free*(Nelder]Mead,*Powell’s*method)*✦ Bayesian*(SMAC,*TPE,*Spearmint)

✦ What*should*we*do?*✦ Tried*on*5*datasets,*opBmized*over*4*hyperparameters!

What*Search*Method?*GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

australianbreast

diabetesfourclass

splice

16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls

Data

set a

nd V

alid

atio

n Er

ror

Maximum Calls1681256625

Comparison of Search Methods Across Learning Problems

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.25

0.50

0.75

0 200 400 600 800Time elapsed (m)

Best

Val

idat

ion

Erro

r See

n So

Far

Search Method●

●

●

Grid − UnoptimizedRandom − OptimizedTPE − Optimized

Model Convergence Over Time

Pu^ng*It*All*Together✦ First*version*of*MLbase*opBmizer

✦ 30GB*dense*images*(240K*x*16K)

✦ 2*model*families,*5*hyperparams

✦ Baseline:*grid*search

✦ Our*method:*combinaBon*of*✦ Batching*✦ Early*stopping*/*Bandit*✦ Random*or*TPE*

20x"speedup"compared'to'grid'search'15'minutes'vs'5'hours!'"

Does*It*Scale?

✦ 1.5TB*dataset*(1.2M*x*160K)

✦ 128*nodes,*thousands*of*passes*over*data

✦ Tried*32*models*in*15*hours*

✦ Good*answer*aler*11*hours

●●●●●●●● ●●●●●● ●●●●● ●●●●

●●●●

●● ●● ●

0.25

0.50

0.75

5 10Time elapsed (h)

Best

Val

idat

ion

Erro

r See

n So

Far

Convergence of Model Accuracy on 1.5TB Dataset

Future*Work

!Data

Feature*

ExtracBon

Model*

Training

Final**

Model

Automated"ML"Pipeline"Construc.on

Data Image!Parser Normalizer Convolver

sqrt,mean

Zipper

Linear Solver

Symmetric!Rectifier

ident,absident,mean

Global

Pooler

Patch!Extractor

Patch Whitener

KMeans!Clusterer

Feature Extractor

Label!Extractor

Linear!Mapper Model

Test!Data

Label!Extractor

Feature Extractor

Test Error

Error!Computer

No Hyperparameters!A few Hyperparameters!Lotsa Hyperparameters

Slide courtesy of Evan Sparks

Other*Future*Work

✦ Ensembling

✦ Leverage*sampling

✦ Beder*parallelism*for*smaller*datasets

✦ MulBple*hypothesis*tesBng*issues

MLOpt:*DeclaraBve*layer*that*aims*to*automate*ML*pipeline*construcBon*

MLI:*API*to*simplify*ML*development*

MLlib:*Spark’s*core*ML*library*


THANKS! QUESTIONS?

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base

www.mlbase.org

MLlib

MLI

MLOpt

Apache*Spark

ameet talwalkar, assistant professor of computer science, ucla at mlconf sf

Technology