ameet talwalkar, assistant professor of computer science, ucla at mlconf sf

36
Collaborators: Evan Sparks, Michael Franklin, Michael I. Jordan, Tim Kraska UC Berkeley Ameet Talwalkar Towards an OpBmizer for MLbase

Upload: sessionsevents

Post on 04-Jul-2015

2.546 views

Category:

Technology


0 download

DESCRIPTION

Abstract: Apache Spark’s MLlib is a terrific library for fitting large-scale machine learning models. However, translating high-level problem statements like “learn a classifier” into a working model presently requires significant manual effort (via ad hoc parameter tuning) and computational resources (to fit several models). We present our work on the MLbase optimizer – a system designed on top of Spark to quickly and automatically search through a hyperparameter space and find a good model. By leveraging performance enhancements, better search algorithms, and statistical heuristics, our system offers an order of magnitude speedup over standard methods.

TRANSCRIPT

Page 1: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Collaborators:*Evan*Sparks,*Michael*Franklin,*Michael*I.*Jordan,*Tim*Kraska*

UC Berkeley

Ameet*Talwalkar

Towards*an*OpBmizer*for*MLbase

Page 2: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Problem:"Scalable"implementa.ons"difficult"for"ML"Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Key Features

-

-

® ®

The Language of Technical Computing

MATLAB® is a high-level language and interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and built-in math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java™.

You can use MATLAB for a range of appli-cations, including signal processing and communications, image and video process-ing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing.

MATLAB Overview 2:04

Analyzing and visualizing data using the MATLAB desktop. The MATLAB environment also lets you write programs and develop algorithms and applications.

Key Features

-

-

® ®

The Language of Technical Computing

MATLAB® is a high-level language and interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and built-in math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java™.

You can use MATLAB for a range of appli-cations, including signal processing and communications, image and video process-ing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing.

MATLAB Overview 2:04

Analyzing and visualizing data using the MATLAB desktop. The MATLAB environment also lets you write programs and develop algorithms and applications.

Page 3: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Problem:"Scalable"implementa.ons"difficult"for"ML"Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 4: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Problem:"Scalable"implementa.ons"difficult"for"ML"Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

CHALLENGE:*Can*we*simplify*distributed*ML*development?

Page 5: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Too*many*ways*to*preprocess…

Too*many*knobs…

Problem:"ML"is"difficultfor"End"Users…

Difficult*to*debug…

Doesn’t*scale…

Too*many*algorithms…

Page 6: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Too*many*ways*to*preprocess…

Too*many*knobs…

Problem:"ML"is"difficultfor"End"Users…

Difficult*to*debug…

Doesn’t*scale…

CHALLENGE:*Can*we*automate*ML*pipeline*

construcBon?

Too*many*algorithms…

Page 7: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

MLbase

4

MLlib

MLI

MLOpt

Apache*Spark

Spark:*Cluster*compuBng*system*designed*for*iteraBve*computaBon

MLlib:*Spark’s*core*ML*library

MLI:*API*to*simplify*ML*development

MLOpt:*DeclaraBve*layer*to*automate*hyperparameter*tuning

MLbase'aims'to'simplify'development'and'deployment'of'scalable'ML'pipelines

Page 8: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

MLbase

4

MLlib

MLI

MLOpt

Apache*Spark

Spark:*Cluster*compuBng*system*designed*for*iteraBve*computaBon

MLlib:*Spark’s*core*ML*library

MLI:*API*to*simplify*ML*development

MLOpt:*DeclaraBve*layer*to*automate*hyperparameter*tuning

MLbase'aims'to'simplify'development'and'deployment'of'scalable'ML'pipelinesMLOpt and MLI are

experimental testbeds

Page 9: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

5

MLlib+**Scalable*and*fast**+**Simple*development*environment*+**Part*of*Spark’s*robust*ecosystem

SparkSQL

Apache Spark

Spark Streaming

MLlib (machine learning)

GraphX (graph)

Page 10: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

AcBve*DevelopmentIni.al"Release• Developed*by*MLbase*team*in*AMPLab*(11*contributors)

• Scala,*Java

• Shipped*with*Spark*v0.8*(Sep*2013)

15"months"later…• 60+*contributors*from*various*organizaBons

• Scala,*Java,*Python

• Many*more*algorithms*and*ML*uBliBes

• Improved*documentaBon*/*code*examples,*API*stability

• Latest*release*part*of*Spark*v1.1*(Sept*2014)

Page 11: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

MLlib,*MLI*and*Roadmap• MLI:*Shield*ML*Developers*from*low]details*

• Provide*familiar*mathemaBcal*operators*in*distributed*se^ng*

(tables,*matrices,*opBmizaBon*primiBves)*

• Standard*APIs*defining*ML*algorithms*and*feature*extractors

• MLlib*v1.2*will*include*ML*API*inspired*by*MLI

• Longer*term*for*MLlib*• Scalable*implementaBons*of*standard*ML*algorithms*and*

underlying*opBmizaBon*primiBves*

• Further*support*for*ML*pipeline*development*(including*hyper*

parameter*tuning*using*ideas*from*MLOpt)

Page 12: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

MLlib,*MLI*and*Roadmap• MLI:*Shield*ML*Developers*from*low]details*

• Provide*familiar*mathemaBcal*operators*in*distributed*se^ng*

(tables,*matrices,*opBmizaBon*primiBves)*

• Standard*APIs*defining*ML*algorithms*and*feature*extractors

• MLlib*v1.2*will*include*ML*API*inspired*by*MLI

• Longer*term*for*MLlib*• Scalable*implementaBons*of*standard*ML*algorithms*and*

underlying*opBmizaBon*primiBves*

• Further*support*for*ML*pipeline*development*(including*hyper*

parameter*tuning*using*ideas*from*MLOpt)

Feedback'and'Contribu9ons'Encouraged!

Page 13: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Vision"MLOpt

Page 14: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

✦ User*declaraBvely*specifies*task

✦ PAQ*=*PredicBve*AnalyBc*Query

✦ Search*through*MLlib*to*find*the*best*model/pipeline

SQL Result

TuPAQ: An Efficient Planner for Large-scale Predictive

Analytic Queries

ABSTRACTThe proliferation of massive datasets combined with the develop-ment of sophisticated analytical techniques have enabled a widevariety of novel applications such as improved product recommen-dations, automatic image tagging, and improved speech driven in-terfaces. These and many other applications can be supported byPredictive Analytic Queries (PAQs). The major obstacle to support-ing these queries is the challenging and expensive process of PAQplanning, which involves identifying and training an appropriatepredictive model. Recent efforts aiming to automate this processhave focused on single node implementations and have assumedthat model training itself is a black box, thus limiting the effective-ness of such approaches on large-scale problems. In this work, webuild upon these recent efforts and propose an integrated PAQ plan-ning architecture that combines advanced model search techniques,bandit resource allocation via runtime algorithm introspection, andphysical optimization via batching. The resulting system, TUPAQ,solves the PAQ planning problem with comparable accuracy to ex-haustive strategies but an order of magnitude faster, and can scale tomodels trained on terabytes of data across hundreds of machines.

1. INTRODUCTIONOver the past four decades, a great deal of database systems re-

search has focused on finding efficient execution strategies for anumber of different workloads – single node and distributed trans-actional, analytical, and stream processing workloads have all beenactive areas of study. Rapidly growing data volumes coupled withthe maturity of sophisticated statistical techniques have led to de-mand for support of a new type of workload: predictive analyticsover large scale, distributed datasets. Indeed, the support of pre-dictive analytic queries in database systems is an increasingly wellstudied area. Systems like MLbase [37], MADLib [33], COLUM-BUS [38], MauveDB [27], BayesStore [52] and DimmWitted [56]are all efforts to integrate statistical query processing with a datamanagement system.

Concretely, users would like to issue queries that involve reason-ing about predicted attributes, where predicted attributes are de-rived from a set of observed input attributes along with a labeled

SELECT e.sender, e.subject, e.message

FROM Emails e

WHERE e.user = ’Bob’

AND PREDICT(e.spam, e.message) = false GIVENLabeledData

(a) Spam labeling.SELECT um.title

FROM UserMovies um

WHERE um.user = ’Bob’ AND um.viewed = falseORDER BY PREDICT(um.rating) GIVEN Ratings

DESC LIMIT 50

(b) Movie recommendation.SELECT p.image

FROM Pictures p

WHERE PREDICT(p.tag, p.photo) = ’Plant’ GIVENLabeledPhotos

AND p.likes > 500

(c) Photo classification.Figure 1: Three examples of PAQs, with the predictive clauseshighlighted in green. (1a) leverages predicted values in the spamattribute to return Bob’s non-spam emails. (1b) leverages Bob’spredicted movie ratings to return Bob’s list of the top 50 predictedmovie titles. (1c) finds popular pictures of photos based on an im-age classification model – even if the images are not labeled. Eachof these use cases may require considerable training data.

training dataset. We refer to such queries as Predictive AnalyticQueries, or PAQs. The predictions returned by the system shouldbe highly accurate both on training data and new data as it comesinto the system. PAQs consist of traditional database queries alongwith new predictive clauses, and these predictive clauses are thefocus of this work. Examples of PAQs are illustrated in Figure 1,with the predictive clauses highlighted in green.

Given recent advances in statistical methodology, supervised ma-chine learning (ML) techniques are a natural way to support thepredictive clauses in PAQs. In the supervised learning setting, a sta-tistical model leverages training data to relate the input attributes tothe desired output attribute. Furthermore, ML methods learn bettermodels as the size of the training data increases, and recent ad-vances in distributed ML algorithm development, e.g., [13, 44, 40]enable large-scale model training in the distributed setting. Unfor-tunately, the application of supervised learning techniques to a newinput dataset is computationally demanding and technically chal-lenging. For a non-expert, the process of carefully preprocessingthe input attributes, selecting the appropriate ML model, and tun-

1

PAQ Model

ML

Page 15: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

!Data

Feature*

ExtracBon

Model*

Training

Final**

Model

A*Standard*ML*Pipeline

✦ In*pracBce,*model*building*is*an*iteraBve*process*of*conBnuous*

refinement

✦ Our*grand*vision*is*to*automate*the*construcBon*of*these*

pipelines

Page 16: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Training*A*Model

✦ For*each*point*in*dataset*✦ compute*gradient*✦ update*model*✦ repeat*unBl*converged

✦ Requires*mul$ple'passes✦ Common*access*padern*

✦ Naive*Bayes,*Trees,*etc.

✦ Minutes*to*train*an*SVM*on*200GB*of*data*on*a*16]node*cluster

Page 17: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

The*Tricky*Part

✦ Algorithms*✦ LogisBc*Regression,*SVM,*Tree]

based,*etc.*

✦ Algorithm*hyper]parameters*✦ Learning*Rate,*RegularizaBon,*etc.

Algorithms

Hyper*Parameters

FeaturizaBon

✦ FeaturizaBon*✦ Text:*n]grams,*TF]IDF*✦ Images:*Gabor*filters,*random*

convoluBons*✦ Random*projecBon?*Scaling?

Page 18: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

A*Standard*ML*Pipeline

✦ In*pracBce,*model*building*is*an*iteraBve*process*of*conBnuous*

refinement

✦ Our*grand*vision*is*to*automate*the*construcBon*of*these*

pipelines

✦ Start*with*one*aspect*of*the*pipeline*]*model*selecBon

!Data

Feature*

ExtracBon

Model*

Training

Final**

Model

Automated"Model"Selec.on

Page 19: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

One*Approach

Learning*Rate

RegularizaBon

Best"answer

✦ Try*it*all!*✦ Search*over*all*

hyperparameters,*algorithms,*features,*etc.

✦ Drawbacks*✦ Expensive*to*compute*models*✦ Hyperparameter*space*is*large

✦ Some*version*of*this*sBll*olen*done*in*pracBce!

Page 20: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

A*Beder*Approach

✦ Beder*resource*uBlizaBon*✦ through*batching*

✦ Algorithmic*Speedups*✦ via*early*stopping*/*bandits*

✦ Improved*Search*✦ e.g.,*via*randomizaBon

Learning*Rate

RegularizaBon

Best"answer

Page 21: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

A*Tale*Of*3*OpBmizaBons

BeLer"Resource"U.liza.on"

Algorithmic"Speedups"

Improved"Search

Page 22: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

✦ Typical*model*update*requires*2]4*flops/double

✦ But*modern*memory*much*slower*than*processors*

✦ We*can*do*25*flops*/*double*read!*

✦ This*equates*to*6]8*model*updates*per*double*we*read,*assuming*models*fit*in*cache

✦ Train"mul.ple"models"simultaneously

Beder*Resource*UBlizaBon

Page 23: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

What*Do*We*See*In*Spark?

✦ 2x*and*5x*increase*in*models*trained/sec*with*batching*

✦ Overhead*from*virtualizaBon,*network,*etc.

Page 24: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

What*Do*We*See*In*Spark?

✦ These*numbers*are*with*vector]matrix*mulBplies

✦ Can*do*beder*when*rewriBng*in*terms*of*matrix]matrix*mulBplies

Page 25: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

A*Tale*Of*3*OpBmizaBons

BeLer"Resource"U.liza.on"

Algorithmic"Speedups"

Improved"Search

Page 26: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Learning*Rate

RegularizaBon

Best"answer

Bandit*Search

✦ Each*point*is*a*trained*model

✦ Some*models*look*bad*early

✦ So*we*give*up*early!

✦ Can*frame*this*as*a*mul.Qarmed"bandit*problem✦ Each*model*is*an*arm

Page 27: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Bandit*Search

✦ Each*point*is*a*trained*model

✦ Some*models*look*bad*early

✦ So*we*give*up*early!

✦ Can*frame*this*as*a*mul.Qarmed"bandit*problem✦ Each*model*is*an*arm

Page 28: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

BeLer"Resource"U.liza.on"

Algorithmic"Speedups"

Improved"Search

A*Tale*Of*3*OpBmizaBons

Page 29: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

What*Search*Method?*GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

australianbreast

diabetesfourclass

splice

16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls

Data

set a

nd V

alid

atio

n Er

ror

Maximum Calls1681256625

Comparison of Search Methods Across Learning Problems

✦ Various*derivaBve]free*opBmizaBon*techniques*✦ Simple*ones*(Grid,*Random)*✦ Classic*DerivaBve]Free*(Nelder]Mead,*Powell’s*method)*✦ Bayesian*(SMAC,*TPE,*Spearmint)

✦ What*should*we*do?*✦ Tried*on*5*datasets,*opBmized*over*4*hyperparameters!

Page 30: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

What*Search*Method?*GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

australianbreast

diabetesfourclass

splice

16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls

Data

set a

nd V

alid

atio

n Er

ror

Maximum Calls1681256625

Comparison of Search Methods Across Learning Problems

Page 31: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.25

0.50

0.75

0 200 400 600 800Time elapsed (m)

Best

Val

idat

ion

Erro

r See

n So

Far

Search Method●

Grid − UnoptimizedRandom − OptimizedTPE − Optimized

Model Convergence Over Time

Pu^ng*It*All*Together✦ First*version*of*MLbase*opBmizer

✦ 30GB*dense*images*(240K*x*16K)

✦ 2*model*families,*5*hyperparams

✦ Baseline:*grid*search

✦ Our*method:*combinaBon*of*✦ Batching*✦ Early*stopping*/*Bandit*✦ Random*or*TPE*

20x"speedup"compared'to'grid'search'15'minutes'vs'5'hours!'"

Page 32: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Does*It*Scale?

✦ 1.5TB*dataset*(1.2M*x*160K)

✦ 128*nodes,*thousands*of*passes*over*data

✦ Tried*32*models*in*15*hours*

✦ Good*answer*aler*11*hours

●●●●●●●● ●●●●●● ●●●●● ●●●●

●●●●

●● ●● ●

0.25

0.50

0.75

5 10Time elapsed (h)

Best

Val

idat

ion

Erro

r See

n So

Far

Convergence of Model Accuracy on 1.5TB Dataset

Page 33: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Future*Work

!Data

Feature*

ExtracBon

Model*

Training

Final**

Model

Automated"ML"Pipeline"Construc.on

Page 34: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Data Image!Parser Normalizer Convolver

sqrt,mean

Zipper

Linear Solver

Symmetric!Rectifier

ident,absident,mean

Global

Pooler

Patch!Extractor

Patch Whitener

KMeans!Clusterer

Feature Extractor

Label!Extractor

Linear!Mapper Model

Test!Data

Label!Extractor

Feature Extractor

Test Error

Error!Computer

No Hyperparameters!A few Hyperparameters!Lotsa Hyperparameters

Slide courtesy of Evan Sparks

Page 35: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

Other*Future*Work

✦ Ensembling

✦ Leverage*sampling

✦ Beder*parallelism*for*smaller*datasets

✦ MulBple*hypothesis*tesBng*issues

Page 36: Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

MLOpt:*DeclaraBve*layer*that*aims*to*automate*ML*pipeline*construcBon*

MLI:*API*to*simplify*ML*development*

MLlib:*Spark’s*core*ML*library*

Spark:*Cluster*compuBng*system*designed*for*iteraBve*computaBon

THANKS! QUESTIONS?

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base

www.mlbase.org

MLlib

MLI

MLOpt

Apache*Spark