ml base ml base ml base - uc berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems...

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML baseCollaborators: Tim Kraska2, Virginia Smith1, Xinghao Pan1, Shivaram Venkataraman1, Matei Zaharia1, Rean Griffith3, John Duchi1, Joseph Gonzalez1,

Michael Franklin1, Michael I. Jordan1

1UC Berkeley 2Brown 3VMware

www.mlbase.org

Evan Sparks and Ameet TalwalkarUC Berkeley

UC Berkeley

http://www.mlbase.org


Problem: Scalable implementa>ons difficult for ML Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Too many algorithms…

Too many knobs…

Problem: ML is difficultfor End Users…

Difficult to debug…

Reliable

Fast

Accurate

Provable

Doesn’t scale…

1. Easy scalable ML development (ML Developers)2. User-‐friendly ML at scale (End Users)

Along the way, we gain insight into data intensive compuJng

ML Experts Systems ExpertsMLbase

VisionMLI DetailsCurrent StatusML Workflow

Lapack

Matlab Interface

Matlab Stack

Single Machine

✦ Lapack: low-‐level Fortran linear algebra library✦ Matlab Interface

✦ Higher-‐level abstracVons for data access / processing✦ More extensive funcVonality than Lapack✦ Leverages Lapack whenever possible

✦ Similar stories for R and Python

MLbase Stack

Runtime(s)

MLlib

MLI

ML Optimizer

Spark

Lapack

Matlab Interface

Single Machine

Spark: cluster compuJng system designed for iteraJve computaJon

MLlib: low-‐level ML library in Spark

MLI: API / plaPorm for feature extracJon and algorithm development✦ PlaXorm independent

ML Op>mizer: automates model selecJon✦ Solves a search problem over feature extractors and algorithms in MLI

Example: MLlib

✦ Goal: ClassificaVon of text file✦ Featurize data manually✦ Calls MLlib’s LR funcVon

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Example: MLI

✦ Use built-‐in feature extracVon funcVonality✦ MLI LogisVc Regression leverages MLlib✦ Extensions:

✦ Embed in cross-‐validaVon rouVne✦ Use different feature extractors / algorithms✦ Write new ones

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Example: ML OpVmizer

var X = load(”text_file”, 2 to 10)var y = load(”text_file”, 1)var (fn-‐model, summary) = doClassify(X, y)

✦ User declaraVvely specifies task✦ ML OpVmizer searches through MLI

SQL Result MQL Model

Lay of the LandEase of u

se

Performance, Scalability

Matlab, Rx + Easy (Resembles math, limited set up)

+ Sufficient for prototyping / wriVng papers— Ad-‐hoc, non-‐scalable scripts— Loss of translaVon upon re-‐implementaVon

GraphLab, VWx

Mahoutx

+ Scalable and (someVmes) fast + ExisVng open-‐source libraries— Difficult to set up, extend

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’ Divide-‐Factor-‐Combine (DFC)✦ IniVal studies in MATLAB (Not distributed)✦ Distributed prototype involving compiled MATLAB

Mahout ALS with Early Stopping✦ Theory: simple if-‐statement (3 lines of code)✦ PracVce: sib through 7 files, nearly 1K lines of code

Lay of the LandEase of u

se

Performance, Scalability

MLIx

MLlibx

Matlab, Rx + Easy (Resembles math, limited set up)

+ Sufficient for prototyping / wriVng papers— Ad-‐hoc, non-‐scalable scripts— Loss of translaVon upon re-‐implementaVon

GraphLab, VWx

Mahoutx

+ Scalable and (someVmes) fast + ExisVng open-‐source libraries— Difficult to set up, extend

ML Developer API (MLI)

OLDval x: RDD[Array[Double]]val x: RDD[spark.uJl.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]

NEWval x: MLTable

✦ Abstract interface for arbitrary backend✦ Common interface to support an opVmizer

ML Developer API (MLI)✦ Shield ML Developers from low-‐details

✦ provide familiar mathemaVcal operators in distributed sedng

✦ Table ComputaVon (MLTable)✦ Flexibility when loading data (heterogenous, missing)✦ Common interface for feature extracVon / algorithms✦ Supports MapReduce and relaVonal operators

✦ Linear Algebra (MLSubMatrix)✦ Linear algebra on *local* parVVons✦ Sparse and Dense matrix support

✦ OpVmizaVon PrimiVves (MLSolve)✦ Distributed implementaVons of common paferns

✦ DFC: ~50 lines of code✦ ALS: early stopping in 3 lines; < 40 lines total

MLI Ease of UseLogisVc Regression

AlternaVng Least Squares

System Lines of Code

Matlab 11

Vowpal Wabbit 721

MLI 55

System Lines of Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

✦ Wall>me: elapsed Vme to execute task

✦ Weak scaling✦ fix problem size per processor✦ ideally: constant wallVme as we grow cluster

✦ Strong scaling✦ fix total problem size✦ ideally: linear speed up as we grow cluster

✦ EC2 Experiments✦ m2.4xlarge instances, up to 32 machine clusters

MLI/Spark Performance

LogisVc Regression -‐ Weak Scaling

✦ Full dataset: 200K images, 160K dense features✦ Similar weak scaling✦ MLI/Spark within a factor of 2 of VW’s wallJme

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

MLI/Spark

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

LogisVc Regression -‐ Strong Scaling

✦ Fixed Dataset: 50K images, 160K dense features✦ MLI/Spark exhibits be^er scaling properJes✦ MLI/Spark faster than VW with 16 and 32 machines

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)



0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal


MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)



0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal














min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)



i


MLI/Spark

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)



0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal


MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)



0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal














min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)



i


MLlib

ALS -‐ WallVme

✦ Dataset: Scaled version of NePlix data (9X in size)✦ Cluster: 9 machines✦ MLI/Spark an order of magnitude faster than Mahout✦ MLI/Spark within factor of 2 of GraphLab

System Wall>me (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLI/Spark 481

Linear Regression (+Lasso, Ridge)

AlternaVng Least Squares, DFC

K-‐Means, DP-‐Means

LogisVc Regression, Linear SVM (+L1, L2), MulVnomial Regression, Naive Bayes, Decision Trees

Parallel Gradient, Local SGD, L-‐BFGS, ADMM, Adagrad

Principal Component Analysis (PCA), N-‐grams, feature normalizaVon

Cross ValidaVon, EvaluaVon Metrics

MLI FuncVonalityRegression:

Collabora>ve Filtering:

Clustering:

Classifica>on:

Op>miza>on Primi>ves:

Feature Extrac>on:

ML Tools:

MLlib

MLI

ML OptimizerEnd User

MLbase Stack Status

Goal 1:Summer Release

Goal 2:Winter Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Future DirecVons✦ Iden>fy minimal set of ML operators

✦ Expose internals of ML algorithms to opVmizer

✦ Plug-‐ins to Python, R

✦ Visualiza>on for unsupervised learning and exploraVon

✦ Advanced ML capabili>es✦ Time-‐series algorithms✦ Graphical models✦ Advanced OpVmizaVon (e.g., asynchronous computaVon)✦ Online updates✦ Sampling for efficiency

Typical Data Analysis WorkflowSpark, MLI Obtain / Load Raw Data

Spark, [MLI] Data Explora>on

MLI Feature Extrac>on

Scala Deployment

MLI Evalua>on

MLI, MLlib Learning

Adapted from slides by Ariel Kleiner

Binary ClassificaVon

Goal: Learn a mapping from enVVes to discrete labels

Example: Spam ClassificaVon✦ EnVVes are emails✦ Labels are {spam, not-‐spam}✦ Given past labeled emails, we want to predict whether

a new email is spam or not-‐spam

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm


Binary ClassificaVon

Goal: Learn a mapping from enVVes to discrete labels

Other Examples:✦ Click (and clickthrough rate) predicVon✦ Fraud detecVon✦ Face detecVon✦ Exercise: “ARTS” vs “LIFE” on Wikipedia

✦ Real data

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm


ClassificaVon PipelineClassifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

1. Randomly split full data into disjoint subsets2. Featurize the data3. Use training set to learn a classifier4. Evaluate classifier on test set (avoid overfidng)5. Use classifier to predict in the wild


E.g., Spam ClassificaVon

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."


"Hi, it's been a while! How are you? ..."

spam

not-spam

spam

not-‐spam

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction


FeaturizaVon

✦ Most classifiers require numeric descripJons of enJJes

✦ Featuriza>on: Transform each enJty into a vector of real numbers✦ Opportunity to incorporate domain knowledge✦ Useful even when original data is already numeric

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction


E.g., “Bag of Words” ✦ EnVVes are documents

✦ Build Vocabulary






Vocabularybeendebt

eliminategivinghowit'smoneywhile






Vocabularybeendebt

eliminategivinghowit'smoneywhile

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction





been

debt

eliminate

giving

how

it's

money

while

0

1

1

1

0

0

1

0

E.g., “Bag of Words” ✦ EnVVes are documents

✦ Build Vocabulary

✦ Derive feature vectors from Vocabulary✦ Exercise: we’ll use bigrams

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction


Support Vector Machines (SVMs)

Credit: FoundaVons of Machine Learning Mohri, Rostamizadeh, Talwalkar

64 Support Vector Machines

w·x+b=0

w·x+b=0

Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.

4.2 SVMs — separable case

In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.

We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.

4.2.1 Primal optimization problem

We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is

w · x + b = 0, (4.3)

where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.

✦ “Max-‐Margin”: find linear separator with the largest separaVon between the two classes

✦ Extensions:✦ non-‐separable sedng✦ non-‐linear classifiers (kernels)

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Model EvaluaVon

✦ Test set simulates performance on new enVty✦ Performance on training data overly opJmisJc!✦ “Overfijng”; “GeneralizaJon”

✦ Various metrics for quality; accuracy is most common✦ EvaluaVon process

✦ Train on training set (don’t expose test set to classifier)✦ Make predicJons using test set (ignoring test labels)✦ Compute fracJon of correct predicJons on test set

✦ Other more sophisVcated evaluaVon methods, e.g., cross-‐validaVon

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction


Contribu>ons encouraged!

www.mlbase.org

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML basebiglearn.org



ml base ml base ml base - uc berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems...

Documents