ml base ml base ml base - uc berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems...

37
base ML Collaborators: Tim Kraska 2 , Virginia Smith 1 , Xinghao Pan 1 , Shivaram Venkataraman 1 , Matei Zaharia 1 , Rean Griffith 3 , John Duchi 1 , Joseph Gonzalez 1 , Michael Franklin 1 , Michael I. Jordan 1 1 UC Berkeley 2 Brown 3 VMware www.mlbase.org Evan Sparks and Ameet Talwalkar UC Berkeley UC Berkeley

Upload: others

Post on 05-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML baseCollaborators:  Tim  Kraska2,  Virginia  Smith1,  Xinghao  Pan1,  Shivaram  Venkataraman1,  Matei  Zaharia1,  Rean  Griffith3,  John  Duchi1,  Joseph  Gonzalez1,  

Michael  Franklin1,  Michael  I.  Jordan1  

1UC  Berkeley            2Brown          3VMware

www.mlbase.org

Evan  Sparks  and  Ameet  TalwalkarUC  Berkeley

UC Berkeley

Page 2: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Problem:  Scalable  implementa>ons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 3: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Reliable

Fast

Accurate

Provable

Doesn’t  scale…

Page 4: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

1. Easy  scalable  ML  development  (ML  Developers)2. User-­‐friendly  ML  at  scale  (End  Users)

Along  the  way,  we  gain  insight  into  data  intensive  compuJng

ML  Experts Systems  ExpertsMLbase

Page 5: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

VisionMLI  DetailsCurrent  StatusML  Workflow

Page 6: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Lapack

Matlab Interface

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library✦ Matlab  Interface

✦ Higher-­‐level  abstracVons  for  data  access  /  processing✦ More  extensive  funcVonality  than  Lapack✦ Leverages  Lapack  whenever  possible

✦ Similar  stories  for  R  and  Python

Page 7: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

MLbase  Stack

Runtime(s)

MLlib

MLI

ML Optimizer

Spark

Lapack

Matlab Interface

Single Machine

Spark:  cluster  compuJng  system  designed  for  iteraJve  computaJon

MLlib:  low-­‐level  ML  library  in  Spark

MLI:  API  /  plaPorm  for  feature  extracJon  and  algorithm  development✦ PlaXorm  independent

ML  Op>mizer:  automates  model  selecJon✦ Solves  a  search  problem  over  feature  extractors  and  algorithms  in  MLI

Page 8: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Example:  MLlib

✦ Goal:  ClassificaVon  of  text  file✦ Featurize  data  manually✦ Calls  MLlib’s  LR  funcVon

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 9: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Example:  MLI

✦ Use  built-­‐in  feature  extracVon  funcVonality✦ MLI  LogisVc  Regression  leverages  MLlib✦ Extensions:

✦ Embed  in  cross-­‐validaVon  rouVne✦ Use  different  feature  extractors  /  algorithms✦ Write  new  ones

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 10: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Example:  ML  OpVmizer

var  X  =  load(”text_file”,  2  to  10)var  y  =  load(”text_file”,  1)var  (fn-­‐model,  summary)  =  doClassify(X,  y)

✦ User  declaraVvely  specifies  task✦ ML  OpVmizer  searches  through  MLI

SQL Result MQL Model

Page 11: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

VisionMLI  DetailsCurrent  StatusML  Workflow

Page 12: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Lay  of  the  LandEase  of  u

se

Performance,  Scalability

Matlab,  Rx  +      Easy  (Resembles  math,  limited  set  up)

 +      Sufficient  for  prototyping  /  wriVng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  translaVon  upon  re-­‐implementaVon

GraphLab,  VWx

Mahoutx

 +      Scalable  and  (someVmes)  fast  +      ExisVng  open-­‐source  libraries—    Difficult  to  set  up,  extend

Page 13: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ IniVal  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

Mahout  ALS  with  Early  Stopping✦ Theory:  simple  if-­‐statement  (3  lines  of  code)✦ PracVce:  sib  through  7  files,  nearly  1K  lines  of  code

Page 14: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Lay  of  the  LandEase  of  u

se

Performance,  Scalability

MLIx

MLlibx

Matlab,  Rx  +      Easy  (Resembles  math,  limited  set  up)

 +      Sufficient  for  prototyping  /  wriVng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  translaVon  upon  re-­‐implementaVon

GraphLab,  VWx

Mahoutx

 +      Scalable  and  (someVmes)  fast  +      ExisVng  open-­‐source  libraries—    Difficult  to  set  up,  extend

Page 15: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

ML  Developer  API  (MLI)

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.uJl.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

NEWval  x:  MLTable

✦ Abstract  interface  for  arbitrary  backend✦ Common  interface  to  support  an  opVmizer

Page 16: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

ML  Developer  API  (MLI)✦ Shield  ML  Developers  from  low-­‐details

✦ provide  familiar  mathemaVcal  operators  in  distributed  sedng

✦ Table  ComputaVon  (MLTable)✦ Flexibility  when  loading  data  (heterogenous,  missing)✦ Common  interface  for  feature  extracVon  /  algorithms✦ Supports  MapReduce  and  relaVonal  operators

✦ Linear  Algebra  (MLSubMatrix)✦ Linear  algebra  on  *local*  parVVons✦ Sparse  and  Dense  matrix  support

✦ OpVmizaVon  PrimiVves  (MLSolve)✦ Distributed  implementaVons  of  common  paferns

✦ DFC:  ~50  lines  of  code✦ ALS:  early  stopping  in  3  lines;  <  40  lines  total

Page 17: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

MLI  Ease  of  UseLogisVc  Regression

AlternaVng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Page 18: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

✦ Wall>me:  elapsed  Vme  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wallVme  as  we  grow  cluster

✦ Strong  scaling✦ fix  total  problem  size✦ ideally:  linear  speed  up  as  we  grow  cluster

✦ EC2  Experiments✦ m2.4xlarge  instances,  up  to  32  machine  clusters

MLI/Spark  Performance

Page 19: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

LogisVc  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features✦ Similar  weak  scaling✦ MLI/Spark  within  a  factor  of  2  of  VW’s  wallJme

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

MLI/Spark

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 20: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

LogisVc  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features✦ MLI/Spark  exhibits  be^er  scaling  properJes✦ MLI/Spark  faster  than  VW  with  16  and  32  machines

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLI/Spark

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 21: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

ALS  -­‐  WallVme

✦ Dataset:  Scaled  version  of  NePlix  data  (9X  in  size)✦ Cluster:  9  machines✦ MLI/Spark  an  order  of  magnitude  faster  than  Mahout✦ MLI/Spark  within  factor  of  2  of  GraphLab

System Wall>me  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLI/Spark 481

Page 22: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

VisionMLI  DetailsCurrent  StatusML  Workflow

Page 23: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Linear  Regression  (+Lasso,  Ridge)

AlternaVng  Least  Squares,  DFC

K-­‐Means,  DP-­‐Means

LogisVc  Regression,  Linear  SVM  (+L1,  L2),  MulVnomial  Regression,  Naive  Bayes,  Decision  Trees

Parallel  Gradient,  Local  SGD,  L-­‐BFGS,  ADMM,  Adagrad

Principal  Component  Analysis  (PCA),  N-­‐grams,  feature  normalizaVon

Cross  ValidaVon,  EvaluaVon  Metrics

MLI  FuncVonalityRegression:

Collabora>ve  Filtering:

Clustering:

Classifica>on:

Op>miza>on  Primi>ves:

Feature  Extrac>on:

ML  Tools:

Page 24: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Goal 1:Summer Release

Goal 2:Winter Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 25: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Future  DirecVons✦ Iden>fy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  opVmizer

✦ Plug-­‐ins  to  Python,  R

✦ Visualiza>on  for  unsupervised  learning  and  exploraVon

✦ Advanced  ML  capabili>es✦ Time-­‐series  algorithms✦ Graphical  models✦ Advanced  OpVmizaVon  (e.g.,  asynchronous  computaVon)✦ Online  updates✦ Sampling  for  efficiency  

Page 26: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

VisionMLI  DetailsCurrent  StatusML  Workflow

Page 27: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Typical  Data  Analysis  WorkflowSpark,  MLI Obtain  /  Load  Raw  Data

Spark,  [MLI] Data  Explora>on

MLI Feature  Extrac>on

Scala Deployment

MLI Evalua>on

MLI,  MLlib Learning

Adapted  from  slides  by  Ariel  Kleiner

Page 28: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Binary  ClassificaVon

Goal:  Learn  a  mapping  from  enVVes  to  discrete  labels

Example:  Spam  ClassificaVon✦ EnVVes  are  emails✦ Labels  are  {spam,  not-­‐spam}✦ Given  past  labeled  emails,  we  want  to  predict  whether  

a  new  email  is  spam  or  not-­‐spam

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

Adapted  from  slides  by  Ariel  Kleiner

Page 29: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Binary  ClassificaVon

Goal:  Learn  a  mapping  from  enVVes  to  discrete  labels

Other  Examples:✦ Click  (and  clickthrough  rate)  predicVon✦ Fraud  detecVon✦ Face  detecVon✦ Exercise:  “ARTS”  vs  “LIFE”  on  Wikipedia

✦ Real  data  

1.2 Definitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.

Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?

1.2 Definitions and terminology

We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.

Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.

Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.

Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

Adapted  from  slides  by  Ariel  Kleiner

Page 30: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

ClassificaVon  PipelineClassifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

1. Randomly  split  full  data  into  disjoint  subsets2. Featurize  the  data3. Use  training  set  to  learn  a  classifier4. Evaluate  classifier  on  test  set  (avoid  overfidng)5. Use  classifier  to  predict  in  the  wild

Adapted  from  slides  by  Ariel  Kleiner

Page 31: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

E.g.,  Spam  ClassificaVon

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

From: [email protected]

"Hi, it's been a while! How are you? ..."

spam

not-spam

spam

not-­‐spam

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 32: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

FeaturizaVon  

✦ Most  classifiers  require  numeric  descripJons  of  enJJes

✦ Featuriza>on:  Transform  each  enJty  into  a  vector  of  real  numbers✦ Opportunity  to  incorporate  domain  knowledge✦ Useful  even  when  original  data  is  already  numeric

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 33: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

E.g.,  “Bag  of  Words”  ✦ EnVVes  are  documents

✦ Build  Vocabulary

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

From: [email protected]

"Hi, it's been a while! How are you? ..."

Vocabularybeendebt

eliminategivinghowit'smoneywhile

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

From: [email protected]

"Hi, it's been a while! How are you? ..."

Vocabularybeendebt

eliminategivinghowit'smoneywhile

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 34: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Example:(Spam(Classifica<on(

From: [email protected]

"Eliminate your debt by giving us your money..."

been

debt

eliminate

giving

how

it's

money

while

0

1

1

1

0

0

1

0

E.g.,  “Bag  of  Words”  ✦ EnVVes  are  documents

✦ Build  Vocabulary

✦ Derive  feature  vectors  from  Vocabulary✦ Exercise:  we’ll  use  bigrams

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 35: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Support  Vector  Machines  (SVMs)

Credit:  FoundaVons  of  Machine  Learning  Mohri,  Rostamizadeh,  Talwalkar

64 Support Vector Machines

w·x+b=0

w·x+b=0

Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.

4.2 SVMs — separable case

In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.

We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.

4.2.1 Primal optimization problem

We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is

w · x + b = 0, (4.3)

where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.

✦ “Max-­‐Margin”:  find  linear  separator  with  the  largest  separaVon  between  the  two  classes

✦ Extensions:✦ non-­‐separable  sedng✦ non-­‐linear  classifiers  (kernels)

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Page 36: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Model  EvaluaVon

✦ Test  set  simulates  performance  on  new  enVty✦ Performance  on  training  data  overly  opJmisJc!✦ “Overfijng”;  “GeneralizaJon”

✦ Various  metrics  for  quality;  accuracy  is  most  common✦ EvaluaVon  process

✦ Train  on  training  set  (don’t  expose  test  set  to  classifier)✦ Make  predicJons  using  test  set  (ignoring  test  labels)✦ Compute  fracJon  of  correct  predicJons  on  test  set

✦ Other  more  sophisVcated  evaluaVon  methods,  e.g.,  cross-­‐validaVon

Classifica<on(

fulldataset

trainingset

test set

classifier

accuracy

new entity

prediction

Adapted  from  slides  by  Ariel  Kleiner

Page 37: ML base ML base ML base - UC Berkeley …ampcamp.berkeley.edu/wp-content/uploads/2013/08/...systems to predict user-product associations. Let M 2 Rm⇥n be some underlying matrix and

Contribu>ons  encouraged!

www.mlbase.org

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML basebiglearn.org