baseML
baseML
baseML
baseML
ML base
ML base
ML base
ML base
ML baseCollaborators: Tim Kraska2, Virginia Smith1, Xinghao Pan1, Shivaram Venkataraman1, Matei Zaharia1, Rean Griffith3, John Duchi1, Joseph Gonzalez1,
Michael Franklin1, Michael I. Jordan1
1UC Berkeley 2Brown 3VMware
www.mlbase.org
Evan Sparks and Ameet TalwalkarUC Berkeley
UC Berkeley
Problem: Scalable implementa>ons difficult for ML Developers…
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Problem: Scalable implementa>ons difficult for ML Developers…
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Problem: Scalable implementa>ons difficult for ML Developers…
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Too many algorithms…
Problem: ML is difficultfor End Users…
Too many algorithms…
Too many knobs…
Problem: ML is difficultfor End Users…
Too many algorithms…
Too many knobs…
Problem: ML is difficultfor End Users…
Difficult to debug…
Too many algorithms…
Too many knobs…
Problem: ML is difficultfor End Users…
Difficult to debug…
Doesn’t scale…
Too many algorithms…
Too many knobs…
Problem: ML is difficultfor End Users…
Difficult to debug…
Reliable
Fast
Accurate
Provable
Doesn’t scale…
ML Experts Systems ExpertsMLbase
1. Easy scalable ML development (ML Developers)2. User-‐friendly ML at scale (End Users)
ML Experts Systems ExpertsMLbase
1. Easy scalable ML development (ML Developers)2. User-‐friendly ML at scale (End Users)
Along the way, we gain insight into data intensive compuJng
ML Experts Systems ExpertsMLbase
VisionMLI DetailsCurrent StatusML Workflow
Matlab Stack
Matlab Stack
Single Machine
Lapack
Matlab Stack
Single Machine
✦ Lapack: low-‐level Fortran linear algebra library
Lapack
Matlab Interface
Matlab Stack
Single Machine
✦ Lapack: low-‐level Fortran linear algebra library✦ Matlab Interface
✦ Higher-‐level abstracVons for data access / processing✦ More extensive funcVonality than Lapack✦ Leverages Lapack whenever possible
Lapack
Matlab Interface
Matlab Stack
Single Machine
✦ Lapack: low-‐level Fortran linear algebra library✦ Matlab Interface
✦ Higher-‐level abstracVons for data access / processing✦ More extensive funcVonality than Lapack✦ Leverages Lapack whenever possible
✦ Similar stories for R and Python
MLbase Stack
Lapack
Matlab Interface
Single Machine
MLbase Stack
Runtime(s)
Lapack
Matlab Interface
Single Machine
MLbase Stack
Runtime(s)Spark
Lapack
Matlab Interface
Single Machine
Spark: cluster compuJng system designed for iteraJve computaJon
MLbase Stack
Runtime(s)
MLlib
Spark
Lapack
Matlab Interface
Single Machine
Spark: cluster compuJng system designed for iteraJve computaJon
MLlib: low-‐level ML library in Spark
MLbase Stack
Runtime(s)
MLlib
MLI
Spark
Lapack
Matlab Interface
Single Machine
Spark: cluster compuJng system designed for iteraJve computaJon
MLlib: low-‐level ML library in Spark
MLI: API / plaPorm for feature extracJon and algorithm development✦ PlaXorm independent
MLbase Stack
Runtime(s)
MLlib
MLI
ML Optimizer
Spark
Lapack
Matlab Interface
Single Machine
Spark: cluster compuJng system designed for iteraJve computaJon
MLlib: low-‐level ML library in Spark
MLI: API / plaPorm for feature extracJon and algorithm development✦ PlaXorm independent
ML Op>mizer: automates model selecJon✦ Solves a search problem over feature extractors and algorithms in MLI
Example: MLlib
Example: MLlib
✦ Goal: ClassificaVon of text file
Example: MLlib
✦ Goal: ClassificaVon of text file✦ Featurize data manually
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: MLlib
✦ Goal: ClassificaVon of text file✦ Featurize data manually✦ Calls MLlib’s LR funcVon
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: MLI
Example: MLI
✦ Use built-‐in feature extracVon funcVonality
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: MLI
✦ Use built-‐in feature extracVon funcVonality✦ MLI LogisVc Regression leverages MLlib
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: MLI
✦ Use built-‐in feature extracVon funcVonality✦ MLI LogisVc Regression leverages MLlib✦ Extensions:
✦ Embed in cross-‐validaVon rouVne✦ Use different feature extractors / algorithms✦ Write new ones
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: ML OpVmizer
var X = load(”text_file”, 2 to 10)var y = load(”text_file”, 1)var (fn-‐model, summary) = doClassify(X, y)
✦ User declaraVvely specifies task✦ ML OpVmizer searches through MLI
Example: ML OpVmizer
✦ User declaraVvely specifies task✦ ML OpVmizer searches through MLI
SQL Result MQL Model
VisionMLI DetailsCurrent StatusML Workflow
Lay of the LandEase of u
se
Performance, Scalability
Lay of the LandEase of u
se
Performance, Scalability
Matlab, Rx + Easy (Resembles math, limited set up)
+ Sufficient for prototyping / wriVng papers— Ad-‐hoc, non-‐scalable scripts— Loss of translaVon upon re-‐implementaVon
Lay of the LandEase of u
se
Performance, Scalability
Matlab, Rx + Easy (Resembles math, limited set up)
+ Sufficient for prototyping / wriVng papers— Ad-‐hoc, non-‐scalable scripts— Loss of translaVon upon re-‐implementaVon
GraphLab, VWx
Mahoutx
+ Scalable and (someVmes) fast + ExisVng open-‐source libraries— Difficult to set up, extend
ExamplesML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
ExamplesML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
‘Distributed’ Divide-‐Factor-‐Combine (DFC)✦ IniVal studies in MATLAB (Not distributed)✦ Distributed prototype involving compiled MATLAB
ExamplesML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
‘Distributed’ Divide-‐Factor-‐Combine (DFC)✦ IniVal studies in MATLAB (Not distributed)✦ Distributed prototype involving compiled MATLAB
Mahout ALS with Early Stopping✦ Theory: simple if-‐statement (3 lines of code)
ExamplesML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
‘Distributed’ Divide-‐Factor-‐Combine (DFC)✦ IniVal studies in MATLAB (Not distributed)✦ Distributed prototype involving compiled MATLAB
Mahout ALS with Early Stopping✦ Theory: simple if-‐statement (3 lines of code)✦ PracVce: sib through 7 files, nearly 1K lines of code
Lay of the LandEase of u
se
Performance, Scalability
Matlab, Rx + Easy (Resembles math, limited set up)
+ Sufficient for prototyping / wriVng papers— Ad-‐hoc, non-‐scalable scripts— Loss of translaVon upon re-‐implementaVon
GraphLab, VWx
Mahoutx
+ Scalable and (someVmes) fast + ExisVng open-‐source libraries— Difficult to set up, extend
Lay of the LandEase of u
se
Performance, Scalability
MLlibx
Matlab, Rx + Easy (Resembles math, limited set up)
+ Sufficient for prototyping / wriVng papers— Ad-‐hoc, non-‐scalable scripts— Loss of translaVon upon re-‐implementaVon
GraphLab, VWx
Mahoutx
+ Scalable and (someVmes) fast + ExisVng open-‐source libraries— Difficult to set up, extend
Lay of the LandEase of u
se
Performance, Scalability
MLIx
MLlibx
Matlab, Rx + Easy (Resembles math, limited set up)
+ Sufficient for prototyping / wriVng papers— Ad-‐hoc, non-‐scalable scripts— Loss of translaVon upon re-‐implementaVon
GraphLab, VWx
Mahoutx
+ Scalable and (someVmes) fast + ExisVng open-‐source libraries— Difficult to set up, extend
ML Developer API (MLI)
ML Developer API (MLI)
OLDval x: RDD[Array[Double]]
ML Developer API (MLI)
OLDval x: RDD[Array[Double]]val x: RDD[spark.uJl.Vector]
ML Developer API (MLI)
OLDval x: RDD[Array[Double]]val x: RDD[spark.uJl.Vector]val x: RDD[breeze.linalg.Vector]
ML Developer API (MLI)
OLDval x: RDD[Array[Double]]val x: RDD[spark.uJl.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]
ML Developer API (MLI)
OLDval x: RDD[Array[Double]]val x: RDD[spark.uJl.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]
ML Developer API (MLI)
OLDval x: RDD[Array[Double]]val x: RDD[spark.uJl.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]
NEWval x: MLTable
ML Developer API (MLI)
OLDval x: RDD[Array[Double]]val x: RDD[spark.uJl.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]
NEWval x: MLTable
✦ Abstract interface for arbitrary backend✦ Common interface to support an opVmizer
ML Developer API (MLI)
ML Developer API (MLI)✦ Shield ML Developers from low-‐details
✦ provide familiar mathemaVcal operators in distributed sedng
ML Developer API (MLI)✦ Shield ML Developers from low-‐details
✦ provide familiar mathemaVcal operators in distributed sedng
✦ Table ComputaVon (MLTable)✦ Flexibility when loading data (heterogenous, missing)✦ Common interface for feature extracVon / algorithms✦ Supports MapReduce and relaVonal operators
ML Developer API (MLI)✦ Shield ML Developers from low-‐details
✦ provide familiar mathemaVcal operators in distributed sedng
✦ Table ComputaVon (MLTable)✦ Flexibility when loading data (heterogenous, missing)✦ Common interface for feature extracVon / algorithms✦ Supports MapReduce and relaVonal operators
✦ Linear Algebra (MLSubMatrix)✦ Linear algebra on *local* parVVons✦ Sparse and Dense matrix support
ML Developer API (MLI)✦ Shield ML Developers from low-‐details
✦ provide familiar mathemaVcal operators in distributed sedng
✦ Table ComputaVon (MLTable)✦ Flexibility when loading data (heterogenous, missing)✦ Common interface for feature extracVon / algorithms✦ Supports MapReduce and relaVonal operators
✦ Linear Algebra (MLSubMatrix)✦ Linear algebra on *local* parVVons✦ Sparse and Dense matrix support
✦ OpVmizaVon PrimiVves (MLSolve)✦ Distributed implementaVons of common paferns
ML Developer API (MLI)✦ Shield ML Developers from low-‐details
✦ provide familiar mathemaVcal operators in distributed sedng
✦ Table ComputaVon (MLTable)✦ Flexibility when loading data (heterogenous, missing)✦ Common interface for feature extracVon / algorithms✦ Supports MapReduce and relaVonal operators
✦ Linear Algebra (MLSubMatrix)✦ Linear algebra on *local* parVVons✦ Sparse and Dense matrix support
✦ OpVmizaVon PrimiVves (MLSolve)✦ Distributed implementaVons of common paferns
✦ DFC: ~50 lines of code✦ ALS: early stopping in 3 lines; < 40 lines total
MLI Ease of Use
MLI Ease of UseLogisVc Regression
AlternaVng Least Squares
System Lines of Code
Matlab 11
Vowpal Wabbit 721
MLI 55
System Lines of Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
MLI Ease of UseLogisVc Regression
AlternaVng Least Squares
System Lines of Code
Matlab 11
Vowpal Wabbit 721
MLI 55
System Lines of Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
MLI Ease of UseLogisVc Regression
AlternaVng Least Squares
System Lines of Code
Matlab 11
Vowpal Wabbit 721
MLI 55
System Lines of Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
MLI/Spark Performance
✦ Wall>me: elapsed Vme to execute task
MLI/Spark Performance
✦ Wall>me: elapsed Vme to execute task
✦ Weak scaling✦ fix problem size per processor✦ ideally: constant wallVme as we grow cluster
MLI/Spark Performance
✦ Wall>me: elapsed Vme to execute task
✦ Weak scaling✦ fix problem size per processor✦ ideally: constant wallVme as we grow cluster
✦ Strong scaling✦ fix total problem size✦ ideally: linear speed up as we grow cluster
MLI/Spark Performance
✦ Wall>me: elapsed Vme to execute task
✦ Weak scaling✦ fix problem size per processor✦ ideally: constant wallVme as we grow cluster
✦ Strong scaling✦ fix total problem size✦ ideally: linear speed up as we grow cluster
✦ EC2 Experiments✦ m2.4xlarge instances, up to 32 machine clusters
MLI/Spark Performance
LogisVc Regression -‐ Weak Scaling
LogisVc Regression -‐ Weak Scaling
✦ Full dataset: 200K images, 160K dense features
LogisVc Regression -‐ Weak Scaling
✦ Full dataset: 200K images, 160K dense features✦ Similar weak scaling
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e
# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
LogisVc Regression -‐ Weak Scaling
✦ Full dataset: 200K images, 160K dense features✦ Similar weak scaling✦ MLI/Spark within a factor of 2 of VW’s wallJme
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
MLI/Spark
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e
# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
LogisVc Regression -‐ Strong Scaling
LogisVc Regression -‐ Strong Scaling
✦ Fixed Dataset: 50K images, 160K dense features
LogisVc Regression -‐ Strong Scaling
✦ Fixed Dataset: 50K images, 160K dense features✦ MLI/Spark exhibits be^er scaling properJes
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
LogisVc Regression -‐ Strong Scaling
✦ Fixed Dataset: 50K images, 160K dense features✦ MLI/Spark exhibits be^er scaling properJes✦ MLI/Spark faster than VW with 16 and 32 machines
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e
# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLI/Spark
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
ALS -‐ WallVme
ALS -‐ WallVme
✦ Dataset: Scaled version of NePlix data (9X in size)✦ Cluster: 9 machines
ALS -‐ WallVme
✦ Dataset: Scaled version of NePlix data (9X in size)✦ Cluster: 9 machines
System Wall>me (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLI/Spark 481
ALS -‐ WallVme
✦ Dataset: Scaled version of NePlix data (9X in size)✦ Cluster: 9 machines
System Wall>me (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLI/Spark 481
ALS -‐ WallVme
✦ Dataset: Scaled version of NePlix data (9X in size)✦ Cluster: 9 machines✦ MLI/Spark an order of magnitude faster than Mahout✦ MLI/Spark within factor of 2 of GraphLab
System Wall>me (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLI/Spark 481
VisionMLI DetailsCurrent StatusML Workflow
Linear Regression (+Lasso, Ridge)
AlternaVng Least Squares, DFC
K-‐Means, DP-‐Means
LogisVc Regression, Linear SVM (+L1, L2), MulVnomial Regression, Naive Bayes, Decision Trees
Parallel Gradient, Local SGD, L-‐BFGS, ADMM, Adagrad
Principal Component Analysis (PCA), N-‐grams, feature normalizaVon
Cross ValidaVon, EvaluaVon Metrics
MLI FuncVonalityRegression:
Collabora>ve Filtering:
Clustering:
Classifica>on:
Op>miza>on Primi>ves:
Feature Extrac>on:
ML Tools:
Linear Regression (+Lasso, Ridge)
AlternaVng Least Squares, DFC
K-‐Means, DP-‐Means
LogisVc Regression, Linear SVM (+L1, L2), MulVnomial Regression, Naive Bayes, Decision Trees
Parallel Gradient, Local SGD, L-‐BFGS, ADMM, Adagrad
Principal Component Analysis (PCA), N-‐grams, feature normalizaVon
Cross ValidaVon, EvaluaVon Metrics
MLI FuncVonalityRegression:
Collabora>ve Filtering:
Clustering:
Classifica>on:
Op>miza>on Primi>ves:
Feature Extrac>on:
ML Tools:
MLlib
MLI
ML OptimizerEnd User
MLbase Stack Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
MLlib
MLI
ML OptimizerEnd User
MLbase Stack Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
MLlib
MLI
ML OptimizerEnd User
MLbase Stack Status
Goal 1:Summer Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
MLlib
MLI
ML OptimizerEnd User
MLbase Stack Status
Goal 1:Summer Release
Goal 2:Winter Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Future DirecVons✦ Iden>fy minimal set of ML operators
✦ Expose internals of ML algorithms to opVmizer
✦ Plug-‐ins to Python, R
✦ Visualiza>on for unsupervised learning and exploraVon
✦ Advanced ML capabili>es✦ Time-‐series algorithms✦ Graphical models✦ Advanced OpVmizaVon (e.g., asynchronous computaVon)✦ Online updates✦ Sampling for efficiency
VisionMLI DetailsCurrent StatusML Workflow
Typical Data Analysis Workflow
Adapted from slides by Ariel Kleiner
Typical Data Analysis WorkflowSpark, MLI Obtain / Load Raw Data
Adapted from slides by Ariel Kleiner
Typical Data Analysis WorkflowSpark, MLI Obtain / Load Raw Data
Spark, [MLI] Data Explora>on
Adapted from slides by Ariel Kleiner
Typical Data Analysis WorkflowSpark, MLI Obtain / Load Raw Data
Spark, [MLI] Data Explora>on
MLI Feature Extrac>on
Adapted from slides by Ariel Kleiner
Typical Data Analysis WorkflowSpark, MLI Obtain / Load Raw Data
Spark, [MLI] Data Explora>on
MLI Feature Extrac>on
MLI, MLlib Learning
Adapted from slides by Ariel Kleiner
Typical Data Analysis WorkflowSpark, MLI Obtain / Load Raw Data
Spark, [MLI] Data Explora>on
MLI Feature Extrac>on
MLI Evalua>on
MLI, MLlib Learning
Adapted from slides by Ariel Kleiner
Typical Data Analysis WorkflowSpark, MLI Obtain / Load Raw Data
Spark, [MLI] Data Explora>on
MLI Feature Extrac>on
Scala Deployment
MLI Evalua>on
MLI, MLlib Learning
Adapted from slides by Ariel Kleiner
Typical Data Analysis WorkflowSpark, MLI Obtain / Load Raw Data
Spark, [MLI] Data Explora>on
MLI Feature Extrac>on
Scala Deployment
MLI Evalua>on
MLI, MLlib Learning
Adapted from slides by Ariel Kleiner
Binary ClassificaVon
1.2 Definitions and terminology 3
Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.
Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?
1.2 Definitions and terminology
We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.
Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.
Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.
Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.
Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.
Validation sample: Examples used to tune the parameters of a learning algorithm
Adapted from slides by Ariel Kleiner
Binary ClassificaVon
Goal: Learn a mapping from enVVes to discrete labels
1.2 Definitions and terminology 3
Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.
Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?
1.2 Definitions and terminology
We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.
Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.
Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.
Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.
Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.
Validation sample: Examples used to tune the parameters of a learning algorithm
Adapted from slides by Ariel Kleiner
Binary ClassificaVon
Goal: Learn a mapping from enVVes to discrete labels
Example: Spam ClassificaVon✦ EnVVes are emails✦ Labels are {spam, not-‐spam}
1.2 Definitions and terminology 3
Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.
Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?
1.2 Definitions and terminology
We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.
Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.
Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.
Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.
Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.
Validation sample: Examples used to tune the parameters of a learning algorithm
Adapted from slides by Ariel Kleiner
Binary ClassificaVon
Goal: Learn a mapping from enVVes to discrete labels
Example: Spam ClassificaVon✦ EnVVes are emails✦ Labels are {spam, not-‐spam}✦ Given past labeled emails, we want to predict whether
a new email is spam or not-‐spam
1.2 Definitions and terminology 3
Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.
Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?
1.2 Definitions and terminology
We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.
Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.
Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.
Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.
Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.
Validation sample: Examples used to tune the parameters of a learning algorithm
Adapted from slides by Ariel Kleiner
Binary ClassificaVon
Goal: Learn a mapping from enVVes to discrete labels
Other Examples:✦ Click (and clickthrough rate) predicVon✦ Fraud detecVon✦ Face detecVon✦ Exercise: “ARTS” vs “LIFE” on Wikipedia
✦ Real data
1.2 Definitions and terminology 3
Figure 1.1 The zig-zag line on the left panel is consistent over the blue and redtraining sample, but it is a complex separation surface that is not likely to generalizewell to unseen data. In contrast, the decision surface on the right panel is simplerand might generalize better in spite of its misclassification of a few points of thetraining sample.
Which concept families can actually be learned, and under what conditions? Howwell can these concepts be learned computationally?
1.2 Definitions and terminology
We will use the canonical problem of spam detection as a running example toillustrate some basic definitions and to describe the use and evaluation of machinelearning algorithms in practice. Spam detection is the problem of learning toautomatically classify email messages as either spam or non-spam.
Examples: Items or instances of data used for learning or evaluation. In our spamproblem, these examples correspond to the collection of email messages we will usefor learning and testing.
Features: The set of attributes, often represented as a vector, associated to anexample. In the case of email messages, some relevant features may include thelength of the message, the name of the sender, various characteristics of the header,the presence of certain keywords in the body of the message, and so on.
Labels: Values or categories assigned to examples. In classification problems,examples are assigned specific categories, for instance, the spam and non-spamcategories in our binary classification problem. In regression, items are assignedreal-valued labels.
Training sample: Examples used to train a learning algorithm. In our spamproblem, the training sample consists of a set of email examples along with theirassociated labels. The training sample varies for di↵erent learning scenarios, asdescribed in section 1.4.
Validation sample: Examples used to tune the parameters of a learning algorithm
Adapted from slides by Ariel Kleiner
ClassificaVon PipelineClassifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
ClassificaVon PipelineClassifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
1. Randomly split full data into disjoint subsets
Adapted from slides by Ariel Kleiner
ClassificaVon PipelineClassifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
1. Randomly split full data into disjoint subsets2. Featurize the data
Adapted from slides by Ariel Kleiner
ClassificaVon PipelineClassifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
1. Randomly split full data into disjoint subsets2. Featurize the data3. Use training set to learn a classifier
Adapted from slides by Ariel Kleiner
ClassificaVon PipelineClassifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
1. Randomly split full data into disjoint subsets2. Featurize the data3. Use training set to learn a classifier4. Evaluate classifier on test set (avoid overfidng)
Adapted from slides by Ariel Kleiner
ClassificaVon PipelineClassifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
1. Randomly split full data into disjoint subsets2. Featurize the data3. Use training set to learn a classifier4. Evaluate classifier on test set (avoid overfidng)5. Use classifier to predict in the wild
Adapted from slides by Ariel Kleiner
E.g., Spam ClassificaVon
Example:(Spam(Classifica<on(
From: [email protected]
"Eliminate your debt by giving us your money..."
From: [email protected]
"Hi, it's been a while! How are you? ..."
spam
not-spam
spam
not-‐spam
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
FeaturizaVon Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
FeaturizaVon
✦ Most classifiers require numeric descripJons of enJJes
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
FeaturizaVon
✦ Most classifiers require numeric descripJons of enJJes
✦ Featuriza>on: Transform each enJty into a vector of real numbers
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
FeaturizaVon
✦ Most classifiers require numeric descripJons of enJJes
✦ Featuriza>on: Transform each enJty into a vector of real numbers✦ Opportunity to incorporate domain knowledge✦ Useful even when original data is already numeric
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
E.g., “Bag of Words”
Example:(Spam(Classifica<on(
From: [email protected]
"Eliminate your debt by giving us your money..."
From: [email protected]
"Hi, it's been a while! How are you? ..."
Vocabularybeendebt
eliminategivinghowit'smoneywhile
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
E.g., “Bag of Words” ✦ EnVVes are documents
Example:(Spam(Classifica<on(
From: [email protected]
"Eliminate your debt by giving us your money..."
From: [email protected]
"Hi, it's been a while! How are you? ..."
Vocabularybeendebt
eliminategivinghowit'smoneywhile
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
E.g., “Bag of Words” ✦ EnVVes are documents
✦ Build Vocabulary
Example:(Spam(Classifica<on(
From: [email protected]
"Eliminate your debt by giving us your money..."
From: [email protected]
"Hi, it's been a while! How are you? ..."
Vocabularybeendebt
eliminategivinghowit'smoneywhile
Example:(Spam(Classifica<on(
From: [email protected]
"Eliminate your debt by giving us your money..."
From: [email protected]
"Hi, it's been a while! How are you? ..."
Vocabularybeendebt
eliminategivinghowit'smoneywhile
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Example:(Spam(Classifica<on(
From: [email protected]
"Eliminate your debt by giving us your money..."
been
debt
eliminate
giving
how
it's
money
while
0
1
1
1
0
0
1
0
E.g., “Bag of Words” ✦ EnVVes are documents
✦ Build Vocabulary
✦ Derive feature vectors from Vocabulary✦ Exercise: we’ll use bigrams
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Support Vector Machines (SVMs)
Credit: FoundaVons of Machine Learning Mohri, Rostamizadeh, Talwalkar
64 Support Vector Machines
w·x+b=0
w·x+b=0
Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.
4.2 SVMs — separable case
In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.
We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.
4.2.1 Primal optimization problem
We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is
w · x + b = 0, (4.3)
where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Support Vector Machines (SVMs)
Credit: FoundaVons of Machine Learning Mohri, Rostamizadeh, Talwalkar
64 Support Vector Machines
w·x+b=0
w·x+b=0
Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.
4.2 SVMs — separable case
In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.
We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.
4.2.1 Primal optimization problem
We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is
w · x + b = 0, (4.3)
where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Support Vector Machines (SVMs)
Credit: FoundaVons of Machine Learning Mohri, Rostamizadeh, Talwalkar
64 Support Vector Machines
w·x+b=0
w·x+b=0
Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.
4.2 SVMs — separable case
In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.
We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.
4.2.1 Primal optimization problem
We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is
w · x + b = 0, (4.3)
where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Support Vector Machines (SVMs)
Credit: FoundaVons of Machine Learning Mohri, Rostamizadeh, Talwalkar
64 Support Vector Machines
w·x+b=0
w·x+b=0
Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.
4.2 SVMs — separable case
In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.
We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.
4.2.1 Primal optimization problem
We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is
w · x + b = 0, (4.3)
where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.
✦ “Max-‐Margin”: find linear separator with the largest separaVon between the two classes
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Support Vector Machines (SVMs)
Credit: FoundaVons of Machine Learning Mohri, Rostamizadeh, Talwalkar
64 Support Vector Machines
w·x+b=0
w·x+b=0
Figure 4.1 Two possible separating hyperplanes. The right-hand side figure showsa hyperplane that maximizes the margin.
4.2 SVMs — separable case
In this section, we assume that the training sample S can be linearly separated,that is, we assume the existence of a hyperplane that perfectly separates thetraining sample into two populations of positively and negatively labeled points,as illustrated by the left panel of figure 4.1. But there are then infinitely manysuch separating hyperplanes. Which hyperplane should a learning algorithm select?The solution returned by the SVM algorithm is the hyperplane with the maximummargin, or distance to the closest points, and is thus known as the maximum-marginhyperplane. The right panel of figure 4.1 illustrates that choice.
We will present later in this chapter a margin theory that provides a strongjustification for this solution. We can observe already, however, that the SVMsolution can also be viewed as the “safest” choice in the following sense: a testpoint is classified correctly by a separating hyperplane with margin ⇢ even whenit falls within a distance ⇢ of the training samples sharing the same label; for theSVM solution, ⇢ is the maximum margin and thus the “safest” value.
4.2.1 Primal optimization problem
We now derive the equations and optimization problem that define the SVMsolution. The general equation of a hyperplane in RN is
w · x + b = 0, (4.3)
where w 2 RN is a non-zero vector normal to the hyperplane and b 2 R ascalar. Note that this definition of a hyperplane is invariant to non-zero scalarmultiplication. Hence, for a hyperplane that does not pass through any samplepoint, we can scale w and b appropriately such that min(x,y)2S |w · x + b| = 1.
✦ “Max-‐Margin”: find linear separator with the largest separaVon between the two classes
✦ Extensions:✦ non-‐separable sedng✦ non-‐linear classifiers (kernels)
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Model EvaluaVon Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Model EvaluaVon
✦ Test set simulates performance on new enVty✦ Performance on training data overly opJmisJc!✦ “Overfijng”; “GeneralizaJon”
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Model EvaluaVon
✦ Test set simulates performance on new enVty✦ Performance on training data overly opJmisJc!✦ “Overfijng”; “GeneralizaJon”
✦ Various metrics for quality; accuracy is most common
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Model EvaluaVon
✦ Test set simulates performance on new enVty✦ Performance on training data overly opJmisJc!✦ “Overfijng”; “GeneralizaJon”
✦ Various metrics for quality; accuracy is most common✦ EvaluaVon process
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Model EvaluaVon
✦ Test set simulates performance on new enVty✦ Performance on training data overly opJmisJc!✦ “Overfijng”; “GeneralizaJon”
✦ Various metrics for quality; accuracy is most common✦ EvaluaVon process
✦ Train on training set (don’t expose test set to classifier)
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Model EvaluaVon
✦ Test set simulates performance on new enVty✦ Performance on training data overly opJmisJc!✦ “Overfijng”; “GeneralizaJon”
✦ Various metrics for quality; accuracy is most common✦ EvaluaVon process
✦ Train on training set (don’t expose test set to classifier)✦ Make predicJons using test set (ignoring test labels)
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Model EvaluaVon
✦ Test set simulates performance on new enVty✦ Performance on training data overly opJmisJc!✦ “Overfijng”; “GeneralizaJon”
✦ Various metrics for quality; accuracy is most common✦ EvaluaVon process
✦ Train on training set (don’t expose test set to classifier)✦ Make predicJons using test set (ignoring test labels)✦ Compute fracJon of correct predicJons on test set
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Model EvaluaVon
✦ Test set simulates performance on new enVty✦ Performance on training data overly opJmisJc!✦ “Overfijng”; “GeneralizaJon”
✦ Various metrics for quality; accuracy is most common✦ EvaluaVon process
✦ Train on training set (don’t expose test set to classifier)✦ Make predicJons using test set (ignoring test labels)✦ Compute fracJon of correct predicJons on test set
✦ Other more sophisVcated evaluaVon methods, e.g., cross-‐validaVon
Classifica<on(
fulldataset
trainingset
test set
classifier
accuracy
new entity
prediction
Adapted from slides by Ariel Kleiner
Contribu>ons encouraged!
www.mlbase.org
baseML
baseML
baseML
baseML
ML base
ML base
ML base
ML base
ML basebiglearn.org