ecs289: scalable machine learningchohsieh/teaching/ecs289g... · ecs289: scalable machine learning...
TRANSCRIPT
![Page 1: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/1.jpg)
ECS289: Scalable Machine Learning
Cho-Jui HsiehUC Davis
Oct 27, 2015
![Page 2: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/2.jpg)
Outline
One versus all/One versus one
Ranking loss for multiclass/multilabel classification
Scaling to millions of labels
![Page 3: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/3.jpg)
Multiclass Learning
n data points, L labels, d featuresInput: training data xi , yini=1:
Each xi is a d dimensional feature vectorEach yi ∈ 1, . . . , L is the corresponding labelEach training data belongs to one category
Goal: find a function to predict the correct label
f (x) ≈ y
![Page 4: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/4.jpg)
Multi-label Problems
n data points, L labels, d features
Input: training data xi , yini=1:
Each xi is a d dimensional feature vectorEach yi ∈ 0, 1L is a label vector (or Yi ∈ 1, 2, . . . , L)
Example: yi = [0, 0, 1, 0, 0, 1, 1] (or Yi = 3, 6, 7)Each training data can belong to multiple categories
Goal: Given a testing sample x , predict the correct labels
![Page 5: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/5.jpg)
Illustration
Multiclass: each row of L has exact one “1”
Multilabel: each row of L can have multiple ones
![Page 6: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/6.jpg)
Measure the accuracy (multi-class)
Let xi , yimi=1 be a set of testing data for multi-class problems
Let z1, . . . , zm be the prediction for each testing data
Accuracy:
1
m
m∑i=1
I (yi = zi )
If the algorithm outputs a set of k potential labels for each sample:
Z1,Z2, . . . ,Zm
Each Zi is a set of k labels
Precision@k:1
m
m∑i=1
I (yi ∈ Zi )
![Page 7: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/7.jpg)
Measure the accuracy (multi-class)
Let xi , yimi=1 be a set of testing data for multi-class problems
Let z1, . . . , zm be the prediction for each testing data
Accuracy:
1
m
m∑i=1
I (yi = zi )
If the algorithm outputs a set of k potential labels for each sample:
Z1,Z2, . . . ,Zm
Each Zi is a set of k labels
Precision@k:1
m
m∑i=1
I (yi ∈ Zi )
![Page 8: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/8.jpg)
Example (multiclass)
![Page 9: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/9.jpg)
Example (multiclass)
![Page 10: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/10.jpg)
Example (multiclass)
![Page 11: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/11.jpg)
Example (multiclass)
![Page 12: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/12.jpg)
Measure the accuracy (multi-label)
Let xi , yimi=1 be a set of testing data for multi-label problems
For each testing data, the classifier outputs zi ∈ 0, 1L
We use Yi = t : (yi )t 6= 0 and Zi = t : (zi )t 6= 0 to denote thesubset of real and predictive labels for the i-th testing data.
If each Zi has k elements, then Precision@k is defined by:
1
m
m∑i=1
Zi ∩ Yi
k
Hamming loss: overall classification performance
1
m
m∑i=1
d(yi , zi ),
where d(yi , zi ) measures the number of places where yi and zi differ.ROC curve (assume the classifier predicts a ranking of labels for eachdata point)
![Page 13: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/13.jpg)
Example (multilabel)
![Page 14: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/14.jpg)
Example (multilabel)
![Page 15: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/15.jpg)
Example (multilabel)
![Page 16: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/16.jpg)
Traditional Approach
Many algorithms for binary classification
Idea: transform multi-class or multi-label problems to multiple binaryclassification problems
Two approaches:
One versus All (OVA)One versus One (OVO)
![Page 17: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/17.jpg)
One Versus All (OVA)
Multi-class/multi-label problems with L categories
Build L different binary classifiers
For the t-th classifier:
Positive samples: all the points in class t (xi : t ∈ yi)Negative samples: all the points not in class t (xi : t /∈ yi)ft(x): the decision value for the t-th classifier
(larger ft ⇒ higher probability that x in class t)
Prediction:f (x) = arg max
tft(x)
Example: using SVM to train each binary classifier.
![Page 18: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/18.jpg)
One Versus One (OVO)
Multi-class/multi-label problems with L categories
Build L(L− 1) different binary classifiers
For the (s, t)-th classifier:
Positive samples: all the points in class s (xi : s ∈ yi)Negative samples: all the points in class t (xi : t ∈ yi)fs,t(x): the decision value for this classifier
(larger fs,t(x) ⇒ label s has higher probability than label t)ft,s(x) = −fs,t(x)
Prediction:
f (x) = arg maxs
(∑t
fs,t(x)
)Example: using SVM to train each binary classifier.
Not for multilabel problems.
![Page 19: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/19.jpg)
OVA vs OVO
Prediction accuracy: depends on datasets
Computational time:
OVA needs to train L classifiers
OVO needs to train L(L− 1)/2 classifiers
Is OVA always faster than OVO?
![Page 20: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/20.jpg)
OVA vs OVO
Prediction accuracy: depends on datasets
Computational time:
OVA needs to train L classifiers
OVO needs to train L(L− 1)/2 classifiers
Is OVA always faster than OVO?
NO, depends on the time complexity of the binary classifier
If the binary classifier requires O(n) time for n samples:OVA and OVO have similar time complexity
If the binary classifier requires O(n1.xx) time:OVO is faster than OVA
LIBSVM (kernel SVM solver): OVO
LIBLINEAR (linear SVM solver): OVA
![Page 21: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/21.jpg)
OVA vs OVO
Prediction accuracy: depends on datasets
Computational time:
OVA needs to train L classifiers
OVO needs to train L(L− 1)/2 classifiers
Is OVA always faster than OVO?
NO, depends on the time complexity of the binary classifier
If the binary classifier requires O(n) time for n samples:OVA and OVO have similar time complexity
If the binary classifier requires O(n1.xx) time:OVO is faster than OVA
LIBSVM (kernel SVM solver): OVO
LIBLINEAR (linear SVM solver): OVA
![Page 22: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/22.jpg)
Comparisons (accuracy)
For kernel SVM with RBF kernel
(See “A comparison of methods for multiclass support vector machines”,2002)
![Page 23: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/23.jpg)
Comparisons (training time)
For kernel SVM (with RBF kernel)
(See “A comparison of methods for multiclass support vector machines”,2002)
![Page 24: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/24.jpg)
Methods Using Instance-wise Ranking Loss
![Page 25: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/25.jpg)
Main idea
OVA and OVO: decompose the problem by labels
However, the ranking of the labels for a testing sample is important
For multiclass classification, the score of yi should be larger than otherlabelsFor multilabel classification, the score of Yi should be larger than otherlabels
Both OVA and OVO decompose the problem into individual labels
⇒ they cannot capture the ranking information
Solve one combined optimization problem by minimizing theranking loss
![Page 26: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/26.jpg)
Main idea
For simplicity, we assume a linear model
Model parameters: w1, . . . ,wL
For each data point x , compute the decision value for each label:
wT1 x , wT
2 x , . . . , wTL x
Prediction:y = arg max
twT
t x
For training data xi , yi is the true label, so we want
yi ≈ arg maxt
wTt xi ∀i
![Page 27: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/27.jpg)
Main idea
Question: how to define the distance between XW and Y
![Page 28: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/28.jpg)
Weston-Watkins Formulation
Proposed in Weston and Watkins, “Multi-class support vectormachines”. In ESANN, 1999.
minwt,ξti
1
2
L∑t=1
‖wt‖2 + Cn∑
i=1
L∑t=1
ξti
s.t. wTyi
xi −wTt xi ≥ 1− ξti , ξti ≥ 0 ∀t 6= yi , ∀i = 1, . . . , n
If point i is in class yi , for any other labels (t 6= yi ), we want
wTyi
xi −wTt xi ≥ 1
or we pay a penalty ξtiPrediction:
f (x) = arg maxt
wTt xi
![Page 29: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/29.jpg)
Weston-Watkins Formulation (dual)
The dual problem of Weston-Watkins formulation:
minαt
1
2
L∑t=1
‖wt(α)‖+n∑
i=1
∑t 6=yi
αti
s.t.0 ≤ αti ≤ C , ∀t 6= yi , ∀i = 1, . . . , n
α1, . . . ,αL ∈ Rn: dual variables for each label
wt(α) = −∑
i αti xi , and αyi
i = −∑
t 6=yiαti
Can be solved by dual (block) coordinate descent
![Page 30: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/30.jpg)
Crammer-Singer Formulation
Proposed in Carmmer and Singer, “On the algorithmic implementationof multiclass kernel-based vector machines”. JMLR, 2001.
minwt,ξti
1
2
L∑t=1
‖wt‖2 + Cn∑
i=1
ξi
s.t. wTyi
xi −wTt xi ≥ 1− ξi , ∀t 6= yi , ∀i = 1, . . . , n
ξi ≥ 0 ∀i = 1, . . . , n
If point i is in class yi , for any other labels (t 6= yi ), we want
wTyi
xi −wTt xi ≥ 1
For each point i , we only pay the largest penalty
Prediction:f (x) = arg max
twT
t xi
![Page 31: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/31.jpg)
Crammer-Singer Formulation (dual)
The dual problem of Crammer-Singer formulation:
minαt
1
2
L∑t=1
‖wt(α)‖+n∑
i=1
∑t 6=yi
αti
s.t.
(αti ≤ C , ∀t,
∑t
αti = 0
), ∀i = 1, 2, . . . , n
α1, . . . ,αL ∈ Rn: dual variables for each label
wt(α) = −∑
i αti xi , and αyi
i = −∑
t 6=yiαti
Can be solved by dual (block) coordinate descent
![Page 32: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/32.jpg)
Comparisons
(See “A Sequential Dual Method for Large Scale Multi-Class Linear SVMs”,KDD 2008)
![Page 33: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/33.jpg)
Scaling to huge number of labels
![Page 34: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/34.jpg)
Challenges: large number of categories
Multi-label (or multi-class) classification with large number of labels
Image classification—> 10000 labels
Recommending tags for articles: millions of labels (tags)
![Page 35: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/35.jpg)
Challenges: large number of categories
Consider a problem with 1 million labels (L = 1, 000, 000)
One versus all: reduce to 1 million binary problems
Training: 1 million binary classification problems.
Need 694 days if each binary problem can be solved in 1 minute
Model size: 1 million models.
Need 1 TB if each model requires 1MB.
Prediction one testing data: 1 million binary prediction
Need 1000 secs if each binary prediction needs 10−3 secs.
![Page 36: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/36.jpg)
Several Approaches
Label space reduction by Compressed Sensing (CS)
Feature space reduction (PCA)
Supervised latent feature model by matrix factorization
![Page 37: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/37.jpg)
Compressed Sensing
If A ∈ Rn×d ,w ∈ Rd , y ∈ Rn, and
Aw = y
Given A and y, can we recover w?
![Page 38: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/38.jpg)
Compressed Sensing
If A ∈ Rn×d ,w ∈ Rd , y ∈ Rn, and
Aw = y
Given A and y, can we recover w?
When n d :
Usually yes, because number of constraints number of variable
(over-determined linear system)
![Page 39: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/39.jpg)
Compressed Sensing
If A ∈ Rn×d ,w ∈ Rd , y ∈ Rn, and
Aw = y
Given A and y, can we recover w?
When n d :
No, because number of constraints number of variable
(high dimensional regression, under-determined linear system, . . . )
![Page 40: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/40.jpg)
Compressed Sensing
However, if we know w is a sparse vector (with ≤ s nonozeroelements), and A satisfies certain condition, then we can recover weven in the high dimensional setting.
![Page 41: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/41.jpg)
Compressed Sensing
Conditions: w is s-sparse and A satisfies Restricted Isometry Property(RIP)
Example: all entries of A are generated i.i.d. from Gaussian N(0, σ). . .
w can be recoverd by O(s log d) samples
Lasso (`1-regularized linear regression)Orthogonal Mathcing Pursuit (OMP). . .
How is this related to multilabel classification?
![Page 42: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/42.jpg)
Label Space Reduction by Compressed Sensing
Proposed in Hsu et al., “Multi-Label Prediction via CompressedSensing”. In NIPS 2009.
Main idea: reduce the label space from RL to Rk , where k L
zi = Myi where M ∈ Rk×L
If we can recover yi by knowing zi and M, then:
we only need to learn a function to map xi to zi :
f (xi ) ≈ zi
By compressed sensing:
yi can be recovered given xi and M even when k = O(s log L)s is the number of nonzeroes in yi : usually very small in practice
![Page 43: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/43.jpg)
Label Space Reduction by Compressed Sensing
Proposed in Hsu et al., “Multi-Label Prediction via CompressedSensing”. In NIPS 2009.
Main idea: reduce the label space from RL to Rk , where k L
zi = Myi where M ∈ Rk×L
If we can recover yi by knowing zi and M, then:
we only need to learn a function to map xi to zi :
f (xi ) ≈ zi
By compressed sensing:
yi can be recovered given xi and M even when k = O(s log L)s is the number of nonzeroes in yi : usually very small in practice
![Page 44: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/44.jpg)
Label Space Reduction by Compressed Sensing
Training:
Step 1: Construct M by i.i.d. Gaussian
Step 2: Compute zi = Myi for all i = 1, 2, . . . , n
Step 3: Learn a function f such that
f (xi ) ≈ zi
Prediction for a test sample xStep 1: compute z = f (x)
Step 2: compute y ∈ RL by solving a Lasso problem
Step 3: Threshold y to give the prediction
![Page 45: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/45.jpg)
Label Space Reduction by Compressed Sensing
Reduce the label size from L to O(s log L)
Drawbacks:1 Slow prediction time (need to solve a Lasso problem for every prediction)2 Large error in practice
![Page 46: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/46.jpg)
Another way to understand this algorithm
X ∈ Rn×d : data matrix, each row is a data point xiY ∈ Rn×L: label matrix, each row is yi
M ∈ RL×k : Gaussian random matrix
Goal: find a U matrix such that XU ≈ YM
![Page 47: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/47.jpg)
Another way to understand this algorithm
X ∈ Rn×d : data matrix, each row is a data point xiY ∈ Rn×L: label matrix, each row is yi
M ∈ RL×k : Gaussian random matrix
M†: psuedo inverse of M
Goal: find a U matrix such that XUM† ≈ YMM†
![Page 48: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/48.jpg)
Feature space reduction
Another way to improve speed: reduce the feature space
Dimension reduction from d dimensional space to k dimensional space
xi → Nxi = xi ,
where N ∈ Rk×d and k d
Matrix form:X = XNT
Multilabel learning on the reduced feature space:
Learn V such thatXV ≈ Y
Time complexity: reduced by a factor of n/k
Several ways to choose N: PCA, random projection, . . .
![Page 49: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/49.jpg)
Another way to understand this algorithm
X ∈ Rn×d : data matrix, each row is a data point xiY ∈ Rn×L: label matrix, each row is yi
N ∈ Rk×d : matrix for dimensional reduction
Goal: find a V matrix such that XNTV ≈ Y
![Page 50: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/50.jpg)
Supervised latent space model (by matrix factorization)
Label space reduction: find U such that
XUV T ≈ Y
Feature space reduction: find V such that
XUV T ≈ Y
How to improve?
Find best U and V simultaneously
Proposed in
Chen and Lin, “Feature-aware label space dimension reduction formulti-label classification”. In NIPS 2012.Xu et al., “Speedup Matrix Completion with Side Information:Application to Multi-Label Learning”. In NIPS 2013.Yu et al., “Large-scale Multi-label Learning with Missing Labels”. InICML 2014.
![Page 51: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/51.jpg)
Low Rank Modeling for Multilabel Classification
Let X ∈ Rn×d be the data matrix, Y ∈ Rn×L be the 0-1 label matrix
Find U ∈ Rd×k and V ∈ RL×k such that
Y ≈ XUV T
Obtain U,V by solving the following optimization problem:
minU∈Rd×k ,V∈RL×k
‖Y − XUV T‖2F + λ1‖U‖2
F + λ2‖V ‖2F
where λ1, λ2 are the regularization parameters
Solve by alternating minimization.
![Page 52: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/52.jpg)
Low Rank Modeling for Multilabel Classification
Time complexity:
O(nkL) for updating V
O(ndk) for updating U
Overall: O(nkL + ndk) per iteration
Original time complexity: O(ndL) per iteration
![Page 53: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/53.jpg)
Extensions
Extend to general loss function:
minU∈Rd×k ,V∈RL×k
∑i ,j
dis(Yij , (XUVT )ij) + λ1‖U‖2
F + λ2‖V ‖2F
Can handle problems with missing data:
minU∈Rd×k ,V∈RL×k
∑i ,j∈Ω
dis(Yij , (XUVT )ij) + λ1‖U‖2
F + λ2‖V ‖2F
Is it similar to inductive matrix completion?
What’s the time complexity for prediction?
![Page 54: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec5ede6e9422601c50f7fa2/html5/thumbnails/54.jpg)
Coming up
Next class: paper presentations on mutlilabel classification and matrixcompletion
Questions?