ecs289: scalable machine learningchohsieh/teaching/ecs289g_fall2016/lecture9.pdfsupport vector...
TRANSCRIPT
ECS289: Scalable Machine Learning
Cho-Jui HsiehUC Davis
Oct 18, 2016
Outline
Kernel Methods
Solving the dual problem
Kernel approximation
Support Vector Machines (SVM)
SVM is a widely used classifier.
Given:
Training data points x1, · · · , xn.Each xi ∈ Rd is a feature vector:Consider a simple case with two classes: yi ∈ {+1,−1}.
Goal: Find a hyperplane to separate these two classes of data:if yi = 1, wTxi ≥ 1− ξi ; yi = −1, wTxi ≤ −1 + ξi .
Support Vector Machines (SVM)
Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem (find optimal w ∈ Rd):
minw ,ξ
1
2wTw + C
n∑i=1
ξi
s.t. yi (wTxi ) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,
SVM dual problem (find optimal α ∈ Rn):
minα
1
2αTQα− eTα
s.t.0 ≤ αi ≤ C , ∀i = 1, . . . , n
Q: n-by-n kernel matrix, Qij = yiyjxTi xj
Each αi corresponds to one training data point.Primal-dual relationship: w =
∑i αiyixi
Non-linearly separable problems
What if the data is not linearly separable?
Solution: map data xi to higher dimensional(maybe infinite) featurespace ϕ(xi ), where they are linearly separable.
Support Vector Machines (SVM)
SVM primal problem:
minw ,ξ
1
2wTw + C
n∑i=1
ξi
s.t. yi (wTϕ(xi )) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,
The dual problem for SVM:
minα
1
2αTQα−eTα,
s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,
where Qij = yiyjϕ(xi )Tϕ(xj) and e = [1, . . . , 1]T .
Kernel trick: define K (xi , xj) = ϕ(xi )Tϕ(xj).
At optimum: w =∑n
i=1 αiyiϕ(xi ),
Various types of kernels
Gaussian kernel: K (xi , yj) = e−γ‖xi−xj‖22 ;
Polynomial kernel: K (xi , xj) = (γxTi xj + c)d .
Other kernels for specific problems:
Graph kernels(Vishwanathan et al., “Graph Kernels”, JMLR, 2010)
Pyramid kernel for image matching(Grauman and Darrell, “The Pyramid Match Kernel: Discriminative Classification
with Sets of Image Features”. In ICCV, 2005)
String kernel(Lodhi et al., “Text classification using string kernels”. JMLR, 2002).
General Kernelized ERM
L2-Regularized Empirical Risk Minimization:
minw
1
2‖w‖2 + C
n∑i=1
`i (wTϕ(xi ))
x1, · · · , xn: training samplesϕ(xi ): nonlinear mapping to a higher dimensional space
Dual problem:
minα
1
2αTQα +
n∑i=1
`∗i (−αi )
where Q ∈ Rn×n and Qij = ϕ(xi )Tϕ(xj) = K (xi , xj).
Kernel Ridge Regression
Given training samples (xi , yi ), i = 1, · · · , n.
minw
1
2‖w‖2 +
1
2
n∑i=1
(wTϕ(xi )− yi )2
Dual problem:minα
αTQα + ‖α‖2 − 2αTy
Scalability
Challenge for solving kernel SVMs (for dataset with n samples):
Space: O(n2) for storing the n-by-n kernel matrix (can be reduced insome cases);Time: O(n3) for computing the exact solution.
Greedy Coordinate Descent for Kernel SVM
Nonlinear SVM
SVM primal problem:
minw ,ξ
1
2wTw + C
n∑i=1
ξi
s.t. yi (wTϕ(xi )) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,
w : an infinite dimensional vector, very hard to solve
The dual problem for SVM:
minα
1
2αTQα−eTα,
s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,
where Qij = yiyjϕ(xi )Tϕ(xj) and e = [1, . . . , 1]T .
Kernel trick: define K (xi , xj) = ϕ(xi )Tϕ(xj).
Example: Gaussian kernel K (xi , xj) = e−γ‖xi−xj‖2
Nonlinear SVM
Can we solve the problem by dual coordinate descent?
The vector w =∑
i yiαiϕ(xi ) may have infinite dimensionality:
Cannot maintain wClosed form solution needs O(n) computational time:
δ∗ = max(−αi min(C − αi ,1− (Qα)i
Qii))
(Assume Q is stored in memory)
Can we improve coordinate descent using the same O(n) timecomplexity?
Greedy Coordinate Descent
The Greedy Coordinate Descent (GCD) algorithm:
For t = 1, 2, . . .
1. Compute δ∗i := argminδ f (α + δei ) for all i = 1, . . . , n
2. Find the best i∗ according to the following criterion:
i∗ = argmaxi|δ∗i |
3. αi∗ ← αi∗ + δ∗i∗
Greedy Coordinate Descent
Other variable selection criterion:
The coordinate with the maximum step size:
i∗ = argmaxi|δ∗i |
The coordinate with maximum objective function reduction:
i∗ = argmaxi
(f (α)− f (α + δ∗i ei )
)The coordinate with the maximum projected gradient.
. . .
Greedy Coordinate Descent
How to compute the optimal coordinate?
Closed form solution of best δ:
δ∗i = max
(− αi ,min
(C − αi ,
1− (Qα)iQii
))Observations:
1 Computing all δ∗i needs O(n2) time2 If Qα is stored in memory, computing all δ∗i only needs O(n) time
Maintaining Qα also needs O(n) time after each update
Greedy Coordinate Descent
Initial: α, z = Qα
For t = 1, 2, . . .
For all i = 1, . . . , n, compute
δ∗i = max
(− αi ,min(C − αi ,
1− ziQii
)
)Let i∗ = argmaxi |δ∗i |α← α + δ∗i∗
z ← z + qi∗δ∗i∗ (qi∗ is the i∗-th column of Q)
(This is a simplified version of the Sequential Minimal Optimization (SMO)algorithm proposed in Platt el al., 1998)(A similar version is implemented in LIBSVM)
How to solve problems with millions of samples?
Q ∈ Rn×n cannot be fully stored
Have a fixed size of memory to “cache” the computed columns of Q
For each coordinate update:
If qi is in memory, directly use it to update
If qi is not in memory1 Kick out the “Least Recent Used” column2 Recompute qi and store it in memory3 Update αi
LIBSVM
Implemented in LIBSVM:
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
Other functionalities:
Multi-class classificationSupport vector regressionCross-validation
Kernel Approximation
Kernel Approximation
Kernel methods are hard to scale up because
Kernel matrix G is an n-by-n matrix
Can we approximate the kernel using a low-rank representation?
Two main algorithms:
Nystrom approximation: Approximate the kernel matrixRandom Features: Approximate the kernel function
Kernel Matrix Approximation
We want to form a low rank approximation of G ∈ Rn×n,
where Gij = K (xi , xj)
Can we do SVD?
No, SVD needs to form the n-by-n matrix
O(n2) space and O(n3) time
Can we do matrix completion or matrix sketching?
Main problem: need to generalize to new points
Kernel Matrix Approximation
We want to form a low rank approximation of G ∈ Rn×n,
where Gij = K (xi , xj)
Can we do SVD?
No, SVD needs to form the n-by-n matrix
O(n2) space and O(n3) time
Can we do matrix completion or matrix sketching?
Main problem: need to generalize to new points
Kernel Matrix Approximation
We want to form a low rank approximation of G ∈ Rn×n,
where Gij = K (xi , xj)
Can we do SVD?
No, SVD needs to form the n-by-n matrix
O(n2) space and O(n3) time
Can we do matrix completion or matrix sketching?
Main problem: need to generalize to new points
Kernel Matrix Approximation
We want to form a low rank approximation of G ∈ Rn×n,
where Gij = K (xi , xj)
Can we do SVD?
No, SVD needs to form the n-by-n matrix
O(n2) space and O(n3) time
Can we do matrix completion or matrix sketching?
Main problem: need to generalize to new points
Nystrom Approximation
Nystrom approximation:
Randomly sample m columns of G :
G =
[W G12
G21 G22
]W : m-by-m square matrix (observed)G21: (n −m)-by-m matrix (observed)G12 = GT
21 (observed)G22: (n −m)-by-(n −m) matrix (not observed)
Form a kernel approximation based on these mn elements
G ≈ G̃ =
[WG21
]W †
[W G12
]W †: pseudo-inverse of W
Nystrom Approximation
Why W †?
Exact recover the top left m-by-m matrix
The kernel approximation:[W G12
G21 G22
]≈[WG21
]W † [W G12
]=
[W G12
G21 G21W†G12
]
Nystrom Approximation
Algorithm:1 Sample m “landmark points”: v1, · · · , vm2 Compute the kernel values between all training data to landmark points:
C =
[WG21
]=
K (x1, v1) · · · K (x1, vm)K (x2, v1) · · · K (x2, vm)
......
...K (xn, v1) · · · K (xn, vm)
3 Form the kernel approximation G̃ = CW †CT
(No need to explicitly form the n-by-n matrix)
Time complexity: mn kernel evaluations (O(mnd) time if use classicalkernels)
How to choose m?
Trade-off between accuracy v.s computational time and memory space
Nystrom Approximation: Training
Solve the dual problem using low-rank representation.
Example: Kernel ridge regression
minα
αT G̃α + λ‖α‖2 − yTα
Solve a linear system (G̃ + λI )α = y
Use iterative methods (such as CG), with fast matrix-vectormultiplication
(G̃ + λI )p = CW †CTp + λp
O(nm) time complexity per iteration
Nystrom Approximation: Training
Another approach: reduce to linear ERM problems!
Rewrite asG̃ = CW †CT = C (W †)1/2(W †)1/2CT
So the approximate kernel can be represent by linear kernel with featurematrix C (W †)1/2:
K̃ (xi , xj) = G̃ij = uTi uj ,
where uj = (W †)1/2
K (xj , v1)...
K (xj , vm)
The problem is equivalent to a linear models with featuresu1, · · · ,un ∈ Rm
Kernel SVM ⇒ Linear SVM with m featuresKernel Ridge Regression ⇒ Linear Ridge regression with m features
Nystrom Approximation: Prediction
Given the dual solution α.
Prediction for testing sample x :
n∑i=1
αi K̃ (xi , x) orn∑
i=1
αiK (xi , x) ?
Nystrom Approximation: Prediction
Given the dual solution α.
Prediction for testing sample x :
n∑i=1
αi K̃ (xi , x)
Need to use approximate kernel instead of original kernel!Approximate kernel gives much better accuracy!Because training & testing should use the same kernel.
Generalization bound:(Alaoui and Mahoney, “Fast Randomized Kernel Methods With StatisticalGuarantees”. 2014. )(Rudi et al., “Less is More: Nystrom Computational Regularization”. NIPS 2015. )(using small number of landmark points can be viewed as a regularization)
Nystrom Approximation: Prediction
How to define K̃ (x , xi )?
G̃ =
K (x1, v1) · · · K (x1, vm)
......
...K (xn, v1) · · · K (x1, vm)K (x , v1) · · · K (x1, vm)
W †
K (v1, x1) · · · K (v1, x)...
......
K (vm, x1) · · · K (vm, x)
Compute u = (W †)1/2
K (v1, x)...
K (vm, x)
(need m kernel evaluations)
Approximate kernel can be defined by
K̃ (x , xi ) = uTui
Nystrom Approximation: Prediction
Prediction:
f (x) =n∑
i=1
αiuTui = uT (n∑
i=1
αiui )
(∑n
i=1 αiui ) can be pre-computed⇒ prediction time is m kernel evaluations
Original kernel method: need n kernel evaluations for prediction.
Summary:
Quality Training time Prediction time
Original kernel exact n1.x kernel + kernel training n kernel
Nystrom (m) rank m nm kernel + linear training m kernel
Nystrom Approximation
How to select landmark points (v1, · · · , vm)
Traditional approach: uniform random sampling from training dataImportance sampling (leverage score):
Drineas and Mahoney “On the Nystrom method for approximating a Grammatrix for improved kernel-based learning”. JMLR 2015.Gittens and Mahoney “Revisiting the Nystrom method for improved large-scalemachine learning”. ICML 2013
Kmeans sampling (performs extremely well) :Run kmeans clustering and set v1, · · · , vm to be cluster centers
Zhang et al., “Improved Nystrom low rank approximation and error analysis”.ICML 2008.
Subspace distance:Lim et al., “Double Nystrom method: An efficient and accurate Nystromscheme for large-scale data sets”. ICML 2015.
Nystrom Approximation (other related papers)
Distributed setting: (Kumar et al., “Ensemble Nystrom method”. NIPS2009).
Block diagonal + low-rank approximation: (Si et al., “Memory efficientkernel approximation”. ICML 2014. )
Nystrom method for fast prediction: (Hsieh et al., “Fast prediction forlarge-scale kernel machines. ” NIPS, 2014. )
Structured landmark points: (Si et al., “Computational EfficientNystrom Approximation using Fast Transforms”. ICML 2016. )
Kernel Approximation (random features)
Random Features
Directly approximation the kernel function
(Data-independent)
Generate nonlinear “features”
⇒ reduce the problem to linear model training
Several examples:
Random Fourier Features:(Rahimi and Recht, “Random features for large-scale kernel machines”. NIPS 2007)Random features for polynomial kernel:(Kar and Karnick, “Random feature maps for dot product kernels”. AISTATS 2012. )
Random Fourier Features
Random features for “Shift-invariant kernel”:
A continuous kernel is shift-invariant if K (x , y) = k(x − y).
Gaussian kernel: K (x , y) = e−‖x−y‖2
Laplacian kernel: K (x , y) = e−‖x−y‖1
Shift invariant kernel (if positive definite) can be written as:
k(x − y) =
∫Rd
P(w)e jwT (x−y)dw
for some probability distribution P(w).
Random Fourier Features
Taking the real part we have
k(x − y) =
∫Rd
P(w) cos(wTx −wTy)
=
∫Rd
P(w)(cos(wTx) cos(wTy) + sin(wTx) sin(wTy))
= Ew∼P(w)[zw (x)T zw (y)]
where zw (x)T = [cos(wTx), sin(wTx)].
zw (x)T zw (y) is an unbiased estimator of K (x , y) if w is sampled fromP(·)
Random Fourier Features
Sample m vectors w1, · · · ,wm from P(·) distribution
Generate random features for each sample:
ui =
sin(wT1 xi )
cos(wT1 xi )
sin(wT2 xi )
cos(wT2 xi )
...sin(wT
m xi )cos(wT
m xi )
K (xi , xj) ≈ uT
i uj
(larger m leads to better approximation)
Random Features
G ≈ UUT , where U =
uT1...
uTn
Rank 2m approximation.
The problem can be reduced to linear classification/regression foru1, · · · ,un.
Time complexity: O(nmd) time + linear training
(Nystrom: O(nmd + m3) time + linear training )
Prediction time: O(md) per sample
(Nystrom: O(md) per sample)
Fastfood
Random Fourier features require O(md) time to generate m features:
Main bottleneck: Computing wTi x for all i = 1, · · · ,m
Can we compute this in O(m log d) time?
Fastfood: proposed in (Le et al., “Fastfood Approximating KernelExpansions in Loglinear Time”. ICML 2013. )
Reduce time by using “structured random features”
Fastfood
Random Fourier features require O(md) time to generate m features:
Main bottleneck: Computing wTi x for all i = 1, · · · ,m
Can we compute this in O(m log d) time?
Fastfood: proposed in (Le et al., “Fastfood Approximating KernelExpansions in Loglinear Time”. ICML 2013. )
Reduce time by using “structured random features”
Fastfood
Tool: fast matrix-vector multiplication for Hadamard matrix(Hd ∈ Rd×d)
Hdx requires O(d log d) time
Hadamard matrix:
H1 = [1]
H2 =
[1 11 −1
]H2k =
[H2k−1 H2k−1
H2k−1 −H2k−1
]Hdx can be computed efficiently by Fast Hadamard Transform(dynamic programming)
Fastfood
Sample m = d random features by
W =[Hd
]v1 0 · · · 00 v2 · · · 0...
...... 0
0 0 · · · vd
So W x only needs O(d log d) time.
Instead of sampling W (d2 elements), we just sample v1, · · · , vd fromN(0, σ)
Each row of W is still N(0, σ) (but rows are not independent)
Faster computation, but performance is slightly worse than originalrandom features.
Extensions
Fastfood: structured random features to improve computational speed:(Le et al., “Fastfood Approximating Kernel Expansions in Loglinear Time”. ICML 2013. )
Structured landmark points for Nystrom approximation:(Si et al., “Computational Efficient Nystrom Approximation using Fast Transforms”. ICML
2016. )
Doubly stochastic gradient with random features:(Dai et al., “Scalable Kernel Methods via Doubly Stochastic Gradients”. NIPS 2015. )
Generalization bound (?)
Coming up
Choose the paper for presentation (due this Sunday, Oct 23).
Questions?