kernel methods, kernel svm and ridge regressionlsong/teaching/8803ml/lecture15.pdfsvm dual problem...
TRANSCRIPT
Kernel methods, kernel SVM and ridge regression
Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Le Song
Collaborative Filtering
2
Collaborative Filtering
R: rating matrix; U: user factor; V: movie factor
Low rank matrix approximation approach
Probabilistic matrix factorization
Bayesian probabilistic matrix factorization
3
.,,0,0..
),(min2
,
nmkVUts
UVRVUfF
T
VU
Parameter Estimation and Prediction
Bayesian treats the unknown parameters as a random variable:
๐(๐|๐ท) =๐ ๐ท ๐ ๐(๐)
๐(๐ท)=๐ ๐ท ๐ ๐(๐)
โซ ๐ ๐ท ๐ ๐ ๐ ๐๐
Posterior mean estimation: ๐๐๐๐ฆ๐๐ = โซ ๐ ๐ ๐ ๐ท ๐๐
Maximum likelihood approach
๐๐๐ฟ = ๐๐๐๐๐๐ฅ๐๐ ๐ท ๐ , ๐๐๐ด๐ = ๐๐๐๐๐๐ฅ๐๐(๐|๐ท)
Bayesian prediction, take into account all possible value of ๐
๐ ๐ฅ๐๐๐ค ๐ท = โซ ๐ ๐ฅ๐๐๐ค , ๐ ๐ท ๐๐ = โซ ๐ ๐ฅ๐๐๐ค ๐ ๐ ๐ ๐ท ๐๐
A frequentist prediction: use a โplug-inโ estimator
๐ ๐ฅ๐๐๐ค ๐ท = ๐(๐ฅ๐๐๐ค| ๐๐๐ฟ) ๐๐ ๐ ๐ฅ๐๐๐ค ๐ท = ๐(๐ฅ๐๐๐ค| ๐๐๐ด๐)
4
๐
๐ ๐
๐๐๐๐ค
PMF: Parameter Estimation
Parameter estimation: MAP estimate
๐๐๐ด๐ = ๐๐๐๐๐๐ฅ๐๐ ๐ ๐ท, ๐ผ = ๐๐๐๐๐๐ฅ๐๐ ๐, ๐ท ๐ผ
= ๐๐๐๐๐๐ฅ๐๐ ๐ท ๐ ๐(๐|๐ผ)
In the paper:
5
PMF: Interpret prior as regularization
Maximize the posterior distribution with respect to parameter U and V
Equivalent to minimize the sum-of-squares error function with quadratic regularization term (Plug in Gaussians and take log)
6
Bayesian PMF: predicting new ratings
Bayesian prediction, take into account all possible value of ๐
๐ ๐ฅ๐๐๐ค ๐ท = โซ ๐ ๐ฅ๐๐๐ค , ๐ ๐ท ๐๐ =โซ ๐ ๐ฅ๐๐๐ค ๐ ๐ ๐ ๐ท ๐๐
In the paper, integrating out all parameters and hyperparameters.
7
Bayesian PMF: overall algorithm
8
Nonlinear classifier
9
Nonlinear Decision Boundaries
Linear SVM Decision Boundaries
Nonlinear regression
10
Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression
Nonlinear regression function
Three standard deviations above mean
Three standard deviations below mean
Linear regression
Nonconventional clusters
11
Need more advanced methods, such as kernel methods or spectral clustering to work
Nonlinear principal component analysis
12
PCA Nonlinear PCA
Support Vector Machines (SVM)
min๐ค
1
2๐คโค๐ค + ๐ถ ๐๐๐
๐ . ๐ก. ๐คโค๐ฅ๐ + ๐ ๐ฆ๐ โฅ 1 โ ๐๐ , ๐๐ โฅ 0, โ๐
13
๐๐: Slack
variables
Solve nonlinear problem with linear relation in feature space
Nonlinear clustering, principal component analysis, canonical correlation analysis โฆ
Transform data points
Linear decision boundary in feature space
Non-linear decision boundary
SVM for nonlinear problem
14
Some problem needs complicated and even infinite features
๐ ๐ฅ = ๐ฅ, ๐ฅ2, ๐ฅ3, ๐ฅ4, โฆ โค
Explicitly computing high dimension features is time consuming, and makes subsequent optimization costly
SVM for nonlinear problems
15
Nonlinear Decision Boundaries
Linear SVM Decision Boundaries
SVM Lagrangian
Primal problem:
min๐ค,๐
1
2๐คโค๐ค + ๐ถ ๐๐๐
๐ . ๐ก. ๐คโค๐ฅ๐ + ๐ ๐ฆ๐ โฅ 1 โ ๐๐ , ๐๐ โฅ 0, โ๐
Lagrangian
๐ฟ ๐ค, ๐, ๐ผ, ๐ฝ =1
2๐คโค๐ค + ๐ถ ๐๐๐ + ๐ผ๐(1 โ ๐๐ โ ๐ค
โค๐ฅ๐ + ๐ ๐ฆ๐) โ ฮฒ๐๐๐ ๐
๐ผ๐ โฅ 0, ๐ฝ๐ โฅ 0
Take derive of ๐ฟ ๐ค, ๐, ๐ผ, ๐ฝ with respect to ๐ค and ๐ we have
๐ค = ๐ผ๐๐ฆ๐๐ฅ๐๐
๐ = ๐ฆ๐ โ๐คโค๐ฅ๐ for any ๐ such that 0 < ๐ผ๐ < ๐ถ
16
Can be infinite dimensional features
๐ ๐ฅ๐
SVM dual problem
Plug in ๐ค and ๐ into the Lagrangian, and the dual problem
๐๐๐ฅ๐ผ ๐ผ๐๐ โ1
2 ๐ผ๐๐ผ๐๐ฆ๐๐ฆ๐๐ฅ๐
โค๐ฅ๐๐,๐
๐ . ๐ก. ๐ผ๐๐ฆ๐๐ = 0
0 โค ๐ผ๐ โค ๐ถ
It is a quadratic programming; solve for ๐ผ, then we get
๐ค = ๐ผ๐๐ฆ๐๐ฅ๐๐
๐ = ๐ฆ๐ โ๐คโค๐ฅ๐ for any ๐ such that 0 < ๐ผ๐ < ๐ถ
Data points corresponding to nonzeros ๐ผ๐ are called support vectors
17
๐ฅ๐โค๐ฅ๐ = ๐ฅ๐๐๐ฅ๐๐
๐
๐(๐ฅ๐)โค๐(๐ฅ๐) = ๐(๐ฅ๐)๐ ๐(๐ฅ๐)๐
๐
Denote the inner product as a function ๐ ๐ฅ๐ , ๐ฅ๐ = ๐ ๐ฅ๐โค๐ ๐ฅ๐
Kernel Functions
18
K( , )=0.6
K( , )=0.2
K( , )=0.5
K( , )=0.7 ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG
GCATGAC GCCATTG ACCTGCT GGTCCTA
Inner product
# node # edge
# triangle # rectangle # pentagon
maps to
โฆ
# node # edge
# triangle # rectangle # pentagon
maps to
โฆ
Problem of explicitly construct features
Explicitly construct feature ๐ ๐ฅ : ๐ ๐ โฆ ๐น, feature space can grow really large and really quickly
Eg. Polynomial feature of degree ๐
๐ฅ1๐ , ๐ฅ1๐ฅ2โฆ๐ฅ๐ , ๐ฅ1
2๐ฅ2โฆ๐ฅ๐โ1
Total number of such feature is huge for large ๐ and ๐
๐ +๐ โ 1๐
=๐+๐โ1 !
๐! ๐โ1 !
๐ = 6,๐ = 100, there are 1.6 billion terms
19
Can we avoid expanding the features?
Rather than computing the features explicitly, and then compute inner product.
Can we merge two steps using a clever kernel function ๐(๐ฅ๐ , ๐ฅ๐)
Eg. Polynomial kernel ๐ = 2
๐ ๐ฅ โค๐ ๐ฆ =
๐ฅ12
๐ฅ1๐ฅ2๐ฅ22
๐ฅ2๐ฅ1
โค
๐ฆ12
๐ฆ1๐ฆ2๐ฆ22
๐ฆ2๐ฆ1
= ๐ฅ12๐ฆ12 + 2๐ฅ1๐ฅ2๐ฆ1๐ฆ2 + ๐ฅ2
2๐ฆ22
= ๐ฅ1๐ฆ1 + ๐ฅ2๐ฆ22= ๐ฅโค๐ฆ 2
Polynomial kernel ๐ = 2, ๐ ๐ฅ, ๐ฆ = ๐ฅโค๐ฆ ๐ = ๐ ๐ฅ โค๐ ๐ฆ
20
๐ ๐ ๐๐๐๐๐ข๐ก๐๐ก๐๐๐!
What ๐(๐ฅ, ๐ฆ) can be called a kernel function?
๐(๐ฅ, ๐ฆ) equivalent to first compute feature ๐(๐ฅ), and then perform inner product ๐ ๐ฅ, ๐ฆ = ๐ ๐ฅ โค๐ ๐ฆ
A dataset ๐ท = ๐ฅ1, ๐ฅ2, ๐ฅ3โฆ๐ฅ๐
Compute pairwise kernel function ๐ ๐ฅ๐ , ๐ฅ๐ and form a ๐ ร ๐
kernel matrix (Gram matrix)
๐พ =๐(๐ฅ1, ๐ฅ2) โฆ ๐(๐ฅ1, ๐ฅ๐)โฎ โฑ โฎ
๐(๐ฅ๐, ๐ฅ1) โฆ ๐(๐ฅ๐, ๐ฅ๐)
๐(๐ฅ, ๐ฆ) is a kernel function, iff matrix ๐พ is positive semidefinite
โ๐ฃ โ ๐ ๐, ๐ฃโค๐พ๐ฃ โฅ 0 21
Kernel trick
The dual problem of SVMs, replace inner product by kernel
๐๐๐ฅ๐ผ ๐ผ๐๐ โ1
2 ๐ผ๐๐ผ๐๐ฆ๐๐ฆ๐๐(๐ฅ๐)
โค๐(๐ฅ๐)๐,๐
๐ . ๐ก. ๐ผ๐๐ฆ๐๐ = 0
0 โค ๐ผ๐ โค ๐ถ
It is a quadratic programming; solve for ๐ผ, then we get
๐ค = ๐ผ๐๐ฆ๐๐(๐ฅ๐๐ )
๐ = ๐ฆ๐ โ๐คโค๐ฅ๐ for any ๐ such that 0 < ๐ผ๐ < ๐ถ
Evaluate the decision boundary on a new data point
๐ ๐ฅ = ๐คโค๐ ๐ฅ =( ๐ผ๐๐ฆ๐๐(๐ฅ๐๐ )) ๐โค ๐ฅ = ๐ผ๐๐ฆ๐๐ ๐(๐ฅ๐ , ๐ฅ)
22
๐(๐ฅ๐ , ๐ฅ๐)
Typical kernels for vector data
Polynomial of degree d
๐ ๐ฅ, ๐ฆ = ๐ฅโค๐ฆ ๐
Polynomial of degree up to d
๐ ๐ฅ, ๐ฆ = ๐ฅโค๐ฆ + ๐ ๐
Exponential kernel (infinite degree polynomials)
๐ ๐ฅ, ๐ฆ = exp (๐ โ ๐ฅโค๐ฆ)
Gaussian RBF kernel
๐ ๐ฅ, ๐ฆ = exp โ๐ฅโ๐ฆ 2
2๐2
Laplace Kernel
๐ ๐ฅ, ๐ฆ = exp โ๐ฅโ๐ฆ
2๐2
Exponentiated distance
๐ ๐ฅ, ๐ฆ = exp โ๐ ๐ฅ,๐ฆ 2
๐ 2
23
Shape of some kernels
Translation invariant kernel ๐ ๐ฅ, ๐ฆ = ๐ ๐ฅ โ ๐ฆ
The decision boundary is weighted sum of bumps
๐ ๐ฅ = ๐ผ๐๐ฆ๐๐ ๐(๐ฅ๐ , ๐ฅ)
24
How to construct more complicated kernels
Know the feature space ๐ ๐ฅ , but find a fast way to compute the inner product ๐(๐ฅ, ๐ฆ)
Eg. string kernels, and graph kernels
Find a function ๐ ๐ฅ, ๐ฆ and prove it is positive semidefinite
Make sure the function captures some useful similarity between data points
Combine existing kernels to get new kernels
What combination still results in kernel
25
Combining kernels I
Positive weighted combination of kernels are kernels
๐1 ๐ฅ, ๐ฆ and ๐2(๐ฅ, ๐ฆ) are kernels
๐ผ, ๐ฝ โฅ 0
Then ๐ ๐ฅ, ๐ฆ = ๐ผ๐1 ๐ฅ, ๐ฆ + ๐ฝ๐2 ๐ฅ, ๐ฆ is a kernel
Weighted combination kernel is like concatenated feature space
๐1 ๐ฅ, ๐ฆ = ๐ ๐ฅโค๐ ๐ฆ , ๐2 ๐ฅ, ๐ฆ = ๐ ๐ฅ
โค๐ ๐ฆ
๐ ๐ฅ, ๐ฆ =๐(๐ฅ)๐(๐ฅ)
โค ๐ ๐ฆ
๐ ๐ฆ
Mapping between spaces give you kernels
๐ ๐ฅ, ๐ฆ is a kernel, then ๐ ๐ ๐ฅ , ๐ ๐ฆ is a kernel
๐ ๐ฅ, ๐ฆ = ๐ฅ2๐ฆ2
26
Kernel combination II
Product of kernels are kernels
๐1 ๐ฅ, ๐ฆ and ๐2(๐ฅ, ๐ฆ) are kernels
Then ๐ ๐ฅ, ๐ฆ = ๐1 ๐ฅ, ๐ฆ ๐2 ๐ฅ, ๐ฆ is a kernel
Product kernel is like using tensor product feature space
๐1 ๐ฅ, ๐ฆ = ๐ ๐ฅโค๐ ๐ฆ , ๐2 ๐ฅ, ๐ฆ = ๐ ๐ฅ
โค๐ ๐ฆ
๐ ๐ฅ, ๐ฆ = ๐ ๐ฅ โ๐ ๐ฅโค๐ ๐ฆ โ๐ ๐ฆ
๐ ๐ฅ, ๐ฆ = ๐1 ๐ฅ, ๐ฆ ๐2 ๐ฅ, ๐ฆ โฆ๐๐ ๐ฅ, ๐ฆ is a kernel with higher order tensor features ๐ ๐ฅ, ๐ฆ = ๐ ๐ฅ โ๐ ๐ฅ โโฏโ ๐(๐ฅ) โค ๐ ๐ฆ โ๐ ๐ฆ โโฏโ ๐(๐ฆ)
27
Nonlinear regression
28
Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression
Nonlinear regression function
Three standard deviations above mean
Three standard deviations below mean
Linear regression
Ridge regression
A dataset ๐ = ๐ฅ1, โฆ , ๐ฅ๐ โ ๐ ๐ร๐, ๐ฆ = ๐ฆ1, โฆ , ๐ฆ๐
โค โ ๐ ๐
With some regularization parameter ๐, the goal is to find regression function ๐ค
๐คโ = ๐๐๐๐๐๐๐ค (๐ฆ๐ โ ๐ฅ๐โค๐ค๐2+ ๐ ๐ค 2)
= ๐๐๐๐๐๐๐ค ๐ฆ โ ๐โค๐ค 2 + ๐ ๐ค 2
Find ๐ค: take derivative of the objective and set it to zeros
๐คโ = ๐๐โค + ๐๐ผ โ1๐๐ฆ
29
Matrix inversion lemma
Find ๐ค: take derivative of the objective and set it to zeros
๐คโ = ๐๐โค + ๐๐ผ๐โ1๐๐ฆ
๐คโ = ๐๐โค + ๐๐ผ๐โ1๐๐ฆ = ๐ ๐โค๐ + ๐๐ผ๐
โ1๐ฆ
Evaluate ๐คโ in a new data point, we have
๐คโโคx = ๐ฆโค ๐โค๐ + ๐๐ผ๐โ1๐โค๐ฅ
The above expression only depends on inner products!
30
Kernel ridge regression
๐คโโคx = ๐ฆโค ๐โค๐ + ๐๐ผ๐โ1๐โค๐ฅ only depends on inner products!
๐โค๐ =๐ฅ1โค๐ฅ1 โฆ ๐ฅ1
โค๐ฅ๐โฎ โฑ โฎ๐ฅ๐โค๐ฅ1 โฆ ๐ฅ๐
โค๐ฅ๐
๐โค๐ฅ =๐ฅ1โค๐ฅโฎ๐ฅ๐โค๐ฅ
Kernel ridge regression: replace inner product by a kernel function
๐โค๐ โ ๐พ = ๐ ๐ฅ๐ , ๐ฅ๐๐ร๐
๐โค๐ฅ โ kx = ๐ ๐ฅ๐ , ๐ฅ ๐ร1
๐ ๐ฅ = ๐ฆโค ๐พ + ๐๐ผ๐โ1๐๐ฅ
31
Kernel ridge regression
Use Gaussian rbf kernel
๐ ๐ฅ, ๐ฆ = exp โ๐ฅโ๐ฆ 2
2๐2
Use cross-validation to choose parameters
32
๐๐๐๐๐ ๐, ๐๐๐๐๐ ๐ ๐ ๐๐๐๐ ๐, ๐ ๐๐๐๐ ๐ ๐ ๐๐๐๐ ๐, ๐๐๐๐๐ ๐