kernel methods, kernel svm and ridge regressionlsong/teaching/8803ml/lecture15.pdfsvm dual problem...

Kernel methods, kernel SVM and ridge regression

Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Le Song

Collaborative Filtering

2

Collaborative Filtering

R: rating matrix; U: user factor; V: movie factor

Low rank matrix approximation approach

Probabilistic matrix factorization

Bayesian probabilistic matrix factorization

3

.,,0,0..

),(min2

,

nmkVUts

UVRVUfF

T

VU

Parameter Estimation and Prediction

Bayesian treats the unknown parameters as a random variable:

𝑃(𝜃|𝐷) =𝑃 𝐷 𝜃 𝑃(𝜃)

𝑃(𝐷)=𝑃 𝐷 𝜃 𝑃(𝜃)

∫ 𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃

Posterior mean estimation: 𝜃𝑏𝑎𝑦𝑒𝑠 = ∫ 𝜃 𝑃 𝜃 𝐷 𝑑𝜃

Maximum likelihood approach

𝜃𝑀𝐿 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑃 𝐷 𝜃 , 𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑃(𝜃|𝐷)

Bayesian prediction, take into account all possible value of 𝜃

𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃 = ∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃

A frequentist prediction: use a “plug-in” estimator

𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤| 𝜃𝑀𝐿) 𝑜𝑟 𝑃 𝑥𝑛𝑒𝑤 𝐷 = 𝑃(𝑥𝑛𝑒𝑤| 𝜃𝑀𝐴𝑃)

4

𝜃

𝑋 𝑁

𝑋𝑛𝑒𝑤

PMF: Parameter Estimation

Parameter estimation: MAP estimate

𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑃 𝜃 𝐷, 𝛼 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑃 𝜃, 𝐷 𝛼

= 𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑃 𝐷 𝜃 𝑃(𝜃|𝛼)

In the paper:

5

PMF: Interpret prior as regularization

Maximize the posterior distribution with respect to parameter U and V

Equivalent to minimize the sum-of-squares error function with quadratic regularization term (Plug in Gaussians and take log)

6

Bayesian PMF: predicting new ratings

Bayesian prediction, take into account all possible value of 𝜃

𝑃 𝑥𝑛𝑒𝑤 𝐷 = ∫ 𝑃 𝑥𝑛𝑒𝑤 , 𝜃 𝐷 𝑑𝜃 =∫ 𝑃 𝑥𝑛𝑒𝑤 𝜃 𝑃 𝜃 𝐷 𝑑𝜃

In the paper, integrating out all parameters and hyperparameters.

7

Bayesian PMF: overall algorithm

8

Nonlinear classifier

9

Nonlinear Decision Boundaries

Linear SVM Decision Boundaries

Nonlinear regression

10

Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression

Nonlinear regression function

Three standard deviations above mean

Three standard deviations below mean

Linear regression

Nonconventional clusters

11

Need more advanced methods, such as kernel methods or spectral clustering to work

Nonlinear principal component analysis

12

PCA Nonlinear PCA

Support Vector Machines (SVM)

min𝑤

1

2𝑤⊤𝑤 + 𝐶 𝜉𝑗𝑗

𝑠. 𝑡. 𝑤⊤𝑥𝑗 + 𝑏 𝑦𝑗 ≥ 1 − 𝜉𝑗 , 𝜉𝑗 ≥ 0, ∀𝑗

13

𝜉𝑗: Slack

variables

Solve nonlinear problem with linear relation in feature space

Nonlinear clustering, principal component analysis, canonical correlation analysis …

Transform data points

Linear decision boundary in feature space

Non-linear decision boundary

SVM for nonlinear problem

14

Some problem needs complicated and even infinite features

𝜙 𝑥 = 𝑥, 𝑥2, 𝑥3, 𝑥4, … ⊤

Explicitly computing high dimension features is time consuming, and makes subsequent optimization costly

SVM for nonlinear problems

15

Nonlinear Decision Boundaries

Linear SVM Decision Boundaries

SVM Lagrangian

Primal problem:

min𝑤,𝜉

1

2𝑤⊤𝑤 + 𝐶 𝜉𝑗𝑗

𝑠. 𝑡. 𝑤⊤𝑥𝑗 + 𝑏 𝑦𝑗 ≥ 1 − 𝜉𝑗 , 𝜉𝑗 ≥ 0, ∀𝑗

Lagrangian

𝐿 𝑤, 𝜉, 𝛼, 𝛽 =1

2𝑤⊤𝑤 + 𝐶 𝜉𝑗𝑗 + 𝛼𝑗(1 − 𝜉𝑗 − 𝑤

⊤𝑥𝑗 + 𝑏 𝑦𝑗) − β𝑗𝜉𝑗 𝑗

𝛼𝑖 ≥ 0, 𝛽𝑖 ≥ 0

Take derive of 𝐿 𝑤, 𝜉, 𝛼, 𝛽 with respect to 𝑤 and 𝜉 we have

𝑤 = 𝛼𝑗𝑦𝑗𝑥𝑗𝑗

𝑏 = 𝑦𝑘 −𝑤⊤𝑥𝑘 for any 𝑘 such that 0 < 𝛼𝑘 < 𝐶

16

Can be infinite dimensional features

𝜙 𝑥𝑗

SVM dual problem

Plug in 𝑤 and 𝑏 into the Lagrangian, and the dual problem

𝑀𝑎𝑥𝛼 𝛼𝑖𝑖 −1

2 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝑥𝑖

⊤𝑥𝑗𝑖,𝑗

𝑠. 𝑡. 𝛼𝑖𝑦𝑖𝑖 = 0

0 ≤ 𝛼𝑖 ≤ 𝐶

It is a quadratic programming; solve for 𝛼, then we get

𝑤 = 𝛼𝑗𝑦𝑗𝑥𝑗𝑗


Data points corresponding to nonzeros 𝛼𝑖 are called support vectors

17

𝑥𝑖⊤𝑥𝑗 = 𝑥𝑖𝑙𝑥𝑗𝑙

𝑙

𝜙(𝑥𝑖)⊤𝜙(𝑥𝑗) = 𝜙(𝑥𝑖)𝑙 𝜙(𝑥𝑗)𝑙

𝑙

Denote the inner product as a function 𝑘 𝑥𝑖 , 𝑥𝑗 = 𝜙 𝑥𝑖⊤𝜙 𝑥𝑗

Kernel Functions

18

K( , )=0.6

K( , )=0.2

K( , )=0.5

K( , )=0.7 ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG

GCATGAC GCCATTG ACCTGCT GGTCCTA

Inner product

# node # edge

# triangle # rectangle # pentagon

maps to

…

# node # edge

# triangle # rectangle # pentagon

maps to

…

Problem of explicitly construct features

Explicitly construct feature 𝜙 𝑥 : 𝑅𝑚 ↦ 𝐹, feature space can grow really large and really quickly

Eg. Polynomial feature of degree 𝑑

𝑥1𝑑 , 𝑥1𝑥2…𝑥𝑑 , 𝑥1

2𝑥2…𝑥𝑑−1

Total number of such feature is huge for large 𝑚 and 𝑑

𝑑 +𝑚 − 1𝑑

=𝑑+𝑚−1 !

𝑑! 𝑚−1 !

𝑑 = 6,𝑚 = 100, there are 1.6 billion terms

19

Can we avoid expanding the features?

Rather than computing the features explicitly, and then compute inner product.

Can we merge two steps using a clever kernel function 𝑘(𝑥𝑖 , 𝑥𝑗)

Eg. Polynomial kernel 𝑑 = 2

𝜙 𝑥 ⊤𝜙 𝑦 =

𝑥12

𝑥1𝑥2𝑥22

𝑥2𝑥1

⊤

𝑦12

𝑦1𝑦2𝑦22

𝑦2𝑦1

= 𝑥12𝑦12 + 2𝑥1𝑥2𝑦1𝑦2 + 𝑥2

2𝑦22

= 𝑥1𝑦1 + 𝑥2𝑦22= 𝑥⊤𝑦 2

Polynomial kernel 𝑑 = 2, 𝑘 𝑥, 𝑦 = 𝑥⊤𝑦 𝑑 = 𝜙 𝑥 ⊤𝜙 𝑦

20

𝑂 𝑚 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛!

What 𝑘(𝑥, 𝑦) can be called a kernel function?

𝑘(𝑥, 𝑦) equivalent to first compute feature 𝜙(𝑥), and then perform inner product 𝑘 𝑥, 𝑦 = 𝜙 𝑥 ⊤𝜙 𝑦

A dataset 𝐷 = 𝑥1, 𝑥2, 𝑥3…𝑥𝑛

Compute pairwise kernel function 𝑘 𝑥𝑖 , 𝑥𝑗 and form a 𝑛 × 𝑛

kernel matrix (Gram matrix)

𝐾 =𝑘(𝑥1, 𝑥2) … 𝑘(𝑥1, 𝑥𝑛)⋮ ⋱ ⋮

𝑘(𝑥𝑛, 𝑥1) … 𝑘(𝑥𝑛, 𝑥𝑛)

𝑘(𝑥, 𝑦) is a kernel function, iff matrix 𝐾 is positive semidefinite

∀𝑣 ∈ 𝑅𝑛, 𝑣⊤𝐾𝑣 ≥ 0 21

Kernel trick

The dual problem of SVMs, replace inner product by kernel

𝑀𝑎𝑥𝛼 𝛼𝑖𝑖 −1

2 𝛼𝑖𝛼𝑗𝑦𝑖𝑦𝑗𝜙(𝑥𝑖)

⊤𝜙(𝑥𝑗)𝑖,𝑗

𝑠. 𝑡. 𝛼𝑖𝑦𝑖𝑖 = 0

0 ≤ 𝛼𝑖 ≤ 𝐶

It is a quadratic programming; solve for 𝛼, then we get

𝑤 = 𝛼𝑗𝑦𝑗𝜙(𝑥𝑗𝑗 )


Evaluate the decision boundary on a new data point

𝑓 𝑥 = 𝑤⊤𝜙 𝑥 =( 𝛼𝑗𝑦𝑗𝜙(𝑥𝑗𝑗 )) 𝜙⊤ 𝑥 = 𝛼𝑗𝑦𝑗𝑗 𝑘(𝑥𝑗 , 𝑥)

22

𝑘(𝑥𝑖 , 𝑥𝑗)

Typical kernels for vector data

Polynomial of degree d

𝑘 𝑥, 𝑦 = 𝑥⊤𝑦 𝑑

Polynomial of degree up to d

𝑘 𝑥, 𝑦 = 𝑥⊤𝑦 + 𝑐 𝑑

Exponential kernel (infinite degree polynomials)

𝑘 𝑥, 𝑦 = exp (𝑠 ⋅ 𝑥⊤𝑦)

Gaussian RBF kernel

𝑘 𝑥, 𝑦 = exp −𝑥−𝑦 2

2𝜎2

Laplace Kernel

𝑘 𝑥, 𝑦 = exp −𝑥−𝑦

2𝜎2

Exponentiated distance

𝑘 𝑥, 𝑦 = exp −𝑑 𝑥,𝑦 2

𝑠2

23

Shape of some kernels

Translation invariant kernel 𝑘 𝑥, 𝑦 = 𝑔 𝑥 − 𝑦

The decision boundary is weighted sum of bumps

𝑓 𝑥 = 𝛼𝑗𝑦𝑗𝑗 𝑘(𝑥𝑗 , 𝑥)

24

How to construct more complicated kernels

Know the feature space 𝜙 𝑥 , but find a fast way to compute the inner product 𝑘(𝑥, 𝑦)

Eg. string kernels, and graph kernels

Find a function 𝑘 𝑥, 𝑦 and prove it is positive semidefinite

Make sure the function captures some useful similarity between data points

Combine existing kernels to get new kernels

What combination still results in kernel

25

Combining kernels I

Positive weighted combination of kernels are kernels

𝑘1 𝑥, 𝑦 and 𝑘2(𝑥, 𝑦) are kernels

𝛼, 𝛽 ≥ 0

Then 𝑘 𝑥, 𝑦 = 𝛼𝑘1 𝑥, 𝑦 + 𝛽𝑘2 𝑥, 𝑦 is a kernel

Weighted combination kernel is like concatenated feature space

𝑘1 𝑥, 𝑦 = 𝜙 𝑥⊤𝜙 𝑦 , 𝑘2 𝑥, 𝑦 = 𝜓 𝑥

⊤𝜓 𝑦

𝑘 𝑥, 𝑦 =𝜙(𝑥)𝜓(𝑥)

⊤ 𝜙 𝑦

𝜓 𝑦

Mapping between spaces give you kernels

𝑘 𝑥, 𝑦 is a kernel, then 𝑘 𝜙 𝑥 , 𝜙 𝑦 is a kernel

𝑘 𝑥, 𝑦 = 𝑥2𝑦2

26

Kernel combination II

Product of kernels are kernels

𝑘1 𝑥, 𝑦 and 𝑘2(𝑥, 𝑦) are kernels

Then 𝑘 𝑥, 𝑦 = 𝑘1 𝑥, 𝑦 𝑘2 𝑥, 𝑦 is a kernel

Product kernel is like using tensor product feature space

𝑘1 𝑥, 𝑦 = 𝜙 𝑥⊤𝜙 𝑦 , 𝑘2 𝑥, 𝑦 = 𝜓 𝑥

⊤𝜓 𝑦

𝑘 𝑥, 𝑦 = 𝜙 𝑥 ⊗𝜓 𝑥⊤𝜙 𝑦 ⊗𝜓 𝑦

𝑘 𝑥, 𝑦 = 𝑘1 𝑥, 𝑦 𝑘2 𝑥, 𝑦 …𝑘𝑑 𝑥, 𝑦 is a kernel with higher order tensor features 𝑘 𝑥, 𝑦 = 𝜙 𝑥 ⊗𝜓 𝑥 ⊗⋯⊗ 𝜉(𝑥) ⊤ 𝜙 𝑦 ⊗𝜓 𝑦 ⊗⋯⊗ 𝜉(𝑦)

27

Nonlinear regression

28

Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression

Nonlinear regression function

Three standard deviations above mean

Three standard deviations below mean

Linear regression

Ridge regression

A dataset 𝑋 = 𝑥1, … , 𝑥𝑛 ∈ 𝑅𝑑×𝑛, 𝑦 = 𝑦1, … , 𝑦𝑛

⊤ ∈ 𝑅𝑑

With some regularization parameter 𝜆, the goal is to find regression function 𝑤

𝑤∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 (𝑦𝑖 − 𝑥𝑖⊤𝑤𝑖2+ 𝜆 𝑤 2)

= 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝑦 − 𝑋⊤𝑤 2 + 𝜆 𝑤 2

Find 𝑤: take derivative of the objective and set it to zeros

𝑤∗ = 𝑋𝑋⊤ + 𝜆𝐼 −1𝑋𝑦

29

Matrix inversion lemma

Find 𝑤: take derivative of the objective and set it to zeros

𝑤∗ = 𝑋𝑋⊤ + 𝜆𝐼𝑑−1𝑋𝑦

𝑤∗ = 𝑋𝑋⊤ + 𝜆𝐼𝑑−1𝑋𝑦 = 𝑋 𝑋⊤𝑋 + 𝜆𝐼𝑛

−1𝑦

Evaluate 𝑤∗ in a new data point, we have

𝑤∗⊤x = 𝑦⊤ 𝑋⊤𝑋 + 𝜆𝐼𝑛−1𝑋⊤𝑥

The above expression only depends on inner products!

30

Kernel ridge regression

𝑤∗⊤x = 𝑦⊤ 𝑋⊤𝑋 + 𝜆𝐼𝑛−1𝑋⊤𝑥 only depends on inner products!

𝑋⊤𝑋 =𝑥1⊤𝑥1 … 𝑥1

⊤𝑥𝑛⋮ ⋱ ⋮𝑥𝑛⊤𝑥1 … 𝑥𝑛

⊤𝑥𝑛

𝑋⊤𝑥 =𝑥1⊤𝑥⋮𝑥𝑛⊤𝑥

Kernel ridge regression: replace inner product by a kernel function

𝑋⊤𝑋 → 𝐾 = 𝑘 𝑥𝑖 , 𝑥𝑗𝑛×𝑛

𝑋⊤𝑥 → kx = 𝑘 𝑥𝑖 , 𝑥 𝑛×1

𝑓 𝑥 = 𝑦⊤ 𝐾 + 𝜆𝐼𝑛−1𝑘𝑥

31

Kernel ridge regression

Use Gaussian rbf kernel

𝑘 𝑥, 𝑦 = exp −𝑥−𝑦 2

2𝜎2

Use cross-validation to choose parameters

32

𝑙𝑎𝑟𝑔𝑒 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆 𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑠𝑚𝑎𝑙𝑙 𝜆 𝑠𝑚𝑎𝑙𝑙 𝜎, 𝑙𝑎𝑟𝑔𝑒 𝜆

kernel methods, kernel svm and ridge regressionlsong/teaching/8803ml/lecture15.pdfsvm dual problem...

Documents