kernel methods, kernel svm and ridge regressionlsong/teaching/8803ml/lecture15.pdfsvm dual problem...

32
Kernel methods, kernel SVM and ridge regression Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Le Song

Upload: others

Post on 16-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Kernel methods, kernel SVM and ridge regression

Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Le Song

Page 2: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Collaborative Filtering

2

Page 3: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Collaborative Filtering

R: rating matrix; U: user factor; V: movie factor

Low rank matrix approximation approach

Probabilistic matrix factorization

Bayesian probabilistic matrix factorization

3

.,,0,0..

),(min2

,

nmkVUts

UVRVUfF

T

VU

Page 4: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Parameter Estimation and Prediction

Bayesian treats the unknown parameters as a random variable:

๐‘ƒ(๐œƒ|๐ท) =๐‘ƒ ๐ท ๐œƒ ๐‘ƒ(๐œƒ)

๐‘ƒ(๐ท)=๐‘ƒ ๐ท ๐œƒ ๐‘ƒ(๐œƒ)

โˆซ ๐‘ƒ ๐ท ๐œƒ ๐‘ƒ ๐œƒ ๐‘‘๐œƒ

Posterior mean estimation: ๐œƒ๐‘๐‘Ž๐‘ฆ๐‘’๐‘  = โˆซ ๐œƒ ๐‘ƒ ๐œƒ ๐ท ๐‘‘๐œƒ

Maximum likelihood approach

๐œƒ๐‘€๐ฟ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐œƒ๐‘ƒ ๐ท ๐œƒ , ๐œƒ๐‘€๐ด๐‘ƒ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐œƒ๐‘ƒ(๐œƒ|๐ท)

Bayesian prediction, take into account all possible value of ๐œƒ

๐‘ƒ ๐‘ฅ๐‘›๐‘’๐‘ค ๐ท = โˆซ ๐‘ƒ ๐‘ฅ๐‘›๐‘’๐‘ค , ๐œƒ ๐ท ๐‘‘๐œƒ = โˆซ ๐‘ƒ ๐‘ฅ๐‘›๐‘’๐‘ค ๐œƒ ๐‘ƒ ๐œƒ ๐ท ๐‘‘๐œƒ

A frequentist prediction: use a โ€œplug-inโ€ estimator

๐‘ƒ ๐‘ฅ๐‘›๐‘’๐‘ค ๐ท = ๐‘ƒ(๐‘ฅ๐‘›๐‘’๐‘ค| ๐œƒ๐‘€๐ฟ) ๐‘œ๐‘Ÿ ๐‘ƒ ๐‘ฅ๐‘›๐‘’๐‘ค ๐ท = ๐‘ƒ(๐‘ฅ๐‘›๐‘’๐‘ค| ๐œƒ๐‘€๐ด๐‘ƒ)

4

๐œƒ

๐‘‹ ๐‘

๐‘‹๐‘›๐‘’๐‘ค

Page 5: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

PMF: Parameter Estimation

Parameter estimation: MAP estimate

๐œƒ๐‘€๐ด๐‘ƒ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐œƒ๐‘ƒ ๐œƒ ๐ท, ๐›ผ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐œƒ๐‘ƒ ๐œƒ, ๐ท ๐›ผ

= ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘Ž๐‘ฅ๐œƒ๐‘ƒ ๐ท ๐œƒ ๐‘ƒ(๐œƒ|๐›ผ)

In the paper:

5

Page 6: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

PMF: Interpret prior as regularization

Maximize the posterior distribution with respect to parameter U and V

Equivalent to minimize the sum-of-squares error function with quadratic regularization term (Plug in Gaussians and take log)

6

Page 7: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Bayesian PMF: predicting new ratings

Bayesian prediction, take into account all possible value of ๐œƒ

๐‘ƒ ๐‘ฅ๐‘›๐‘’๐‘ค ๐ท = โˆซ ๐‘ƒ ๐‘ฅ๐‘›๐‘’๐‘ค , ๐œƒ ๐ท ๐‘‘๐œƒ =โˆซ ๐‘ƒ ๐‘ฅ๐‘›๐‘’๐‘ค ๐œƒ ๐‘ƒ ๐œƒ ๐ท ๐‘‘๐œƒ

In the paper, integrating out all parameters and hyperparameters.

7

Page 8: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Bayesian PMF: overall algorithm

8

Page 9: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Nonlinear classifier

9

Nonlinear Decision Boundaries

Linear SVM Decision Boundaries

Page 10: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Nonlinear regression

10

Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression

Nonlinear regression function

Three standard deviations above mean

Three standard deviations below mean

Linear regression

Page 11: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Nonconventional clusters

11

Need more advanced methods, such as kernel methods or spectral clustering to work

Page 12: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Nonlinear principal component analysis

12

PCA Nonlinear PCA

Page 13: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Support Vector Machines (SVM)

min๐‘ค

1

2๐‘คโŠค๐‘ค + ๐ถ ๐œ‰๐‘—๐‘—

๐‘ . ๐‘ก. ๐‘คโŠค๐‘ฅ๐‘— + ๐‘ ๐‘ฆ๐‘— โ‰ฅ 1 โˆ’ ๐œ‰๐‘— , ๐œ‰๐‘— โ‰ฅ 0, โˆ€๐‘—

13

๐œ‰๐‘—: Slack

variables

Page 14: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Solve nonlinear problem with linear relation in feature space

Nonlinear clustering, principal component analysis, canonical correlation analysis โ€ฆ

Transform data points

Linear decision boundary in feature space

Non-linear decision boundary

SVM for nonlinear problem

14

Page 15: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Some problem needs complicated and even infinite features

๐œ™ ๐‘ฅ = ๐‘ฅ, ๐‘ฅ2, ๐‘ฅ3, ๐‘ฅ4, โ€ฆ โŠค

Explicitly computing high dimension features is time consuming, and makes subsequent optimization costly

SVM for nonlinear problems

15

Nonlinear Decision Boundaries

Linear SVM Decision Boundaries

Page 16: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

SVM Lagrangian

Primal problem:

min๐‘ค,๐œ‰

1

2๐‘คโŠค๐‘ค + ๐ถ ๐œ‰๐‘—๐‘—

๐‘ . ๐‘ก. ๐‘คโŠค๐‘ฅ๐‘— + ๐‘ ๐‘ฆ๐‘— โ‰ฅ 1 โˆ’ ๐œ‰๐‘— , ๐œ‰๐‘— โ‰ฅ 0, โˆ€๐‘—

Lagrangian

๐ฟ ๐‘ค, ๐œ‰, ๐›ผ, ๐›ฝ =1

2๐‘คโŠค๐‘ค + ๐ถ ๐œ‰๐‘—๐‘— + ๐›ผ๐‘—(1 โˆ’ ๐œ‰๐‘— โˆ’ ๐‘ค

โŠค๐‘ฅ๐‘— + ๐‘ ๐‘ฆ๐‘—) โˆ’ ฮฒ๐‘—๐œ‰๐‘— ๐‘—

๐›ผ๐‘– โ‰ฅ 0, ๐›ฝ๐‘– โ‰ฅ 0

Take derive of ๐ฟ ๐‘ค, ๐œ‰, ๐›ผ, ๐›ฝ with respect to ๐‘ค and ๐œ‰ we have

๐‘ค = ๐›ผ๐‘—๐‘ฆ๐‘—๐‘ฅ๐‘—๐‘—

๐‘ = ๐‘ฆ๐‘˜ โˆ’๐‘คโŠค๐‘ฅ๐‘˜ for any ๐‘˜ such that 0 < ๐›ผ๐‘˜ < ๐ถ

16

Can be infinite dimensional features

๐œ™ ๐‘ฅ๐‘—

Page 17: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

SVM dual problem

Plug in ๐‘ค and ๐‘ into the Lagrangian, and the dual problem

๐‘€๐‘Ž๐‘ฅ๐›ผ ๐›ผ๐‘–๐‘– โˆ’1

2 ๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ๐‘–๐‘ฆ๐‘—๐‘ฅ๐‘–

โŠค๐‘ฅ๐‘—๐‘–,๐‘—

๐‘ . ๐‘ก. ๐›ผ๐‘–๐‘ฆ๐‘–๐‘– = 0

0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ

It is a quadratic programming; solve for ๐›ผ, then we get

๐‘ค = ๐›ผ๐‘—๐‘ฆ๐‘—๐‘ฅ๐‘—๐‘—

๐‘ = ๐‘ฆ๐‘˜ โˆ’๐‘คโŠค๐‘ฅ๐‘˜ for any ๐‘˜ such that 0 < ๐›ผ๐‘˜ < ๐ถ

Data points corresponding to nonzeros ๐›ผ๐‘– are called support vectors

17

๐‘ฅ๐‘–โŠค๐‘ฅ๐‘— = ๐‘ฅ๐‘–๐‘™๐‘ฅ๐‘—๐‘™

๐‘™

๐œ™(๐‘ฅ๐‘–)โŠค๐œ™(๐‘ฅ๐‘—) = ๐œ™(๐‘ฅ๐‘–)๐‘™ ๐œ™(๐‘ฅ๐‘—)๐‘™

๐‘™

Page 18: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Denote the inner product as a function ๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘— = ๐œ™ ๐‘ฅ๐‘–โŠค๐œ™ ๐‘ฅ๐‘—

Kernel Functions

18

K( , )=0.6

K( , )=0.2

K( , )=0.5

K( , )=0.7 ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG

GCATGAC GCCATTG ACCTGCT GGTCCTA

Inner product

# node # edge

# triangle # rectangle # pentagon

maps to

โ€ฆ

# node # edge

# triangle # rectangle # pentagon

maps to

โ€ฆ

Page 19: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Problem of explicitly construct features

Explicitly construct feature ๐œ™ ๐‘ฅ : ๐‘…๐‘š โ†ฆ ๐น, feature space can grow really large and really quickly

Eg. Polynomial feature of degree ๐‘‘

๐‘ฅ1๐‘‘ , ๐‘ฅ1๐‘ฅ2โ€ฆ๐‘ฅ๐‘‘ , ๐‘ฅ1

2๐‘ฅ2โ€ฆ๐‘ฅ๐‘‘โˆ’1

Total number of such feature is huge for large ๐‘š and ๐‘‘

๐‘‘ +๐‘š โˆ’ 1๐‘‘

=๐‘‘+๐‘šโˆ’1 !

๐‘‘! ๐‘šโˆ’1 !

๐‘‘ = 6,๐‘š = 100, there are 1.6 billion terms

19

Page 20: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Can we avoid expanding the features?

Rather than computing the features explicitly, and then compute inner product.

Can we merge two steps using a clever kernel function ๐‘˜(๐‘ฅ๐‘– , ๐‘ฅ๐‘—)

Eg. Polynomial kernel ๐‘‘ = 2

๐œ™ ๐‘ฅ โŠค๐œ™ ๐‘ฆ =

๐‘ฅ12

๐‘ฅ1๐‘ฅ2๐‘ฅ22

๐‘ฅ2๐‘ฅ1

โŠค

๐‘ฆ12

๐‘ฆ1๐‘ฆ2๐‘ฆ22

๐‘ฆ2๐‘ฆ1

= ๐‘ฅ12๐‘ฆ12 + 2๐‘ฅ1๐‘ฅ2๐‘ฆ1๐‘ฆ2 + ๐‘ฅ2

2๐‘ฆ22

= ๐‘ฅ1๐‘ฆ1 + ๐‘ฅ2๐‘ฆ22= ๐‘ฅโŠค๐‘ฆ 2

Polynomial kernel ๐‘‘ = 2, ๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐‘ฅโŠค๐‘ฆ ๐‘‘ = ๐œ™ ๐‘ฅ โŠค๐œ™ ๐‘ฆ

20

๐‘‚ ๐‘š ๐‘๐‘œ๐‘š๐‘๐‘ข๐‘ก๐‘Ž๐‘ก๐‘–๐‘œ๐‘›!

Page 21: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

What ๐‘˜(๐‘ฅ, ๐‘ฆ) can be called a kernel function?

๐‘˜(๐‘ฅ, ๐‘ฆ) equivalent to first compute feature ๐œ™(๐‘ฅ), and then perform inner product ๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐œ™ ๐‘ฅ โŠค๐œ™ ๐‘ฆ

A dataset ๐ท = ๐‘ฅ1, ๐‘ฅ2, ๐‘ฅ3โ€ฆ๐‘ฅ๐‘›

Compute pairwise kernel function ๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘— and form a ๐‘› ร— ๐‘›

kernel matrix (Gram matrix)

๐พ =๐‘˜(๐‘ฅ1, ๐‘ฅ2) โ€ฆ ๐‘˜(๐‘ฅ1, ๐‘ฅ๐‘›)โ‹ฎ โ‹ฑ โ‹ฎ

๐‘˜(๐‘ฅ๐‘›, ๐‘ฅ1) โ€ฆ ๐‘˜(๐‘ฅ๐‘›, ๐‘ฅ๐‘›)

๐‘˜(๐‘ฅ, ๐‘ฆ) is a kernel function, iff matrix ๐พ is positive semidefinite

โˆ€๐‘ฃ โˆˆ ๐‘…๐‘›, ๐‘ฃโŠค๐พ๐‘ฃ โ‰ฅ 0 21

Page 22: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Kernel trick

The dual problem of SVMs, replace inner product by kernel

๐‘€๐‘Ž๐‘ฅ๐›ผ ๐›ผ๐‘–๐‘– โˆ’1

2 ๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ๐‘–๐‘ฆ๐‘—๐œ™(๐‘ฅ๐‘–)

โŠค๐œ™(๐‘ฅ๐‘—)๐‘–,๐‘—

๐‘ . ๐‘ก. ๐›ผ๐‘–๐‘ฆ๐‘–๐‘– = 0

0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ

It is a quadratic programming; solve for ๐›ผ, then we get

๐‘ค = ๐›ผ๐‘—๐‘ฆ๐‘—๐œ™(๐‘ฅ๐‘—๐‘— )

๐‘ = ๐‘ฆ๐‘˜ โˆ’๐‘คโŠค๐‘ฅ๐‘˜ for any ๐‘˜ such that 0 < ๐›ผ๐‘˜ < ๐ถ

Evaluate the decision boundary on a new data point

๐‘“ ๐‘ฅ = ๐‘คโŠค๐œ™ ๐‘ฅ =( ๐›ผ๐‘—๐‘ฆ๐‘—๐œ™(๐‘ฅ๐‘—๐‘— )) ๐œ™โŠค ๐‘ฅ = ๐›ผ๐‘—๐‘ฆ๐‘—๐‘— ๐‘˜(๐‘ฅ๐‘— , ๐‘ฅ)

22

๐‘˜(๐‘ฅ๐‘– , ๐‘ฅ๐‘—)

Page 23: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Typical kernels for vector data

Polynomial of degree d

๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐‘ฅโŠค๐‘ฆ ๐‘‘

Polynomial of degree up to d

๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐‘ฅโŠค๐‘ฆ + ๐‘ ๐‘‘

Exponential kernel (infinite degree polynomials)

๐‘˜ ๐‘ฅ, ๐‘ฆ = exp (๐‘  โ‹… ๐‘ฅโŠค๐‘ฆ)

Gaussian RBF kernel

๐‘˜ ๐‘ฅ, ๐‘ฆ = exp โˆ’๐‘ฅโˆ’๐‘ฆ 2

2๐œŽ2

Laplace Kernel

๐‘˜ ๐‘ฅ, ๐‘ฆ = exp โˆ’๐‘ฅโˆ’๐‘ฆ

2๐œŽ2

Exponentiated distance

๐‘˜ ๐‘ฅ, ๐‘ฆ = exp โˆ’๐‘‘ ๐‘ฅ,๐‘ฆ 2

๐‘ 2

23

Page 24: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Shape of some kernels

Translation invariant kernel ๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐‘” ๐‘ฅ โˆ’ ๐‘ฆ

The decision boundary is weighted sum of bumps

๐‘“ ๐‘ฅ = ๐›ผ๐‘—๐‘ฆ๐‘—๐‘— ๐‘˜(๐‘ฅ๐‘— , ๐‘ฅ)

24

Page 25: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

How to construct more complicated kernels

Know the feature space ๐œ™ ๐‘ฅ , but find a fast way to compute the inner product ๐‘˜(๐‘ฅ, ๐‘ฆ)

Eg. string kernels, and graph kernels

Find a function ๐‘˜ ๐‘ฅ, ๐‘ฆ and prove it is positive semidefinite

Make sure the function captures some useful similarity between data points

Combine existing kernels to get new kernels

What combination still results in kernel

25

Page 26: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Combining kernels I

Positive weighted combination of kernels are kernels

๐‘˜1 ๐‘ฅ, ๐‘ฆ and ๐‘˜2(๐‘ฅ, ๐‘ฆ) are kernels

๐›ผ, ๐›ฝ โ‰ฅ 0

Then ๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐›ผ๐‘˜1 ๐‘ฅ, ๐‘ฆ + ๐›ฝ๐‘˜2 ๐‘ฅ, ๐‘ฆ is a kernel

Weighted combination kernel is like concatenated feature space

๐‘˜1 ๐‘ฅ, ๐‘ฆ = ๐œ™ ๐‘ฅโŠค๐œ™ ๐‘ฆ , ๐‘˜2 ๐‘ฅ, ๐‘ฆ = ๐œ“ ๐‘ฅ

โŠค๐œ“ ๐‘ฆ

๐‘˜ ๐‘ฅ, ๐‘ฆ =๐œ™(๐‘ฅ)๐œ“(๐‘ฅ)

โŠค ๐œ™ ๐‘ฆ

๐œ“ ๐‘ฆ

Mapping between spaces give you kernels

๐‘˜ ๐‘ฅ, ๐‘ฆ is a kernel, then ๐‘˜ ๐œ™ ๐‘ฅ , ๐œ™ ๐‘ฆ is a kernel

๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐‘ฅ2๐‘ฆ2

26

Page 27: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Kernel combination II

Product of kernels are kernels

๐‘˜1 ๐‘ฅ, ๐‘ฆ and ๐‘˜2(๐‘ฅ, ๐‘ฆ) are kernels

Then ๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐‘˜1 ๐‘ฅ, ๐‘ฆ ๐‘˜2 ๐‘ฅ, ๐‘ฆ is a kernel

Product kernel is like using tensor product feature space

๐‘˜1 ๐‘ฅ, ๐‘ฆ = ๐œ™ ๐‘ฅโŠค๐œ™ ๐‘ฆ , ๐‘˜2 ๐‘ฅ, ๐‘ฆ = ๐œ“ ๐‘ฅ

โŠค๐œ“ ๐‘ฆ

๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐œ™ ๐‘ฅ โŠ—๐œ“ ๐‘ฅโŠค๐œ™ ๐‘ฆ โŠ—๐œ“ ๐‘ฆ

๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐‘˜1 ๐‘ฅ, ๐‘ฆ ๐‘˜2 ๐‘ฅ, ๐‘ฆ โ€ฆ๐‘˜๐‘‘ ๐‘ฅ, ๐‘ฆ is a kernel with higher order tensor features ๐‘˜ ๐‘ฅ, ๐‘ฆ = ๐œ™ ๐‘ฅ โŠ—๐œ“ ๐‘ฅ โŠ—โ‹ฏโŠ— ๐œ‰(๐‘ฅ) โŠค ๐œ™ ๐‘ฆ โŠ—๐œ“ ๐‘ฆ โŠ—โ‹ฏโŠ— ๐œ‰(๐‘ฆ)

27

Page 28: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Nonlinear regression

28

Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression

Nonlinear regression function

Three standard deviations above mean

Three standard deviations below mean

Linear regression

Page 29: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Ridge regression

A dataset ๐‘‹ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› โˆˆ ๐‘…๐‘‘ร—๐‘›, ๐‘ฆ = ๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘›

โŠค โˆˆ ๐‘…๐‘‘

With some regularization parameter ๐œ†, the goal is to find regression function ๐‘ค

๐‘คโˆ— = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘›๐‘ค (๐‘ฆ๐‘– โˆ’ ๐‘ฅ๐‘–โŠค๐‘ค๐‘–2+ ๐œ† ๐‘ค 2)

= ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘›๐‘ค ๐‘ฆ โˆ’ ๐‘‹โŠค๐‘ค 2 + ๐œ† ๐‘ค 2

Find ๐‘ค: take derivative of the objective and set it to zeros

๐‘คโˆ— = ๐‘‹๐‘‹โŠค + ๐œ†๐ผ โˆ’1๐‘‹๐‘ฆ

29

Page 30: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Matrix inversion lemma

Find ๐‘ค: take derivative of the objective and set it to zeros

๐‘คโˆ— = ๐‘‹๐‘‹โŠค + ๐œ†๐ผ๐‘‘โˆ’1๐‘‹๐‘ฆ

๐‘คโˆ— = ๐‘‹๐‘‹โŠค + ๐œ†๐ผ๐‘‘โˆ’1๐‘‹๐‘ฆ = ๐‘‹ ๐‘‹โŠค๐‘‹ + ๐œ†๐ผ๐‘›

โˆ’1๐‘ฆ

Evaluate ๐‘คโˆ— in a new data point, we have

๐‘คโˆ—โŠคx = ๐‘ฆโŠค ๐‘‹โŠค๐‘‹ + ๐œ†๐ผ๐‘›โˆ’1๐‘‹โŠค๐‘ฅ

The above expression only depends on inner products!

30

Page 31: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Kernel ridge regression

๐‘คโˆ—โŠคx = ๐‘ฆโŠค ๐‘‹โŠค๐‘‹ + ๐œ†๐ผ๐‘›โˆ’1๐‘‹โŠค๐‘ฅ only depends on inner products!

๐‘‹โŠค๐‘‹ =๐‘ฅ1โŠค๐‘ฅ1 โ€ฆ ๐‘ฅ1

โŠค๐‘ฅ๐‘›โ‹ฎ โ‹ฑ โ‹ฎ๐‘ฅ๐‘›โŠค๐‘ฅ1 โ€ฆ ๐‘ฅ๐‘›

โŠค๐‘ฅ๐‘›

๐‘‹โŠค๐‘ฅ =๐‘ฅ1โŠค๐‘ฅโ‹ฎ๐‘ฅ๐‘›โŠค๐‘ฅ

Kernel ridge regression: replace inner product by a kernel function

๐‘‹โŠค๐‘‹ โ†’ ๐พ = ๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ๐‘—๐‘›ร—๐‘›

๐‘‹โŠค๐‘ฅ โ†’ kx = ๐‘˜ ๐‘ฅ๐‘– , ๐‘ฅ ๐‘›ร—1

๐‘“ ๐‘ฅ = ๐‘ฆโŠค ๐พ + ๐œ†๐ผ๐‘›โˆ’1๐‘˜๐‘ฅ

31

Page 32: Kernel methods, kernel SVM and ridge regressionlsong/teaching/8803ML/lecture15.pdfSVM dual problem Plug in and into the Lagrangian, and the dual problem ... Nonlinear regression and

Kernel ridge regression

Use Gaussian rbf kernel

๐‘˜ ๐‘ฅ, ๐‘ฆ = exp โˆ’๐‘ฅโˆ’๐‘ฆ 2

2๐œŽ2

Use cross-validation to choose parameters

32

๐‘™๐‘Ž๐‘Ÿ๐‘”๐‘’ ๐œŽ, ๐‘™๐‘Ž๐‘Ÿ๐‘”๐‘’ ๐œ† ๐‘ ๐‘š๐‘Ž๐‘™๐‘™ ๐œŽ, ๐‘ ๐‘š๐‘Ž๐‘™๐‘™ ๐œ† ๐‘ ๐‘š๐‘Ž๐‘™๐‘™ ๐œŽ, ๐‘™๐‘Ž๐‘Ÿ๐‘”๐‘’ ๐œ†