lecture 2 introduction to kernel methods - uni-tuebingen.de · lecture 2 introduction to kernel...

49
Lecture 2 Introduction to Kernel Methods Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universit¨ at T¨ ubingen, Germany Advanced Topics in Machine Learning, 2012 P. Laskov and B. Nelson (T¨ ubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 1 / 28

Upload: others

Post on 04-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Lecture 2Introduction to Kernel Methods

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 1 / 28

Page 2: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

An Apartment for Rent.

Living area: 51m2

Monthly rent: 550e

Is this price fair?

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 2 / 28

Page 3: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Linear Regression

Objective: find relationship between (correlated)input variables x & output variables y

f (x) = w⊤x+ b

Applications of regression:

Interpolation: determining output values for other points within the rangeof available dataExtrapolation: determining output values for other points outside of therange of available dataAnalysis: understanding of specific parameters, e.g., slope w or bias b

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 3 / 28

Page 4: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Linear RegressionProblem Setup

Given is the data D = {(xi , yi )}Ni=1 where each data point is a pair ofinput variables xi ∈ ℜD & the corresponding output yi

Objective: find the weight vector w and the bias b which minimize thediscrepancy between predictions w⊤xi + b and the true output yi foreach point i

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 4 / 28

Page 5: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Linear RegressionMatrix Representation

Arrange all data points in rows of the data matrix X

X =

x⊤1...x⊤N

Then the prediction error can be expressed as

ξ = y − Xw − 1b

The objective is now to minimize the squared prediction error

min L = ξ⊤ξ = (y − Xw − 1b)⊤(y − Xw − 1b)

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 5 / 28

Page 6: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Linear RegressionDerivation of Optimal Solution

Compute partial derivatives of L with respect to w and b:

∂L∂w

= −2X⊤(y − Xw − 1b) = −2X⊤y − 2X⊤Xw + 2X⊤1b

∂L∂b

= −21⊤(y − Xw − 1b) = −21⊤y − 21⊤Xw + 21⊤1b

The optimal solution is obtained by equating partial derivatives to zero:

[

X⊤X X⊤11⊤X 1⊤1

] [

w

b

]

=

[

X⊤

1⊤

]

y

By setting X = [X , 1], w = [w ; b] one obtains the classical system ofequations for linear regression passing through the origin:

X⊤X w = X⊤y

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 6 / 28

Page 7: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Linear RegressionSolution of the apartment problem

0 20 40 60 80 1000

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro

Linear regression, lambda = 0.00, MSE = 20.721631

Regression slope is 9.47e/m2, bias is 9.42e

The (fair) rent should have been 9.47 · 51 = 483e

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 7 / 28

Page 8: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Nonlinear Regression

Model the dependency with a linear combination of non-linear basisfunctions:

f (x) = c1f1(x) + c2f2(x) + . . . + cM fM(x)

Construct the data matrix X from non-linear contributions of each datapoint:

X =

f1(x1) f2(x1) . . . fM(x1)...

f1(xN) f2(xN) . . . fM(xN)

Re-use the standard solution

The solution vector w contains coefficients c1, . . . , cM .

In general, we will denote the space containing nonlineartransformations of input data as feature space:

Φ(x) = [fl(x), f2(x), . . . , fM(x)]

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 8 / 28

Page 9: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Nonlinear RegressionExample: Polynomial of Degree 2

0 20 40 60 80 100

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro

Degree 2 polynomial regression, MSE = 20.661319

f (x) = b + c1x + c2x2

Slightly better accuracy of fit: MSE of 20.66 instead of 20.72

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 9 / 28

Page 10: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Nonlinear RegressionExample: Polynomial of Degree 3

0 20 40 60 80 100

0

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro

Degree 3 polynomial regression, MSE = 20.65

f (x) = b + c1x + c2x2 + c3x

3

Tiny improvement of accuracy: MSE of 20.65 instead of 20.66

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 10 / 28

Page 11: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Nonlinear RegressionExample: Polynomial of Degree 4

0 20 40 60 80 100

02004006008001000

Living area, m2

Ren

t, eu

ro

Degree 4 polynomial regression, MSE = 20.29

f (x) = b + c1x + c2x2 + c3x

3 + c4x4

A significant jump in accuracy: MSE of 20.29 instead of 20.65

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 11 / 28

Page 12: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Overfitting

Polynomials of higher degree can easily fit any constellations of trainingdata. Some other complex functions fit better, but...

/ They poorly approximate the true dependency both between and outsideof the range of training data points.

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 12 / 28

Page 13: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Overfitting

Polynomials of higher degree can easily fit any constellations of trainingdata. Some other complex functions fit better, but...

/ They poorly approximate the true dependency both between and outsideof the range of training data points.

Regularization: the trade-off between the accuracy of fit to thetraining data and the accuracy of approximation of the true dependencycan be controlled by restricting the coefficients c1, c2, . . . , cM .

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 12 / 28

Page 14: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Overfitting

Polynomials of higher degree can easily fit any constellations of trainingdata. Some other complex functions fit better, but...

/ They poorly approximate the true dependency both between and outsideof the range of training data points.

Regularization: the trade-off between the accuracy of fit to thetraining data and the accuracy of approximation of the true dependencycan be controlled by restricting the coefficients c1, c2, . . . , cM .

Modify the objective function as follows:

min L = ξ⊤ξ + λw⊤w

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 12 / 28

Page 15: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Overfitting

Polynomials of higher degree can easily fit any constellations of trainingdata. Some other complex functions fit better, but...

/ They poorly approximate the true dependency both between and outsideof the range of training data points.

Regularization: the trade-off between the accuracy of fit to thetraining data and the accuracy of approximation of the true dependencycan be controlled by restricting the coefficients c1, c2, . . . , cM .

Modify the objective function as follows:

min L = ξ⊤ξ + λw⊤w

Regularized solution:[

X⊤X + λI X⊤11⊤X 1⊤1

] [

w

b

]

=

[

X⊤

1⊤

]

y

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 12 / 28

Page 16: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Nonlinear RegressionExample: Regularized Polynomial of Degree 4

0 20 40 60 80 100

0

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro

Degree 4 polynomial regression, lambda = 0.50, MSE = 20.62

More improvement in accuracy of fit: MSE of 20.62 instead of 20.65

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 13 / 28

Page 17: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

The Curse of Dimensionality

How many monomials has the polynomial of degree k in d dimensions?

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28

Page 18: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

The Curse of Dimensionality

How many monomials has the polynomial of degree k in d dimensions?

k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28

Page 19: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

The Curse of Dimensionality

How many monomials has the polynomial of degree k in d dimensions?

k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6

k = 2, d = 3:

x21 + x22 + x23 + x1x2 + x1x3 + x2x3 + x1 + x2 + x3 + b ⇒ 10

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28

Page 20: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

The Curse of Dimensionality

How many monomials has the polynomial of degree k in d dimensions?

k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6

k = 2, d = 3:

x21 + x22 + x23 + x1x2 + x1x3 + x2x3 + x1 + x2 + x3 + b ⇒ 10

k = 2, d = d :

x21 + . . .+ x2dd

+ x1x2 + . . .+ xd−1xdd(d−1)

2

+ x1 + . . .+ x3d

+b ⇒ O(d2)

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28

Page 21: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

The Curse of Dimensionality

How many monomials has the polynomial of degree k in d dimensions?

k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6

k = 2, d = 3:

x21 + x22 + x23 + x1x2 + x1x3 + x2x3 + x1 + x2 + x3 + b ⇒ 10

k = 2, d = d :

x21 + . . .+ x2dd

+ x1x2 + . . .+ xd−1xdd(d−1)

2

+ x1 + . . .+ x3d

+b ⇒ O(d2)

k = k , d = d :

mon. deg. k + mon. deg. k − 1 + . . .+ b ⇒k

i=1

(i + d − 1)!

i !(d − 1)!= O(dk)

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 14 / 28

Page 22: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

A New Look at the Problem

Consider the following optimization problem

minξ,w

1

2ξ⊤ξ +

λ

2w⊤w

subject to: ξ = y − Xw − 1b

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 15 / 28

Page 23: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

A New Look at the Problem

Consider the following optimization problem

minξ,w

1

2ξ⊤ξ +

λ

2w⊤w

subject to: ξ = y − Xw − 1b

How do we minimize a function subject to constraints?

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 15 / 28

Page 24: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Some Facts from Optimization Theory

1 To optimize a function f (x) subject to constraints g(x) = 0,

Add g(x) to the objective function weighted by the dual variables α

L = f (x) + α⊤g(x)

The objective function extended with weighted constraints is called theLagrangian.Optimize the Lagrangian L with respect to both x and α.

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 16 / 28

Page 25: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Some Facts from Optimization Theory

1 To optimize a function f (x) subject to constraints g(x) = 0,

Add g(x) to the objective function weighted by the dual variables α

L = f (x) + α⊤g(x)

The objective function extended with weighted constraints is called theLagrangian.Optimize the Lagrangian L with respect to both x and α.

2 If joint optimization is difficult, simplify the problem by the followingtrick:

Compute partial derivatives of L with respect to x and equate them to zero.Use these constraints to eliminate x from L.The resulting optimization problem containing only α is known as the dualoptimization problem.

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 16 / 28

Page 26: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Nonlinear Regression: A Dual ViewDifferential of the Lagrangian

The Lagrangian of the regression problem is:

1

2ξ⊤ξ +

λ

2w⊤w − α⊤(y − Xw − 1b)

Its partial derivatives with respect to primal variables are:

∂L∂ξ

= ξ − α = 0 ⇒ ξ = α

∂L∂w

= λw − X⊤α = 0 ⇒ w =1

λX⊤α

∂L∂b

= 1⊤α = 0

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 17 / 28

Page 27: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Nonlinear Regression: A Dual ViewDerivation of the Dual Problem

Substituting the expresions for optimal ξ and w back into theLagrangian we obtain:

L =1

2α⊤α+

λ

2

1

λ2α⊤XX⊤α− α⊤α

+ α⊤y − 1

λα⊤XX⊤α− α⊤1

=0b

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 18 / 28

Page 28: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Nonlinear Regression: A Dual ViewDerivation of the Dual Problem

Substituting the expresions for optimal ξ and w back into theLagrangian we obtain:

L =1

2α⊤α+

λ

2

1

λ2α⊤XX⊤α− α⊤α

+ α⊤y − 1

λα⊤XX⊤α− α⊤1

=0b

=1

2

1

λα⊤XX⊤α− 1

2α⊤α+ α⊤y

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 18 / 28

Page 29: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Nonlinear Regression: A Dual ViewDerivation of the Dual Problem

Substituting the expresions for optimal ξ and w back into theLagrangian we obtain:

L =1

2α⊤α+

λ

2

1

λ2α⊤XX⊤α− α⊤α

+ α⊤y − 1

λα⊤XX⊤α− α⊤1

=0b

=1

2

1

λα⊤XX⊤α− 1

2α⊤α+ α⊤y

Differentiating with respect to α we obtain the following optimalitycondition:

(XX⊤ + λI )α = λy

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 18 / 28

Page 30: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Primal vs. Dual

Final solutions of both problems are similar:

primal: (X⊤X + λI )w = X⊤y

dual: (XX⊤ + λI )α = λy

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 19 / 28

Page 31: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Primal vs. Dual

Final solutions of both problems are similar:

primal: (X⊤X + λI )w = X⊤y

dual: (XX⊤ + λI )α = λy

Key difference: dimensionality of the linear system

primal: X⊤X : (M + 1)× (M + 1)

dual: XX⊤ : N × N

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 19 / 28

Page 32: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Primal vs. Dual

Final solutions of both problems are similar:

primal: (X⊤X + λI )w = X⊤y

dual: (XX⊤ + λI )α = λy

Key difference: dimensionality of the linear system

primal: X⊤X : (M + 1)× (M + 1)

dual: XX⊤ : N × N

For nonlinear regression,

N ≪ M =

k∑

i=1

(i + d − 1)!

i !(d − 1)!!!!

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 19 / 28

Page 33: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

A Closer Look at the Dual Problem

How long does it take to compute XX⊤?

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28

Page 34: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

A Closer Look at the Dual Problem

How long does it take to compute XX⊤?

[N ×M] [M × N] = O(M3) = O(

∑ki=1

(i+d−1)!i !(d−1)!

)3//

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28

Page 35: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

A Closer Look at the Dual Problem

How long does it take to compute XX⊤?

[N ×M] [M × N] = O(M3) = O(

∑ki=1

(i+d−1)!i !(d−1)!

)3//

Let’s look at the structure of XX⊤:

x⊤1x⊤2...x⊤N

x1 x2 . . . xN

=

x⊤1 x1 x⊤1 x2 · · · x⊤1 xNx⊤2 x1 x⊤2 x2 · · · x⊤2 xN...

.... . .

...x⊤Nx1 x⊤Nx2 · · · x⊤NxN

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28

Page 36: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

A Closer Look at the Dual Problem

How long does it take to compute XX⊤?

[N ×M] [M × N] = O(M3) = O(

∑ki=1

(i+d−1)!i !(d−1)!

)3//

Let’s look at the structure of XX⊤:

x⊤1x⊤2...x⊤N

x1 x2 . . . xN

=

x⊤1 x1 x⊤1 x2 · · · x⊤1 xNx⊤2 x1 x⊤2 x2 · · · x⊤2 xN...

.... . .

...x⊤Nx1 x⊤Nx2 · · · x⊤NxN

The main challenge is to efficiently compute inner products x⊤i xj

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28

Page 37: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

A Closer Look at the Dual Problem

How long does it take to compute XX⊤?

[N ×M] [M × N] = O(M3) = O(

∑ki=1

(i+d−1)!i !(d−1)!

)3//

Let’s look at the structure of XX⊤:

x⊤1x⊤2...x⊤N

x1 x2 . . . xN

=

x⊤1 x1 x⊤1 x2 · · · x⊤1 xNx⊤2 x1 x⊤2 x2 · · · x⊤2 xN...

.... . .

...x⊤Nx1 x⊤Nx2 · · · x⊤NxN

The main challenge is to efficiently compute inner products x⊤i xj

Can we do this faster than in O(M)?

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 20 / 28

Page 38: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Kernel MagicExample 1: 2-dimensional Polynomials of Degree 2

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 2:

Φ(x) = [x21 , x22 ,√2x1x2,

√2x1,

√2x2, 1]

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 21 / 28

Page 39: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Kernel MagicExample 1: 2-dimensional Polynomials of Degree 2

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 2:

Φ(x) = [x21 , x22 ,√2x1x2,

√2x1,

√2x2, 1]

Let us compute the inner product between two points in the featurespace:

Φ(x)⊤Φ(y) = x21 y21 + x22 y

22 + 2x1x2y1y2 + 2x1y1 + 2x2y2 + 1

= (x1y1 + x2y2 + 1)2

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 21 / 28

Page 40: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Kernel MagicExample 1: 2-dimensional Polynomials of Degree 2

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 2:

Φ(x) = [x21 , x22 ,√2x1x2,

√2x1,

√2x2, 1]

Let us compute the inner product between two points in the featurespace:

Φ(x)⊤Φ(y) = x21 y21 + x22 y

22 + 2x1x2y1y2 + 2x1y1 + 2x2y2 + 1

= (x1y1 + x2y2 + 1)2

Complexity: 3 multiplications instead of 6.

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 21 / 28

Page 41: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:

Φ(x) = [x31 , x32 ,√3x21 x2,

√3x1x

22 ,√3x21 ,

√3x22 ,

√6x1x2,

√3x1,

√3x2, 1]

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 22 / 28

Page 42: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:

Φ(x) = [x31 , x32 ,√3x21 x2,

√3x1x

22 ,√3x21 ,

√3x22 ,

√6x1x2,

√3x1,

√3x2, 1]

Let us compute the inner product between two points in the featurespace:

Φ(x)⊤Φ(y) = x31 y31 + x32 y

32 + 3x21 x2y

21 y2 + 3x1x

22 y1y

22 + 3x21 y

21 + 3x22 y

22

+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1

= (x1y1 + x2y2 + 1)3

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 22 / 28

Page 43: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:

Φ(x) = [x31 , x32 ,√3x21 x2,

√3x1x

22 ,√3x21 ,

√3x22 ,

√6x1x2,

√3x1,

√3x2, 1]

Let us compute the inner product between two points in the featurespace:

Φ(x)⊤Φ(y) = x31 y31 + x32 y

32 + 3x21 x2y

21 y2 + 3x1x

22 y1y

22 + 3x21 y

21 + 3x22 y

22

+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1

= (x1y1 + x2y2 + 1)3

Complexity: 3 multiplications instead of 10.

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 22 / 28

Page 44: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Polynomial Kernel

For a polynomial of degree k in d dimensions the inner product betweentwo feature vectors can be computed as:

Φ(x)⊤Φ(y) = (x⊤y + 1)k

Complexity: O(d)

The function k(x, y) = (x⊤y + 1)k is called a reproducing kernel forthe feature space of polynomials (the qualification “reproducing” will bediscussed in the next lecture)

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 23 / 28

Page 45: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Classical Kernel Functions

Radial Basis Function (RBF) is defined solely via its kernel:

k(x, y) = e−||x−y||2

Hyperbolic Tangent Function (used for neural networks):

k(x, y) = tanh(αx⊤y + c)

Both of these functions enable O(d) computation of inner products ininfinite-dimensional features spaces.

More than 30 different kernel functions are widely used in machinelearning algorithms: splines, Chi-square, histogram, wavelets, etc.

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 24 / 28

Page 46: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Algorithmic Properties of Kernels

, Kernel functions provide an efficient way for computation of similaritybetween data in complex feature spaces

Complexity varies from linear to low-degree polynomial compared tohigh-degree polynomial, exponential or even infinite.

, Kernel functions provide a powerful abstraction for algorithmic design

For algorithms formulated in terms of kernels, it suffices to change a kernelfunction to implement a new feature space; no change to the algorithm isnecessary.

/ Numeric properties of kernel-based algorithms are worse

Special care must be taken to avoid numeric instability.

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 25 / 28

Page 47: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Kernel Nonlinear RegressionExample: Polynomial of Degree 4

0 20 40 60 80 1000

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro

Ridge regression, polynomial deg. 4, lambda = 5.00, MSE = 20.131777

Excellent accuracy: MSE of 20.13 (20.26 for explicit space)

Numeric instability: strong regularization required (λ = 5)

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 26 / 28

Page 48: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Kernel Nonlinear RegressionExample: Radial Basis Function, σ = 30

0 20 40 60 80 100

0

200

400

600

800

Living area, m2

Ren

t, eu

ro

Ridge regression, RBF sig. 30.000000, lambda = 0.01, MSE = 20.294771

Good accuracy: MSE of 20.29

Smooth kernel parameter required: σ = 30

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 27 / 28

Page 49: Lecture 2 Introduction to Kernel Methods - uni-tuebingen.de · Lecture 2 Introduction to Kernel Methods Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Summary

1 Development of learning algorithms involves complex algorithms foroptimization of cost functions

2 A wide-spread method for development of learning algorithms ismapping data objects into a nonlinear feature space

3 Prohibitively high dimensionality of feature spaces can, in many cases,be overcome by transforming the problem into the dual form in whichthe data occurs in the form of inner products

4 Inner products in high-dimensional spaces can be efficiently computedwith nonlinear kernel functions

5 In general, kernel functions offer a powerful abstraction for developmentof learning algorithms

6 Next Lecture: We will discuss futher mathematical properties of kernelfunctions which affect the development of learning algorithms

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 28 / 28