lecture 2 introduction to kernel methods - uni-tuebingen.de · lecture 2 introduction to kernel...

Lecture 2Introduction to Kernel Methods

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 2: Introduction to Kernel Methods April 24, 2012 1 / 28

An Apartment for Rent.

Living area: 51m2

Monthly rent: 550e

Is this price fair?


Linear Regression

Objective: find relationship between (correlated)input variables x & output variables y

f (x) = w⊤x+ b

Applications of regression:

Interpolation: determining output values for other points within the rangeof available dataExtrapolation: determining output values for other points outside of therange of available dataAnalysis: understanding of specific parameters, e.g., slope w or bias b


Linear RegressionProblem Setup

Given is the data D = {(xi , yi )}Ni=1 where each data point is a pair ofinput variables xi ∈ ℜD & the corresponding output yi

Objective: find the weight vector w and the bias b which minimize thediscrepancy between predictions w⊤xi + b and the true output yi foreach point i


Linear RegressionMatrix Representation

Arrange all data points in rows of the data matrix X

X =

x⊤1...x⊤N

Then the prediction error can be expressed as

ξ = y − Xw − 1b

The objective is now to minimize the squared prediction error

min L = ξ⊤ξ = (y − Xw − 1b)⊤(y − Xw − 1b)


Linear RegressionDerivation of Optimal Solution

Compute partial derivatives of L with respect to w and b:

∂L∂w

= −2X⊤(y − Xw − 1b) = −2X⊤y − 2X⊤Xw + 2X⊤1b

∂L∂b

= −21⊤(y − Xw − 1b) = −21⊤y − 21⊤Xw + 21⊤1b

The optimal solution is obtained by equating partial derivatives to zero:

[

X⊤X X⊤11⊤X 1⊤1

] [

w

b

]

=

[

X⊤

1⊤

]

y

By setting X = [X , 1], w = [w ; b] one obtains the classical system ofequations for linear regression passing through the origin:

X⊤X w = X⊤y


Linear RegressionSolution of the apartment problem

0 20 40 60 80 1000

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro

Linear regression, lambda = 0.00, MSE = 20.721631

Regression slope is 9.47e/m2, bias is 9.42e

The (fair) rent should have been 9.47 · 51 = 483e


Nonlinear Regression

Model the dependency with a linear combination of non-linear basisfunctions:

f (x) = c1f1(x) + c2f2(x) + . . . + cM fM(x)

Construct the data matrix X from non-linear contributions of each datapoint:

X =

f1(x1) f2(x1) . . . fM(x1)...

f1(xN) f2(xN) . . . fM(xN)

Re-use the standard solution

The solution vector w contains coefficients c1, . . . , cM .

In general, we will denote the space containing nonlineartransformations of input data as feature space:

Φ(x) = [fl(x), f2(x), . . . , fM(x)]


Nonlinear RegressionExample: Polynomial of Degree 2

0 20 40 60 80 100

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro

Degree 2 polynomial regression, MSE = 20.661319

f (x) = b + c1x + c2x2

Slightly better accuracy of fit: MSE of 20.66 instead of 20.72



0 20 40 60 80 100

0

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro


f (x) = b + c1x + c2x2 + c3x

3

Tiny improvement of accuracy: MSE of 20.65 instead of 20.66



0 20 40 60 80 100

02004006008001000

Living area, m2

Ren

t, eu

ro


f (x) = b + c1x + c2x2 + c3x

3 + c4x4

A significant jump in accuracy: MSE of 20.29 instead of 20.65


Overfitting

Polynomials of higher degree can easily fit any constellations of trainingdata. Some other complex functions fit better, but...

/ They poorly approximate the true dependency both between and outsideof the range of training data points.


Overfitting



Regularization: the trade-off between the accuracy of fit to thetraining data and the accuracy of approximation of the true dependencycan be controlled by restricting the coefficients c1, c2, . . . , cM .


Overfitting




Modify the objective function as follows:

min L = ξ⊤ξ + λw⊤w


Overfitting




Modify the objective function as follows:

min L = ξ⊤ξ + λw⊤w

Regularized solution:[

X⊤X + λI X⊤11⊤X 1⊤1

] [

w

b

]

=

[

X⊤

1⊤

]

y


Nonlinear RegressionExample: Regularized Polynomial of Degree 4

0 20 40 60 80 100

0

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro

Degree 4 polynomial regression, lambda = 0.50, MSE = 20.62

More improvement in accuracy of fit: MSE of 20.62 instead of 20.65


The Curse of Dimensionality

How many monomials has the polynomial of degree k in d dimensions?




k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6




k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6

k = 2, d = 3:

x21 + x22 + x23 + x1x2 + x1x3 + x2x3 + x1 + x2 + x3 + b ⇒ 10




k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6

k = 2, d = 3:

x21 + x22 + x23 + x1x2 + x1x3 + x2x3 + x1 + x2 + x3 + b ⇒ 10

k = 2, d = d :

x21 + . . .+ x2dd

+ x1x2 + . . .+ xd−1xdd(d−1)

2

+ x1 + . . .+ x3d

+b ⇒ O(d2)




k = 2, d = 2:x21 + x1x2 + x22 + x1 + x2 + b ⇒ 6

k = 2, d = 3:

x21 + x22 + x23 + x1x2 + x1x3 + x2x3 + x1 + x2 + x3 + b ⇒ 10

k = 2, d = d :

x21 + . . .+ x2dd

+ x1x2 + . . .+ xd−1xdd(d−1)

2

+ x1 + . . .+ x3d

+b ⇒ O(d2)

k = k , d = d :

mon. deg. k + mon. deg. k − 1 + . . .+ b ⇒k

∑

i=1

(i + d − 1)!

i !(d − 1)!= O(dk)


A New Look at the Problem

Consider the following optimization problem

minξ,w

1

2ξ⊤ξ +

λ

2w⊤w

subject to: ξ = y − Xw − 1b


A New Look at the Problem

Consider the following optimization problem

minξ,w

1

2ξ⊤ξ +

λ

2w⊤w

subject to: ξ = y − Xw − 1b

How do we minimize a function subject to constraints?


Some Facts from Optimization Theory

1 To optimize a function f (x) subject to constraints g(x) = 0,

Add g(x) to the objective function weighted by the dual variables α

L = f (x) + α⊤g(x)

The objective function extended with weighted constraints is called theLagrangian.Optimize the Lagrangian L with respect to both x and α.


Some Facts from Optimization Theory

1 To optimize a function f (x) subject to constraints g(x) = 0,

Add g(x) to the objective function weighted by the dual variables α

L = f (x) + α⊤g(x)

The objective function extended with weighted constraints is called theLagrangian.Optimize the Lagrangian L with respect to both x and α.

2 If joint optimization is difficult, simplify the problem by the followingtrick:

Compute partial derivatives of L with respect to x and equate them to zero.Use these constraints to eliminate x from L.The resulting optimization problem containing only α is known as the dualoptimization problem.


Nonlinear Regression: A Dual ViewDifferential of the Lagrangian

The Lagrangian of the regression problem is:

1

2ξ⊤ξ +

λ

2w⊤w − α⊤(y − Xw − 1b)

Its partial derivatives with respect to primal variables are:

∂L∂ξ

= ξ − α = 0 ⇒ ξ = α

∂L∂w

= λw − X⊤α = 0 ⇒ w =1

λX⊤α

∂L∂b

= 1⊤α = 0


Nonlinear Regression: A Dual ViewDerivation of the Dual Problem

Substituting the expresions for optimal ξ and w back into theLagrangian we obtain:

L =1

2α⊤α+

λ

2

1

λ2α⊤XX⊤α− α⊤α

+ α⊤y − 1

λα⊤XX⊤α− α⊤1

=0b




L =1

2α⊤α+

λ

2

1


+ α⊤y − 1


=0b

=1

2

1

λα⊤XX⊤α− 1

2α⊤α+ α⊤y




L =1

2α⊤α+

λ

2

1


+ α⊤y − 1


=0b

=1

2

1

λα⊤XX⊤α− 1

2α⊤α+ α⊤y

Differentiating with respect to α we obtain the following optimalitycondition:

(XX⊤ + λI )α = λy


Primal vs. Dual

Final solutions of both problems are similar:

primal: (X⊤X + λI )w = X⊤y

dual: (XX⊤ + λI )α = λy


Primal vs. Dual




Key difference: dimensionality of the linear system

primal: X⊤X : (M + 1)× (M + 1)

dual: XX⊤ : N × N


Primal vs. Dual




Key difference: dimensionality of the linear system

primal: X⊤X : (M + 1)× (M + 1)

dual: XX⊤ : N × N

For nonlinear regression,

N ≪ M =

k∑

i=1

(i + d − 1)!

i !(d − 1)!!!!


A Closer Look at the Dual Problem

How long does it take to compute XX⊤?




[N ×M] [M × N] = O(M3) = O(

∑ki=1

(i+d−1)!i !(d−1)!

)3//




[N ×M] [M × N] = O(M3) = O(

∑ki=1

(i+d−1)!i !(d−1)!

)3//

Let’s look at the structure of XX⊤:

x⊤1x⊤2...x⊤N

x1 x2 . . . xN

=

x⊤1 x1 x⊤1 x2 · · · x⊤1 xNx⊤2 x1 x⊤2 x2 · · · x⊤2 xN...

.... . .

...x⊤Nx1 x⊤Nx2 · · · x⊤NxN




[N ×M] [M × N] = O(M3) = O(

∑ki=1

(i+d−1)!i !(d−1)!

)3//


x⊤1x⊤2...x⊤N

x1 x2 . . . xN

=

x⊤1 x1 x⊤1 x2 · · · x⊤1 xNx⊤2 x1 x⊤2 x2 · · · x⊤2 xN...

.... . .

...x⊤Nx1 x⊤Nx2 · · · x⊤NxN

The main challenge is to efficiently compute inner products x⊤i xj




[N ×M] [M × N] = O(M3) = O(

∑ki=1

(i+d−1)!i !(d−1)!

)3//


x⊤1x⊤2...x⊤N

x1 x2 . . . xN

=

x⊤1 x1 x⊤1 x2 · · · x⊤1 xNx⊤2 x1 x⊤2 x2 · · · x⊤2 xN...

.... . .

...x⊤Nx1 x⊤Nx2 · · · x⊤NxN

The main challenge is to efficiently compute inner products x⊤i xj

Can we do this faster than in O(M)?


Kernel MagicExample 1: 2-dimensional Polynomials of Degree 2

Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 2:

Φ(x) = [x21 , x22 ,√2x1x2,

√2x1,

√2x2, 1]




Φ(x) = [x21 , x22 ,√2x1x2,

√2x1,

√2x2, 1]

Let us compute the inner product between two points in the featurespace:

Φ(x)⊤Φ(y) = x21 y21 + x22 y

22 + 2x1x2y1y2 + 2x1y1 + 2x2y2 + 1

= (x1y1 + x2y2 + 1)2




Φ(x) = [x21 , x22 ,√2x1x2,

√2x1,

√2x2, 1]


Φ(x)⊤Φ(y) = x21 y21 + x22 y

22 + 2x1x2y1y2 + 2x1y1 + 2x2y2 + 1

= (x1y1 + x2y2 + 1)2

Complexity: 3 multiplications instead of 6.




Φ(x) = [x31 , x32 ,√3x21 x2,

√3x1x

22 ,√3x21 ,

√3x22 ,

√6x1x2,

√3x1,

√3x2, 1]




Φ(x) = [x31 , x32 ,√3x21 x2,

√3x1x

22 ,√3x21 ,

√3x22 ,

√6x1x2,

√3x1,

√3x2, 1]


Φ(x)⊤Φ(y) = x31 y31 + x32 y

32 + 3x21 x2y

21 y2 + 3x1x

22 y1y

22 + 3x21 y

21 + 3x22 y

22

+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1

= (x1y1 + x2y2 + 1)3




Φ(x) = [x31 , x32 ,√3x21 x2,

√3x1x

22 ,√3x21 ,

√3x22 ,

√6x1x2,

√3x1,

√3x2, 1]


Φ(x)⊤Φ(y) = x31 y31 + x32 y

32 + 3x21 x2y

21 y2 + 3x1x

22 y1y

22 + 3x21 y

21 + 3x22 y

22

+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1

= (x1y1 + x2y2 + 1)3

Complexity: 3 multiplications instead of 10.


Polynomial Kernel

For a polynomial of degree k in d dimensions the inner product betweentwo feature vectors can be computed as:

Φ(x)⊤Φ(y) = (x⊤y + 1)k

Complexity: O(d)

The function k(x, y) = (x⊤y + 1)k is called a reproducing kernel forthe feature space of polynomials (the qualification “reproducing” will bediscussed in the next lecture)


Classical Kernel Functions

Radial Basis Function (RBF) is defined solely via its kernel:

k(x, y) = e−||x−y||2

2σ

Hyperbolic Tangent Function (used for neural networks):

k(x, y) = tanh(αx⊤y + c)

Both of these functions enable O(d) computation of inner products ininfinite-dimensional features spaces.

More than 30 different kernel functions are widely used in machinelearning algorithms: splines, Chi-square, histogram, wavelets, etc.


Algorithmic Properties of Kernels

, Kernel functions provide an efficient way for computation of similaritybetween data in complex feature spaces

Complexity varies from linear to low-degree polynomial compared tohigh-degree polynomial, exponential or even infinite.

, Kernel functions provide a powerful abstraction for algorithmic design

For algorithms formulated in terms of kernels, it suffices to change a kernelfunction to implement a new feature space; no change to the algorithm isnecessary.

/ Numeric properties of kernel-based algorithms are worse

Special care must be taken to avoid numeric instability.


Kernel Nonlinear RegressionExample: Polynomial of Degree 4

0 20 40 60 80 1000

200

400

600

800

1000

Living area, m2

Ren

t, eu

ro

Ridge regression, polynomial deg. 4, lambda = 5.00, MSE = 20.131777

Excellent accuracy: MSE of 20.13 (20.26 for explicit space)

Numeric instability: strong regularization required (λ = 5)


Kernel Nonlinear RegressionExample: Radial Basis Function, σ = 30

0 20 40 60 80 100

0

200

400

600

800

Living area, m2

Ren

t, eu

ro

Ridge regression, RBF sig. 30.000000, lambda = 0.01, MSE = 20.294771

Good accuracy: MSE of 20.29

Smooth kernel parameter required: σ = 30


Summary

1 Development of learning algorithms involves complex algorithms foroptimization of cost functions

2 A wide-spread method for development of learning algorithms ismapping data objects into a nonlinear feature space

3 Prohibitively high dimensionality of feature spaces can, in many cases,be overcome by transforming the problem into the dual form in whichthe data occurs in the form of inner products

4 Inner products in high-dimensional spaces can be efficiently computedwith nonlinear kernel functions

5 In general, kernel functions offer a powerful abstraction for developmentof learning algorithms

6 Next Lecture: We will discuss futher mathematical properties of kernelfunctions which affect the development of learning algorithms


lecture 2 introduction to kernel methods - uni-tuebingen.de · lecture 2 introduction to kernel...

Documents