kernel regression

Kernel Regression

Prof. Bennett

Math Model of Learning and Discovery 2/24/03

Based on Chapter 2 of

Shawe-Taylor and Cristianini

Outline

Review Ridge Regression

LS-SVM=KRR

Dual Derivation

Bias Issue

Summary

Ridge Regression Review

Use least norm solution for fixed

Regularized problem

Optimality Condition:

2 2min ( , )L S w w w y Xw

( , )2 2 ' 2 ' 0

L S

w

w X y X Xww

' 'n X X I w X y

0.

Requires 0(n3) operations

Dual Representation

Inverse always exists for any

Alternative representation:

1' ' w X X I X y

0.

1

1

' ' ' '

' '

X X I w X y w X y X Xw

w X y Xw X α

1

1

'

'

where '

α y Xw

α y Xw y XX α

XX α α y

α G I y G XX

Solving ll equation is 0(l3)

Dual Ridge Regression

To predict new point:

Note need only compute G, the Gram Matrix

1

1

( ) , , '

where ,

i ii

i

g

x w x x x y G I z

z x x

' ,ij i jG G XX x x

Ridge Regression requires only inner products between data points

Linear Regression in Feature Space

Key Idea:

Map data to higher dimensional space (feature space) and perform linear regression in embedded space.

Embedding Map:: n NR F R N n x

Kernel Function

A kernel is a function K such that

There are many possible kernels.

Simplest is linear kernel.

, ( ), ( )

where is a mapping from input space

to feature space .

FK

F

x u x u

, ,K x u x u

Ridge Regression in Feature Space

To predict new point:

To compute the Gram Matrix

1

1

( ( )) , ( ) ( ), ( ) '

where ( ), ( )

i ii

i

g

x w x x x y G I z

z x x

( ) ( ) ' ( ), ( ) ( , )ij i j i jG K G X X x x x x

Use kernel to compute inner product

Alternative Dual Derivation

Original math model

Equivalent math model

Construct dual using Wolfe Duality

2 2min ( )f w w w y Xw

2 2,

1

1 1min ( )

2 2

. . 1, ,

z ii

i i i

f z

s t y z i

w w w

X w

Lagrangian Function

Consider the problem

Lagrangian function is

min ( ): diff

( ) 0: diff . .

1, ,

nr

nii

f rf R R

h rh R Rs t

i

1

1

( , ) ( ) ( ( ))

( , ) ( ) ( ( ))

i ii

r r i r ii

L r u f r u h r

L r u f r u h r

Wolfe Dual Problem

Primal

Dual

min ( ): diff and convex

( ) 0: diff . .

1, ,

nr

nii

f rf R R

h rh R Rs t

i

,1

1

max ( , ) ( ) ( ( ))

. . ( , ) ( ) ( ( )) 0

i ir u

i

r r i r ii

L r u f r u h r

s t L r u f r u h r

Lagrangian Function

Primal

Lagrangian

21 1 2, 2 2

1

min ( , )

. . 1, ,

l

ii

i i i

f z

s t y z i

w z w z w

X w

21 1 22 2

1 1

1

1

( , , ) ( )

( , , ) 0

( , ) 0

i

i

l l

i i i ii i

l

ii

L z y z

L

L

w

z

w z α w X w

w z α w X

w,z α z α

Wolfe Dual Problem

Construct Wolfe Dual

Simplify by eliminating z=

21 1 22 2, ,

1 1

1

max ( , , ) ( )

. . ( , , ) 0

1( , ) 0

i

i

l l

i i i ii i

l

ii

L z y z

s t L

L

w z α

w

z

w z α w X w

w z α w X

w,z α z α

Simplified Problem

Get rid of z

Simplify by eliminating w=X’

21 2 22 2 2,

1 1

21 22 2

1 1

1

max

. . ( , , ) 0

i

i

i

l l

i i i i ii i i i

l l

i i i ii i i

l

ii

y

y

s t L

w α

w

w X w

w X w

w z α w X

Simplified Problem

Get rid of w

1 22 2

1 1 1

1 1 1

1 22 2

i,j=1 1 1

max ,

,

,

unconstrained

i j

i i j

i

l l l

i j ii j i

l l l

i i ji i j

l l l

i j i j i ii i

y

x x y

αX X

X X

Optimal solution

Problem in matrix notation with G=XX’

Solution satisfies

1

2 2min ( ) ' 'f G y

1

( ) 0

( )l

f G y

G I y

What about Bias

If we limit regression function to f(x)=w’x means that solution must pass through origin.

Many models may require a bias or constant factor f(x)=w’x+b

Eliminate Bias

One way to eliminate bias is to “center” the data

Make data have mean of 0

1

2

1ystandard deviation

l

ii

l

ii

ymean y y

l

y y

l

Center y

centered y y y

Y now has sample mean of 0

Frequently good to make y have standard length:

y

y ynormalized y

Center X

Mean X

Center X

1 1' where is vector of onesil l

μ x X e e

1

1 1 1

ˆ ' ' Centered Point

ˆ ' ' ( ') Centered Data

i i i

x x μ x e X

X X eμ X ee X I ee X

You Try

Consider data matrix with 3 points in 4 dimensions

Computer the centered by hand and with the following formula.

1 2 4 1

3 4 1 4

2 0 9 1

X

1ˆ ( ') X I ee X

Center (X) in Feature Space

We cannot center (X) directly in feature space.

Center G = XX’

Works in feature space too for G in kernel space

1 1 1 1ˆ ˆ ˆ ' ( ') '( ') ' ( ') ( ') ' G XX I ee XX I ee I ee G I ee

( ) ( ') G X X K

1 1ˆ ( ') ( ') ' K I ee K I ee

Centering Kernel

Practical Computation:

1 1ˆ ( ') ( ') K I ee K I ee

1

1

Let ' ' row average of

Let ' subtract row average

Let row average of

ˆLet ' subtract column average

μ e K K

K K eμ

c K e K

K K ce

Ridge Regression in Feature Space

Original way

Predicted normalized y

Predicted original y

1

1

( ( )) ( , )i ii

g K

α G I y x x x

1

1

ˆ ˆˆ ˆ ˆ( ( )) ( , )i ii

g K

α G I y x x x

1

1

ˆ ˆˆ ˆ ˆ( ( )) ( , )y i i yi

g K

α G I y x x x

Worksheet

Normalized Y

Invert to get unnormalized y

ˆ ˆi yi y i y i

y

yy y y

Centering Test Data

1

1

ˆ ˆˆ ˆ ˆ( ( )) ( , )y i i yi

g K

α G I y x x x

1 1' '

1'

ˆ ( ')

ˆ ( )( ')

trtr tr tr tr

tst tst tr

where

K K e μ I ee μ K e

K K e μ I ee

1'ˆ ˆˆ ˆˆ ( ( ))tst y tst yg

α G I y X K α e

Calculate test data just like training data:

Prediction of test data becomes:

Alternate Approach

Directly add bias to the model:

Optimization problem becomes:21 1 2

, , 2 21

min ( , , )

. . 1, ,

l

b ii

i i ib

f b z

s t y z i

w z w z w

X w

is biasi iy b b X w

Lagrangian Function

Consider the problem

Lagrangian function is

min ( ): diff

( ) 0: diff . .

1, ,

nr

nii

f rf R R

h rh R Rs t

i

1

1

( , ) ( ) ( ( ))

( , ) ( ) ( ( ))

i ii

r r i r ii

L r u f r u h r

L r u f r u h r

Lagrangian FunctionPrimal

21 1 22 2

1 1

1

1

1

( , , , ) ( )

( , , , ) 0

( , , ) 0

( , , ) 0

i

i

i

l l

i i i ii i

l

ii

l

bi

L b z y b z

L b

L b

L b

w

z

w z α w X w

w z α w X

w,z α z α

w,z α

21 1 2, , 2 2

1

min ( , , )

. . 1, ,

l

b ii

i i ib

f b z

s t y z i

w z w z w

X w

Wolfe Dual Problem

Simplify by eliminating z= and using e’ =0

21 1 22 2, , ,

1 1

1

1

max ( )

. . ( , , ) 0

1( , ) 0

( , , ) 0

i

i

i

l l

i i i ib

i i

l

ii

l

bi

z y b z

s t L

L

L b

w z α

w

z

w X w

w z α w X

w,z α z α

w,z α

Simplified Problem

Simplify by eliminating w=X’

21 2 2

2 2 2,1 1

21 22 2

1 1

1

1

max

. . ( , , ) 0

0

i

i

i

i

l l

i i i i i ii i i i i

l l

i i i ii i i

l

ii

l

i

y b

y

s t L

w α

w

w X w

w X w

w z α w X

Simplified Problem

Get rid of w

1 22 2

1 1 1

1 1 1

1 22 2

i,j=1 1 1

max ,

,

,

. . 0

i j

i i j

i

l l l

i j ii j i

l l l

i i ji i j

l l l

i j i j i ii i

i

y

x x y

s t

αX X

X X

New Problem to be solved

Problem in matrix notation with G=XX’

This is a constrained optimization problem. Solution is also system of equations, but not as simple.

1

2 2min ( )

. . 0

f

s t

α

α α'Gα α y'α

e'α

Kernel Ridge Regression

Centered algorithm just requires centering of the kernel and solving one equation.

Can also add bias directly.

+ Lots of fast equation solvers.

+ Theory supports generalization

- requires full training kernel to compute - requires full training kernel to predict future points

kernel regression

Documents

training kernel

regression function

training data

linear kernel

data pointslinear regression

map data

center g

kernel regressionprof