introduction to machine learning 2nd edition

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Kernel Machines Discriminant-based: No need to estimate densities first

Define the discriminant in terms of support vectors

The use of kernel functions, application-specific measures of similarity

No need to represent instances as vectors

Convex optimization problems with a unique solution

3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

4

Optimal Separating Hyperplane

(Cortes and Vapnik, 1995; Vapnik, 1995)

1

11

11

1

1

0

0

0

0

2

1

wr

rw

rw

w

C

Crr

tTt

ttT

ttT

t

tt

ttt

xw

xw

xw

w

x

xx

as rewritten be can which

for

for

that such and find

if

if where,X

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

5

Margin Distance from the discriminant to the closest instances on

either side

Distance of x to the hyperplane is

We require

For a unique sol’n, fix ρ||w||=1, and to max margin

w

xw 0wtT

t

wr tTt

,w

xw 0

twr tTt ,12

10

2xww to subject min


Margin


7

00

0

2

1

12

1

12

1

10

1

11

0

2

1

0

2

0

2

N

t

ttp

N

t

tttp

N

t

tN

t

tTtt

N

t

tTttp

tTt

rw

L

rL

wr

wrL

twr

xww

xww

xww

xww

to subject min ,


8

Most αt are 0 and only a small number have αt >0; they are the support vectors

t

and to subject

tr

rr

rwrL

ttt

t

tsTtst

t s

st

t

tT

t t

ttt

t

tttTTd

,00

2

1

2

1

2

10

xx

ww

xwww


9

Soft Margin Hyperplane

Not linearly separable

Soft error

New primal is

ttTt wxr 10w

t

t

t

tt

t

ttTtt

t

tp wxrCL 1

2

10

2ww


Hinge Loss

11

otherwise

iftt

tttt

hingery

ryryL

1

10),(


n-SVM

12

n

n

t

tttt

N

t s

sTtststd

tttTt

t

t

Nr

xxrrL

wr

N

,,

,,

100

2

1

00

2

1

1

0

2

t

to subject

to subject

1 - min

xw

w

n controls the fraction of support vectors


13

t

ttt

t

TtttT

t

ttt

t

ttt

Krg

rg

rr

xxx

xφxφxφwx

xφzw

,

Kernel Trick

Preprocess input x by basis functions

z = φ(x) g(z)=wTz

g(x)=wT φ(x)

The SVM solution


14

Vectorial Kernels

Polynomials of degree q:

qtTtK 1 xxxx ,

T

T

xxxxxx

yxyxyyxxyxyx

yxyx

K

2

2

2

12121

2

2

2

2

2

1

2

121212211

2

2211

2

2221

2221

1

1

,,,,,

,

x

yxyx


15

Vectorial Kernels

Radial-basis functions:

2

2

2sK

t

txx

xx exp,


Defining kernels Kernel “engineering”

Defining good measures of similarity

String kernels, graph kernels, image kernels, ...

Empirical kernel map: Define a set of templates mi and score function s(x,mi)

(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)]

and

K(x,xt)= (x)T (xt)


Fixed kernel combination

Adaptive kernel combination

Localized kernel combination

Multiple Kernel Learning

yxyx

yxyx

yx

yx

,,

,,

,

,

21

21

KK

KK

cK

K

17

i

tii

t

tt

t s i

stii

stst

t

td

i

m

ii

Krg

KrrL

KK

xxx

xx

yxyx

,)(

,

,,

2

1

1

i

tii

t

tt Krg xxxx ,|)(


Multiclass Kernel Machines 1-vs-all

Pairwise separation

Error-Correcting Output Codes (section 17.5)

Single multiclass optimization

18

02

2

1

00

1

2

ti

ttii

tT

iz

tT

z

i t

ti

K

ii

ziww

C

tt

to subject

min

,,xwxw

w


19

SVM for Regression

Use a linear model (possibly kernelized)

f(x)=wTx+w0

Use the є-sensitive error function

otherwise

if ,

tt

tt

tt

fr

frfre

x

xx

0

0

0

0

tt

ttT

tTt

rw

wr

,

xw

xw

t

ttC 2

2

1wmin


Kernel Regression

Polynomial kernel Gaussian kernel


One-Class Kernel Machines Consider a sphere with center a and radius R

22

t

tt

N

t s

sTtstst

t

sTttd

ttt

t

t

C

xxrrxxL

Ra

R

10

0

1

2

2

,

,

to subject

to subject

C min

x


Kernel Dimensionality Reduction

Kernel PCA does PCA on the kernel matrix (equal to canonical PCA with a linear kernel)

Kernel LDA


introduction to machine learning 2nd edition

Documents