nuno vasconcelos ece depp,artment, ucsdpositive definite kernels it is not hard to show that all dot...

Dot-product kernels

Nuno Vasconcelos ECE Department, UCSDp ,

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world

Perceptron: classifier implements the linear decision rulep p

with [ ]g(x)sgn )( =xh bxwxg T +=)( w

appropriate when the classes arelinearly separable

b

x

xg )(

to deal with non-linear separabilitywe introduce a kernel

wb w

2

Kernel summary1. D not linearly separable in X, apply feature transformation Φ:X → Z,

such that dim(Z) >> dim(X)

2. computing Φ(x) too expensive:• write your learning algorithm in dot-product form• instead of Φ(xi) we only need Φ(xi)TΦ(xj) ∀ijinstead of Φ(xi), we only need Φ(xi) Φ(xj) ∀ij

3. instead of computing Φ(xi)TΦ(xj) ∀ij, define the “dot-product kernel”

)()(),( zxzxK T ΦΦ=

and compute K(xi,xj) ∀ij directly• note: the matrix

)()()(

⎥⎤

⎢⎡ M

is called the “kernel” or Gram matrix

⎥⎥⎥

⎦⎢⎢⎢

⎣

=M

LL ),( ji xxKK

3

4. forget about Φ(x) and use K(x,z) from the start!

Polynomial kernelsthis makes a significant difference when K(x,z) is easier to compute that Φ(x)TΦ(z)p ( ) ( )e.g., we have seen that

( ) TT zxzxzxK )()(),( 2ΦΦ== ( )

x

)()(),(

1 ⎞⎜⎛

ℜ→ℜΦ : with2dd

( )Tddddd

d

xxxxxxxxxxxxx

,,,,,,,, 2112111 LLLM →⎟⎟

⎠⎜⎜⎜

⎝

while K(x,z) has complexity O(d), Φ(x)TΦ(z) is O(d2)for K(x,z) = (xTz)k we go from O(d) to O(dk)

4

( , ) ( ) g ( ) ( )

Questionwhat is a good dot-product kernel?• intuitively a good kernel is one that maximizes the margin γ in• intuitively, a good kernel is one that maximizes the margin γ in

range space• however, nobody knows how to do this effectively

in practice:• pick a kernel from a library of known kernels• we talked about• we talked about

• the linear kernel K(x,z) = xTz• the Gaussian family 2zx −

• the polynomial family

σ),(zx

ezxK−

=

( )k

5

( ) { }L,2,1,1),( ∈+= kzxzxK kT

Question“this problem of mine is really asking for the kernel k’(x,z) = ...”( , )• how do I know if this is a dot-product kernel?

let’s start by the definitionDefinition: a mapping

k: X x X → ℜxxx

xx

xxx

xxxx

ooo

ooooo

oooo

x2 X

(x,y) → k(x,y)is a dot-product kernel if and only if

ooo x1

xxx

xx

xxx

xxxx

ooo

oo oo

ox1

Φ

H

k(x,y) = <Φ(x),Φ(y)>where Φ: X → H, H is a vector space and, <.,.> a dot-product in H

o o oo oox1

x3 x2xn

6

product in H

Vector spacesnote that both H and <.,.> can be abstract, not necessarily ℜdyDefinition: a vector space is a set H where • addition and scalar multiplication are defined and satisfy:

1) x+(x’+x’’) = (x+x’)+x” 5) λx ∈ H2) x+x’ = x’+x ∈ H 6) 1x = x3) 0 ∈ H, 0 + x = x 7) λ(λ’ x) = (λλ’)x) , ) ( ) ( )4) –x ∈ H, -x + x = 0 8) λ(x+x’) = λx + λx’

9) (λ+λ’)x = λx + λ’x

the canonical example is ℜd with standard vector additionthe canonical example is ℜ with standard vector addition and scalar multiplicationanother example is the space of mappings X → ℜ with

7

(f+g)(x) = f(x) + g(x) (λf)(x) = λf(x)

Bilinear formsto define dot-product we first need to recall the notion of a bilinear formDefinition: a bilinear form on a vector space H is a mapping

Q: H x H ℜQ: H x H → ℜ(x,x’) → Q(x,x’)

such that ∀ x,x’,x’’∈ Hsuch that ∀ x,x ,x ∈ H

i) Q[(λx+λx’),x”] = λQ(x,x”) + λ’Q(x’,x”)ii) Q[x”,(λx+λx’)] = λQ(x”,x) + λ’Q(x”,x’)) [ ( )] ( ) ( )

in ℜ d the canonical bilinear form is Q(x,x’) = xTAx’

8

if Q(x,x’) = Q(x’,x) ∀ x,x’∈ H, the form is symmetric

Dot productsDefinition: a dot-product on a vector space H is a symmetric bilinear formy

<.,.>: H x H → ℜ(x,x’) → <x,x’>

h th tsuch that

i) <x,x> ≥ 0, ∀ x∈ Hii) <x x> = 0 if and only if x = 0ii) <x,x> = 0 if and only if x = 0

note that for the canonical bilinear form in ℜ dnote that for the canonical bilinear form in ℜ<x,x> = xTAx

this means that A must be positive definite

9

this means that A must be positive definitexTAx > 0, ∀ x ≠ 0

Positive definite matricesrecall that (e.g. Linear Algebra and Applications, Strang)

Definition: each of the following is a necessary andDefinition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (semi) positive definite:

i) TA ≥ 0 0i) xTAx ≥ 0, ∀ x ≠ 0 ii) all eigenvalues of A satisfy λi ≥ 0iii) all upper-left submatrices Ak have non-negative determinanti ) there is a matri R ith independent ro s s ch thativ) there is a matrix R with independent rows such that

A = RTR

l ft b t iupper left submatrices:

L 322212

3,12,11,1

32,11,1

2111 ⎥⎥⎤

⎢⎢⎡

=⎥⎤

⎢⎡

== aaaaaa

Aaa

AaA

10

3,32,31,3

3,22,21,232,21,2

21,11

⎥⎥⎦⎢

⎢⎣

⎥⎦

⎢⎣ aaa

aa

Positive definite matricesproperty iv) is particularly interesting• in ℜ d <x x> = xTAx is a dot-product kernel if and only if A is• in ℜ , <x,x> = x Ax is a dot-product kernel if and only if A is

positive definite• from iv) this holds if and only if there is R such that A = RTR• hence

<x,y> = xTAy = (Rx)T(Ry) = Φ(x)TΦ(y)with

Φ: ℜ d → ℜ d

x → Rx

i.e. the dot-product kernelp

k(x,z) = xTAz, (A positive definite)is the standard dot-product in the range space of the

11

is the standard dot product in the range space of the mapping Φ(x) = Rx

Notethere are positive semidefinite matrices

xTAx ≥ 0x Ax ≥ 0and positive definite matrices

xTAx > 0

we will work with semidefinite but, to simplify, will call definiteif we really need > 0 we will say “strictly positive definite”

12

Positive definite kernelshow do we define a positive definite function?Definition: a function k(x y) is a positive definite kernel onDefinition: a function k(x,y) is a positive definite kernel on X xX if ∀ l and ∀ {x1, ..., xl}, xi∈ X, the Gram matrix

⎤⎡ M

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

M

LL

M

),( ji xxkK

is positive definite.Note: this implies that

⎥⎦⎢⎣ M

Note: this implies that• k(x,x) ≥ 0 ∀ x∈ X

• PD ∀ x,y∈ X⎤⎡ ),(),( yxkxxk (*)

13

yetc...⎥

⎦

⎤⎢⎣

⎡),(),(),(),(

yykxyky ( )

Positive definite kernelsthis proves some simple properties• a PD kernel is symmetrica PD kernel is symmetric

Xyxxykyxk ∈∀= ,),,(),( Proof:

since PD means symmetric (*) implies k(x,y) = k(y,x) ∀ x,y∈ X

• Cauchy-Schwarz inequality for kernels: if k(x,y) is a PD kernel, y q y ( y)then

Xyxyykxxkyxk ∈∀≤ ,),,(),(),( 2

Proof:from (*), and property iii) of PD matrices, the determinant of the 2x2 matrix of (*) is non-negative This means that

14

2x2 matrix of ( ) is non negative. This means that

k(x,x)k(y,y) – k(x,y)2 ≥ 0

Positive definite kernelsit is not hard to show that all dot product kernels are PDLemma 1: Let k(x y) be a dot product kernel Then k(x y)Lemma 1: Let k(x,y) be a dot-product kernel. Then k(x,y)is positive definiteproof: p• k(x,y) dot product kernel implies that• ∃Φ and some dot product <.,.> such that

k( ) <Φ( ) Φ( )>k(x,y) = <Φ(x),Φ(y)> • this implies that if:

• we pick any l, and any sequence {x1, ..., xl}, ⎤⎡ My y { 1 l}• and let K be the associated Gram matrix • then, for ∀ c ≠ 0 ⎥

⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

M

LL

M

),( ji xxkK

15

Positive definite kernels

( ),= ∑ij

jijiT xxkccKcc

( ) ( ),ΦΦ= ∑ij

jiji

j

xxcc (k is dot product)

( ) ( ), ΦΦ= ∑∑j

jji

ii xcxc (<.,.> is a bilinear form)

( ) 02

≥Φ= ∑ ii

j

xc (from def of dot product)i product)

16

Positive definite kernelsthe converse is also true but more difficult to proveLemma 2: Let k(x y) x y∈ X be a positive definiteLemma 2: Let k(x,y), x,y∈ X, be a positive definite kernel. Then k(x,y) is a dot product kernelproof:p• we need to show that there is a transformation Φ, a vector space

H = Φ(X), and a dot product <.,.>* in H such that k(x y) <Φ(x) Φ(y)> x2k(x,y) = <Φ(x),Φ(y)>*

• we proceed in three steps1. construct a vector space H

x xx

x

xxx

x

xxx

x

ooo

o

oooo

oooo

x1

x2

Φ

X

p H

2. define the dot-product <.,.>* on H

3. show that k(x,y) = <Φ(x),Φ(y)>* holdsx xx

x

xxx

x

xxx

x

ooo

ooo

oox1

Φ

H

17

ooo

oo ox1

x3 x2xn

The vector space Hwe define H as the space spanned by linear combinations of k(.,xi)( , i)

H = ( )⎭⎬⎫

⎩⎨⎧

∈∀∀= ∑=

m

iiii Xxmxkff

1,,.,(.)|(.) α

notation: by k(.,xi) we mean a function of g(y) = k(y,xi) of y, xi is fixed.homework: check that H is a vector space• e.g. 2) ( )

⎪⎪⎫

= ∑ .,(.) xkfm

iiα ( )

( )∈+=+

⎪⎪⎭

⎪⎪⎬

= ∑

∑= (.)(.)'(.)'(.)

'.,(.)'

.,(.)

'

1

1 ffffxkf

xkf

m

ijj

iii

β

αH

18

⎪⎭=1i

Examplewhen we use the Gaussian kernel

2. ix−

k( ) i G i t d ith i I

2)(., σi

i exK−

=

k(.,xi) is a Gaussian centered on xi with covariance σIand

⎪⎫⎪⎧−m x i.

H =

is the space of all linear⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

∀∀= ∑=

−m

iii xmeff

1

,,(.)|(.) 2 σα

is the space of all linear combinations of Gaussiansnote that these are not mixtures

e.g.

19

but close

The operator <.,.>*

if f(.) and g(.) ∈ H, with'mm

( ) ( )∑∑==

==11

'.,(.).,(.)m

jjj

m

iii xkgxkf βα (**)

we define the operator <.,.>* as

'

( )∑∑= =

=m

iji

m

jji xxkgf

1

'

1*

',, βα (***)

20

Examplewhen we use the Gaussian kernel

2. ix−

th t < > i i ht d f G i t

2)(., σi

i exK−

=

the operator <.,.>* is a weighted sum of Gaussian terms

∑∑−

−m xxm ji

f''

2

2

β

you can look at this as either:

∑∑= =

=i j

ji egf1 1

*

2, σβα

you can look at this as either:• a dot product in H (still need to prove this)• a non-linear measure of similarity in X, somewhat related to

21

ylikelihoods

The operator <.,.>*

important note: for f(.) and g(.) ∈ H, the operator <.,.>*

m m '

( )∑∑= =

=m

iji

m

jji xxkgf

1

'

1*

',, βα

has the property

( )xxkxkxk ')'()( = (****)

proof: just make

( )jiji xxkxkxk ,)(.,),(.,*

= ( )

p j

⎩⎨⎧

≠∀==≠∀==

jkik

kj

ki 0,10,1

ββαα

22

⎩ ≠∀ jkkj 0,1 ββ

The operator <.,.>*

assume that <.,.>* is a dot product in H (proof in moments))since

( )jiji xxkxkxk ,)(.,),(.,*

=

then, clearly

( )*

)(),(, jiji xxxxk ΦΦ=

with

( )*

)(),(, jiji

Φ: X → Hx → k( x)

i.e. the kernel is a dot-product on H, which results from the feature transformation Φ

x → k(.,x)

23

this proves Lemma 2

Examplewhen we use the Gaussian kernel• the point xi ∈ ℜd is mapped into the Gaussian G(x xi σI)

σ

2

),(ixx

i exxK−

−=

• the point xi ∈ ℜ is mapped into the Gaussian G(x,xi,σI)• H is the space of all functions that are linear combinations of

Gaussians• this has infinite dimension• the kernel is a dot product in H, and a non-linear

similarity on X

xx

x

x

xx

x

xx2

ℜd H*

x

xx

x

oo

o

o

o

o oo

o oo

ox1

xx

x

x

x

xx

x

xx

x

x

o

o

ooo

o Φ

*

24

o o

x3x2

xn

x oo

o

ooo

x1*

In summaryto show that k(x,y), x,y∈ X, positive definite ⇒ k(x,y) is a dot product kernelpwe need to show that

( )m m '

( )∑∑= =

=i

jij

ji xxkgf1 1

*',, βα

is a dot product on

H = ( ) ⎬⎫

⎨⎧

∈∀∀= ∑m

Xxmxkff ( )|( ) αH =

this reduces to verifying the dot product conditions

( )⎭⎬

⎩⎨ ∈∀∀= ∑

=iiii Xxmxkff

1,,.,(.)|(.) α

25

y g p

The operator <.,.>*

1) is <.,.>* a bilinear form on H ?

by definition of f( ) and g( ) in (**)by definition of f(.) and g(.) in ( )

( ) ( )'

*'.,,.,, j

m

j

m

ii xkxkgf ∑∑= βα

on the other hand,*11 ji

∑∑==

m m '

( )

( )

∑∑= =

=

m m

m

iji

m

jji xxkgf

'

1 1*

',, βα from (***)

equality of the two left hand sides is the definition of

( ) ( )∑∑= =

=m

iji

m

jji xkxk

1*

1'.,,.,βα from (****)

26

q ybilinearity

The operator <.,.>*

2) is <.,.>* symmetric?note thatnote that

( )*

1

'

1*

,,', gfxxkfgm

iij

m

jji == ∑∑

= =

βα

if and only if k(xi,xj’) = k(xj’,xi) for all xi, xj’.b t thi f ll f th iti d fi it f k( )

1 1i j

but this follows from the positive definiteness of k(x,y)we have seen that a PD kernel is always symmetrich i t ihence, <.,.>* is symmetric

27

The operator <.,.>*

3) is <f,f>* ≥ 0, ∀ f ∈ H ?

by definition of f( ) in (**)by definition of f(.) in ( )

( ) αααα Kxxkff Tm

ji

m

ji == ∑∑*,,

where α ∈ ℜm and K is the Gram matrixsince k(x y) is positive definite K is positive definite by

i j= =1 1

since k(x,y) is positive definite, K is positive definite by definition and <f,f>* ≥ 0the only non-trivial part of the proof is to show that

(x)

<f,f>* = 0 ⇒ f = 0we need two more results

28

The operator <.,.>*

Lemma 3: <.,.>* is itself a positive definite kernel on H x H

proof:proof:• consider any sequence {f1, ..., fm}, fi ∈ H

• then

=∑ ∑∑*

*,, ffff

ij jjj

iiijiji γγγγ (by bilinearity of <.,.>*)

0

≥

=*21

*

,ggj j

(for some g1,g2 ∈ H )

(by (x) )

• hence the Gram matrix is always PD and the kernel <.,.>* is PD

( y ( ) )

29

The operator <.,.>*

Lemma 4: ∀ f ∈ H, <k(.,x),f(.)>* = f(x)

proof:proof:

)(.,),(.,(.)),(.,*

xkxkfxk ii= ∑α (by (**) )

)(.,),(.,*

*

xkxki

ii

i

= ∑α (by bilinearity of <.,.>*)

)(),( xfxxki

ii == ∑α (by (****) )

30

The operator <.,.>*

4) we are now ready to prove that <f,f>* = 0 ⇒ f = 0proof:proof:• since <.,.>* is a PD kernel (lemma 3) we can apply Cauchy-

Schwarz Xyxyykxxkyxk ∈∀≤ ,),,(),(),( 2 • using k(.,x) as x and f(.) as y this becomes

( )2

***),(.,,)(.,),(., fxkffxkxk ≥

• and using lemma 4

)(,),( 2*

xfffxxk ≥

• from which <f,f>* = 0 ⇒ f = 0

31

In summarywe have shown that

( )∑∑= =

=m

iji

m

jji xxkgf

1

'

1*

',, βα

is a dot product on

( ) ⎬⎫

⎨⎧

∀∀∑m

Xxmxkff ( )|( )H =

and this shows that if k(x y) x y∈ X is a positive definite

( )⎭⎬

⎩⎨ ∈∀∀= ∑

=iiii Xxmxkff

1,,.,(.)|(.) α

and this shows that if k(x,y), x,y∈ X, is a positive definite kernel, then k(x,y) is a dot product kernel.since we had initially proven the converse, we have the

32

y p ,following theorem.

Dot product kernelsTheorem: k(x,y), x,y∈ X, is a dot-product kernel if and only if it is a positive definite kernely p

this is interesting because it allows us to check whether a k l i d t d t t!kernel is a dot product or not!• check if the Gram matrix is positive definite for all possible

sequences {x1, ..., xl}, xi∈ X

but the proof is much more interesting than this result aloneit actually gives us insight on what the kernel is doinglet’s summarize

33

Dot product kernelsa dot product kernel k(x,y), x,y∈ X:• applies a feature transformationapplies a feature transformation

Φ: X → Hx → k(.,x)

• to the vector space

H =

x → k(.,x)

( )⎭⎬⎫

⎩⎨⎧

∈∀∀= ∑m

iii Xxmxkff ,,.,(.)|(.) α

• where the kernel implements the dot product

( )⎭⎬

⎩⎨ ∑

=iiii

1,,,( )|( )

m m '

( )∑∑= =

=m

iji

m

jji xxkgf

1 1*

',, βα

34

Dot product kernelsthe dot product

m m '

h th d i t

( )∑∑= =

=m

iji

m

jji xxkgf

1

'

1*

',, βα

has the reproducing property

)((.)),(.,*

xffxk =you can think of this as analog to the convolution with a Dirac deltawe will talk about this a lot in the coming lecturesfinally, <.,.>* is itself a positive definite kernel on H x H

35

A good picture to rememberwhen we use the Gaussian kernel• the point xi ∈ ℜd is mapped into the Gaussian G(x xi σI)

σ

2

),(ixx

i exxK−

−=

• the point xi ∈ ℜ is mapped into the Gaussian G(x,xi,σI)• H is the space of all functions that are linear combinations of

Gaussiansth k l i d t d t i• the kernel is a dot product in H

• the dot product with one of the Gaussianshas the reproducing property

xx

x

x

xx

x

xx2

ℜd H*

x

xx

x

oo

o

o

o

o oo

o oo

ox1

xx

x

x

x

xx

x

xx

x

x

o

o

ooo

o Φ

*

36

o o

x3x2

xn

x oo

o

ooo

x1*

nuno vasconcelos ece depp,artment, ucsdpositive definite kernels it is not hard to show that all dot...

Documents