lecture 4 elementary kernel algorithms - uni-tuebingen.de

45
Lecture 4 Elementary Kernel Algorithms Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universit¨ at T¨ ubingen, Germany Advanced Topics in Machine Learning, 2012 P. Laskov and B. Nelson (T¨ ubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 1 / 22

Upload: others

Post on 09-Feb-2022

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Lecture 4Elementary Kernel Algorithms

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 1 / 22

Page 2: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Lessons from Last Lecture

Matrices and their properties

Vector spaces and Hilbert spaces

Kernel functions and RKHS

Relation between kernel functions andpositive definitenss

Closure properties of kernel functions

y

g(x) = K(x,y)

f(x)

FE

Figure 6.1: Reproducing property of a reproducing kernel.tion 6.4.1 is equal to K(x; y) as a function of x evaluated at �xed y, the innerproduct hf(x); g(x)iF is equal to the evaluation of f(x) at point y.It is convenient to introduce the notation K(x; �) which means \K of a �xedargument x considered as a function in F ." In the notation above the reproducingproperty (6.37) can be expressed as:K(x; y) = hK(x; �); K(y; �)iF : (6.38)This expression, to be also referred to as the inner product property, is important forthree reasons. First, it sheds light on how a kernel function can provide the value ofan inner product without ever computing the latter: by de�nition, the reproducingproperty de�nes a mapping from the domain of a set of functions into the set itself,and thus an operation in the space of functions can be substituted by an operationin their domain. The second purpose of property (6.38) is to verify that a givenfunction constitutes a reproducing kernel. Lastly, the cross-kernel will be de�nedsolely in terms of the inner product property rather than the reproducing property.The inner product property is illustrated in Figure 6.2.Example. Let F2 be the space of homogeneous polynomials in two variables:f(u; v) = au2 + bp2uv + cv2:89

y

g(t) = K(y,.)

f(t) = K(x,.)

FE

xFigure 6.2: Inner product property of a reproducing kernel.The basis functions in this space are fu2;p2uv; v2g, and the inner product is de�nedas hf1; f2iF2 � a1a2 + b1b2 + c1c2:The function K((u1; v1); (u2; v2)) = (u1u2 + v1v2)2is the reproducing kernel in F2. Indeed, for �xed (u1; v1),K((u1; v1); (u; v)) = (u1u+ v1v)2= u2u2 +p2uvp2uv + v2v2:The latter is obviously the element fu2;p2uv; v2g in the space F2, and hence Ksatis�es property 1 of De�nition 6.4.1. Also,hK((u1; v1); �); K((u2; v2); �)iF2 == hu21u2 +p2u1v1p2uv + v21v2;u22u2 +p2u2v2p2uv + v22v2iF2= u21u22 +p2u1v1 � p2u2v2 + v21 v22= (u1u2 + v1v2)2:Therefore, K satis�es property 2 of De�nition 6.4.1.90

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 2 / 22

Page 3: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Agenda for Today

Carrying out elementary operations on data in the feature space withoutexplicit access to it

Computing norms of data pointsComputing distances between data pointsScaling dataCentering dataProjecting data on arbitrary directions

Simple learning algorithms using kernel operations

Centroid anomaly detectionFisher Discriminant Analysis

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 3 / 22

Page 4: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Normed Vector Space

Vector space:

Closure under addition: ∀ x, z ∈ X ⇒ x+ z ∈ XClosure under scaler multiplication: ∀ x ∈ X , a ∈ ℜ ⇒ a · x ∈ X7 additional axioms (see last lecture)

Normed vector space has a unary norm operator ‖·‖ with the followingproperties:

Nonnegativity: x 6= 0 ⇒ ‖x‖ > 0Scalability: ‖a · x‖ = |a| · ‖x‖Triangle inequality: ‖x+ z‖ ≤ ‖x‖ + ‖z‖

Norm represents the notion of length

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 4 / 22

Page 5: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Inner Product Spaces

Inner product vector space has a binary inner product operator 〈·, ·〉 withthe following properties:

Commutativity: 〈x, z〉 = 〈z, x〉Associativity: 〈a · x, z〉 = a · 〈x, z〉Distributivity: 〈(x + y), z〉 = 〈x, y〉+ 〈y, z〉Positive-definiteness: 〈x, x〉 ≥ 0, with equality only for x = 0.

Inner product space are normed spaces:

Euclidean norm is defined as ‖x‖2 =√

〈x, x〉Inner product represents the notion of angle:

∠xz = arccos

( 〈x, z〉‖x‖ ‖z‖

)

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 5 / 22

Page 6: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Application to RKHS

Given the kernel function k(x, y), how can we compute the Euclideannorm of the image φ(x)?

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 6 / 22

Page 7: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Application to RKHS

Given the kernel function k(x, y), how can we compute the Euclideannorm of the image φ(x)?

‖φ(x)‖2 =√

〈φ(x), φ(x)〉 =√

k(x, x)

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 6 / 22

Page 8: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Application to RKHS

Given the kernel function k(x, y), how can we compute the Euclideannorm of the image φ(x)?

‖φ(x)‖2 =√

〈φ(x), φ(x)〉 =√

k(x, x)

Similarly, one can compute the squared distance between two arbitraryfeature vectors:

‖φ(x)− φ(z)‖22 = 〈φ(x)− φ(z), φ(x) − φ(z)〉= 〈φ(x), φ(x)〉 − 2 〈φ(x), φ(z)〉 + 〈φ(z), φ(z)〉= k(x, x) − 2k(x, z) + k(z, z)

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 6 / 22

Page 9: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Distance from the Center of Mass

Denote the average of all data points as the center of mass:

φS =1

n

n∑

i=1

φ(xi )

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 7 / 22

Page 10: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Distance from the Center of Mass

Denote the average of all data points as the center of mass:

φS =1

n

n∑

i=1

φ(xi )

Then the squared distance of any point from the center of mass can becomputed as follows:

‖φ(x) − φS‖22 =⟨

φ(x)− 1

n

n∑

i=1

φ(xi ), φ(x) −1

n

n∑

i=1

φ(xi )

= 〈φ(x), φ(x)〉 − 2

n

n∑

i=1

〈φ(x), φ(xi )〉+1

n2

n∑

i ,j=1

〈φ(xi ), φ(xj )〉

= k(x, x) − 2

n

n∑

i=1

k(x, xi ) +1

n2

n∑

i ,j=1

k(xi , xj )

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 7 / 22

Page 11: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Simple Anomaly Detection Algorithm

Points that are far away from the expected mean of the data should beconsidered anomalous.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 8 / 22

Page 12: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Simple Anomaly Detection Algorithm

Points that are far away from the expected mean of the data should beconsidered anomalous.

How far is “far away”?

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 8 / 22

Page 13: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Simple Anomaly Detection Algorithm

Points that are far away from the expected mean of the data should beconsidered anomalous.

How far is “far away”?

How to compute the expected mean?

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 8 / 22

Page 14: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Centroid Anomaly Detection with Finite Data

Consider the finite sample S = x1, . . . , xn of size n.

Compute the empirical mean of the data, and the distances r iemp fromthe mean for all data points (user “kernelization” if necessary).

Let k be the index of the point with the largest distance rkemp

It can be shown that the distance between the empirical mean and theexpected mean is bounded, with probability 1− δ, as follows:

g(S) = ‖x− E [x]‖ ≤√

2R2

n

(

√2 +

ln1

δ

)

where R is the supremum over the norm of the points in the distribution.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 9 / 22

Page 15: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Centroid Anomaly Detection with Finite Data (ctd.)

By triangle inequality,

remp ≤ r + g(S)

rk ≤ rkemp + g(S)

Rearranging the terms, we obtain

r ≥ remp − g(S)

−rk ≥ −rkemp − g(S)

Adding together, we obtain:

r − rk ≥ remp − rkemp − 2g(S)

remprk

emp

r rk

g

x

xk

Expected mean

Empirical mean

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 10 / 22

Page 16: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Centroid Anomaly Detection: Decision Making

What is the meaning of the inequality

r − rk ≥ remp − rkemp − 2g(S) ?

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22

Page 17: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Centroid Anomaly Detection: Decision Making

What is the meaning of the inequality

r − rk ≥ remp − rkemp − 2g(S) ?

If the right-hand side is greater than 0, i.e., remp > rkemp + 2g(S), thenthe left-hand side is also greater than 0, i.e. the new data point x isfarther away from the expected mean that the point xk .

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22

Page 18: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Centroid Anomaly Detection: Decision Making

What is the meaning of the inequality

r − rk ≥ remp − rkemp − 2g(S) ?

If the right-hand side is greater than 0, i.e., remp > rkemp + 2g(S), thenthe left-hand side is also greater than 0, i.e. the new data point x isfarther away from the expected mean that the point xk .

, And we have do this without ever computing r or rk !

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22

Page 19: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Centroid Anomaly Detection: Decision Making

What is the meaning of the inequality

r − rk ≥ remp − rkemp − 2g(S) ?

If the right-hand side is greater than 0, i.e., remp > rkemp + 2g(S), thenthe left-hand side is also greater than 0, i.e. the new data point x isfarther away from the expected mean that the point xk .

, And we have do this without ever computing r or rk !

/ The “margin” g(S) can be pretty high, especially if we want to have

high confidence δ (g ∼√

ln 1δ)

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22

Page 20: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Centroid Anomaly Detection: Decision Making

What is the meaning of the inequality

r − rk ≥ remp − rkemp − 2g(S) ?

If the right-hand side is greater than 0, i.e., remp > rkemp + 2g(S), thenthe left-hand side is also greater than 0, i.e. the new data point x isfarther away from the expected mean that the point xk .

, And we have do this without ever computing r or rk !

/ The “margin” g(S) can be pretty high, especially if we want to have

high confidence δ (g ∼√

ln 1δ)

/ The “margin” g(S) is infinite unless the distribution is bounded (g ∼ R)

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22

Page 21: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Centroid Anomaly Detection: Lessons Learned

The prediction risk on unknown data can be often estimated in the form:

R(x) ≤ Remp(S) + h(S)

Minimization of the emprical risk alone does not suffice!

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 12 / 22

Page 22: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Centroid Anomaly Detection: Lessons Learned

The prediction risk on unknown data can be often estimated in the form:

R(x) ≤ Remp(S) + h(S)

Minimization of the emprical risk alone does not suffice!

The bound on prediction risk often depends on the surpremum on thenorm of data points R .

Data of unbounded size is bad!

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 12 / 22

Page 23: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Using Kernels for Data Normalization

Suppose we want to project all data in thefeature space to have norm 1. . .

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 13 / 22

Page 24: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Using Kernels for Data Normalization

Suppose we want to project all data in thefeature space to have norm 1. . .

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 13 / 22

Page 25: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Using Kernels for Data Normalization

Suppose we want to project all data in thefeature space to have norm 1. . .

Divide each point by its norm:

φ(x) =φ(x)

‖φ(x)‖x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 13 / 22

Page 26: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Using Kernels for Data Normalization

Suppose we want to project all data in thefeature space to have norm 1. . .

Divide each point by its norm:

φ(x) =φ(x)

‖φ(x)‖

Use inner products between normalized data:

k(x, z) =

φ(x)

‖φ(x)‖ ,φ(z)

‖φ(z)‖

=〈φ(x), φ(z)〉

‖φ(x)‖ ‖φ(z)‖ =k(x, z)

k(x, x) k(z, z)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 13 / 22

Page 27: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Using Kernels for Data Centering

Suppose we want to translate the origin tothe center of mass of the data. We will see inthe next lecture (homework) why...

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 14 / 22

Page 28: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Using Kernels for Data Centering

Suppose we want to translate the origin tothe center of mass of the data. We will see inthe next lecture (homework) why...

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 14 / 22

Page 29: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Using Kernels for Data Centering

Suppose we want to translate the origin tothe center of mass of the data. We will see inthe next lecture (homework) why...

The inner product in the new coordinate system becomes

k(x, z) =

φ(x)− 1

n

n∑

i=1

φ(xi ), φ(z) −1

n

n∑

i=1

φ(zi )

= k(x, z) − 1

n

n∑

i=1

k(x, xi )−1

n

n∑

i=1

k(z, xi ) +1

n2

n∑

i ,j=1

k(xi , xj )

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 14 / 22

Page 30: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Using Kernels for Data Centering

Suppose we want to translate the origin tothe center of mass of the data. We will see inthe next lecture (homework) why...

The inner product in the new coordinate system becomes

k(x, z) =

φ(x)− 1

n

n∑

i=1

φ(xi ), φ(z) −1

n

n∑

i=1

φ(zi )

= k(x, z) − 1

n

n∑

i=1

k(x, xi )−1

n

n∑

i=1

k(z, xi ) +1

n2

n∑

i ,j=1

k(xi , xj )

The kernel matrix for the training data can be expressed as:

K = K − 1

n1nnK − 1

nK1nn +

1

n21nnK1nn

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 14 / 22

Page 31: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Fisher Discriminant Analysis

Pick some direction w

Project all data points onto w

Compute the averages µ+ and µ− of the setsof projected points from two classes

Compute the corresponding standarddeviations σ+ and σ−

Choose w so as to maximize the followingcost function:

J(w) =‖µ+

w − µ−w‖

2

(σ+w )2 + (σ−

w )2

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 15 / 22

Page 32: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Projection of a Point onto a Direction

Projection of a point x onto a direction w iscomputed as:

xw = w〈w, x〉‖w‖

The average projection of a set of points onto w can be computed as:

µw =1

n

n∑

i=1

w

‖w‖ 〈w, xi 〉 =1

‖w‖wm⊤w,

where

m :=1

n

n∑

i=1

xi

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 16 / 22

Page 33: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

“Between” Scatter Matrix

Consider the numerator of J(w):

∥µ+w− µ−

w

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22

Page 34: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

“Between” Scatter Matrix

Consider the numerator of J(w):

∥µ+w− µ−

w

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

=1

‖w‖2(

w[m+]⊤w − w[m−]⊤w)⊤ (

w[m+]⊤w − w[m−]⊤w)

.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22

Page 35: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

“Between” Scatter Matrix

Consider the numerator of J(w):

∥µ+w− µ−

w

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

=1

‖w‖2(

w[m+]⊤w − w[m−]⊤w)⊤ (

w[m+]⊤w − w[m−]⊤w)

=1

‖w‖2(

w⊤m

+w

⊤w[m+]⊤w − w

⊤m

−w

⊤w[m+]⊤w

−w⊤m

+w

⊤w[m−]⊤w + w

⊤m

−w

⊤w[m−]⊤w

)

.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22

Page 36: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

“Between” Scatter Matrix

Consider the numerator of J(w):

∥µ+w− µ−

w

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

=1

‖w‖2(

w[m+]⊤w − w[m−]⊤w)⊤ (

w[m+]⊤w − w[m−]⊤w)

=1

‖w‖2(

w⊤m

+w

⊤w[m+]⊤w − w

⊤m

−w

⊤w[m+]⊤w

−w⊤m

+w

⊤w[m−]⊤w + w

⊤m

−w

⊤w[m−]⊤w

)

=‖w‖2

‖w‖2w

⊤ (m+ −m−)(m+ −m

−)⊤

SB

w

.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22

Page 37: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

“Between” Scatter Matrix

Consider the numerator of J(w):

∥µ+w− µ−

w

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

=1

‖w‖2(

w[m+]⊤w − w[m−]⊤w)⊤ (

w[m+]⊤w − w[m−]⊤w)

=1

‖w‖2(

w⊤m

+w

⊤w[m+]⊤w − w

⊤m

−w

⊤w[m+]⊤w

−w⊤m

+w

⊤w[m−]⊤w + w

⊤m

−w

⊤w[m−]⊤w

)

=‖w‖2

‖w‖2w

⊤ (m+ −m−)(m+ −m

−)⊤

SB

w

SB is called “between” scatter matrix.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22

Page 38: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

“Within” Scatter Matrix

Consider the terms in the denominator of J(w):

(σ+w )

2 =1

n+

n+∑

i=1

w

‖w‖ 〈w, xi 〉 −1

n+w

‖w‖

n+∑

j=1

〈w, xj 〉

2

.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 18 / 22

Page 39: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

“Within” Scatter Matrix

Consider the terms in the denominator of J(w):

(σ+w )

2 =1

n+

n+∑

i=1

w

‖w‖ 〈w, xi 〉 −1

n+w

‖w‖

n+∑

j=1

〈w, xj 〉

2

=1

n+1

‖w‖2n+∑

i=1

∥wx

i w −w[m+]⊤w∥

2

.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 18 / 22

Page 40: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

“Within” Scatter Matrix

Consider the terms in the denominator of J(w):

(σ+w )

2 =1

n+

n+∑

i=1

w

‖w‖ 〈w, xi 〉 −1

n+w

‖w‖

n+∑

j=1

〈w, xj 〉

2

=1

n+1

‖w‖2n+∑

i=1

∥wx

i w −w[m+]⊤w∥

2

= w⊤

1

n+

n+∑

i=1

(xi −m+)(xi −m

+)⊤

S+W

w

.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 18 / 22

Page 41: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

“Within” Scatter Matrix

Consider the terms in the denominator of J(w):

(σ+w )

2 =1

n+

n+∑

i=1

w

‖w‖ 〈w, xi 〉 −1

n+w

‖w‖

n+∑

j=1

〈w, xj 〉

2

=1

n+1

‖w‖2n+∑

i=1

∥wx

i w −w[m+]⊤w∥

2

= w⊤

1

n+

n+∑

i=1

(xi −m+)(xi −m

+)⊤

S+W

w

SW is called “with” scatter matrix.

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 18 / 22

Page 42: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Fisher Discriminant Analysis (ctd.)

Putting all terms together, we obtain the following optimization problem:

maxw

J(w) =w

TSBw

wT (S+W + S−

W )w

Using the properties of the Rayleigh coefficient (Lecture 3, slide 14), it canbe shown that the optimal solution is given by the leading eigenvector ofthe matrix

(S+W + S−

W )−1SB

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 19 / 22

Page 43: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Fisher Discriminant Analsys: Kernelization

From the theory of RKHS, it follows that the solution w in the featurespace must be a linear combination of images of the training data:

w =

n∑

i=1

αiφ(xi )

Then each expression of the form w⊤m can be represented as:

w⊤m =

n∑

i=1

n∑

j=1

αiφ(xi )⊤φ(xj )

=

n∑

i=1

n∑

j=1

αik(xi , xj ) = α⊤K1n

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 20 / 22

Page 44: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Kernel Fisher Discriminant Analysis

Putting all pieces together, the optimization problem of KFD can beformulated as follows:

maxα

J(α) =αTMα

αTNα

where

M = (K+ − K−)1nn(K+ − K−)

N = K+

(

In+ − 1

n+1n+n+

)

[K+]⊤ + K−

(

In− − 1

n−1n−n−

)

[K−]⊤

Classification is performed as:

f (x) = w · φ(x) =n∑

i=1

αik(xi , x)

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 21 / 22

Page 45: Lecture 4 Elementary Kernel Algorithms - uni-tuebingen.de

Summary

We explored the connection between inner products (kernels) and norms

We learned how to use kernels to compute distances in feature spaces

We applied this technique for a simple anomaly detection algorithm

We saw the first bound on the expected risk of a learning algorithm

We learned how to perform other interesting operations in the featurespace, e.g., scaling and centering

We performed “kernelization” of the classical Fisher DiscriminantAnalysis algorithm

Next Lecture: kernelization of other pattern analysis techniques

Principal Component Analysis (PCA)Canonical Correlation Analysis (CCA)

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 22 / 22