lecture 4 elementary kernel algorithms - uni-tuebingen.de

Lecture 4Elementary Kernel Algorithms

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 1 / 22

Lessons from Last Lecture

Matrices and their properties

Vector spaces and Hilbert spaces

Kernel functions and RKHS

Relation between kernel functions andpositive definitenss

Closure properties of kernel functions

y

g(x) = K(x,y)

f(x)

FE

Figure 6.1: Reproducing property of a reproducing kernel.tion 6.4.1 is equal to K(x; y) as a function of x evaluated at �xed y, the innerproduct hf(x); g(x)iF is equal to the evaluation of f(x) at point y.It is convenient to introduce the notation K(x; �) which means \K of a �xedargument x considered as a function in F ." In the notation above the reproducingproperty (6.37) can be expressed as:K(x; y) = hK(x; �); K(y; �)iF : (6.38)This expression, to be also referred to as the inner product property, is important forthree reasons. First, it sheds light on how a kernel function can provide the value ofan inner product without ever computing the latter: by de�nition, the reproducingproperty de�nes a mapping from the domain of a set of functions into the set itself,and thus an operation in the space of functions can be substituted by an operationin their domain. The second purpose of property (6.38) is to verify that a givenfunction constitutes a reproducing kernel. Lastly, the cross-kernel will be de�nedsolely in terms of the inner product property rather than the reproducing property.The inner product property is illustrated in Figure 6.2.Example. Let F2 be the space of homogeneous polynomials in two variables:f(u; v) = au2 + bp2uv + cv2:89

y

g(t) = K(y,.)

f(t) = K(x,.)

FE

xFigure 6.2: Inner product property of a reproducing kernel.The basis functions in this space are fu2;p2uv; v2g, and the inner product is de�nedas hf1; f2iF2 � a1a2 + b1b2 + c1c2:The function K((u1; v1); (u2; v2)) = (u1u2 + v1v2)2is the reproducing kernel in F2. Indeed, for �xed (u1; v1),K((u1; v1); (u; v)) = (u1u+ v1v)2= u2u2 +p2uvp2uv + v2v2:The latter is obviously the element fu2;p2uv; v2g in the space F2, and hence Ksatis�es property 1 of De�nition 6.4.1. Also,hK((u1; v1); �); K((u2; v2); �)iF2 == hu21u2 +p2u1v1p2uv + v21v2;u22u2 +p2u2v2p2uv + v22v2iF2= u21u22 +p2u1v1 � p2u2v2 + v21 v22= (u1u2 + v1v2)2:Therefore, K satis�es property 2 of De�nition 6.4.1.90


Agenda for Today

Carrying out elementary operations on data in the feature space withoutexplicit access to it

Computing norms of data pointsComputing distances between data pointsScaling dataCentering dataProjecting data on arbitrary directions

Simple learning algorithms using kernel operations

Centroid anomaly detectionFisher Discriminant Analysis


Normed Vector Space

Vector space:

Closure under addition: ∀ x, z ∈ X ⇒ x+ z ∈ XClosure under scaler multiplication: ∀ x ∈ X , a ∈ ℜ ⇒ a · x ∈ X7 additional axioms (see last lecture)

Normed vector space has a unary norm operator ‖·‖ with the followingproperties:

Nonnegativity: x 6= 0 ⇒ ‖x‖ > 0Scalability: ‖a · x‖ = |a| · ‖x‖Triangle inequality: ‖x+ z‖ ≤ ‖x‖ + ‖z‖

Norm represents the notion of length


Inner Product Spaces

Inner product vector space has a binary inner product operator 〈·, ·〉 withthe following properties:

Commutativity: 〈x, z〉 = 〈z, x〉Associativity: 〈a · x, z〉 = a · 〈x, z〉Distributivity: 〈(x + y), z〉 = 〈x, y〉+ 〈y, z〉Positive-definiteness: 〈x, x〉 ≥ 0, with equality only for x = 0.

Inner product space are normed spaces:

Euclidean norm is defined as ‖x‖2 =√

〈x, x〉Inner product represents the notion of angle:

∠xz = arccos

( 〈x, z〉‖x‖ ‖z‖

)


Application to RKHS

Given the kernel function k(x, y), how can we compute the Euclideannorm of the image φ(x)?


Application to RKHS


‖φ(x)‖2 =√

〈φ(x), φ(x)〉 =√

k(x, x)


Application to RKHS


‖φ(x)‖2 =√

〈φ(x), φ(x)〉 =√

k(x, x)

Similarly, one can compute the squared distance between two arbitraryfeature vectors:

‖φ(x)− φ(z)‖22 = 〈φ(x)− φ(z), φ(x) − φ(z)〉= 〈φ(x), φ(x)〉 − 2 〈φ(x), φ(z)〉 + 〈φ(z), φ(z)〉= k(x, x) − 2k(x, z) + k(z, z)


Distance from the Center of Mass

Denote the average of all data points as the center of mass:

φS =1

n

n∑

i=1

φ(xi )


Distance from the Center of Mass

Denote the average of all data points as the center of mass:

φS =1

n

n∑

i=1

φ(xi )

Then the squared distance of any point from the center of mass can becomputed as follows:

‖φ(x) − φS‖22 =⟨

φ(x)− 1

n

n∑

i=1

φ(xi ), φ(x) −1

n

n∑

i=1

φ(xi )

⟩

= 〈φ(x), φ(x)〉 − 2

n

n∑

i=1

〈φ(x), φ(xi )〉+1

n2

n∑

i ,j=1

〈φ(xi ), φ(xj )〉

= k(x, x) − 2

n

n∑

i=1

k(x, xi ) +1

n2

n∑

i ,j=1

k(xi , xj )


Simple Anomaly Detection Algorithm

Points that are far away from the expected mean of the data should beconsidered anomalous.




How far is “far away”?




How far is “far away”?

How to compute the expected mean?


Centroid Anomaly Detection with Finite Data

Consider the finite sample S = x1, . . . , xn of size n.

Compute the empirical mean of the data, and the distances r iemp fromthe mean for all data points (user “kernelization” if necessary).

Let k be the index of the point with the largest distance rkemp

It can be shown that the distance between the empirical mean and theexpected mean is bounded, with probability 1− δ, as follows:

g(S) = ‖x− E [x]‖ ≤√

2R2

n

(

√2 +

√

ln1

δ

)

where R is the supremum over the norm of the points in the distribution.


Centroid Anomaly Detection with Finite Data (ctd.)

By triangle inequality,

remp ≤ r + g(S)

rk ≤ rkemp + g(S)

Rearranging the terms, we obtain

r ≥ remp − g(S)

−rk ≥ −rkemp − g(S)

Adding together, we obtain:

r − rk ≥ remp − rkemp − 2g(S)

remprk

emp

r rk

g

x

xk

Expected mean

Empirical mean


Centroid Anomaly Detection: Decision Making

What is the meaning of the inequality

r − rk ≥ remp − rkemp − 2g(S) ?





If the right-hand side is greater than 0, i.e., remp > rkemp + 2g(S), thenthe left-hand side is also greater than 0, i.e. the new data point x isfarther away from the expected mean that the point xk .






, And we have do this without ever computing r or rk !







/ The “margin” g(S) can be pretty high, especially if we want to have

high confidence δ (g ∼√

ln 1δ)







/ The “margin” g(S) can be pretty high, especially if we want to have

high confidence δ (g ∼√

ln 1δ)

/ The “margin” g(S) is infinite unless the distribution is bounded (g ∼ R)


Centroid Anomaly Detection: Lessons Learned

The prediction risk on unknown data can be often estimated in the form:

R(x) ≤ Remp(S) + h(S)

Minimization of the emprical risk alone does not suffice!


Centroid Anomaly Detection: Lessons Learned

The prediction risk on unknown data can be often estimated in the form:

R(x) ≤ Remp(S) + h(S)

Minimization of the emprical risk alone does not suffice!

The bound on prediction risk often depends on the surpremum on thenorm of data points R .

Data of unbounded size is bad!


Using Kernels for Data Normalization

Suppose we want to project all data in thefeature space to have norm 1. . .

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12




Divide each point by its norm:

φ(x) =φ(x)

‖φ(x)‖x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12




Divide each point by its norm:

φ(x) =φ(x)

‖φ(x)‖

Use inner products between normalized data:

k(x, z) =

⟨

φ(x)

‖φ(x)‖ ,φ(z)

‖φ(z)‖

⟩

=〈φ(x), φ(z)〉

‖φ(x)‖ ‖φ(z)‖ =k(x, z)

√

k(x, x) k(z, z)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12


Using Kernels for Data Centering

Suppose we want to translate the origin tothe center of mass of the data. We will see inthe next lecture (homework) why...

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12




x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x




The inner product in the new coordinate system becomes

k(x, z) =

⟨

φ(x)− 1

n

n∑

i=1

φ(xi ), φ(z) −1

n

n∑

i=1

φ(zi )

⟩

= k(x, z) − 1

n

n∑

i=1

k(x, xi )−1

n

n∑

i=1

k(z, xi ) +1

n2

n∑

i ,j=1

k(xi , xj )




The inner product in the new coordinate system becomes

k(x, z) =

⟨

φ(x)− 1

n

n∑

i=1

φ(xi ), φ(z) −1

n

n∑

i=1

φ(zi )

⟩

= k(x, z) − 1

n

n∑

i=1

k(x, xi )−1

n

n∑

i=1

k(z, xi ) +1

n2

n∑

i ,j=1

k(xi , xj )

The kernel matrix for the training data can be expressed as:

K = K − 1

n1nnK − 1

nK1nn +

1

n21nnK1nn


Fisher Discriminant Analysis

Pick some direction w

Project all data points onto w

Compute the averages µ+ and µ− of the setsof projected points from two classes

Compute the corresponding standarddeviations σ+ and σ−

Choose w so as to maximize the followingcost function:

J(w) =‖µ+

w − µ−w‖

2

(σ+w )2 + (σ−

w )2


Projection of a Point onto a Direction

Projection of a point x onto a direction w iscomputed as:

xw = w〈w, x〉‖w‖

The average projection of a set of points onto w can be computed as:

µw =1

n

n∑

i=1

w

‖w‖ 〈w, xi 〉 =1

‖w‖wm⊤w,

where

m :=1

n

n∑

i=1

xi


“Between” Scatter Matrix

Consider the numerator of J(w):

∥

∥µ+w− µ−

w

∥

∥

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

⟩

.




∥

∥µ+w− µ−

w

∥

∥

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

⟩

=1

‖w‖2(

w[m+]⊤w − w[m−]⊤w)⊤ (

w[m+]⊤w − w[m−]⊤w)

.




∥

∥µ+w− µ−

w

∥

∥

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

⟩

=1

‖w‖2(

w[m+]⊤w − w[m−]⊤w)⊤ (

w[m+]⊤w − w[m−]⊤w)

=1

‖w‖2(

w⊤m

+w

⊤w[m+]⊤w − w

⊤m

−w

⊤w[m+]⊤w

−w⊤m

+w

⊤w[m−]⊤w + w

⊤m

−w

⊤w[m−]⊤w

)

.




∥

∥µ+w− µ−

w

∥

∥

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

⟩

=1

‖w‖2(

w[m+]⊤w − w[m−]⊤w)⊤ (

w[m+]⊤w − w[m−]⊤w)

=1

‖w‖2(

w⊤m

+w

⊤w[m+]⊤w − w

⊤m

−w

⊤w[m+]⊤w

−w⊤m

+w

⊤w[m−]⊤w + w

⊤m

−w

⊤w[m−]⊤w

)

=‖w‖2

‖w‖2w

⊤ (m+ −m−)(m+ −m

−)⊤

SB

w

.




∥

∥µ+w− µ−

w

∥

∥

2=⟨

µ+w− µ−

w, µ+

w− µ−

w

⟩

=1

‖w‖2(

w[m+]⊤w − w[m−]⊤w)⊤ (

w[m+]⊤w − w[m−]⊤w)

=1

‖w‖2(

w⊤m

+w

⊤w[m+]⊤w − w

⊤m

−w

⊤w[m+]⊤w

−w⊤m

+w

⊤w[m−]⊤w + w

⊤m

−w

⊤w[m−]⊤w

)

=‖w‖2

‖w‖2w

⊤ (m+ −m−)(m+ −m

−)⊤

SB

w

SB is called “between” scatter matrix.


“Within” Scatter Matrix

Consider the terms in the denominator of J(w):

(σ+w )

2 =1

n+

n+∑

i=1

∥

∥

∥

∥

∥

∥

w

‖w‖ 〈w, xi 〉 −1

n+w

‖w‖

n+∑

j=1

〈w, xj 〉

∥

∥

∥

∥

∥

∥

2

.




(σ+w )

2 =1

n+

n+∑

i=1

∥

∥

∥

∥

∥

∥

w

‖w‖ 〈w, xi 〉 −1

n+w

‖w‖

n+∑

j=1

〈w, xj 〉

∥

∥

∥

∥

∥

∥

2

=1

n+1

‖w‖2n+∑

i=1

∥

∥

∥wx

⊤

i w −w[m+]⊤w∥

∥

∥

2

.




(σ+w )

2 =1

n+

n+∑

i=1

∥

∥

∥

∥

∥

∥

w

‖w‖ 〈w, xi 〉 −1

n+w

‖w‖

n+∑

j=1

〈w, xj 〉

∥

∥

∥

∥

∥

∥

2

=1

n+1

‖w‖2n+∑

i=1

∥

∥

∥wx

⊤

i w −w[m+]⊤w∥

∥

∥

2

= w⊤

1

n+

n+∑

i=1

(xi −m+)(xi −m

+)⊤

S+W

w

.




(σ+w )

2 =1

n+

n+∑

i=1

∥

∥

∥

∥

∥

∥

w

‖w‖ 〈w, xi 〉 −1

n+w

‖w‖

n+∑

j=1

〈w, xj 〉

∥

∥

∥

∥

∥

∥

2

=1

n+1

‖w‖2n+∑

i=1

∥

∥

∥wx

⊤

i w −w[m+]⊤w∥

∥

∥

2

= w⊤

1

n+

n+∑

i=1

(xi −m+)(xi −m

+)⊤

S+W

w

SW is called “with” scatter matrix.


Fisher Discriminant Analysis (ctd.)

Putting all terms together, we obtain the following optimization problem:

maxw

J(w) =w

TSBw

wT (S+W + S−

W )w

Using the properties of the Rayleigh coefficient (Lecture 3, slide 14), it canbe shown that the optimal solution is given by the leading eigenvector ofthe matrix

(S+W + S−

W )−1SB


Fisher Discriminant Analsys: Kernelization

From the theory of RKHS, it follows that the solution w in the featurespace must be a linear combination of images of the training data:

w =

n∑

i=1

αiφ(xi )

Then each expression of the form w⊤m can be represented as:

w⊤m =

n∑

i=1

n∑

j=1

αiφ(xi )⊤φ(xj )

=

n∑

i=1

n∑

j=1

αik(xi , xj ) = α⊤K1n


Kernel Fisher Discriminant Analysis

Putting all pieces together, the optimization problem of KFD can beformulated as follows:

maxα

J(α) =αTMα

αTNα

where

M = (K+ − K−)1nn(K+ − K−)

N = K+

(

In+ − 1

n+1n+n+

)

[K+]⊤ + K−

(

In− − 1

n−1n−n−

)

[K−]⊤

Classification is performed as:

f (x) = w · φ(x) =n∑

i=1

αik(xi , x)


Summary

We explored the connection between inner products (kernels) and norms

We learned how to use kernels to compute distances in feature spaces

We applied this technique for a simple anomaly detection algorithm

We saw the first bound on the expected risk of a learning algorithm

We learned how to perform other interesting operations in the featurespace, e.g., scaling and centering

We performed “kernelization” of the classical Fisher DiscriminantAnalysis algorithm

Next Lecture: kernelization of other pattern analysis techniques

Principal Component Analysis (PCA)Canonical Correlation Analysis (CCA)


lecture 4 elementary kernel algorithms - uni-tuebingen.de

Documents