lecture 4 elementary kernel algorithms - uni-tuebingen.de
TRANSCRIPT
Lecture 4Elementary Kernel Algorithms
Pavel Laskov1 Blaine Nelson1
1Cognitive Systems Group
Wilhelm Schickard Institute for Computer Science
Universitat Tubingen, Germany
Advanced Topics in Machine Learning, 2012
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 1 / 22
Lessons from Last Lecture
Matrices and their properties
Vector spaces and Hilbert spaces
Kernel functions and RKHS
Relation between kernel functions andpositive definitenss
Closure properties of kernel functions
y
g(x) = K(x,y)
f(x)
FE
Figure 6.1: Reproducing property of a reproducing kernel.tion 6.4.1 is equal to K(x; y) as a function of x evaluated at �xed y, the innerproduct hf(x); g(x)iF is equal to the evaluation of f(x) at point y.It is convenient to introduce the notation K(x; �) which means \K of a �xedargument x considered as a function in F ." In the notation above the reproducingproperty (6.37) can be expressed as:K(x; y) = hK(x; �); K(y; �)iF : (6.38)This expression, to be also referred to as the inner product property, is important forthree reasons. First, it sheds light on how a kernel function can provide the value ofan inner product without ever computing the latter: by de�nition, the reproducingproperty de�nes a mapping from the domain of a set of functions into the set itself,and thus an operation in the space of functions can be substituted by an operationin their domain. The second purpose of property (6.38) is to verify that a givenfunction constitutes a reproducing kernel. Lastly, the cross-kernel will be de�nedsolely in terms of the inner product property rather than the reproducing property.The inner product property is illustrated in Figure 6.2.Example. Let F2 be the space of homogeneous polynomials in two variables:f(u; v) = au2 + bp2uv + cv2:89
y
g(t) = K(y,.)
f(t) = K(x,.)
FE
xFigure 6.2: Inner product property of a reproducing kernel.The basis functions in this space are fu2;p2uv; v2g, and the inner product is de�nedas hf1; f2iF2 � a1a2 + b1b2 + c1c2:The function K((u1; v1); (u2; v2)) = (u1u2 + v1v2)2is the reproducing kernel in F2. Indeed, for �xed (u1; v1),K((u1; v1); (u; v)) = (u1u+ v1v)2= u2u2 +p2uvp2uv + v2v2:The latter is obviously the element fu2;p2uv; v2g in the space F2, and hence Ksatis�es property 1 of De�nition 6.4.1. Also,hK((u1; v1); �); K((u2; v2); �)iF2 == hu21u2 +p2u1v1p2uv + v21v2;u22u2 +p2u2v2p2uv + v22v2iF2= u21u22 +p2u1v1 � p2u2v2 + v21 v22= (u1u2 + v1v2)2:Therefore, K satis�es property 2 of De�nition 6.4.1.90
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 2 / 22
Agenda for Today
Carrying out elementary operations on data in the feature space withoutexplicit access to it
Computing norms of data pointsComputing distances between data pointsScaling dataCentering dataProjecting data on arbitrary directions
Simple learning algorithms using kernel operations
Centroid anomaly detectionFisher Discriminant Analysis
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 3 / 22
Normed Vector Space
Vector space:
Closure under addition: ∀ x, z ∈ X ⇒ x+ z ∈ XClosure under scaler multiplication: ∀ x ∈ X , a ∈ ℜ ⇒ a · x ∈ X7 additional axioms (see last lecture)
Normed vector space has a unary norm operator ‖·‖ with the followingproperties:
Nonnegativity: x 6= 0 ⇒ ‖x‖ > 0Scalability: ‖a · x‖ = |a| · ‖x‖Triangle inequality: ‖x+ z‖ ≤ ‖x‖ + ‖z‖
Norm represents the notion of length
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 4 / 22
Inner Product Spaces
Inner product vector space has a binary inner product operator 〈·, ·〉 withthe following properties:
Commutativity: 〈x, z〉 = 〈z, x〉Associativity: 〈a · x, z〉 = a · 〈x, z〉Distributivity: 〈(x + y), z〉 = 〈x, y〉+ 〈y, z〉Positive-definiteness: 〈x, x〉 ≥ 0, with equality only for x = 0.
Inner product space are normed spaces:
Euclidean norm is defined as ‖x‖2 =√
〈x, x〉Inner product represents the notion of angle:
∠xz = arccos
( 〈x, z〉‖x‖ ‖z‖
)
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 5 / 22
Application to RKHS
Given the kernel function k(x, y), how can we compute the Euclideannorm of the image φ(x)?
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 6 / 22
Application to RKHS
Given the kernel function k(x, y), how can we compute the Euclideannorm of the image φ(x)?
‖φ(x)‖2 =√
〈φ(x), φ(x)〉 =√
k(x, x)
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 6 / 22
Application to RKHS
Given the kernel function k(x, y), how can we compute the Euclideannorm of the image φ(x)?
‖φ(x)‖2 =√
〈φ(x), φ(x)〉 =√
k(x, x)
Similarly, one can compute the squared distance between two arbitraryfeature vectors:
‖φ(x)− φ(z)‖22 = 〈φ(x)− φ(z), φ(x) − φ(z)〉= 〈φ(x), φ(x)〉 − 2 〈φ(x), φ(z)〉 + 〈φ(z), φ(z)〉= k(x, x) − 2k(x, z) + k(z, z)
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 6 / 22
Distance from the Center of Mass
Denote the average of all data points as the center of mass:
φS =1
n
n∑
i=1
φ(xi )
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 7 / 22
Distance from the Center of Mass
Denote the average of all data points as the center of mass:
φS =1
n
n∑
i=1
φ(xi )
Then the squared distance of any point from the center of mass can becomputed as follows:
‖φ(x) − φS‖22 =⟨
φ(x)− 1
n
n∑
i=1
φ(xi ), φ(x) −1
n
n∑
i=1
φ(xi )
⟩
= 〈φ(x), φ(x)〉 − 2
n
n∑
i=1
〈φ(x), φ(xi )〉+1
n2
n∑
i ,j=1
〈φ(xi ), φ(xj )〉
= k(x, x) − 2
n
n∑
i=1
k(x, xi ) +1
n2
n∑
i ,j=1
k(xi , xj )
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 7 / 22
Simple Anomaly Detection Algorithm
Points that are far away from the expected mean of the data should beconsidered anomalous.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 8 / 22
Simple Anomaly Detection Algorithm
Points that are far away from the expected mean of the data should beconsidered anomalous.
How far is “far away”?
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 8 / 22
Simple Anomaly Detection Algorithm
Points that are far away from the expected mean of the data should beconsidered anomalous.
How far is “far away”?
How to compute the expected mean?
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 8 / 22
Centroid Anomaly Detection with Finite Data
Consider the finite sample S = x1, . . . , xn of size n.
Compute the empirical mean of the data, and the distances r iemp fromthe mean for all data points (user “kernelization” if necessary).
Let k be the index of the point with the largest distance rkemp
It can be shown that the distance between the empirical mean and theexpected mean is bounded, with probability 1− δ, as follows:
g(S) = ‖x− E [x]‖ ≤√
2R2
n
(
√2 +
√
ln1
δ
)
where R is the supremum over the norm of the points in the distribution.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 9 / 22
Centroid Anomaly Detection with Finite Data (ctd.)
By triangle inequality,
remp ≤ r + g(S)
rk ≤ rkemp + g(S)
Rearranging the terms, we obtain
r ≥ remp − g(S)
−rk ≥ −rkemp − g(S)
Adding together, we obtain:
r − rk ≥ remp − rkemp − 2g(S)
remprk
emp
r rk
g
x
xk
Expected mean
Empirical mean
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 10 / 22
Centroid Anomaly Detection: Decision Making
What is the meaning of the inequality
r − rk ≥ remp − rkemp − 2g(S) ?
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22
Centroid Anomaly Detection: Decision Making
What is the meaning of the inequality
r − rk ≥ remp − rkemp − 2g(S) ?
If the right-hand side is greater than 0, i.e., remp > rkemp + 2g(S), thenthe left-hand side is also greater than 0, i.e. the new data point x isfarther away from the expected mean that the point xk .
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22
Centroid Anomaly Detection: Decision Making
What is the meaning of the inequality
r − rk ≥ remp − rkemp − 2g(S) ?
If the right-hand side is greater than 0, i.e., remp > rkemp + 2g(S), thenthe left-hand side is also greater than 0, i.e. the new data point x isfarther away from the expected mean that the point xk .
, And we have do this without ever computing r or rk !
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22
Centroid Anomaly Detection: Decision Making
What is the meaning of the inequality
r − rk ≥ remp − rkemp − 2g(S) ?
If the right-hand side is greater than 0, i.e., remp > rkemp + 2g(S), thenthe left-hand side is also greater than 0, i.e. the new data point x isfarther away from the expected mean that the point xk .
, And we have do this without ever computing r or rk !
/ The “margin” g(S) can be pretty high, especially if we want to have
high confidence δ (g ∼√
ln 1δ)
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22
Centroid Anomaly Detection: Decision Making
What is the meaning of the inequality
r − rk ≥ remp − rkemp − 2g(S) ?
If the right-hand side is greater than 0, i.e., remp > rkemp + 2g(S), thenthe left-hand side is also greater than 0, i.e. the new data point x isfarther away from the expected mean that the point xk .
, And we have do this without ever computing r or rk !
/ The “margin” g(S) can be pretty high, especially if we want to have
high confidence δ (g ∼√
ln 1δ)
/ The “margin” g(S) is infinite unless the distribution is bounded (g ∼ R)
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 11 / 22
Centroid Anomaly Detection: Lessons Learned
The prediction risk on unknown data can be often estimated in the form:
R(x) ≤ Remp(S) + h(S)
Minimization of the emprical risk alone does not suffice!
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 12 / 22
Centroid Anomaly Detection: Lessons Learned
The prediction risk on unknown data can be often estimated in the form:
R(x) ≤ Remp(S) + h(S)
Minimization of the emprical risk alone does not suffice!
The bound on prediction risk often depends on the surpremum on thenorm of data points R .
Data of unbounded size is bad!
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 12 / 22
Using Kernels for Data Normalization
Suppose we want to project all data in thefeature space to have norm 1. . .
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 13 / 22
Using Kernels for Data Normalization
Suppose we want to project all data in thefeature space to have norm 1. . .
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 13 / 22
Using Kernels for Data Normalization
Suppose we want to project all data in thefeature space to have norm 1. . .
Divide each point by its norm:
φ(x) =φ(x)
‖φ(x)‖x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 13 / 22
Using Kernels for Data Normalization
Suppose we want to project all data in thefeature space to have norm 1. . .
Divide each point by its norm:
φ(x) =φ(x)
‖φ(x)‖
Use inner products between normalized data:
k(x, z) =
⟨
φ(x)
‖φ(x)‖ ,φ(z)
‖φ(z)‖
⟩
=〈φ(x), φ(z)〉
‖φ(x)‖ ‖φ(z)‖ =k(x, z)
√
k(x, x) k(z, z)
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 13 / 22
Using Kernels for Data Centering
Suppose we want to translate the origin tothe center of mass of the data. We will see inthe next lecture (homework) why...
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 14 / 22
Using Kernels for Data Centering
Suppose we want to translate the origin tothe center of mass of the data. We will see inthe next lecture (homework) why...
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 14 / 22
Using Kernels for Data Centering
Suppose we want to translate the origin tothe center of mass of the data. We will see inthe next lecture (homework) why...
The inner product in the new coordinate system becomes
k(x, z) =
⟨
φ(x)− 1
n
n∑
i=1
φ(xi ), φ(z) −1
n
n∑
i=1
φ(zi )
⟩
= k(x, z) − 1
n
n∑
i=1
k(x, xi )−1
n
n∑
i=1
k(z, xi ) +1
n2
n∑
i ,j=1
k(xi , xj )
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 14 / 22
Using Kernels for Data Centering
Suppose we want to translate the origin tothe center of mass of the data. We will see inthe next lecture (homework) why...
The inner product in the new coordinate system becomes
k(x, z) =
⟨
φ(x)− 1
n
n∑
i=1
φ(xi ), φ(z) −1
n
n∑
i=1
φ(zi )
⟩
= k(x, z) − 1
n
n∑
i=1
k(x, xi )−1
n
n∑
i=1
k(z, xi ) +1
n2
n∑
i ,j=1
k(xi , xj )
The kernel matrix for the training data can be expressed as:
K = K − 1
n1nnK − 1
nK1nn +
1
n21nnK1nn
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 14 / 22
Fisher Discriminant Analysis
Pick some direction w
Project all data points onto w
Compute the averages µ+ and µ− of the setsof projected points from two classes
Compute the corresponding standarddeviations σ+ and σ−
Choose w so as to maximize the followingcost function:
J(w) =‖µ+
w − µ−w‖
2
(σ+w )2 + (σ−
w )2
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 15 / 22
Projection of a Point onto a Direction
Projection of a point x onto a direction w iscomputed as:
xw = w〈w, x〉‖w‖
The average projection of a set of points onto w can be computed as:
µw =1
n
n∑
i=1
w
‖w‖ 〈w, xi 〉 =1
‖w‖wm⊤w,
where
m :=1
n
n∑
i=1
xi
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 16 / 22
“Between” Scatter Matrix
Consider the numerator of J(w):
∥
∥µ+w− µ−
w
∥
∥
2=⟨
µ+w− µ−
w, µ+
w− µ−
w
⟩
.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22
“Between” Scatter Matrix
Consider the numerator of J(w):
∥
∥µ+w− µ−
w
∥
∥
2=⟨
µ+w− µ−
w, µ+
w− µ−
w
⟩
=1
‖w‖2(
w[m+]⊤w − w[m−]⊤w)⊤ (
w[m+]⊤w − w[m−]⊤w)
.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22
“Between” Scatter Matrix
Consider the numerator of J(w):
∥
∥µ+w− µ−
w
∥
∥
2=⟨
µ+w− µ−
w, µ+
w− µ−
w
⟩
=1
‖w‖2(
w[m+]⊤w − w[m−]⊤w)⊤ (
w[m+]⊤w − w[m−]⊤w)
=1
‖w‖2(
w⊤m
+w
⊤w[m+]⊤w − w
⊤m
−w
⊤w[m+]⊤w
−w⊤m
+w
⊤w[m−]⊤w + w
⊤m
−w
⊤w[m−]⊤w
)
.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22
“Between” Scatter Matrix
Consider the numerator of J(w):
∥
∥µ+w− µ−
w
∥
∥
2=⟨
µ+w− µ−
w, µ+
w− µ−
w
⟩
=1
‖w‖2(
w[m+]⊤w − w[m−]⊤w)⊤ (
w[m+]⊤w − w[m−]⊤w)
=1
‖w‖2(
w⊤m
+w
⊤w[m+]⊤w − w
⊤m
−w
⊤w[m+]⊤w
−w⊤m
+w
⊤w[m−]⊤w + w
⊤m
−w
⊤w[m−]⊤w
)
=‖w‖2
‖w‖2w
⊤ (m+ −m−)(m+ −m
−)⊤
SB
w
.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22
“Between” Scatter Matrix
Consider the numerator of J(w):
∥
∥µ+w− µ−
w
∥
∥
2=⟨
µ+w− µ−
w, µ+
w− µ−
w
⟩
=1
‖w‖2(
w[m+]⊤w − w[m−]⊤w)⊤ (
w[m+]⊤w − w[m−]⊤w)
=1
‖w‖2(
w⊤m
+w
⊤w[m+]⊤w − w
⊤m
−w
⊤w[m+]⊤w
−w⊤m
+w
⊤w[m−]⊤w + w
⊤m
−w
⊤w[m−]⊤w
)
=‖w‖2
‖w‖2w
⊤ (m+ −m−)(m+ −m
−)⊤
SB
w
SB is called “between” scatter matrix.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 17 / 22
“Within” Scatter Matrix
Consider the terms in the denominator of J(w):
(σ+w )
2 =1
n+
n+∑
i=1
∥
∥
∥
∥
∥
∥
w
‖w‖ 〈w, xi 〉 −1
n+w
‖w‖
n+∑
j=1
〈w, xj 〉
∥
∥
∥
∥
∥
∥
2
.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 18 / 22
“Within” Scatter Matrix
Consider the terms in the denominator of J(w):
(σ+w )
2 =1
n+
n+∑
i=1
∥
∥
∥
∥
∥
∥
w
‖w‖ 〈w, xi 〉 −1
n+w
‖w‖
n+∑
j=1
〈w, xj 〉
∥
∥
∥
∥
∥
∥
2
=1
n+1
‖w‖2n+∑
i=1
∥
∥
∥wx
⊤
i w −w[m+]⊤w∥
∥
∥
2
.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 18 / 22
“Within” Scatter Matrix
Consider the terms in the denominator of J(w):
(σ+w )
2 =1
n+
n+∑
i=1
∥
∥
∥
∥
∥
∥
w
‖w‖ 〈w, xi 〉 −1
n+w
‖w‖
n+∑
j=1
〈w, xj 〉
∥
∥
∥
∥
∥
∥
2
=1
n+1
‖w‖2n+∑
i=1
∥
∥
∥wx
⊤
i w −w[m+]⊤w∥
∥
∥
2
= w⊤
1
n+
n+∑
i=1
(xi −m+)(xi −m
+)⊤
S+W
w
.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 18 / 22
“Within” Scatter Matrix
Consider the terms in the denominator of J(w):
(σ+w )
2 =1
n+
n+∑
i=1
∥
∥
∥
∥
∥
∥
w
‖w‖ 〈w, xi 〉 −1
n+w
‖w‖
n+∑
j=1
〈w, xj 〉
∥
∥
∥
∥
∥
∥
2
=1
n+1
‖w‖2n+∑
i=1
∥
∥
∥wx
⊤
i w −w[m+]⊤w∥
∥
∥
2
= w⊤
1
n+
n+∑
i=1
(xi −m+)(xi −m
+)⊤
S+W
w
SW is called “with” scatter matrix.
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 18 / 22
Fisher Discriminant Analysis (ctd.)
Putting all terms together, we obtain the following optimization problem:
maxw
J(w) =w
TSBw
wT (S+W + S−
W )w
Using the properties of the Rayleigh coefficient (Lecture 3, slide 14), it canbe shown that the optimal solution is given by the leading eigenvector ofthe matrix
(S+W + S−
W )−1SB
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 19 / 22
Fisher Discriminant Analsys: Kernelization
From the theory of RKHS, it follows that the solution w in the featurespace must be a linear combination of images of the training data:
w =
n∑
i=1
αiφ(xi )
Then each expression of the form w⊤m can be represented as:
w⊤m =
n∑
i=1
n∑
j=1
αiφ(xi )⊤φ(xj )
=
n∑
i=1
n∑
j=1
αik(xi , xj ) = α⊤K1n
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 20 / 22
Kernel Fisher Discriminant Analysis
Putting all pieces together, the optimization problem of KFD can beformulated as follows:
maxα
J(α) =αTMα
αTNα
where
M = (K+ − K−)1nn(K+ − K−)
N = K+
(
In+ − 1
n+1n+n+
)
[K+]⊤ + K−
(
In− − 1
n−1n−n−
)
[K−]⊤
Classification is performed as:
f (x) = w · φ(x) =n∑
i=1
αik(xi , x)
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 21 / 22
Summary
We explored the connection between inner products (kernels) and norms
We learned how to use kernels to compute distances in feature spaces
We applied this technique for a simple anomaly detection algorithm
We saw the first bound on the expected risk of a learning algorithm
We learned how to perform other interesting operations in the featurespace, e.g., scaling and centering
We performed “kernelization” of the classical Fisher DiscriminantAnalysis algorithm
Next Lecture: kernelization of other pattern analysis techniques
Principal Component Analysis (PCA)Canonical Correlation Analysis (CCA)
P. Laskov and B. Nelson (Tubingen) Lecture 4: Elementary Kernel Algorithms May 15, 2012 22 / 22