introduction to neuroinformatics: support vector...
Post on 04-Nov-2019
16 Views
Preview:
TRANSCRIPT
Introduction to Neuroinformatics:
Support Vector Machines
Prof. Dr. Martin Riedmiller
University of Osnabruck
Institute of Computer Science and Institute of Cognitive Science
Introduction to Neuroinformatics: Support Vector Machines – p. 1
Outline
support vector machines (SVM) for classification: basic ideas
optimization under constraints: Lagrange multiplier, Kuhn-Tucker conditions,dual, algorithmic approaches
applying optimization theory to SVM
kernel trick: make linear problems non-linear
variants of SVMs
Introduction to Neuroinformatics: Support Vector Machines – p. 2
Linear classification
perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:
〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x
〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x
Introduction to Neuroinformatics: Support Vector Machines – p. 3
Linear classification
perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:
〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x
〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x
perceptron learning finds solving hyperplane (if possible) by adding andsubtracting patterns to the weight vector
Introduction to Neuroinformatics: Support Vector Machines – p. 3
Linear classification
perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:
〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x
〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x
perceptron learning finds solving hyperplane (if possible) by adding andsubtracting patterns to the weight vector
if there are several solving hyperplanes, perceptron learning finds anarbitrary one
Introduction to Neuroinformatics: Support Vector Machines – p. 3
Linear classification(cont.)
Which is the best separating hyperplane?
−
−
−
−
−−
−
−
−
−
−
+ +
++
++
the performance on the training set is equally good, but what do you expectfor an independent test set?
Introduction to Neuroinformatics: Support Vector Machines – p. 4
Linear classification(cont.)
Which is the best separating hyperplane?
−
−
−
−
−−
−
−
−
−
−
+ +
++
++
+
+
++
++
+
+
−
−
−−
−
−
−
−
−−
−
the performance on the training set is equally good, but what do you expectfor an independent test set?
Introduction to Neuroinformatics: Support Vector Machines – p. 4
Linear classification(cont.)
for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.
Introduction to Neuroinformatics: Support Vector Machines – p. 5
Linear classification(cont.)
for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.
optimality:The risk of misclassification is the smaller the larger the margin is.
−
−
−
−
−−
−
−
−
−
−
+ +
++
++
Introduction to Neuroinformatics: Support Vector Machines – p. 5
Linear classification(cont.)
for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.
optimality:The risk of misclassification is the smaller the larger the margin is.
−
−
−
−
−−
−
−
−
−
−
+ +
++
++
Introduction to Neuroinformatics: Support Vector Machines – p. 5
Linear classification(cont.)
for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.
optimality:The risk of misclassification is the smaller the larger the margin is.
−
−
−
−
−−
−
−
−
−
−
+ +
++
++
Introduction to Neuroinformatics: Support Vector Machines – p. 5
Linear classification(cont.)
for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.
optimality:The risk of misclassification is the smaller the larger the margin is.
−
−
−
−
−−
−
−
−
−
−
+ +
++
++
Introduction to Neuroinformatics: Support Vector Machines – p. 5
Support vector machines
support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.
Introduction to Neuroinformatics: Support Vector Machines – p. 6
Support vector machines
support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.
notation:
• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,
d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b
(= w0, to be in line with SVM literature)
• margin ρ
Introduction to Neuroinformatics: Support Vector Machines – p. 6
Support vector machines
support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.
notation:
• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,
d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b
(= w0, to be in line with SVM literature)
• margin ρ
distance of a point ~x to a hyperplane (~w, b):
∣∣∣∣
〈~x, ~w〉 + b
||~w||
∣∣∣∣
Introduction to Neuroinformatics: Support Vector Machines – p. 6
Support vector machines
find the separating hyperplane with maximal margin:
maximize~w,b,ρ
ρ2
subject to ρ > 0
d(i) ·⟨~x(i), ~w
⟩+ b
||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D
Introduction to Neuroinformatics: Support Vector Machines – p. 7
Support vector machines
find the separating hyperplane with maximal margin:
maximize~w,b,ρ
ρ2
subject to ρ > 0
d(i) ·⟨~x(i), ~w
⟩+ b
||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D
this task contains a degree of freedom: ~w and b can be scaled by a positivenumber without changing the separating hyperplane.We can remove this degree of freedom and simplify the problem by adding aconstraint on the length of ~w. Here:
||~w|| =1
ρ
Introduction to Neuroinformatics: Support Vector Machines – p. 7
Support vector machines(cont.)
optimization problem with length constraint:
maximize~w,b,ρ
ρ2
Introduction to Neuroinformatics: Support Vector Machines – p. 8
Support vector machines(cont.)
optimization problem with length constraint:
maximize~w,b,ρ
ρ2
Introduction to Neuroinformatics: Support Vector Machines – p. 8
Support vector machines(cont.)
optimization problem with length constraint:
maximize~w,b,ρ
ρ2
Introduction to Neuroinformatics: Support Vector Machines – p. 8
Support vector machines(cont.)
optimization problem with length constraint:
maximize~w,b,ρ
ρ2
Introduction to Neuroinformatics: Support Vector Machines – p. 8
Support vector machines(cont.)
optimization problem with length constraint:
maximize~w,b,ρ
ρ2
Introduction to Neuroinformatics: Support Vector Machines – p. 8
Support vector machines(cont.)
optimization problem with length constraint:
maximize~w,b,ρ
ρ2
Introduction to Neuroinformatics: Support Vector Machines – p. 8
Support vector machines(cont.)
optimization problem with length constraint:
maximize~w,b,ρ
ρ2 1
||~w||2
subject to ρ > 0
||~w|| =
1
ρρ =
1
||~w||
d(i) ·⟨~x(i), ~w
⟩+ b
||~w|| ≥ ρ1
||~w|| for all (~x(i), d(i)) ∈ D
rewriting the optimization problem
Finally, we can make it a minimzation task minimizing the reciprocal of 1||w||2
Introduction to Neuroinformatics: Support Vector Machines – p. 8
Support vector machines(cont.)
the minimization task:
minimize~w,b,ρ
1
2||~w||2
subject to d(i)( ⟨
~x(i), ~w⟩
+ b)≥ 1 for all (~x(i), d(i)) ∈ D
ρ =1
||~w||
Introduction to Neuroinformatics: Support Vector Machines – p. 9
Support vector machines(cont.)
the minimization task:
minimize~w,b,ρ
1
2||~w||2
subject to d(i)( ⟨
~x(i), ~w⟩
+ b)≥ 1 for all (~x(i), d(i)) ∈ D
ρ =
1
||~w||
ρ occurs only in one equality constraint, i.e. we can solve the optimizationproblem for ~w and b and calculate ρ afterwards from ~w
Introduction to Neuroinformatics: Support Vector Machines – p. 9
Support vector machines(cont.)
final mathematical form of the minimization task:
minimize~w,b
1
2||~w||2
subject to d(i)( ⟨
~x(i), ~w⟩
+ b)≥ 1 for all (~x(i), d(i)) ∈ D
the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)
Introduction to Neuroinformatics: Support Vector Machines – p. 10
Optimization Theory
Introduction to Neuroinformatics: Support Vector Machines – p. 11
Optimization theory
a general form of optimization problem:
minimize~x
f(~x)
subject to ~x ∈ Ω
Ω ⊆ Rn: the set of feasible (permitted) points
f : Rn → R: the target function that should be minimized
Introduction to Neuroinformatics: Support Vector Machines – p. 12
Optimization theory
a general form of optimization problem:
minimize~x
f(~x)
subject to ~x ∈ Ω
Ω ⊆ Rn: the set of feasible (permitted) points
f : Rn → R: the target function that should be minimized
maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))
Introduction to Neuroinformatics: Support Vector Machines – p. 12
Optimization theory
a general form of optimization problem:
minimize~x
f(~x)
subject to ~x ∈ Ω
Ω ⊆ Rn: the set of feasible (permitted) points
f : Rn → R: the target function that should be minimized
maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))
often, Ω is implicitly given by equality and inequality constraints
Introduction to Neuroinformatics: Support Vector Machines – p. 12
Local and global minima
a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:
f(~x) ≤ f(~y)
Introduction to Neuroinformatics: Support Vector Machines – p. 13
Local and global minima
a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:
f(~x) ≤ f(~y)
a feasible point ~x ∈ Ω is called alocal minimum if there is a radiusr > 0 so that for all points ~y ∈ Ωwith ||~x − ~y|| < r holds:
f(~x) ≤ f(~y)
x
f(x)
Ω
loc.glob.loc.
Introduction to Neuroinformatics: Support Vector Machines – p. 13
Convexity
A set X is called a convex set if for all points ~x, ~y ∈ X and any 0 ≤ θ ≤ 1holds:
θ~x + (1 − θ)~y ∈ X
intuitively: a set is convex, if all points between two elements of the set arealso elements of the set.
~x ∈ X
~y ∈ Xset X
θ~x + (1 − θ)~y ∈ X ?
Introduction to Neuroinformatics: Support Vector Machines – p. 14
Convexity(cont.)
convex sets:
non convex sets:
Introduction to Neuroinformatics: Support Vector Machines – p. 15
Convexity(cont.)
intersections of convex sets are also convex.Proof: → blackboard
Introduction to Neuroinformatics: Support Vector Machines – p. 16
Convexity(cont.)
intersections of convex sets are also convex.Proof: → blackboard
unions of convex sets are not guaranteed to be convex.
Introduction to Neuroinformatics: Support Vector Machines – p. 16
Convexity(cont.)
A function f is called a convex function, if for all points ~x, ~y and all0 ≤ θ ≤ 1 holds:
f(θ~x + (1 − θ)~y) ≤ θf(~x) + (1 − θ)f(~y)
x
f(x)
~x ~y
Introduction to Neuroinformatics: Support Vector Machines – p. 17
Convexity(cont.)
x
f(x)
x
f(x)
x
f(x)
x
f(x)
x
f(x)
x
f(x)
x
f(x)
x
f(x)
convex functions:
non−convex functions:
Introduction to Neuroinformatics: Support Vector Machines – p. 18
Convexity(cont.)
sums of convex functions are also convex.Proof: → blackboard
Introduction to Neuroinformatics: Support Vector Machines – p. 19
Convexity(cont.)
sums of convex functions are also convex.Proof: → blackboard
positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)
Introduction to Neuroinformatics: Support Vector Machines – p. 19
Convexity(cont.)
sums of convex functions are also convex.Proof: → blackboard
positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)
differences of convex functions are not guaranteed to be convex.
Introduction to Neuroinformatics: Support Vector Machines – p. 19
Convexity(cont.)
linear functions are convexProof: → for your own practice (very simple)
Introduction to Neuroinformatics: Support Vector Machines – p. 20
Convexity(cont.)
linear functions are convexProof: → for your own practice (very simple)
the function: f : x 7→ x2 is a convex function.Proof:
f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =
= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =
= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0
Introduction to Neuroinformatics: Support Vector Machines – p. 20
Convexity(cont.)
linear functions are convexProof: → for your own practice (very simple)
the function: f : x 7→ x2 is a convex function.Proof:
f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =
= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =
= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0
the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above
Introduction to Neuroinformatics: Support Vector Machines – p. 20
Convexity(cont.)
linear functions are convexProof: → for your own practice (very simple)
the function: f : x 7→ x2 is a convex function.Proof:
f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =
= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =
= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0
the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above
if f is a convex function, then the set ~x|f(~x) ≤ 0 is a convex set.Proof: → blackboard
Introduction to Neuroinformatics: Support Vector Machines – p. 20
Convexity and minima
Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.
Proof:→ blackboard
Introduction to Neuroinformatics: Support Vector Machines – p. 21
Convexity and minima
Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.
Proof:→ blackboard
this lemma does not mean that for each convex set and each convex functiona minimum exists!
Introduction to Neuroinformatics: Support Vector Machines – p. 21
Convexity and minima
Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.
Proof:→ blackboard
this lemma does not mean that for each convex set and each convex functiona minimum exists!
there are conditions for which the existence of a global minimum can beguaranteed. (e.g. Ω non-empty and compact, f continuous)
Introduction to Neuroinformatics: Support Vector Machines – p. 21
Convexity and minima (cont.)
Lemma:If Ω is a convex set and f is a convex function w.r.t Ω, each point satisfyinggradf(~x) = 0 is a local minimum.
Proof:→ omitted (literature about convex functions)
Introduction to Neuroinformatics: Support Vector Machines – p. 22
Constraints
often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:
Ω =(
k⋂
i=1
~x|gi(~x) ≤ 0)∩
(m⋂
i=1
~x|hi(~x) = 0)
gj and hi are real valued functions
• constraints of the form gj(~x) ≤ 0 are called inequality constraints
• constraints of the form hi(~x) = 0 are called equality constraints
Introduction to Neuroinformatics: Support Vector Machines – p. 23
Constraints
often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:
Ω =(
k⋂
i=1
~x|gi(~x) ≤ 0)∩
(m⋂
i=1
~x|hi(~x) = 0)
gj and hi are real valued functions
• constraints of the form gj(~x) ≤ 0 are called inequality constraints
• constraints of the form hi(~x) = 0 are called equality constraints
equality constraints hi(~x) = 0 can be replaced by two inequality constraints:
hi(~x) ≤ 0 and −hi(~x) ≤ 0.
Introduction to Neuroinformatics: Support Vector Machines – p. 23
Characterizing local minima
for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible
direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r
Introduction to Neuroinformatics: Support Vector Machines – p. 24
Characterizing local minima
for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible
direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r
for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a descending
direction if exists r > 0 so that f(~x + θ~v) < f(~x) for all 0 < θ ≤ r
Introduction to Neuroinformatics: Support Vector Machines – p. 24
Characterizing local minima
for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible
direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r
for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a descending
direction if exists r > 0 so that f(~x + θ~v) < f(~x) for all 0 < θ ≤ r
Introduction to Neuroinformatics: Support Vector Machines – p. 24
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum
no minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum
no minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum
local minimum(desc. directions non−feasible)
no minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum
local minimum(desc. directions non−feasible)
no minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum
no minimum
local minimum(desc. directions non−feasible)
no minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum
no minimum
local minimum(desc. directions non−feasible)
no minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Characterizing local minima(cont.)
a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum
no minimum
minimum(no descending directions)
local minimum(desc. directions non−feasible)
no minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 25
Linear constraints, differentiable target function
theory simplifies if the target function f is differentiable and the constraintsare linear:
minimize~x
f(~x)
subject to gj(~x) ≤ 0 j = 1, . . . , k
hi(~x) = 0 i = 1, . . . ,m
with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn
(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)
hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn
Introduction to Neuroinformatics: Support Vector Machines – p. 26
Linear constraints, differentiable target function
theory simplifies if the target function f is differentiable and the constraintsare linear:
minimize~x
f(~x)
subject to gj(~x) ≤ 0 j = 1, . . . , k
hi(~x) = 0 i = 1, . . . ,m
with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn
(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)
hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn
if f is a linear function, these tasks are called linear programs
if f is a quadratic function, these tasks are called quadratic programs
Introduction to Neuroinformatics: Support Vector Machines – p. 26
Linear constraints, differentiable target function(cont.)
what are the descending directions at point ~x ?
Introduction to Neuroinformatics: Support Vector Machines – p. 27
Linear constraints, differentiable target function(cont.)
what are the descending directions at point ~x ?a direction ~v is descending if
〈~v, gradf(~x)〉 < 0
Introduction to Neuroinformatics: Support Vector Machines – p. 27
Linear constraints, differentiable target function(cont.)
what are the feasible directions inpoint ~x ?
Introduction to Neuroinformatics: Support Vector Machines – p. 28
Linear constraints, differentiable target function(cont.)
what are the feasible directions inpoint ~x ?
• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.
grad g(x)
grad g(x)
Introduction to Neuroinformatics: Support Vector Machines – p. 28
Linear constraints, differentiable target function(cont.)
what are the feasible directions inpoint ~x ?
• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.
• inactive constraints:inactive constraints do notrestrict feasible directions
grad g(x)
feasible directions
grad g(x)
Introduction to Neuroinformatics: Support Vector Machines – p. 28
Linear constraints, differentiable target function(cont.)
what are the feasible directions inpoint ~x ?
• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.
• inactive constraints:inactive constraints do notrestrict feasible directions
• active constraints:directions ~v with〈~v, gradg(~x)〉 > 0 areinfeasible
grad g(x)
feasible directions
grad g(x)
feasible directions
Introduction to Neuroinformatics: Support Vector Machines – p. 28
Linear constraints, differentiable target function(cont.)
what are the feasible directions in point ~x ?
Introduction to Neuroinformatics: Support Vector Machines – p. 29
Linear constraints, differentiable target function(cont.)
what are the feasible directions in point ~x ?A direction ~v is feasible in ~x if for all constraints gi holds:
gi is inactive or 〈~v, gradgi(~x)〉 ≤ 0
Introduction to Neuroinformatics: Support Vector Machines – p. 29
Characterizing local minima
local minima are feasible points at which the intersection of descendingdirections and feasible directions is empty.
Introduction to Neuroinformatics: Support Vector Machines – p. 30
Characterizing local minima
local minima are feasible points at which the intersection of descendingdirections and feasible directions is empty.
examples with no active constraints:
grad f grad f=0
gradf(~x) 6= ~0 ⇒ nominimum
gradf(~x) = ~0 ⇒minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 30
Characterizing local minima(cont.)
examples with one active constraints:
grad g(x)
grad f(x)
desc. dir.
feasible dir.
−gradf = λgradg with λ < 0all descending directions feasibleno minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 31
Characterizing local minima(cont.)
examples with one active constraints:
grad g(x)
grad f(x)
desc. dir.
feasible dir.
−gradf = λgradg with λ < 0all descending directions feasibleno minimum
grad g(x)
feasible dir.
grad f(x)
desc. dir.
−gradf = λgradg + ~u with λ < 0
and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 31
Characterizing local minima(cont.)
grad g(x)desc. dir.
grad f(x)
feasible dir.
−gradf = λgradg + ~u with λ > 0
and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 32
Characterizing local minima(cont.)
grad g(x)desc. dir.
grad f(x)
feasible dir.
−gradf = λgradg + ~u with λ > 0
and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum
grad g(x)desc. dir.
grad f(x)
feasible dir.
−gradf = λgradg with λ > 0all descending directions non-feasibleminimum
Introduction to Neuroinformatics: Support Vector Machines – p. 32
Characterizing local minima(cont.)
grad g(x)
feasible dir.
grad f=0 −gradf = ~0 = 0 · gradg
no descending directionsminimum
Introduction to Neuroinformatics: Support Vector Machines – p. 33
Characterizing local minima(cont.)
grad g(x)
feasible dir.
grad f=0 −gradf = ~0 = 0 · gradg
no descending directionsminimum
For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)
only if λ < 0 or ~u 6= ~0 we find feasible descending directions
Introduction to Neuroinformatics: Support Vector Machines – p. 33
Characterizing local minima(cont.)
grad g(x)
feasible dir.
grad f=0 −gradf = ~0 = 0 · gradg
no descending directionsminimum
For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)
only if λ < 0 or ~u 6= ~0 we find feasible descending directions
~x is minimum if we find λ ≥ 0 with gradf(~x) + λgradg(~x) = ~0
Remark: this principle can be generalized to more active constraints
Introduction to Neuroinformatics: Support Vector Machines – p. 33
Characterizing local minima(cont.)
examples with two active constraints:
grad g1(x)
grad g2(x)
feasible dir.
desc. dir.
grad f(x)−gradf = λ1gradg1 + λ2gradg2
with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 34
Characterizing local minima(cont.)
examples with two active constraints:
grad g1(x)
grad g2(x)
feasible dir.
desc. dir.
grad f(x)−gradf = λ1gradg1 + λ2gradg2
with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum
grad g2(x)
grad g1(x)
desc. dir.
grad f(x)feasible dir.
−gradf = λ1gradg1 + λ2gradg2
with λ1 ≥ 0, λ2 ≥ 0all descending directions non-feasibleminimum
Introduction to Neuroinformatics: Support Vector Machines – p. 34
Characterizing local minima(cont.)
examples with equality constraint:
grad h(x)desc. dir.
grad f(x)
feasible dir.
−gradf = λgradh + ~u with
~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 35
Characterizing local minima(cont.)
examples with equality constraint:
grad h(x)desc. dir.
grad f(x)
feasible dir.
−gradf = λgradh + ~u with
~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum
grad h(x)desc. dir.
grad f(x)
feasible dir.−gradf = λgradh
all descending directions non-feasibleminimum
Introduction to Neuroinformatics: Support Vector Machines – p. 35
Characterizing local minima(cont.)
general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:
A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:
grad f(~x)+∑
active inequalityconstraints j
(αj ·grad gj(~x))+m∑
i=1
(βi·grad hi(~x)) = ~0
(corollary to Farka’s lemma)
Introduction to Neuroinformatics: Support Vector Machines – p. 36
Characterizing local minima(cont.)
general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:
A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:
grad f(~x)+∑
active inequalityconstraints j
(αj ·grad gj(~x))+m∑
i=1
(βi·grad hi(~x)) = ~0
(corollary to Farka’s lemma)
Lagrange function:
L(~x, ~α, ~β) = f(~x) +k∑
j=1
(αj · gj(~x)) +m∑
i=1
(βi · hi(~x))
αi, βi are called Lagrange multipliers
Introduction to Neuroinformatics: Support Vector Machines – p. 36
(Karush-) Kuhn-Tucker conditions
rewriting these conditions yield the (K)KT conditions:
gi(~x) ≤ 0 for all inequality constraints
hj(~x) = 0 for all equality constraints
αi · gi(~x) = 0 for all inequality constraints
0 = ∂L(~x,~α,~β)∂xi
for all i = 1, . . . , n
αi ≥ 0 for all inequality constraints
Lemma:A point ~x is minimum, if and only if exist αj and βi so that the KT conditionsare met.
Proof: omitted, see literature
Introduction to Neuroinformatics: Support Vector Machines – p. 37
Kuhn-Tucker conditions(cont.)
the KT conditions do not give an algorithm to find the minima, but they allowto check whether a given point is a minimum
example:
minimizex1,x2
x21 + x2
2
subject to − x1 + x2 + 2 ≤ 0
− 2x1 − x2 − 2 ≤ 0
Lagrange function, KT conditions, check whether (0, 0), (4, 2), (1,−1) areminima. → blackboard
Introduction to Neuroinformatics: Support Vector Machines – p. 38
Ways to find the minimum
brute force approach
• equality constraints are always active
• inequality constraints may be active or inactive
Introduction to Neuroinformatics: Support Vector Machines – p. 39
Ways to find the minimum
brute force approach
• equality constraints are always active
• inequality constraints may be active or inactive
• at the minimum ~x∗, a certain subset A of inequality constraints is active
• if we would know A and f looks “nice”, we could find ~x∗ analytically
Introduction to Neuroinformatics: Support Vector Machines – p. 39
Ways to find the minimum
brute force approach
• equality constraints are always active
• inequality constraints may be active or inactive
• at the minimum ~x∗, a certain subset A of inequality constraints is active
• if we would know A and f looks “nice”, we could find ~x∗ analytically
• main idea: check all possible combinations of active constraints for theminimum
example → blackboard
Introduction to Neuroinformatics: Support Vector Machines – p. 39
Ways to find the minimum
brute force approach
• equality constraints are always active
• inequality constraints may be active or inactive
• at the minimum ~x∗, a certain subset A of inequality constraints is active
• if we would know A and f looks “nice”, we could find ~x∗ analytically
• main idea: check all possible combinations of active constraints for theminimum
example → blackboard
can be applied only for linear and quadratic target functions
number of possible combinations of active constraints grows exponentially
Introduction to Neuroinformatics: Support Vector Machines – p. 39
Ways to find the minimum(cont.)
active set methods: incrementally build the set of active constraints
Introduction to Neuroinformatics: Support Vector Machines – p. 40
Ways to find the minimum(cont.)
active set methods: incrementally build the set of active constraints
start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is
consistent with the constraints in A
if a constraint 6∈ A is violated it is added to A
if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.
if necessary, constraints are removed from A
Introduction to Neuroinformatics: Support Vector Machines – p. 40
Ways to find the minimum(cont.)
active set methods: incrementally build the set of active constraints
start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is
consistent with the constraints in A
if a constraint 6∈ A is violated it is added to A
if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.
if necessary, constraints are removed from A
sketch of algorithm → next slide
Introduction to Neuroinformatics: Support Vector Machines – p. 40
Ways to find the minimum(cont.)
1: repeat2: calculate direction of maximal descent3: project direction of maximal descent onto boundaries of constraints in A (yields search
direction ~v)
4: if ~v 6= ~0 then5: calculate minimum on ray from ~x into direction ~v (yields steplength τ )6: check whether constraints are violated when step is performed. If yes, shorten τ and add
restricting constraint to A
7: perform step from ~x to ~x + τ~v
8: else9: calculate Lagrange multipliers
10: if Lagrange multipliers are all non-negative then11: minimum found, stop.12: else13: remove constraint with negative Lagrange multiplier from A
14: end if15: end if16: until minimum found
Introduction to Neuroinformatics: Support Vector Machines – p. 41
Duality
a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.
the original task is called primal problem, the other problem is called dualproblem
Introduction to Neuroinformatics: Support Vector Machines – p. 42
Duality
a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.
the original task is called primal problem, the other problem is called dualproblem
primal task: (reminder)
minimize~x
f(~x)
subject to gj(~x) ≤ 0 j = 1, . . . , k
hi(~x) = 0 i = 1, . . . ,m
Introduction to Neuroinformatics: Support Vector Machines – p. 42
Duality
a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.
the original task is called primal problem, the other problem is called dualproblem
primal task: (reminder)
minimize~x
f(~x)
subject to gj(~x) ≤ 0 j = 1, . . . , k
hi(~x) = 0 i = 1, . . . ,m
the Lagrange function of the primal: (reminder)
L(~x, ~α, ~β) = f(~x) +k∑
j=1
(αj · gj(~x)) +m∑
i=1
(βi · hi(~x))
Introduction to Neuroinformatics: Support Vector Machines – p. 42
Duality(cont.)
the function Q:
Q(~α, ~β) = inf~x∈Rn
L(~x, ~α, ~β)
remarks:
• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞
• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.
Introduction to Neuroinformatics: Support Vector Machines – p. 43
Duality(cont.)
the function Q:
Q(~α, ~β) = inf~x∈Rn
L(~x, ~α, ~β)
remarks:
• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞
• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.
dual task:
maximize~α,~β
Q(~α, ~β)
subject to αi ≥ 0 for all i = 1, . . . , k
Introduction to Neuroinformatics: Support Vector Machines – p. 43
Duality(cont.)
Lemma:
If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:
Q(~α′, ~β′) ≤ f(~x′)
Introduction to Neuroinformatics: Support Vector Machines – p. 44
Duality(cont.)
Lemma:
If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:
Q(~α′, ~β′) ≤ f(~x′)
Proof:
Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)
Introduction to Neuroinformatics: Support Vector Machines – p. 44
Duality(cont.)
Lemma:
If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:
Q(~α′, ~β′) ≤ f(~x′)
Proof:
Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)
= f(~x′) +k∑
i=1
( α′i
︸︷︷︸
≥0
· gi(~x′)
︸ ︷︷ ︸
≤0
) +m∑
j=1
(β′j · hj(~x
′)︸ ︷︷ ︸
=0
)
Introduction to Neuroinformatics: Support Vector Machines – p. 44
Duality(cont.)
Lemma:
If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:
Q(~α′, ~β′) ≤ f(~x′)
Proof:
Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)
= f(~x′) +k∑
i=1
( α′i
︸︷︷︸
≥0
· gi(~x′)
︸ ︷︷ ︸
≤0
) +m∑
j=1
(β′j · hj(~x
′)︸ ︷︷ ︸
=0
)
≤ f(~x′)
Introduction to Neuroinformatics: Support Vector Machines – p. 44
Duality(cont.)
Corollary:
If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the
dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem
and (~α′, ~β′) is solution of the dual problem.
Introduction to Neuroinformatics: Support Vector Machines – p. 45
Duality(cont.)
Corollary:
If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the
dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem
and (~α′, ~β′) is solution of the dual problem.
Proof:For any feasible ~x:
f(~x) ≥ Q(~α′, ~β′) = f(~x′)
For any feasible (~α, ~β):
Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)
Introduction to Neuroinformatics: Support Vector Machines – p. 45
Duality(cont.)
Corollary:
If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the
dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem
and (~α′, ~β′) is solution of the dual problem.
Proof:For any feasible ~x:
f(~x) ≥ Q(~α′, ~β′) = f(~x′)
For any feasible (~α, ~β):
Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)
Potential Problem: points ~x′ and (~α′, ~β′) do not necessarily exist for generalproblems ⇒ restrictions on f and type of constraints.
Introduction to Neuroinformatics: Support Vector Machines – p. 45
Duality(cont.)
Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the
primal problem then a solution of the dual (~α′, ~β′) exists and
f(~x′) = Q(~α′, ~β′)
Introduction to Neuroinformatics: Support Vector Machines – p. 46
Duality(cont.)
Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the
primal problem then a solution of the dual (~α′, ~β′) exists and
f(~x′) = Q(~α′, ~β′)
Proof:
~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions
Introduction to Neuroinformatics: Support Vector Machines – p. 46
Duality(cont.)
Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the
primal problem then a solution of the dual (~α′, ~β′) exists and
f(~x′) = Q(~α′, ~β′)
Proof:
~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions
f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x
Introduction to Neuroinformatics: Support Vector Machines – p. 46
Duality(cont.)
Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the
primal problem then a solution of the dual (~α′, ~β′) exists and
f(~x′) = Q(~α′, ~β′)
Proof:
~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions
f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x
from KT conditions: ∂L∂xi
(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)
Introduction to Neuroinformatics: Support Vector Machines – p. 46
Duality(cont.)
Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the
primal problem then a solution of the dual (~α′, ~β′) exists and
f(~x′) = Q(~α′, ~β′)
Proof:
~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions
f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x
from KT conditions: ∂L∂xi
(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)
Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual
Introduction to Neuroinformatics: Support Vector Machines – p. 46
Duality(cont.)
Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the
primal problem then a solution of the dual (~α′, ~β′) exists and
f(~x′) = Q(~α′, ~β′)
Proof:
~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions
f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x
from KT conditions: ∂L∂xi
(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)
Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual
proof shows: KT conditions are the link between primal solution and dualsolution
solutions do not need to be unique
Introduction to Neuroinformatics: Support Vector Machines – p. 46
Duality(cont.)
main message from duality: under certain circumstances,
• you can solve the primal and get the solution of the dual for free
• you can solve the dual and get the solution of the primal for free
• you can use the Kuhn-Tucker conditions to transform the primal solutioninto a dual solution and vice versa
Introduction to Neuroinformatics: Support Vector Machines – p. 47
Summary (optimization theory)
general ideas about optimization (local, global minima)
convex problems: each local minimum is a global minimum
general characterization of minima (feasible directions, descendingdirections, Kuhn-Tucker)
algorithms to find the minimum (brute force, active set)
duality
literature: Roger Fletcher, Practical methods of optimization, Wiley 1991
Introduction to Neuroinformatics: Support Vector Machines – p. 48
Intermediate Last Slide
Introduction to Neuroinformatics: Support Vector Machines – p. 49
Applying Results of Optimization Theoryto Support Vector Machines
Introduction to Neuroinformatics: Support Vector Machines – p. 50
Support vector machines
final mathematical form of the minimization task (reminder):
minimize~w,b
1
2||~w||2
subject to d(i)( ⟨
~x(i), ~w⟩
+ b)≥ 1 for i = 1, . . . , p
the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)
Introduction to Neuroinformatics: Support Vector Machines – p. 51
Support vector machines(cont.)
transforming into the appropriate form:
minimize~w,b
1
2||~w||2
subject to 1 − d(i)( ⟨
~x(i), ~w⟩
+ b)≤ 0 for i = 1, . . . , p
Introduction to Neuroinformatics: Support Vector Machines – p. 52
Support vector machines(cont.)
transforming into the appropriate form:
minimize~w,b
1
2||~w||2
subject to 1 − d(i)( ⟨
~x(i), ~w⟩
+ b)≤ 0 for i = 1, . . . , p
note:
• target function is convex
• constraints are linear
• hence, local minima are also global minima (if exist feasible points)
Introduction to Neuroinformatics: Support Vector Machines – p. 52
Support vector machines(cont.)
Lagrange function:
L(~w, b, ~α) =1
2||~w||2 +
p∑
i=1
αi
(
1 − d(i)( ⟨
~x(i), ~w⟩
+ b))
=1
2||~w||2 +
p∑
i=1
αi −p
∑
i=1
αid(i)
⟨~x(i), ~w
⟩− b
p∑
i=1
αid(i)
Introduction to Neuroinformatics: Support Vector Machines – p. 53
Support vector machines(cont.)
Lagrange function:
L(~w, b, ~α) =1
2||~w||2 +
p∑
i=1
αi
(
1 − d(i)( ⟨
~x(i), ~w⟩
+ b))
=1
2||~w||2 +
p∑
i=1
αi −p
∑
i=1
αid(i)
⟨~x(i), ~w
⟩− b
p∑
i=1
αid(i)
derivatives:
∂L
∂wj
= wj −p
∑
i=1
αid(i)x
(i)j
Introduction to Neuroinformatics: Support Vector Machines – p. 53
Support vector machines(cont.)
Lagrange function:
L(~w, b, ~α) =1
2||~w||2 +
p∑
i=1
αi
(
1 − d(i)( ⟨
~x(i), ~w⟩
+ b))
=1
2||~w||2 +
p∑
i=1
αi −p
∑
i=1
αid(i)
⟨~x(i), ~w
⟩− b
p∑
i=1
αid(i)
derivatives:
∂L
∂wj
= wj −p
∑
i=1
αid(i)x
(i)j
∂L
∂b= −
p∑
i=1
αid(i)
Introduction to Neuroinformatics: Support Vector Machines – p. 53
Support vector machines(cont.)
calculating Q(~α) = inf ~w,b L(~w, b, ~α)
L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum
Introduction to Neuroinformatics: Support Vector Machines – p. 54
Support vector machines(cont.)
calculating Q(~α) = inf ~w,b L(~w, b, ~α)
L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum
zeroing the derivatives and resolving w.r.t. ~w and b:
wj =
p∑
i=1
αid(i)x
(i)j
Introduction to Neuroinformatics: Support Vector Machines – p. 54
Support vector machines(cont.)
calculating Q(~α) = inf ~w,b L(~w, b, ~α)
L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum
zeroing the derivatives and resolving w.r.t. ~w and b:
wj =
p∑
i=1
αid(i)x
(i)j
~w =
p∑
i=1
αid(i)~x(i)
Introduction to Neuroinformatics: Support Vector Machines – p. 54
Support vector machines(cont.)
calculating Q(~α) = inf ~w,b L(~w, b, ~α)
L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum
zeroing the derivatives and resolving w.r.t. ~w and b:
wj =
p∑
i=1
αid(i)x
(i)j
~w =
p∑
i=1
αid(i)~x(i)
0 =
p∑
i=1
αid(i)
Introduction to Neuroinformatics: Support Vector Machines – p. 54
Support vector machines(cont.)
calculating Q(~α) = inf ~w,b L(~w, b, ~α)
L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum
zeroing the derivatives and resolving w.r.t. ~w and b:
wj =
p∑
i=1
αid(i)x
(i)j
~w =
p∑
i=1
αid(i)~x(i)
0 =
p∑
i=1
αid(i)
strange equation, b is lost, what does it mean?
Introduction to Neuroinformatics: Support Vector Machines – p. 54
Support vector machines(cont.)
L(~w, b, ~α) =1
2||~w||2 +
p∑
i=1
αi −p
∑
i=1
αid(i)
⟨~x(i), ~w
⟩− b
p∑
i=1
αid(i)
~w =
p∑
i=1
αid(i)~x(i) 0 =
p∑
i=1
αid(i)
Introduction to Neuroinformatics: Support Vector Machines – p. 55
Support vector machines(cont.)
L(~w, b, ~α) =1
2||~w||2 +
p∑
i=1
αi −p
∑
i=1
αid(i)
⟨~x(i), ~w
⟩− b
p∑
i=1
αid(i)
~w =
p∑
i=1
αid(i)~x(i) 0 =
p∑
i=1
αid(i)
first case:∑p
i=1 αid(i) = 0
in this case, we can find ~w that zeros the partial derivatives for any value of
b. its the minimum=infimum. Substituting ~w by∑p
i=1 αid(i)~x(i) yields:
Q(~α) = −1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨~x(i), ~x(j)
⟩ )+
p∑
i=1
αi
Introduction to Neuroinformatics: Support Vector Machines – p. 55
Support vector machines(cont.)
L(~w, b, ~α) =1
2||~w||2 +
p∑
i=1
αi −p
∑
i=1
αid(i)
⟨~x(i), ~w
⟩− b
p∑
i=1
αid(i)
~w =
p∑
i=1
αid(i)~x(i) 0 =
p∑
i=1
αid(i)
second case:∑p
i=1 αid(i) 6= 0
don’t matter of ~w we can make L arbitrary small varying b. Hence,
Q(~α) = −∞
Introduction to Neuroinformatics: Support Vector Machines – p. 55
Support vector machines(cont.)
the dual problem:
maximize~α
−1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨~x(i), ~x(j)
⟩ )+
p∑
i=1
αi
if
p∑
i=1
αid(i) = 0
−∞ otherwise
subject to αi ≥ 0 for all i = 1, . . . , p
Introduction to Neuroinformatics: Support Vector Machines – p. 56
Support vector machines(cont.)
since the second case in case distinction will never be the maximum, we canrewrite the dual:
maximize~α
− 1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨~x(i), ~x(j)
⟩ )+
p∑
i=1
αi
subject to
p∑
i=1
αid(i) = 0
αi ≥ 0 for all i = 1, . . . , p
Introduction to Neuroinformatics: Support Vector Machines – p. 57
Support vector machines(cont.)
since the second case in case distinction will never be the maximum, we canrewrite the dual:
maximize~α
− 1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨~x(i), ~x(j)
⟩ )+
p∑
i=1
αi
subject to
p∑
i=1
αid(i) = 0
αi ≥ 0 for all i = 1, . . . , p
note: dual is a problem in ~α. ~w and b do not occur
Introduction to Neuroinformatics: Support Vector Machines – p. 57
Support vector machines(cont.)
since the second case in case distinction will never be the maximum, we canrewrite the dual:
maximize~α
− 1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨~x(i), ~x(j)
⟩ )+
p∑
i=1
αi
subject to
p∑
i=1
αid(i) = 0
αi ≥ 0 for all i = 1, . . . , p
note: dual is a problem in ~α. ~w and b do not occur
for now, there is no advantage of solving the dual instead of the primalproblem. Later on, we will see the advantage of the dual.
for both the dual and the primal problem, powerful algorithmic solutionmethods exist (e.g. active set methods)
Introduction to Neuroinformatics: Support Vector Machines – p. 57
Support vector machines(cont.)
back to primal, KT-conditions:
1 − d(i)⟨~x(i), ~w
⟩− d(i)b ≤ 0 for all i = 1, . . . , p
αi ·(1 − d(i)
⟨~x(i), ~w
⟩− d(i)b
)= 0 for all i = 1, . . . , p
~w =
p∑
i=1
(αid(i)~x(i))
0 =
p∑
i=1
αid(i)
αi ≥ 0 for all i = 1, . . . , p
Introduction to Neuroinformatics: Support Vector Machines – p. 58
Support vector machines(cont.)
back to primal, KT-conditions:
1 − d(i)⟨~x(i), ~w
⟩− d(i)b ≤ 0 for all i = 1, . . . , p
αi ·(1 − d(i)
⟨~x(i), ~w
⟩− d(i)b
)= 0 for all i = 1, . . . , p
~w =
p∑
i=1
(αid(i)~x(i))
0 =
p∑
i=1
αid(i)
αi ≥ 0 for all i = 1, . . . , p
what are ~w and b if αi are already known?
Introduction to Neuroinformatics: Support Vector Machines – p. 58
Support vector machines(cont.)
resolving the KT conditions w.r.t ~w and b:
• ~w =∑p
i=1(αid(i)~x(i))
Introduction to Neuroinformatics: Support Vector Machines – p. 59
Support vector machines(cont.)
resolving the KT conditions w.r.t ~w and b:
• ~w =∑p
i=1(αid(i)~x(i))
• for i with αi 6= 0 we get from the complementary condition:
1 − d(i)⟨~x(i), ~w
⟩− d(i)b = 0
Introduction to Neuroinformatics: Support Vector Machines – p. 59
Support vector machines(cont.)
resolving the KT conditions w.r.t ~w and b:
• ~w =∑p
i=1(αid(i)~x(i))
• for i with αi 6= 0 we get from the complementary condition:
1 − d(i)⟨~x(i), ~w
⟩− d(i)b = 0
Hence:
b =1
d(i)−
⟨~x(i), ~w
⟩=
1
d(i)−
p∑
j=1
(αjd
(j)⟨~x(i), ~x(j)
⟩ )
Introduction to Neuroinformatics: Support Vector Machines – p. 59
Support vector machines(cont.)
resolving the KT conditions w.r.t ~w and b:
• ~w =∑p
i=1(αid(i)~x(i))
• for i with αi 6= 0 we get from the complementary condition:
1 − d(i)⟨~x(i), ~w
⟩− d(i)b = 0
Hence:
b =1
d(i)−
⟨~x(i), ~w
⟩=
1
d(i)−
p∑
j=1
(αjd
(j)⟨~x(i), ~x(j)
⟩ )
the margin ρ:
ρ =1
||~w|| =1
√∑p
i=1
∑pj=1
(αiαjd(i)d(j) 〈~x(i), ~x(j)〉
)
Introduction to Neuroinformatics: Support Vector Machines – p. 59
Support vector machines(cont.)
some constraints become active(αi > 0), some inactive (αi = 0)
each constraint refers to onepattern
constraint is active if respectivepattern is located next to separatinghyperplane
patterns that refer to activeconstraints are called supportvectors
solution only depends on supportvectors
margin
margin
positivessv
sv
sv
negatives
optimal separating hyperplane
Introduction to Neuroinformatics: Support Vector Machines – p. 60
Support vector machines(cont.)
some constraints become active(αi > 0), some inactive (αi = 0)
each constraint refers to onepattern
constraint is active if respectivepattern is located next to separatinghyperplane
patterns that refer to activeconstraints are called supportvectors
solution only depends on supportvectors
margin
margin
positivessv
sv
sv
negatives
optimal separating hyperplane
Lemma:Removing non support vectors froma classification task does notchange the solution found by asupport vector machine.
Introduction to Neuroinformatics: Support Vector Machines – p. 60
Support vector machines(cont.)
example (blackboard):
i x(i)1 x
(i)2 d(i)
1 0 0 −1
2 0 1 −1
3 −1 1 −1
4 1 0 +1
5 2 1 +1
Introduction to Neuroinformatics: Support Vector Machines – p. 61
Support vector machines(cont.)
example (blackboard):
i x(i)1 x
(i)2 d(i)
1 0 0 −1
2 0 1 −1
3 −1 1 −1
4 1 0 +1
5 2 1 +1
solution:
w1 = 2, w2 = 0, b = −1
support vectors: ~x(1), ~x(2), ~x(4)
Introduction to Neuroinformatics: Support Vector Machines – p. 61
Fault tolerant SVMs
Introduction to Neuroinformatics: Support Vector Machines – p. 62
Soft margin SVM
SVM discussed up to here(so-called “hard margin case”):perfect classification
case of data which are not linearlyseparable?
Introduction to Neuroinformatics: Support Vector Machines – p. 63
Soft margin SVM
SVM discussed up to here(so-called “hard margin case”):perfect classification
case of data which are not linearlyseparable?
soft margin SVM
margin
margin
positives
negatives
additional error
additional error
additional error
separating hyperplane
Introduction to Neuroinformatics: Support Vector Machines – p. 63
Soft margin SVM
SVM discussed up to here(so-called “hard margin case”):perfect classification
case of data which are not linearlyseparable?
soft margin SVM
positives
negatives
additional error
separating hyperplane
additional error
Introduction to Neuroinformatics: Support Vector Machines – p. 63
Soft margin SVM
SVM discussed up to here(so-called “hard margin case”):perfect classification
case of data which are not linearlyseparable?
soft margin SVM
positives
negatives
separating hyperplane
Introduction to Neuroinformatics: Support Vector Machines – p. 63
Soft margin SVM
SVM discussed up to here(so-called “hard margin case”):perfect classification
case of data which are not linearlyseparable?
soft margin SVM
opposing aspects: maximizing themargin vs. minimizing the errors
positives
negatives
separating hyperplane
Introduction to Neuroinformatics: Support Vector Machines – p. 63
Soft margin SVM(cont.)
mathematical model of errors: slack variables ξi. ξi models the error that is
made for the training pattern ~x(i)
Introduction to Neuroinformatics: Support Vector Machines – p. 64
Soft margin SVM(cont.)
mathematical model of errors: slack variables ξi. ξi models the error that is
made for the training pattern ~x(i)
extended constraints:
d(i) · (⟨~w, ~x(i)
⟩+ b) ≥ 1 − ξi
ξi ≥ 0
Introduction to Neuroinformatics: Support Vector Machines – p. 64
Soft margin SVM(cont.)
mathematical model of errors: slack variables ξi. ξi models the error that is
made for the training pattern ~x(i)
extended constraints:
d(i) · (⟨~w, ~x(i)
⟩+ b) ≥ 1 − ξi
ξi ≥ 0
extended optimization target:
minimize~w,b,~ξ
1
2||~w||2 + C
p∑
i=1
ξ2i
C > 0 is a fixed canstant to balance margin and errors
Introduction to Neuroinformatics: Support Vector Machines – p. 64
Soft margin SVM(cont.)
primal of soft margin classification:
minimize~w,b,~ξ
1
2||~w||2 + C
p∑
i=1
ξ2i
subject to d(i)( ⟨
~x(i), ~w⟩
+ b)≥ 1 − ξi for i = 1, . . . , p
ξi ≥ 0 for i = 1, . . . , p
Introduction to Neuroinformatics: Support Vector Machines – p. 65
Soft margin SVM(cont.)
primal of soft margin classification:
minimize~w,b,~ξ
1
2||~w||2 + C
p∑
i=1
ξ2i
subject to d(i)( ⟨
~x(i), ~w⟩
+ b)≥ 1 − ξi for i = 1, . . . , p
ξi ≥ 0 for i = 1, . . . , p
Lagrange function:
L(~w, b, ~ξ,~α, ~β) =1
2||~w||2 + C
p∑
i=1
ξ2i +
+
p∑
i=1
(αi(1 − ξi − d(i)(
⟨~x(i), ~w
⟩+ b))
)+
p∑
i=1
(βi · (−ξi)
)
primal variables: ~w, b, ~ξ. Lagrange multipliers: ~α, ~β
Introduction to Neuroinformatics: Support Vector Machines – p. 65
Soft margin SVM(cont.)
KT conditions: 1 − ξi − d(i)( ⟨
~x(i), ~w⟩
+ b)≤ 0 for i = 1, . . . , p
−ξi ≤ 0 for i = 1, . . . , p
αi ·(
1 − ξi − d(i)( ⟨
~x(i), ~w⟩
+ b))
= 0 for i = 1, . . . , p
βi(−ξi) = 0 for i = 1, . . . , p
∂L
∂wj
= wj −p
∑
i=1
αid(i)x
(i)j = 0 for i = 1, . . . , p
∂L
∂b= −
p∑
i=1
αid(i) = 0
∂L
∂ξi
= 2Cξi − αi − βi = 0 for i = 1, . . . , p
αi ≥ 0 for i = 1, . . . , p
βi ≥ 0 for i = 1, . . . , pIntroduction to Neuroinformatics: Support Vector Machines – p. 66
Soft margin SVM(cont.)
deriving the solution from the KT conditions. We get ~w and ξi directly:
~w =
p∑
i=1
(αid(i)~x(i))
ξi =αi + βi
2C
Introduction to Neuroinformatics: Support Vector Machines – p. 67
Soft margin SVM(cont.)
deriving the solution from the KT conditions. We get ~w and ξi directly:
~w =
p∑
i=1
(αid(i)~x(i))
ξi =αi + βi
2C
to derive b, we exploit the complementary condition for a pattern with nonzero αi:
b =1 − ξi − d(i)
⟨~x(i), ~w
⟩
d(i)
=1 − αi+βi
2C
d(i)−
p∑
j=1
(αjd(j)
⟨~x(i), ~x(j)
⟩)
Introduction to Neuroinformatics: Support Vector Machines – p. 67
Soft margin SVM(cont.)
which patterns become support vectors?
Introduction to Neuroinformatics: Support Vector Machines – p. 68
Soft margin SVM(cont.)
which patterns become support vectors?
• patterns on the boundary of the margin band
• misclassified patterns and patterns within the margin band
Introduction to Neuroinformatics: Support Vector Machines – p. 68
Soft margin SVM(cont.)
calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)
Introduction to Neuroinformatics: Support Vector Machines – p. 69
Soft margin SVM(cont.)
calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)
first case:∑p
i=1 αid(i) = 0:
Then, minimum exists with ~w =∑p
i=1 αid(i)~x(i), ξi = αi+βi
2C. b arbitrary.
Q(~α, ~β) = − 1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨~x(i), ~x(j)
⟩ )
− 1
2
p∑
i=1
p∑
j=1
(αi + βi)2
2C+
p∑
i=1
αi
Introduction to Neuroinformatics: Support Vector Machines – p. 69
Soft margin SVM(cont.)
calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)
first case:∑p
i=1 αid(i) = 0:
Then, minimum exists with ~w =∑p
i=1 αid(i)~x(i), ξi = αi+βi
2C. b arbitrary.
Q(~α, ~β) = − 1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨~x(i), ~x(j)
⟩ )
− 1
2
p∑
i=1
p∑
j=1
(αi + βi)2
2C+
p∑
i=1
αi
second case:∑p
i=1 αid(i) 6= 0.
Hence, L is unbounded below. Q(~α, ~β) = −∞
Introduction to Neuroinformatics: Support Vector Machines – p. 69
Soft margin SVM(cont.)
calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)
first case:∑p
i=1 αid(i) = 0:
Then, minimum exists with ~w =∑p
i=1 αid(i)~x(i), ξi = αi+βi
2C. b arbitrary.
Q(~α, ~β) = − 1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨~x(i), ~x(j)
⟩ )
− 1
2
p∑
i=1
p∑
j=1
(αi + βi)2
2C+
p∑
i=1
αi
second case:∑p
i=1 αid(i) 6= 0.
Hence, L is unbounded below. Q(~α, ~β) = −∞ observation: second case does not contribute to the maximum of Q
Introduction to Neuroinformatics: Support Vector Machines – p. 69
Soft margin SVM(cont.)
the dual:
maximize~α,~β
(
− 1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨~x(i), ~x(j)
⟩ )
− 1
2
p∑
i=1
p∑
j=1
(αi + βi)2
2C+
p∑
i=1
αi
)
subject to
p∑
i=1
αid(i) = 0
αi ≥ 0 for all i = 1, . . . , p
βi ≥ 0 for all i = 1, . . . , p
Introduction to Neuroinformatics: Support Vector Machines – p. 70
Soft margin SVM(cont.)
there is a variant of soft margin SVM that uses the following optimizationtarget:
minimize~w,b,~ξ
1
2||~w||2 + C
p∑
i=1
ξi
instead of:minimize
~w,b,~ξ
1
2||~w||2 + C
p∑
i=1
ξ2i
advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.
Introduction to Neuroinformatics: Support Vector Machines – p. 71
Soft margin SVM(cont.)
there is a variant of soft margin SVM that uses the following optimizationtarget:
minimize~w,b,~ξ
1
2||~w||2 + C
p∑
i=1
ξi
instead of:minimize
~w,b,~ξ
1
2||~w||2 + C
p∑
i=1
ξ2i
advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.
C controls the balance between errors and margin width:
• large C prefers small errors
• small C prefers large margin
adequate value of C has to be optimized experimentally
→ demo in practical exercises
Introduction to Neuroinformatics: Support Vector Machines – p. 71
Non-linear SVMs
Introduction to Neuroinformatics: Support Vector Machines – p. 72
Non-linear SVMs
linear discrimination is not appropriate in all cases, we need a non-linearvariant of SVMs
switching from linear constraints to non-linear constraints causes a lot ofnumerical problems:
• losing convexity
• KT conditions not sufficient
• optimization algorithms with low performance
Introduction to Neuroinformatics: Support Vector Machines – p. 73
Non-linear SVMs(cont.)
classical idea: use non-linear features
feature mapping Φ (non-linear) maps the original input vectors to featurevectors
we have to replace ~x in all formulae by Φ(~x)
Introduction to Neuroinformatics: Support Vector Machines – p. 74
Non-linear SVMs(cont.)
classical idea: use non-linear features
feature mapping Φ (non-linear) maps the original input vectors to featurevectors
we have to replace ~x in all formulae by Φ(~x)
dual of the hard margin case becomes:
maximize~α
− 1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)⟨Φ(~x(i)),Φ(~x(j))
⟩ )+
p∑
i=1
αi
subject to
p∑
i=1
αid(i) = 0
αi ≥ 0 for all i = 1, . . . , p
Introduction to Neuroinformatics: Support Vector Machines – p. 74
Kernel trick
defining a function KΦ(·) as: KΦ(~x, ~y) = 〈Φ(~x),Φ(~y)〉 we get:
maximize~α
− 1
2
p∑
i=1
p∑
j=1
(αiαjd
(i)d(j)KΦ(~x(i), ~x(j)))
+
p∑
i=1
αi
subject to
p∑
i=1
αid(i) = 0
αi ≥ 0 for all i = 1, . . . , p
now, patterns only occur as arguments of the function KΦ in the dual
the function KΦ is called a kernel function
Introduction to Neuroinformatics: Support Vector Machines – p. 75
Kernel trick(cont.)
experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?
Introduction to Neuroinformatics: Support Vector Machines – p. 76
Kernel trick(cont.)
experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!
Introduction to Neuroinformatics: Support Vector Machines – p. 76
Kernel trick(cont.)
experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!
can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)
Introduction to Neuroinformatics: Support Vector Machines – p. 76
Kernel trick(cont.)
experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!
can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)
weight vector: no, bias: yes, margin: yes
b =1
d(s)−
p∑
j=1
(αjd
(j)KΦ(~x(s), ~x(j)))
for support vector s
ρ =1
√∑p
i=1
∑p
j=1
(αiαjd(i)d(j)KΦ(~x(i), ~x(j))
)
Introduction to Neuroinformatics: Support Vector Machines – p. 76
Kernel trick(cont.)
experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!
can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)
weight vector: no, bias: yes, margin: yes
b =1
d(s)−
p∑
j=1
(αjd
(j)KΦ(~x(s), ~x(j)))
for support vector s
ρ =1
√∑p
i=1
∑p
j=1
(αiαjd(i)d(j)KΦ(~x(i), ~x(j))
)
can we apply the learned classification to a new input pattern ~x(new) ?
Introduction to Neuroinformatics: Support Vector Machines – p. 76
Kernel trick(cont.)
experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!
can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)
weight vector: no, bias: yes, margin: yes
b =1
d(s)−
p∑
j=1
(αjd
(j)KΦ(~x(s), ~x(j)))
for support vector s
ρ =1
√∑p
i=1
∑p
j=1
(αiαjd(i)d(j)KΦ(~x(i), ~x(j))
)
can we apply the learned classification to a new input pattern ~x(new) ?Yes! (see next slide)
Introduction to Neuroinformatics: Support Vector Machines – p. 76
Kernel trick(cont.)
~w =
p∑
i=1
(αid(i)Φ(~x(i)))
applying ~w and b to a new pattern:
⟨~w,Φ(~x(new))
⟩+ b =
⟨p
∑
i=1
(αid(i)Φ(~x(i))),Φ(~x(new))
⟩
+ b
=
p∑
i=1
(αid(i)KΦ(~x(i), ~x(new))) + b
Introduction to Neuroinformatics: Support Vector Machines – p. 77
Kernel trick(cont.)
quintessence:
• replacing the dotproduct by a kernel function is possible
• an explicit representation is not possible if we do not have access to Φ
• nonetheless, predictions can be made
• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi
Introduction to Neuroinformatics: Support Vector Machines – p. 78
Kernel trick(cont.)
quintessence:
• replacing the dotproduct by a kernel function is possible
• an explicit representation is not possible if we do not have access to Φ
• nonetheless, predictions can be made
• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi
the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick
classification
linearoptimal
Φ−1Φ
Introduction to Neuroinformatics: Support Vector Machines – p. 78
Kernel trick(cont.)
quintessence:
• replacing the dotproduct by a kernel function is possible
• an explicit representation is not possible if we do not have access to Φ
• nonetheless, predictions can be made
• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi
the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick
classification
linearoptimal
Φ−1
kernel−trick
Φ
Introduction to Neuroinformatics: Support Vector Machines – p. 78
Kernel trick(cont.)
example (one dimensional input space):
D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space
feature mapping: Φ(x) =
x2
√2x
1
kernel function:
KΦ(x, y) =
⟨
x2
√2x
1
,
y2
√2y
1
⟩
= (xy)2 +2(xy)+1 = (xy +1)2
Introduction to Neuroinformatics: Support Vector Machines – p. 79
Kernel trick(cont.)
example (one dimensional input space):
D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space
feature mapping: Φ(x) =
x2
√2x
1
kernel function:
KΦ(x, y) =
⟨
x2
√2x
1
,
y2
√2y
1
⟩
= (xy)2 +2(xy)+1 = (xy +1)2
solution: α1 = 34, α2 = 2
3, α3 = 1
12
b = −1, ~w = (1,−12
√2, 0)T (in feature space!)
Introduction to Neuroinformatics: Support Vector Machines – p. 79
Kernel trick(cont.)
example, illustration:
input space0 1
Introduction to Neuroinformatics: Support Vector Machines – p. 80
Kernel trick(cont.)
example, illustration:fe
atu
re s
pac
e
input space0 1
1 f1
f2
Φ
Φ
Φ
√2
Introduction to Neuroinformatics: Support Vector Machines – p. 80
Kernel trick(cont.)
example, illustration:fe
atu
re s
pac
e
input space0 1
1 f1
f2
√2
Introduction to Neuroinformatics: Support Vector Machines – p. 80
Kernel trick(cont.)
example, illustration:fe
atu
re s
pac
e
input space0 1
1 f1
f2
Φ
Φ
√2
Which points in input space are on thedecision boundary?look for z with∑p
i=1(αid(i)KΦ(~x(i), z)) + b = 0
yields: z = 12(1 ±
√5)
Introduction to Neuroinformatics: Support Vector Machines – p. 80
Kernel trick(cont.)
kernels can be designed individually for an application:
• look for senseful features
• create Φ
• derive KΦ
In these cases, using the kernel instead is not better than an explicitcalculation in feature space
Introduction to Neuroinformatics: Support Vector Machines – p. 81
Kernel trick(cont.)
kernels can be designed individually for an application:
• look for senseful features
• create Φ
• derive KΦ
In these cases, using the kernel instead is not better than an explicitcalculation in feature space
there are classes of generic kernels
• can be used for many tasks
• often, Φ is very complicated and feature space is very high dimensional
• for some generic kernels, feature space has infinite dimension and Φcannot be calculated explicitely
• typically, using kernel instead of explicit calculation in feature space iscomputationally more efficient
Introduction to Neuroinformatics: Support Vector Machines – p. 81
Generic kernels
example of kernel (one-dimensional input space):
Φ(x) =
x2
√2x
1
mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions
Introduction to Neuroinformatics: Support Vector Machines – p. 82
Generic kernels
example of kernel (one-dimensional input space):
Φ(x) =
x2
√2x
1
mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions
corresponding kernel: KΦ(x, y) = (xy + 1)2
kernel evaluation needs: 2 multiplications, 1 addition
Introduction to Neuroinformatics: Support Vector Machines – p. 82
Generic kernels(cont.)
a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:
KΦ(~x, ~y) = (〈~x, ~y〉)d
or:
KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d
d ∈ N is a kernel parameter thatneeds to be chosen manually
Introduction to Neuroinformatics: Support Vector Machines – p. 83
Generic kernels(cont.)
a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:
KΦ(~x, ~y) = (〈~x, ~y〉)d
or:
KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d
d ∈ N is a kernel parameter thatneeds to be chosen manually
for polynomial kernels, Φ can becalculated explicitely. It contains allmonomials of degree d (first case) orall monomials of maximal degree d
(second case)But: feature space becomes verylarge, even for small d, e.g. for thefirst variant the number of features is:
n d number features
3 2 6
5 3 35
10 3 220
10 5 2002
10 10 92 378
50 4 292 825
256 4 183 181 376
Introduction to Neuroinformatics: Support Vector Machines – p. 83
Generic kernels(cont.)
what makes a symmetric function K a kernel function?
Introduction to Neuroinformatics: Support Vector Machines – p. 84
Generic kernels(cont.)
what makes a symmetric function K a kernel function?
• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉
Introduction to Neuroinformatics: Support Vector Machines – p. 84
Generic kernels(cont.)
what makes a symmetric function K a kernel function?
• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉
• K meets Mercer’s theorem:
A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,
if and only if for all functions g : [a, b]n → R with∫
[a,b]ng(~x)2d~x < ∞
holds: ∫
[a,b]n
∫
[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0
Proof: omitted
Introduction to Neuroinformatics: Support Vector Machines – p. 84
Generic kernels(cont.)
what makes a symmetric function K a kernel function?
• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉
• K meets Mercer’s theorem:
A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,
if and only if for all functions g : [a, b]n → R with∫
[a,b]ng(~x)2d~x < ∞
holds: ∫
[a,b]n
∫
[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0
Proof: omitted
consequences: sums of kernels and positive multiples of kernels are alsokernels
Introduction to Neuroinformatics: Support Vector Machines – p. 84
Generic kernels(cont.)
RBF kernels (Gaussian kernels):
KΦ(~x, ~y) = e−||~x−~y||2
2σ2
σ2 > 0 is a kernel parameter, chosen manually
Introduction to Neuroinformatics: Support Vector Machines – p. 85
Generic kernels(cont.)
RBF kernels (Gaussian kernels):
KΦ(~x, ~y) = e−||~x−~y||2
2σ2
σ2 > 0 is a kernel parameter, chosen manually
feature spaces have inifinite dimension
local information processing, circular decision boundaries in input space
what is calculated in a SVM with RBF kernel resembles a RBF network, buttraining is done in a different way
kernel parameter controls the complexity of the kernel: if σ2 large, behavioris similar to dot product, if σ2 is very small, decision boundaries in inputspace may become very complex
degeneration σ2 → 0
most popular kernel type in practice
Introduction to Neuroinformatics: Support Vector Machines – p. 85
Generic kernels(cont.)
sigmoid kernels:
KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))
γ > 0 and θ are kernel parameters, chosen manually
Introduction to Neuroinformatics: Support Vector Machines – p. 86
Generic kernels(cont.)
sigmoid kernels:
KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))
γ > 0 and θ are kernel parameters, chosen manually
feature spaces have inifinite dimension
what is calculated in a SVM with sigmoidal kernel resembles a MLP networkwith one hidden layer
no practical use due to numerical degeneration of the optimization problem
Introduction to Neuroinformatics: Support Vector Machines – p. 86
Generic kernelssummary
dot-product kernel: KΦ(~x, ~y) = 〈~x, ~y〉input space=feature space, simple, but useful
polynomial kernels: KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d,KΦ(~x, ~y) = (〈~x, ~y〉)d
extensions of dot product kernel, feature spaces with finite dimension
RBF kernels: KΦ(~x, ~y) = e−||~x−~y||2
2σ2
very important in practice, very flexible
sigmoid kernels: KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))limited practical use
problem dependent kernels
→ demo in practical exercises
Introduction to Neuroinformatics: Support Vector Machines – p. 87
Summary (SVMs)
linear classification with optimal separating hyperplane
mathematical problem description as convex quadratic program
numerical solving by various approaches, often based on active set methods:libSVM, SMO, ...
representation of the solution in terms of support vectors
soft margin case: classification with errors
kernel-trick: implicit calculations in feature space, non-linear SVM
extensions:
• SVM for regression
• SVM for one class classification
important ideas:
• large margin techniques
• kernel-based techniques Introduction to Neuroinformatics: Support Vector Machines – p. 88
Further readings
Nello Christianini and John Shawe-Taylor, An introduction to support vectormachines and other kernel-based learning methods, Cambridge Univ. Press,2000
C. J. C. Burges, A Tutorial on Support Vector Machines for PatternRecognition, 1998. available at: http://www.kernel-machines.org
Roger Fletcher, Practical methods of optimization, Wiley, 1987
Introduction to Neuroinformatics: Support Vector Machines – p. 89
top related