introduction to neuroinformatics: support vector...

Introduction to Neuroinformatics:

Support Vector Machines

Prof. Dr. Martin Riedmiller

University of Osnabruck

Institute of Computer Science and Institute of Cognitive Science

Introduction to Neuroinformatics: Support Vector Machines – p. 1

Outline

support vector machines (SVM) for classification: basic ideas

optimization under constraints: Lagrange multiplier, Kuhn-Tucker conditions,dual, algorithmic approaches

applying optimization theory to SVM

kernel trick: make linear problems non-linear

variants of SVMs


Linear classification

perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:

〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x

〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x






perceptron learning finds solving hyperplane (if possible) by adding andsubtracting patterns to the weight vector






perceptron learning finds solving hyperplane (if possible) by adding andsubtracting patterns to the weight vector

if there are several solving hyperplanes, perceptron learning finds anarbitrary one


Linear classification(cont.)

Which is the best separating hyperplane?

−

−

−

−

−−

−

−

−

−

−

+ +

++

++

the performance on the training set is equally good, but what do you expectfor an independent test set?



Which is the best separating hyperplane?

−

−

−

−

−−

−

−

−

−

−

+ +

++

++

+

+

++

++

+

+

−

−

−−

−

−

−

−

−−

−

the performance on the training set is equally good, but what do you expectfor an independent test set?



for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.



for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−

−

−

−

−−

−

−

−

−

−

+ +

++

++


Support vector machines

support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.




notation:

• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,

d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b

(= w0, to be in line with SVM literature)

• margin ρ




notation:

• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,

d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b

(= w0, to be in line with SVM literature)

• margin ρ

distance of a point ~x to a hyperplane (~w, b):

∣∣∣∣

〈~x, ~w〉 + b

||~w||

∣∣∣∣



find the separating hyperplane with maximal margin:

maximize~w,b,ρ

ρ2

subject to ρ > 0

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D



find the separating hyperplane with maximal margin:

maximize~w,b,ρ

ρ2

subject to ρ > 0

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D

this task contains a degree of freedom: ~w and b can be scaled by a positivenumber without changing the separating hyperplane.We can remove this degree of freedom and simplify the problem by adding aconstraint on the length of ~w. Here:

||~w|| =1

ρ


Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2



optimization problem with length constraint:

maximize~w,b,ρ

ρ2 1

||~w||2

subject to ρ > 0

||~w|| =

1

ρρ =

1

||~w||

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ1

||~w|| for all (~x(i), d(i)) ∈ D

rewriting the optimization problem

Finally, we can make it a minimzation task minimizing the reciprocal of 1||w||2



the minimization task:

minimize~w,b,ρ

1

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

ρ =1

||~w||



the minimization task:

minimize~w,b,ρ

1

2||~w||2


~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

ρ =

1

||~w||

ρ occurs only in one equality constraint, i.e. we can solve the optimizationproblem for ~w and b and calculate ρ afterwards from ~w



final mathematical form of the minimization task:

minimize~w,b

1

2||~w||2


~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)


Optimization Theory


Optimization theory

a general form of optimization problem:

minimize~x

f(~x)

subject to ~x ∈ Ω

Ω ⊆ Rn: the set of feasible (permitted) points

f : Rn → R: the target function that should be minimized


Optimization theory


minimize~x

f(~x)




maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))


Optimization theory


minimize~x

f(~x)




maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))

often, Ω is implicitly given by equality and inequality constraints


Local and global minima

a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:

f(~x) ≤ f(~y)


Local and global minima

a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:

f(~x) ≤ f(~y)

a feasible point ~x ∈ Ω is called alocal minimum if there is a radiusr > 0 so that for all points ~y ∈ Ωwith ||~x − ~y|| < r holds:

f(~x) ≤ f(~y)

x

f(x)

Ω

loc.glob.loc.


Convexity

A set X is called a convex set if for all points ~x, ~y ∈ X and any 0 ≤ θ ≤ 1holds:

θ~x + (1 − θ)~y ∈ X

intuitively: a set is convex, if all points between two elements of the set arealso elements of the set.

~x ∈ X

~y ∈ Xset X

θ~x + (1 − θ)~y ∈ X ?


Convexity(cont.)

convex sets:

non convex sets:


Convexity(cont.)

intersections of convex sets are also convex.Proof: → blackboard


Convexity(cont.)

intersections of convex sets are also convex.Proof: → blackboard

unions of convex sets are not guaranteed to be convex.


Convexity(cont.)

A function f is called a convex function, if for all points ~x, ~y and all0 ≤ θ ≤ 1 holds:

f(θ~x + (1 − θ)~y) ≤ θf(~x) + (1 − θ)f(~y)

x

f(x)

~x ~y


Convexity(cont.)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

convex functions:

non−convex functions:


Convexity(cont.)

sums of convex functions are also convex.Proof: → blackboard


Convexity(cont.)


positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)


Convexity(cont.)


positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)

differences of convex functions are not guaranteed to be convex.


Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)


Convexity(cont.)


the function: f : x 7→ x2 is a convex function.Proof:

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0


Convexity(cont.)



f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above


Convexity(cont.)



f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above

if f is a convex function, then the set ~x|f(~x) ≤ 0 is a convex set.Proof: → blackboard


Convexity and minima

Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.

Proof:→ blackboard





this lemma does not mean that for each convex set and each convex functiona minimum exists!





this lemma does not mean that for each convex set and each convex functiona minimum exists!

there are conditions for which the existence of a global minimum can beguaranteed. (e.g. Ω non-empty and compact, f continuous)


Convexity and minima (cont.)

Lemma:If Ω is a convex set and f is a convex function w.r.t Ω, each point satisfyinggradf(~x) = 0 is a local minimum.

Proof:→ omitted (literature about convex functions)


Constraints

often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:

Ω =(

k⋂

i=1

~x|gi(~x) ≤ 0)∩

(m⋂

i=1

~x|hi(~x) = 0)

gj and hi are real valued functions

• constraints of the form gj(~x) ≤ 0 are called inequality constraints

• constraints of the form hi(~x) = 0 are called equality constraints


Constraints

often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:

Ω =(

k⋂

i=1

~x|gi(~x) ≤ 0)∩

(m⋂

i=1

~x|hi(~x) = 0)

gj and hi are real valued functions

• constraints of the form gj(~x) ≤ 0 are called inequality constraints

• constraints of the form hi(~x) = 0 are called equality constraints

equality constraints hi(~x) = 0 can be replaced by two inequality constraints:

hi(~x) ≤ 0 and −hi(~x) ≤ 0.


Characterizing local minima

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible

direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r



for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible

direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a descending

direction if exists r > 0 so that f(~x + θ~v) < f(~x) for all 0 < θ ≤ r


Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.



a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum




no minimum




local minimum(desc. directions non−feasible)

no minimum




no minimum


no minimum




no minimum

minimum(no descending directions)


no minimum


Linear constraints, differentiable target function

theory simplifies if the target function f is differentiable and the constraintsare linear:

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn

(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)

hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn


Linear constraints, differentiable target function

theory simplifies if the target function f is differentiable and the constraintsare linear:

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn

(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)

hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn

if f is a linear function, these tasks are called linear programs

if f is a quadratic function, these tasks are called quadratic programs


Linear constraints, differentiable target function(cont.)

what are the descending directions at point ~x ?



what are the descending directions at point ~x ?a direction ~v is descending if

〈~v, gradf(~x)〉 < 0



what are the feasible directions inpoint ~x ?




• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.

grad g(x)

grad g(x)





• inactive constraints:inactive constraints do notrestrict feasible directions

grad g(x)

feasible directions

grad g(x)





• inactive constraints:inactive constraints do notrestrict feasible directions

• active constraints:directions ~v with〈~v, gradg(~x)〉 > 0 areinfeasible

grad g(x)

feasible directions

grad g(x)

feasible directions



what are the feasible directions in point ~x ?



what are the feasible directions in point ~x ?A direction ~v is feasible in ~x if for all constraints gi holds:

gi is inactive or 〈~v, gradgi(~x)〉 ≤ 0



local minima are feasible points at which the intersection of descendingdirections and feasible directions is empty.



local minima are feasible points at which the intersection of descendingdirections and feasible directions is empty.

examples with no active constraints:

grad f grad f=0

gradf(~x) 6= ~0 ⇒ nominimum

gradf(~x) = ~0 ⇒minimum



examples with one active constraints:

grad g(x)

grad f(x)

desc. dir.

feasible dir.

−gradf = λgradg with λ < 0all descending directions feasibleno minimum



examples with one active constraints:

grad g(x)

grad f(x)

desc. dir.

feasible dir.

−gradf = λgradg with λ < 0all descending directions feasibleno minimum

grad g(x)

feasible dir.

grad f(x)

desc. dir.

−gradf = λgradg + ~u with λ < 0

and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum



grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg + ~u with λ > 0




grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg + ~u with λ > 0


grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg with λ > 0all descending directions non-feasibleminimum



grad g(x)

feasible dir.

grad f=0 −gradf = ~0 = 0 · gradg

no descending directionsminimum



grad g(x)

feasible dir.



For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)

only if λ < 0 or ~u 6= ~0 we find feasible descending directions



grad g(x)

feasible dir.



For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)

only if λ < 0 or ~u 6= ~0 we find feasible descending directions

~x is minimum if we find λ ≥ 0 with gradf(~x) + λgradg(~x) = ~0

Remark: this principle can be generalized to more active constraints



examples with two active constraints:

grad g1(x)

grad g2(x)

feasible dir.

desc. dir.

grad f(x)−gradf = λ1gradg1 + λ2gradg2

with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum



examples with two active constraints:

grad g1(x)

grad g2(x)

feasible dir.

desc. dir.

grad f(x)−gradf = λ1gradg1 + λ2gradg2

with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum

grad g2(x)

grad g1(x)

desc. dir.

grad f(x)feasible dir.

−gradf = λ1gradg1 + λ2gradg2

with λ1 ≥ 0, λ2 ≥ 0all descending directions non-feasibleminimum



examples with equality constraint:

grad h(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradh + ~u with

~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum



examples with equality constraint:

grad h(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradh + ~u with

~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum

grad h(x)desc. dir.

grad f(x)

feasible dir.−gradf = λgradh

all descending directions non-feasibleminimum



general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:

A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:

grad f(~x)+∑

active inequalityconstraints j

(αj ·grad gj(~x))+m∑

i=1

(βi·grad hi(~x)) = ~0

(corollary to Farka’s lemma)



general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:

A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:

grad f(~x)+∑

active inequalityconstraints j

(αj ·grad gj(~x))+m∑

i=1

(βi·grad hi(~x)) = ~0

(corollary to Farka’s lemma)

Lagrange function:

L(~x, ~α, ~β) = f(~x) +k∑

j=1

(αj · gj(~x)) +m∑

i=1

(βi · hi(~x))

αi, βi are called Lagrange multipliers


(Karush-) Kuhn-Tucker conditions

rewriting these conditions yield the (K)KT conditions:

gi(~x) ≤ 0 for all inequality constraints

hj(~x) = 0 for all equality constraints

αi · gi(~x) = 0 for all inequality constraints

0 = ∂L(~x,~α,~β)∂xi

for all i = 1, . . . , n

αi ≥ 0 for all inequality constraints

Lemma:A point ~x is minimum, if and only if exist αj and βi so that the KT conditionsare met.

Proof: omitted, see literature


Kuhn-Tucker conditions(cont.)

the KT conditions do not give an algorithm to find the minima, but they allowto check whether a given point is a minimum

example:

minimizex1,x2

x21 + x2

2

subject to − x1 + x2 + 2 ≤ 0

− 2x1 − x2 − 2 ≤ 0

Lagrange function, KT conditions, check whether (0, 0), (4, 2), (1,−1) areminima. → blackboard


Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive






• at the minimum ~x∗, a certain subset A of inequality constraints is active

• if we would know A and f looks “nice”, we could find ~x∗ analytically








• main idea: check all possible combinations of active constraints for theminimum

example → blackboard








• main idea: check all possible combinations of active constraints for theminimum

example → blackboard

can be applied only for linear and quadratic target functions

number of possible combinations of active constraints grows exponentially


Ways to find the minimum(cont.)

active set methods: incrementally build the set of active constraints




start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is

consistent with the constraints in A

if a constraint 6∈ A is violated it is added to A

if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.

if necessary, constraints are removed from A




start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is

consistent with the constraints in A

if a constraint 6∈ A is violated it is added to A

if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.

if necessary, constraints are removed from A

sketch of algorithm → next slide



1: repeat2: calculate direction of maximal descent3: project direction of maximal descent onto boundaries of constraints in A (yields search

direction ~v)

4: if ~v 6= ~0 then5: calculate minimum on ray from ~x into direction ~v (yields steplength τ )6: check whether constraints are violated when step is performed. If yes, shorten τ and add

restricting constraint to A

7: perform step from ~x to ~x + τ~v

8: else9: calculate Lagrange multipliers

10: if Lagrange multipliers are all non-negative then11: minimum found, stop.12: else13: remove constraint with negative Lagrange multiplier from A

14: end if15: end if16: until minimum found


Duality

a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.

the original task is called primal problem, the other problem is called dualproblem


Duality



primal task: (reminder)

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m


Duality



primal task: (reminder)

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

the Lagrange function of the primal: (reminder)

L(~x, ~α, ~β) = f(~x) +k∑

j=1

(αj · gj(~x)) +m∑

i=1

(βi · hi(~x))


Duality(cont.)

the function Q:

Q(~α, ~β) = inf~x∈Rn

L(~x, ~α, ~β)

remarks:

• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞

• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.


Duality(cont.)

the function Q:

Q(~α, ~β) = inf~x∈Rn

L(~x, ~α, ~β)

remarks:

• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞

• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.

dual task:

maximize~α,~β

Q(~α, ~β)

subject to αi ≥ 0 for all i = 1, . . . , k


Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)


Duality(cont.)

Lemma:


Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)


Duality(cont.)

Lemma:


Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

= f(~x′) +k∑

i=1

( α′i

︸︷︷︸

≥0

· gi(~x′)

︸︷︷︸

≤0

) +m∑

j=1

(β′j · hj(~x

′)︸︷︷︸

=0

)


Duality(cont.)

Lemma:


Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

= f(~x′) +k∑

i=1

( α′i

︸︷︷︸

≥0

· gi(~x′)

︸︷︷︸

≤0

) +m∑

j=1

(β′j · hj(~x

′)︸︷︷︸

=0

)

≤ f(~x′)


Duality(cont.)

Corollary:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the

dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem

and (~α′, ~β′) is solution of the dual problem.


Duality(cont.)

Corollary:




Proof:For any feasible ~x:

f(~x) ≥ Q(~α′, ~β′) = f(~x′)

For any feasible (~α, ~β):

Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)


Duality(cont.)

Corollary:




Proof:For any feasible ~x:

f(~x) ≥ Q(~α′, ~β′) = f(~x′)

For any feasible (~α, ~β):

Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)

Potential Problem: points ~x′ and (~α′, ~β′) do not necessarily exist for generalproblems ⇒ restrictions on f and type of constraints.


Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)


Duality(cont.)



f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions


Duality(cont.)



f(~x′) = Q(~α′, ~β′)

Proof:


f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x


Duality(cont.)



f(~x′) = Q(~α′, ~β′)

Proof:



from KT conditions: ∂L∂xi

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)


Duality(cont.)



f(~x′) = Q(~α′, ~β′)

Proof:




(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual


Duality(cont.)



f(~x′) = Q(~α′, ~β′)

Proof:




(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual

proof shows: KT conditions are the link between primal solution and dualsolution

solutions do not need to be unique


Duality(cont.)

main message from duality: under certain circumstances,

• you can solve the primal and get the solution of the dual for free

• you can solve the dual and get the solution of the primal for free

• you can use the Kuhn-Tucker conditions to transform the primal solutioninto a dual solution and vice versa


Summary (optimization theory)

general ideas about optimization (local, global minima)

convex problems: each local minimum is a global minimum

general characterization of minima (feasible directions, descendingdirections, Kuhn-Tucker)

algorithms to find the minimum (brute force, active set)

duality

literature: Roger Fletcher, Practical methods of optimization, Wiley 1991


Intermediate Last Slide


Applying Results of Optimization Theoryto Support Vector Machines



final mathematical form of the minimization task (reminder):

minimize~w,b

1

2||~w||2


~x(i), ~w⟩

+ b)≥ 1 for i = 1, . . . , p

the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)



transforming into the appropriate form:

minimize~w,b

1

2||~w||2

subject to 1 − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p



transforming into the appropriate form:

minimize~w,b

1

2||~w||2

subject to 1 − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

note:

• target function is convex

• constraints are linear

• hence, local minima are also global minima (if exist feasible points)



Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi

(

1 − d(i)( ⟨

~x(i), ~w⟩

+ b))

=1

2||~w||2 +

p∑

i=1

αi −p

∑

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)



Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi

(

1 − d(i)( ⟨

~x(i), ~w⟩

+ b))

=1

2||~w||2 +

p∑

i=1

αi −p

∑

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

derivatives:

∂L

∂wj

= wj −p

∑

i=1

αid(i)x

(i)j



Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi

(

1 − d(i)( ⟨

~x(i), ~w⟩

+ b))

=1

2||~w||2 +

p∑

i=1

αi −p

∑

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

derivatives:

∂L

∂wj

= wj −p

∑

i=1

αid(i)x

(i)j

∂L

∂b= −

p∑

i=1

αid(i)



calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum





zeroing the derivatives and resolving w.r.t. ~w and b:

wj =

p∑

i=1

αid(i)x

(i)j






wj =

p∑

i=1

αid(i)x

(i)j

~w =

p∑

i=1

αid(i)~x(i)






wj =

p∑

i=1

αid(i)x

(i)j

~w =

p∑

i=1

αid(i)~x(i)

0 =

p∑

i=1

αid(i)






wj =

p∑

i=1

αid(i)x

(i)j

~w =

p∑

i=1

αid(i)~x(i)

0 =

p∑

i=1

αid(i)

strange equation, b is lost, what does it mean?



L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi −p

∑

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

~w =

p∑

i=1

αid(i)~x(i) 0 =

p∑

i=1

αid(i)



L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi −p

∑

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

~w =

p∑

i=1

αid(i)~x(i) 0 =

p∑

i=1

αid(i)

first case:∑p

i=1 αid(i) = 0

in this case, we can find ~w that zeros the partial derivatives for any value of

b. its the minimum=infimum. Substituting ~w by∑p

i=1 αid(i)~x(i) yields:

Q(~α) = −1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi



L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi −p

∑

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

~w =

p∑

i=1

αid(i)~x(i) 0 =

p∑

i=1

αid(i)

second case:∑p

i=1 αid(i) 6= 0

don’t matter of ~w we can make L arbitrary small varying b. Hence,

Q(~α) = −∞



the dual problem:

maximize~α

−1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

if

p∑

i=1

αid(i) = 0

−∞ otherwise

subject to αi ≥ 0 for all i = 1, . . . , p



since the second case in case distinction will never be the maximum, we canrewrite the dual:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p




maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

note: dual is a problem in ~α. ~w and b do not occur




maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

note: dual is a problem in ~α. ~w and b do not occur

for now, there is no advantage of solving the dual instead of the primalproblem. Later on, we will see the advantage of the dual.

for both the dual and the primal problem, powerful algorithmic solutionmethods exist (e.g. active set methods)



back to primal, KT-conditions:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b ≤ 0 for all i = 1, . . . , p

αi ·(1 − d(i)

⟨~x(i), ~w

⟩− d(i)b

)= 0 for all i = 1, . . . , p

~w =

p∑

i=1

(αid(i)~x(i))

0 =

p∑

i=1

αid(i)

αi ≥ 0 for all i = 1, . . . , p



back to primal, KT-conditions:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b ≤ 0 for all i = 1, . . . , p

αi ·(1 − d(i)

⟨~x(i), ~w

⟩− d(i)b

)= 0 for all i = 1, . . . , p

~w =

p∑

i=1

(αid(i)~x(i))

0 =

p∑

i=1

αid(i)

αi ≥ 0 for all i = 1, . . . , p

what are ~w and b if αi are already known?



resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))




• ~w =∑p

i=1(αid(i)~x(i))

• for i with αi 6= 0 we get from the complementary condition:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0




• ~w =∑p

i=1(αid(i)~x(i))


1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Hence:

b =1

d(i)−

⟨~x(i), ~w

⟩=

1

d(i)−

p∑

j=1

(αjd

(j)⟨~x(i), ~x(j)

⟩ )




• ~w =∑p

i=1(αid(i)~x(i))


1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Hence:

b =1

d(i)−

⟨~x(i), ~w

⟩=

1

d(i)−

p∑

j=1

(αjd

(j)⟨~x(i), ~x(j)

⟩ )

the margin ρ:

ρ =1

||~w|| =1

√∑p

i=1

∑pj=1

(αiαjd(i)d(j) 〈~x(i), ~x(j)〉

)



some constraints become active(αi > 0), some inactive (αi = 0)

each constraint refers to onepattern

constraint is active if respectivepattern is located next to separatinghyperplane

patterns that refer to activeconstraints are called supportvectors

solution only depends on supportvectors

margin

margin

positivessv

sv

sv

negatives

optimal separating hyperplane



some constraints become active(αi > 0), some inactive (αi = 0)

each constraint refers to onepattern

constraint is active if respectivepattern is located next to separatinghyperplane

patterns that refer to activeconstraints are called supportvectors

solution only depends on supportvectors

margin

margin

positivessv

sv

sv

negatives

optimal separating hyperplane

Lemma:Removing non support vectors froma classification task does notchange the solution found by asupport vector machine.



example (blackboard):

i x(i)1 x

(i)2 d(i)

1 0 0 −1

2 0 1 −1

3 −1 1 −1

4 1 0 +1

5 2 1 +1



example (blackboard):

i x(i)1 x

(i)2 d(i)

1 0 0 −1

2 0 1 −1

3 −1 1 −1

4 1 0 +1

5 2 1 +1

solution:

w1 = 2, w2 = 0, b = −1

support vectors: ~x(1), ~x(2), ~x(4)


Fault tolerant SVMs


Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?


Soft margin SVM



soft margin SVM

margin

margin

positives

negatives

additional error

additional error

additional error

separating hyperplane


Soft margin SVM



soft margin SVM

positives

negatives

additional error


additional error


Soft margin SVM



soft margin SVM

positives

negatives



Soft margin SVM



soft margin SVM

opposing aspects: maximizing themargin vs. minimizing the errors

positives

negatives



Soft margin SVM(cont.)

mathematical model of errors: slack variables ξi. ξi models the error that is

made for the training pattern ~x(i)





extended constraints:

d(i) · (⟨~w, ~x(i)

⟩+ b) ≥ 1 − ξi

ξi ≥ 0





extended constraints:

d(i) · (⟨~w, ~x(i)

⟩+ b) ≥ 1 − ξi

ξi ≥ 0

extended optimization target:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

C > 0 is a fixed canstant to balance margin and errors



primal of soft margin classification:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i


~x(i), ~w⟩

+ b)≥ 1 − ξi for i = 1, . . . , p

ξi ≥ 0 for i = 1, . . . , p



primal of soft margin classification:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i


~x(i), ~w⟩

+ b)≥ 1 − ξi for i = 1, . . . , p

ξi ≥ 0 for i = 1, . . . , p

Lagrange function:

L(~w, b, ~ξ,~α, ~β) =1

2||~w||2 + C

p∑

i=1

ξ2i +

+

p∑

i=1

(αi(1 − ξi − d(i)(

⟨~x(i), ~w

⟩+ b))

)+

p∑

i=1

(βi · (−ξi)

)

primal variables: ~w, b, ~ξ. Lagrange multipliers: ~α, ~β



KT conditions: 1 − ξi − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

−ξi ≤ 0 for i = 1, . . . , p

αi ·(

1 − ξi − d(i)( ⟨

~x(i), ~w⟩

+ b))

= 0 for i = 1, . . . , p

βi(−ξi) = 0 for i = 1, . . . , p

∂L

∂wj

= wj −p

∑

i=1

αid(i)x

(i)j = 0 for i = 1, . . . , p

∂L

∂b= −

p∑

i=1

αid(i) = 0

∂L

∂ξi

= 2Cξi − αi − βi = 0 for i = 1, . . . , p

αi ≥ 0 for i = 1, . . . , p

βi ≥ 0 for i = 1, . . . , pIntroduction to Neuroinformatics: Support Vector Machines – p. 66


deriving the solution from the KT conditions. We get ~w and ξi directly:

~w =

p∑

i=1

(αid(i)~x(i))

ξi =αi + βi

2C



deriving the solution from the KT conditions. We get ~w and ξi directly:

~w =

p∑

i=1

(αid(i)~x(i))

ξi =αi + βi

2C

to derive b, we exploit the complementary condition for a pattern with nonzero αi:

b =1 − ξi − d(i)

⟨~x(i), ~w

⟩

d(i)

=1 − αi+βi

2C

d(i)−

p∑

j=1

(αjd(j)

⟨~x(i), ~x(j)

⟩)



which patterns become support vectors?



which patterns become support vectors?

• patterns on the boundary of the margin band

• misclassified patterns and patterns within the margin band



calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)




first case:∑p

i=1 αid(i) = 0:

Then, minimum exists with ~w =∑p

i=1 αid(i)~x(i), ξi = αi+βi

2C. b arbitrary.

Q(~α, ~β) = − 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi




first case:∑p

i=1 αid(i) = 0:



2C. b arbitrary.

Q(~α, ~β) = − 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

second case:∑p

i=1 αid(i) 6= 0.

Hence, L is unbounded below. Q(~α, ~β) = −∞




first case:∑p

i=1 αid(i) = 0:



2C. b arbitrary.

Q(~α, ~β) = − 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

second case:∑p

i=1 αid(i) 6= 0.

Hence, L is unbounded below. Q(~α, ~β) = −∞ observation: second case does not contribute to the maximum of Q



the dual:

maximize~α,~β

(

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

)

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

βi ≥ 0 for all i = 1, . . . , p



there is a variant of soft margin SVM that uses the following optimizationtarget:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξi

instead of:minimize

~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.



there is a variant of soft margin SVM that uses the following optimizationtarget:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξi

instead of:minimize

~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.

C controls the balance between errors and margin width:

• large C prefers small errors

• small C prefers large margin

adequate value of C has to be optimized experimentally

→ demo in practical exercises


Non-linear SVMs


Non-linear SVMs

linear discrimination is not appropriate in all cases, we need a non-linearvariant of SVMs

switching from linear constraints to non-linear constraints causes a lot ofnumerical problems:

• losing convexity

• KT conditions not sufficient

• optimization algorithms with low performance


Non-linear SVMs(cont.)

classical idea: use non-linear features

feature mapping Φ (non-linear) maps the original input vectors to featurevectors

we have to replace ~x in all formulae by Φ(~x)


Non-linear SVMs(cont.)

classical idea: use non-linear features

feature mapping Φ (non-linear) maps the original input vectors to featurevectors

we have to replace ~x in all formulae by Φ(~x)

dual of the hard margin case becomes:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨Φ(~x(i)),Φ(~x(j))

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p


Kernel trick

defining a function KΦ(·) as: KΦ(~x, ~y) = 〈Φ(~x),Φ(~y)〉 we get:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)KΦ(~x(i), ~x(j)))

+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

now, patterns only occur as arguments of the function KΦ in the dual

the function KΦ is called a kernel function


Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?


Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!


Kernel trick(cont.)


can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)


Kernel trick(cont.)



weight vector: no, bias: yes, margin: yes

b =1

d(s)−

p∑

j=1

(αjd

(j)KΦ(~x(s), ~x(j)))

for support vector s

ρ =1

√∑p

i=1

∑p

j=1

(αiαjd(i)d(j)KΦ(~x(i), ~x(j))

)


Kernel trick(cont.)




b =1

d(s)−

p∑

j=1

(αjd

(j)KΦ(~x(s), ~x(j)))


ρ =1

√∑p

i=1

∑p

j=1


)

can we apply the learned classification to a new input pattern ~x(new) ?


Kernel trick(cont.)




b =1

d(s)−

p∑

j=1

(αjd

(j)KΦ(~x(s), ~x(j)))


ρ =1

√∑p

i=1

∑p

j=1


)

can we apply the learned classification to a new input pattern ~x(new) ?Yes! (see next slide)


Kernel trick(cont.)

~w =

p∑

i=1

(αid(i)Φ(~x(i)))

applying ~w and b to a new pattern:

⟨~w,Φ(~x(new))

⟩+ b =

⟨p

∑

i=1

(αid(i)Φ(~x(i))),Φ(~x(new))

⟩

+ b

=

p∑

i=1

(αid(i)KΦ(~x(i), ~x(new))) + b


Kernel trick(cont.)

quintessence:

• replacing the dotproduct by a kernel function is possible

• an explicit representation is not possible if we do not have access to Φ

• nonetheless, predictions can be made

• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi


Kernel trick(cont.)

quintessence:





the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick

classification

linearoptimal

Φ−1Φ


Kernel trick(cont.)

quintessence:





the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick

classification

linearoptimal

Φ−1

kernel−trick

Φ


Kernel trick(cont.)

example (one dimensional input space):

D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space

feature mapping: Φ(x) =

x2

√2x

1

kernel function:

KΦ(x, y) =

⟨

x2

√2x

1

,

y2

√2y

1

⟩

= (xy)2 +2(xy)+1 = (xy +1)2


Kernel trick(cont.)

example (one dimensional input space):

D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space

feature mapping: Φ(x) =

x2

√2x

1

kernel function:

KΦ(x, y) =

⟨

x2

√2x

1

,

y2

√2y

1

⟩

= (xy)2 +2(xy)+1 = (xy +1)2

solution: α1 = 34, α2 = 2

3, α3 = 1

12

b = −1, ~w = (1,−12

√2, 0)T (in feature space!)


Kernel trick(cont.)

example, illustration:

input space0 1


Kernel trick(cont.)

example, illustration:fe

atu

re s

pac

e

input space0 1

1 f1

f2

Φ

Φ

Φ

√2


Kernel trick(cont.)


atu

re s

pac

e

input space0 1

1 f1

f2

√2


Kernel trick(cont.)


atu

re s

pac

e

input space0 1

1 f1

f2

Φ

Φ

√2

Which points in input space are on thedecision boundary?look for z with∑p

i=1(αid(i)KΦ(~x(i), z)) + b = 0

yields: z = 12(1 ±

√5)


Kernel trick(cont.)

kernels can be designed individually for an application:

• look for senseful features

• create Φ

• derive KΦ

In these cases, using the kernel instead is not better than an explicitcalculation in feature space


Kernel trick(cont.)

kernels can be designed individually for an application:

• look for senseful features

• create Φ

• derive KΦ

In these cases, using the kernel instead is not better than an explicitcalculation in feature space

there are classes of generic kernels

• can be used for many tasks

• often, Φ is very complicated and feature space is very high dimensional

• for some generic kernels, feature space has infinite dimension and Φcannot be calculated explicitely

• typically, using kernel instead of explicit calculation in feature space iscomputationally more efficient


Generic kernels

example of kernel (one-dimensional input space):

Φ(x) =

x2

√2x

1

mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions


Generic kernels

example of kernel (one-dimensional input space):

Φ(x) =

x2

√2x

1

mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions

corresponding kernel: KΦ(x, y) = (xy + 1)2

kernel evaluation needs: 2 multiplications, 1 addition


Generic kernels(cont.)

a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:

KΦ(~x, ~y) = (〈~x, ~y〉)d

or:

KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d

d ∈ N is a kernel parameter thatneeds to be chosen manually



a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:

KΦ(~x, ~y) = (〈~x, ~y〉)d

or:

KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d

d ∈ N is a kernel parameter thatneeds to be chosen manually

for polynomial kernels, Φ can becalculated explicitely. It contains allmonomials of degree d (first case) orall monomials of maximal degree d

(second case)But: feature space becomes verylarge, even for small d, e.g. for thefirst variant the number of features is:

n d number features

3 2 6

5 3 35

10 3 220

10 5 2002

10 10 92 378

50 4 292 825

256 4 183 181 376



what makes a symmetric function K a kernel function?




• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉





• K meets Mercer’s theorem:

A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,

if and only if for all functions g : [a, b]n → R with∫

[a,b]ng(~x)2d~x < ∞

holds: ∫

[a,b]n

∫

[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0

Proof: omitted





• K meets Mercer’s theorem:

A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,

if and only if for all functions g : [a, b]n → R with∫

[a,b]ng(~x)2d~x < ∞

holds: ∫

[a,b]n

∫

[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0

Proof: omitted

consequences: sums of kernels and positive multiples of kernels are alsokernels



RBF kernels (Gaussian kernels):

KΦ(~x, ~y) = e−||~x−~y||2

2σ2

σ2 > 0 is a kernel parameter, chosen manually



RBF kernels (Gaussian kernels):

KΦ(~x, ~y) = e−||~x−~y||2

2σ2

σ2 > 0 is a kernel parameter, chosen manually

feature spaces have inifinite dimension

local information processing, circular decision boundaries in input space

what is calculated in a SVM with RBF kernel resembles a RBF network, buttraining is done in a different way

kernel parameter controls the complexity of the kernel: if σ2 large, behavioris similar to dot product, if σ2 is very small, decision boundaries in inputspace may become very complex

degeneration σ2 → 0

most popular kernel type in practice



sigmoid kernels:

KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))

γ > 0 and θ are kernel parameters, chosen manually



sigmoid kernels:

KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))

γ > 0 and θ are kernel parameters, chosen manually

feature spaces have inifinite dimension

what is calculated in a SVM with sigmoidal kernel resembles a MLP networkwith one hidden layer

no practical use due to numerical degeneration of the optimization problem


Generic kernelssummary

dot-product kernel: KΦ(~x, ~y) = 〈~x, ~y〉input space=feature space, simple, but useful

polynomial kernels: KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d,KΦ(~x, ~y) = (〈~x, ~y〉)d

extensions of dot product kernel, feature spaces with finite dimension

RBF kernels: KΦ(~x, ~y) = e−||~x−~y||2

2σ2

very important in practice, very flexible

sigmoid kernels: KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))limited practical use

problem dependent kernels

→ demo in practical exercises


Summary (SVMs)

linear classification with optimal separating hyperplane

mathematical problem description as convex quadratic program

numerical solving by various approaches, often based on active set methods:libSVM, SMO, ...

representation of the solution in terms of support vectors

soft margin case: classification with errors

kernel-trick: implicit calculations in feature space, non-linear SVM

extensions:

• SVM for regression

• SVM for one class classification

important ideas:

• large margin techniques

• kernel-based techniques Introduction to Neuroinformatics: Support Vector Machines – p. 88

Further readings

Nello Christianini and John Shawe-Taylor, An introduction to support vectormachines and other kernel-based learning methods, Cambridge Univ. Press,2000

C. J. C. Burges, A Tutorial on Support Vector Machines for PatternRecognition, 1998. available at: http://www.kernel-machines.org

Roger Fletcher, Practical methods of optimization, Wiley, 1987


introduction to neuroinformatics: support vector...

Documents