introduction to neuroinformatics: support vector...

211
Introduction to Neuroinformatics: Support Vector Machines Prof. Dr. Martin Riedmiller University of Osnabr¨ uck Institute of Computer Science and Institute of Cognitive Science Introduction to Neuroinformatics: Support Vector Machines – p. 1

Upload: others

Post on 04-Nov-2019

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Introduction to Neuroinformatics:

Support Vector Machines

Prof. Dr. Martin Riedmiller

University of Osnabruck

Institute of Computer Science and Institute of Cognitive Science

Introduction to Neuroinformatics: Support Vector Machines – p. 1

Page 2: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Outline

support vector machines (SVM) for classification: basic ideas

optimization under constraints: Lagrange multiplier, Kuhn-Tucker conditions,dual, algorithmic approaches

applying optimization theory to SVM

kernel trick: make linear problems non-linear

variants of SVMs

Introduction to Neuroinformatics: Support Vector Machines – p. 2

Page 3: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification

perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:

〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x

〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x

Introduction to Neuroinformatics: Support Vector Machines – p. 3

Page 4: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification

perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:

〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x

〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x

perceptron learning finds solving hyperplane (if possible) by adding andsubtracting patterns to the weight vector

Introduction to Neuroinformatics: Support Vector Machines – p. 3

Page 5: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification

perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:

〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x

〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x

perceptron learning finds solving hyperplane (if possible) by adding andsubtracting patterns to the weight vector

if there are several solving hyperplanes, perceptron learning finds anarbitrary one

Introduction to Neuroinformatics: Support Vector Machines – p. 3

Page 6: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification(cont.)

Which is the best separating hyperplane?

−−

+ +

++

++

the performance on the training set is equally good, but what do you expectfor an independent test set?

Introduction to Neuroinformatics: Support Vector Machines – p. 4

Page 7: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification(cont.)

Which is the best separating hyperplane?

−−

+ +

++

++

+

+

++

++

+

+

−−

−−

the performance on the training set is equally good, but what do you expectfor an independent test set?

Introduction to Neuroinformatics: Support Vector Machines – p. 4

Page 8: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Page 9: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−−

+ +

++

++

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Page 10: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−−

+ +

++

++

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Page 11: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−−

+ +

++

++

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Page 12: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−−

+ +

++

++

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Page 13: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines

support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.

Introduction to Neuroinformatics: Support Vector Machines – p. 6

Page 14: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines

support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.

notation:

• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,

d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b

(= w0, to be in line with SVM literature)

• margin ρ

Introduction to Neuroinformatics: Support Vector Machines – p. 6

Page 15: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines

support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.

notation:

• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,

d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b

(= w0, to be in line with SVM literature)

• margin ρ

distance of a point ~x to a hyperplane (~w, b):

∣∣∣∣

〈~x, ~w〉 + b

||~w||

∣∣∣∣

Introduction to Neuroinformatics: Support Vector Machines – p. 6

Page 16: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines

find the separating hyperplane with maximal margin:

maximize~w,b,ρ

ρ2

subject to ρ > 0

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D

Introduction to Neuroinformatics: Support Vector Machines – p. 7

Page 17: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines

find the separating hyperplane with maximal margin:

maximize~w,b,ρ

ρ2

subject to ρ > 0

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D

this task contains a degree of freedom: ~w and b can be scaled by a positivenumber without changing the separating hyperplane.We can remove this degree of freedom and simplify the problem by adding aconstraint on the length of ~w. Here:

||~w|| =1

ρ

Introduction to Neuroinformatics: Support Vector Machines – p. 7

Page 18: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Page 19: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Page 20: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Page 21: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Page 22: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Page 23: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Page 24: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2 1

||~w||2

subject to ρ > 0

||~w|| =

1

ρρ =

1

||~w||

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ1

||~w|| for all (~x(i), d(i)) ∈ D

rewriting the optimization problem

Finally, we can make it a minimzation task minimizing the reciprocal of 1||w||2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Page 25: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

the minimization task:

minimize~w,b,ρ

1

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

ρ =1

||~w||

Introduction to Neuroinformatics: Support Vector Machines – p. 9

Page 26: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

the minimization task:

minimize~w,b,ρ

1

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

ρ =

1

||~w||

ρ occurs only in one equality constraint, i.e. we can solve the optimizationproblem for ~w and b and calculate ρ afterwards from ~w

Introduction to Neuroinformatics: Support Vector Machines – p. 9

Page 27: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

final mathematical form of the minimization task:

minimize~w,b

1

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)

Introduction to Neuroinformatics: Support Vector Machines – p. 10

Page 28: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Optimization Theory

Introduction to Neuroinformatics: Support Vector Machines – p. 11

Page 29: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Optimization theory

a general form of optimization problem:

minimize~x

f(~x)

subject to ~x ∈ Ω

Ω ⊆ Rn: the set of feasible (permitted) points

f : Rn → R: the target function that should be minimized

Introduction to Neuroinformatics: Support Vector Machines – p. 12

Page 30: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Optimization theory

a general form of optimization problem:

minimize~x

f(~x)

subject to ~x ∈ Ω

Ω ⊆ Rn: the set of feasible (permitted) points

f : Rn → R: the target function that should be minimized

maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))

Introduction to Neuroinformatics: Support Vector Machines – p. 12

Page 31: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Optimization theory

a general form of optimization problem:

minimize~x

f(~x)

subject to ~x ∈ Ω

Ω ⊆ Rn: the set of feasible (permitted) points

f : Rn → R: the target function that should be minimized

maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))

often, Ω is implicitly given by equality and inequality constraints

Introduction to Neuroinformatics: Support Vector Machines – p. 12

Page 32: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Local and global minima

a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:

f(~x) ≤ f(~y)

Introduction to Neuroinformatics: Support Vector Machines – p. 13

Page 33: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Local and global minima

a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:

f(~x) ≤ f(~y)

a feasible point ~x ∈ Ω is called alocal minimum if there is a radiusr > 0 so that for all points ~y ∈ Ωwith ||~x − ~y|| < r holds:

f(~x) ≤ f(~y)

x

f(x)

Ω

loc.glob.loc.

Introduction to Neuroinformatics: Support Vector Machines – p. 13

Page 34: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity

A set X is called a convex set if for all points ~x, ~y ∈ X and any 0 ≤ θ ≤ 1holds:

θ~x + (1 − θ)~y ∈ X

intuitively: a set is convex, if all points between two elements of the set arealso elements of the set.

~x ∈ X

~y ∈ Xset X

θ~x + (1 − θ)~y ∈ X ?

Introduction to Neuroinformatics: Support Vector Machines – p. 14

Page 35: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

convex sets:

non convex sets:

Introduction to Neuroinformatics: Support Vector Machines – p. 15

Page 36: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

intersections of convex sets are also convex.Proof: → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 16

Page 37: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

intersections of convex sets are also convex.Proof: → blackboard

unions of convex sets are not guaranteed to be convex.

Introduction to Neuroinformatics: Support Vector Machines – p. 16

Page 38: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

A function f is called a convex function, if for all points ~x, ~y and all0 ≤ θ ≤ 1 holds:

f(θ~x + (1 − θ)~y) ≤ θf(~x) + (1 − θ)f(~y)

x

f(x)

~x ~y

Introduction to Neuroinformatics: Support Vector Machines – p. 17

Page 39: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

convex functions:

non−convex functions:

Introduction to Neuroinformatics: Support Vector Machines – p. 18

Page 40: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

sums of convex functions are also convex.Proof: → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 19

Page 41: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

sums of convex functions are also convex.Proof: → blackboard

positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)

Introduction to Neuroinformatics: Support Vector Machines – p. 19

Page 42: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

sums of convex functions are also convex.Proof: → blackboard

positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)

differences of convex functions are not guaranteed to be convex.

Introduction to Neuroinformatics: Support Vector Machines – p. 19

Page 43: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)

Introduction to Neuroinformatics: Support Vector Machines – p. 20

Page 44: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)

the function: f : x 7→ x2 is a convex function.Proof:

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

Introduction to Neuroinformatics: Support Vector Machines – p. 20

Page 45: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)

the function: f : x 7→ x2 is a convex function.Proof:

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above

Introduction to Neuroinformatics: Support Vector Machines – p. 20

Page 46: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)

the function: f : x 7→ x2 is a convex function.Proof:

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above

if f is a convex function, then the set ~x|f(~x) ≤ 0 is a convex set.Proof: → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 20

Page 47: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity and minima

Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.

Proof:→ blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 21

Page 48: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity and minima

Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.

Proof:→ blackboard

this lemma does not mean that for each convex set and each convex functiona minimum exists!

Introduction to Neuroinformatics: Support Vector Machines – p. 21

Page 49: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity and minima

Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.

Proof:→ blackboard

this lemma does not mean that for each convex set and each convex functiona minimum exists!

there are conditions for which the existence of a global minimum can beguaranteed. (e.g. Ω non-empty and compact, f continuous)

Introduction to Neuroinformatics: Support Vector Machines – p. 21

Page 50: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Convexity and minima (cont.)

Lemma:If Ω is a convex set and f is a convex function w.r.t Ω, each point satisfyinggradf(~x) = 0 is a local minimum.

Proof:→ omitted (literature about convex functions)

Introduction to Neuroinformatics: Support Vector Machines – p. 22

Page 51: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Constraints

often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:

Ω =(

k⋂

i=1

~x|gi(~x) ≤ 0)∩

(m⋂

i=1

~x|hi(~x) = 0)

gj and hi are real valued functions

• constraints of the form gj(~x) ≤ 0 are called inequality constraints

• constraints of the form hi(~x) = 0 are called equality constraints

Introduction to Neuroinformatics: Support Vector Machines – p. 23

Page 52: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Constraints

often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:

Ω =(

k⋂

i=1

~x|gi(~x) ≤ 0)∩

(m⋂

i=1

~x|hi(~x) = 0)

gj and hi are real valued functions

• constraints of the form gj(~x) ≤ 0 are called inequality constraints

• constraints of the form hi(~x) = 0 are called equality constraints

equality constraints hi(~x) = 0 can be replaced by two inequality constraints:

hi(~x) ≤ 0 and −hi(~x) ≤ 0.

Introduction to Neuroinformatics: Support Vector Machines – p. 23

Page 53: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible

direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r

Introduction to Neuroinformatics: Support Vector Machines – p. 24

Page 54: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible

direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a descending

direction if exists r > 0 so that f(~x + θ~v) < f(~x) for all 0 < θ ≤ r

Introduction to Neuroinformatics: Support Vector Machines – p. 24

Page 55: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible

direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a descending

direction if exists r > 0 so that f(~x + θ~v) < f(~x) for all 0 < θ ≤ r

Introduction to Neuroinformatics: Support Vector Machines – p. 24

Page 56: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 57: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 58: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 59: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 60: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 61: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 62: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 63: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 64: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 65: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

minimum(no descending directions)

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Page 66: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function

theory simplifies if the target function f is differentiable and the constraintsare linear:

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn

(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)

hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn

Introduction to Neuroinformatics: Support Vector Machines – p. 26

Page 67: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function

theory simplifies if the target function f is differentiable and the constraintsare linear:

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn

(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)

hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn

if f is a linear function, these tasks are called linear programs

if f is a quadratic function, these tasks are called quadratic programs

Introduction to Neuroinformatics: Support Vector Machines – p. 26

Page 68: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function(cont.)

what are the descending directions at point ~x ?

Introduction to Neuroinformatics: Support Vector Machines – p. 27

Page 69: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function(cont.)

what are the descending directions at point ~x ?a direction ~v is descending if

〈~v, gradf(~x)〉 < 0

Introduction to Neuroinformatics: Support Vector Machines – p. 27

Page 70: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function(cont.)

what are the feasible directions inpoint ~x ?

Introduction to Neuroinformatics: Support Vector Machines – p. 28

Page 71: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function(cont.)

what are the feasible directions inpoint ~x ?

• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.

grad g(x)

grad g(x)

Introduction to Neuroinformatics: Support Vector Machines – p. 28

Page 72: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function(cont.)

what are the feasible directions inpoint ~x ?

• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.

• inactive constraints:inactive constraints do notrestrict feasible directions

grad g(x)

feasible directions

grad g(x)

Introduction to Neuroinformatics: Support Vector Machines – p. 28

Page 73: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function(cont.)

what are the feasible directions inpoint ~x ?

• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.

• inactive constraints:inactive constraints do notrestrict feasible directions

• active constraints:directions ~v with〈~v, gradg(~x)〉 > 0 areinfeasible

grad g(x)

feasible directions

grad g(x)

feasible directions

Introduction to Neuroinformatics: Support Vector Machines – p. 28

Page 74: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function(cont.)

what are the feasible directions in point ~x ?

Introduction to Neuroinformatics: Support Vector Machines – p. 29

Page 75: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Linear constraints, differentiable target function(cont.)

what are the feasible directions in point ~x ?A direction ~v is feasible in ~x if for all constraints gi holds:

gi is inactive or 〈~v, gradgi(~x)〉 ≤ 0

Introduction to Neuroinformatics: Support Vector Machines – p. 29

Page 76: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima

local minima are feasible points at which the intersection of descendingdirections and feasible directions is empty.

Introduction to Neuroinformatics: Support Vector Machines – p. 30

Page 77: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima

local minima are feasible points at which the intersection of descendingdirections and feasible directions is empty.

examples with no active constraints:

grad f grad f=0

gradf(~x) 6= ~0 ⇒ nominimum

gradf(~x) = ~0 ⇒minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 30

Page 78: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

examples with one active constraints:

grad g(x)

grad f(x)

desc. dir.

feasible dir.

−gradf = λgradg with λ < 0all descending directions feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 31

Page 79: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

examples with one active constraints:

grad g(x)

grad f(x)

desc. dir.

feasible dir.

−gradf = λgradg with λ < 0all descending directions feasibleno minimum

grad g(x)

feasible dir.

grad f(x)

desc. dir.

−gradf = λgradg + ~u with λ < 0

and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 31

Page 80: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg + ~u with λ > 0

and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 32

Page 81: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg + ~u with λ > 0

and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum

grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg with λ > 0all descending directions non-feasibleminimum

Introduction to Neuroinformatics: Support Vector Machines – p. 32

Page 82: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

grad g(x)

feasible dir.

grad f=0 −gradf = ~0 = 0 · gradg

no descending directionsminimum

Introduction to Neuroinformatics: Support Vector Machines – p. 33

Page 83: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

grad g(x)

feasible dir.

grad f=0 −gradf = ~0 = 0 · gradg

no descending directionsminimum

For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)

only if λ < 0 or ~u 6= ~0 we find feasible descending directions

Introduction to Neuroinformatics: Support Vector Machines – p. 33

Page 84: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

grad g(x)

feasible dir.

grad f=0 −gradf = ~0 = 0 · gradg

no descending directionsminimum

For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)

only if λ < 0 or ~u 6= ~0 we find feasible descending directions

~x is minimum if we find λ ≥ 0 with gradf(~x) + λgradg(~x) = ~0

Remark: this principle can be generalized to more active constraints

Introduction to Neuroinformatics: Support Vector Machines – p. 33

Page 85: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

examples with two active constraints:

grad g1(x)

grad g2(x)

feasible dir.

desc. dir.

grad f(x)−gradf = λ1gradg1 + λ2gradg2

with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 34

Page 86: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

examples with two active constraints:

grad g1(x)

grad g2(x)

feasible dir.

desc. dir.

grad f(x)−gradf = λ1gradg1 + λ2gradg2

with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum

grad g2(x)

grad g1(x)

desc. dir.

grad f(x)feasible dir.

−gradf = λ1gradg1 + λ2gradg2

with λ1 ≥ 0, λ2 ≥ 0all descending directions non-feasibleminimum

Introduction to Neuroinformatics: Support Vector Machines – p. 34

Page 87: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

examples with equality constraint:

grad h(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradh + ~u with

~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 35

Page 88: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

examples with equality constraint:

grad h(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradh + ~u with

~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum

grad h(x)desc. dir.

grad f(x)

feasible dir.−gradf = λgradh

all descending directions non-feasibleminimum

Introduction to Neuroinformatics: Support Vector Machines – p. 35

Page 89: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:

A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:

grad f(~x)+∑

active inequalityconstraints j

(αj ·grad gj(~x))+m∑

i=1

(βi·grad hi(~x)) = ~0

(corollary to Farka’s lemma)

Introduction to Neuroinformatics: Support Vector Machines – p. 36

Page 90: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Characterizing local minima(cont.)

general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:

A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:

grad f(~x)+∑

active inequalityconstraints j

(αj ·grad gj(~x))+m∑

i=1

(βi·grad hi(~x)) = ~0

(corollary to Farka’s lemma)

Lagrange function:

L(~x, ~α, ~β) = f(~x) +k∑

j=1

(αj · gj(~x)) +m∑

i=1

(βi · hi(~x))

αi, βi are called Lagrange multipliers

Introduction to Neuroinformatics: Support Vector Machines – p. 36

Page 91: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

(Karush-) Kuhn-Tucker conditions

rewriting these conditions yield the (K)KT conditions:

gi(~x) ≤ 0 for all inequality constraints

hj(~x) = 0 for all equality constraints

αi · gi(~x) = 0 for all inequality constraints

0 = ∂L(~x,~α,~β)∂xi

for all i = 1, . . . , n

αi ≥ 0 for all inequality constraints

Lemma:A point ~x is minimum, if and only if exist αj and βi so that the KT conditionsare met.

Proof: omitted, see literature

Introduction to Neuroinformatics: Support Vector Machines – p. 37

Page 92: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kuhn-Tucker conditions(cont.)

the KT conditions do not give an algorithm to find the minima, but they allowto check whether a given point is a minimum

example:

minimizex1,x2

x21 + x2

2

subject to − x1 + x2 + 2 ≤ 0

− 2x1 − x2 − 2 ≤ 0

Lagrange function, KT conditions, check whether (0, 0), (4, 2), (1,−1) areminima. → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 38

Page 93: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive

Introduction to Neuroinformatics: Support Vector Machines – p. 39

Page 94: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive

• at the minimum ~x∗, a certain subset A of inequality constraints is active

• if we would know A and f looks “nice”, we could find ~x∗ analytically

Introduction to Neuroinformatics: Support Vector Machines – p. 39

Page 95: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive

• at the minimum ~x∗, a certain subset A of inequality constraints is active

• if we would know A and f looks “nice”, we could find ~x∗ analytically

• main idea: check all possible combinations of active constraints for theminimum

example → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 39

Page 96: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive

• at the minimum ~x∗, a certain subset A of inequality constraints is active

• if we would know A and f looks “nice”, we could find ~x∗ analytically

• main idea: check all possible combinations of active constraints for theminimum

example → blackboard

can be applied only for linear and quadratic target functions

number of possible combinations of active constraints grows exponentially

Introduction to Neuroinformatics: Support Vector Machines – p. 39

Page 97: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Ways to find the minimum(cont.)

active set methods: incrementally build the set of active constraints

Introduction to Neuroinformatics: Support Vector Machines – p. 40

Page 98: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Ways to find the minimum(cont.)

active set methods: incrementally build the set of active constraints

start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is

consistent with the constraints in A

if a constraint 6∈ A is violated it is added to A

if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.

if necessary, constraints are removed from A

Introduction to Neuroinformatics: Support Vector Machines – p. 40

Page 99: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Ways to find the minimum(cont.)

active set methods: incrementally build the set of active constraints

start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is

consistent with the constraints in A

if a constraint 6∈ A is violated it is added to A

if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.

if necessary, constraints are removed from A

sketch of algorithm → next slide

Introduction to Neuroinformatics: Support Vector Machines – p. 40

Page 100: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Ways to find the minimum(cont.)

1: repeat2: calculate direction of maximal descent3: project direction of maximal descent onto boundaries of constraints in A (yields search

direction ~v)

4: if ~v 6= ~0 then5: calculate minimum on ray from ~x into direction ~v (yields steplength τ )6: check whether constraints are violated when step is performed. If yes, shorten τ and add

restricting constraint to A

7: perform step from ~x to ~x + τ~v

8: else9: calculate Lagrange multipliers

10: if Lagrange multipliers are all non-negative then11: minimum found, stop.12: else13: remove constraint with negative Lagrange multiplier from A

14: end if15: end if16: until minimum found

Introduction to Neuroinformatics: Support Vector Machines – p. 41

Page 101: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality

a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.

the original task is called primal problem, the other problem is called dualproblem

Introduction to Neuroinformatics: Support Vector Machines – p. 42

Page 102: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality

a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.

the original task is called primal problem, the other problem is called dualproblem

primal task: (reminder)

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

Introduction to Neuroinformatics: Support Vector Machines – p. 42

Page 103: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality

a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.

the original task is called primal problem, the other problem is called dualproblem

primal task: (reminder)

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

the Lagrange function of the primal: (reminder)

L(~x, ~α, ~β) = f(~x) +k∑

j=1

(αj · gj(~x)) +m∑

i=1

(βi · hi(~x))

Introduction to Neuroinformatics: Support Vector Machines – p. 42

Page 104: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

the function Q:

Q(~α, ~β) = inf~x∈Rn

L(~x, ~α, ~β)

remarks:

• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞

• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.

Introduction to Neuroinformatics: Support Vector Machines – p. 43

Page 105: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

the function Q:

Q(~α, ~β) = inf~x∈Rn

L(~x, ~α, ~β)

remarks:

• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞

• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.

dual task:

maximize~α,~β

Q(~α, ~β)

subject to αi ≥ 0 for all i = 1, . . . , k

Introduction to Neuroinformatics: Support Vector Machines – p. 43

Page 106: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)

Introduction to Neuroinformatics: Support Vector Machines – p. 44

Page 107: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

Introduction to Neuroinformatics: Support Vector Machines – p. 44

Page 108: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

= f(~x′) +k∑

i=1

( α′i

︸︷︷︸

≥0

· gi(~x′)

︸ ︷︷ ︸

≤0

) +m∑

j=1

(β′j · hj(~x

′)︸ ︷︷ ︸

=0

)

Introduction to Neuroinformatics: Support Vector Machines – p. 44

Page 109: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

= f(~x′) +k∑

i=1

( α′i

︸︷︷︸

≥0

· gi(~x′)

︸ ︷︷ ︸

≤0

) +m∑

j=1

(β′j · hj(~x

′)︸ ︷︷ ︸

=0

)

≤ f(~x′)

Introduction to Neuroinformatics: Support Vector Machines – p. 44

Page 110: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Corollary:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the

dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem

and (~α′, ~β′) is solution of the dual problem.

Introduction to Neuroinformatics: Support Vector Machines – p. 45

Page 111: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Corollary:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the

dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem

and (~α′, ~β′) is solution of the dual problem.

Proof:For any feasible ~x:

f(~x) ≥ Q(~α′, ~β′) = f(~x′)

For any feasible (~α, ~β):

Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)

Introduction to Neuroinformatics: Support Vector Machines – p. 45

Page 112: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Corollary:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the

dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem

and (~α′, ~β′) is solution of the dual problem.

Proof:For any feasible ~x:

f(~x) ≥ Q(~α′, ~β′) = f(~x′)

For any feasible (~α, ~β):

Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)

Potential Problem: points ~x′ and (~α′, ~β′) do not necessarily exist for generalproblems ⇒ restrictions on f and type of constraints.

Introduction to Neuroinformatics: Support Vector Machines – p. 45

Page 113: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Page 114: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Page 115: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Page 116: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x

from KT conditions: ∂L∂xi

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Page 117: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x

from KT conditions: ∂L∂xi

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Page 118: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x

from KT conditions: ∂L∂xi

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual

proof shows: KT conditions are the link between primal solution and dualsolution

solutions do not need to be unique

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Page 119: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Duality(cont.)

main message from duality: under certain circumstances,

• you can solve the primal and get the solution of the dual for free

• you can solve the dual and get the solution of the primal for free

• you can use the Kuhn-Tucker conditions to transform the primal solutioninto a dual solution and vice versa

Introduction to Neuroinformatics: Support Vector Machines – p. 47

Page 120: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Summary (optimization theory)

general ideas about optimization (local, global minima)

convex problems: each local minimum is a global minimum

general characterization of minima (feasible directions, descendingdirections, Kuhn-Tucker)

algorithms to find the minimum (brute force, active set)

duality

literature: Roger Fletcher, Practical methods of optimization, Wiley 1991

Introduction to Neuroinformatics: Support Vector Machines – p. 48

Page 121: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Intermediate Last Slide

Introduction to Neuroinformatics: Support Vector Machines – p. 49

Page 122: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Applying Results of Optimization Theoryto Support Vector Machines

Introduction to Neuroinformatics: Support Vector Machines – p. 50

Page 123: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines

final mathematical form of the minimization task (reminder):

minimize~w,b

1

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for i = 1, . . . , p

the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)

Introduction to Neuroinformatics: Support Vector Machines – p. 51

Page 124: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

transforming into the appropriate form:

minimize~w,b

1

2||~w||2

subject to 1 − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 52

Page 125: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

transforming into the appropriate form:

minimize~w,b

1

2||~w||2

subject to 1 − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

note:

• target function is convex

• constraints are linear

• hence, local minima are also global minima (if exist feasible points)

Introduction to Neuroinformatics: Support Vector Machines – p. 52

Page 126: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi

(

1 − d(i)( ⟨

~x(i), ~w⟩

+ b))

=1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 53

Page 127: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi

(

1 − d(i)( ⟨

~x(i), ~w⟩

+ b))

=1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

derivatives:

∂L

∂wj

= wj −p

i=1

αid(i)x

(i)j

Introduction to Neuroinformatics: Support Vector Machines – p. 53

Page 128: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi

(

1 − d(i)( ⟨

~x(i), ~w⟩

+ b))

=1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

derivatives:

∂L

∂wj

= wj −p

i=1

αid(i)x

(i)j

∂L

∂b= −

p∑

i=1

αid(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 53

Page 129: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Page 130: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

zeroing the derivatives and resolving w.r.t. ~w and b:

wj =

p∑

i=1

αid(i)x

(i)j

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Page 131: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

zeroing the derivatives and resolving w.r.t. ~w and b:

wj =

p∑

i=1

αid(i)x

(i)j

~w =

p∑

i=1

αid(i)~x(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Page 132: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

zeroing the derivatives and resolving w.r.t. ~w and b:

wj =

p∑

i=1

αid(i)x

(i)j

~w =

p∑

i=1

αid(i)~x(i)

0 =

p∑

i=1

αid(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Page 133: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

zeroing the derivatives and resolving w.r.t. ~w and b:

wj =

p∑

i=1

αid(i)x

(i)j

~w =

p∑

i=1

αid(i)~x(i)

0 =

p∑

i=1

αid(i)

strange equation, b is lost, what does it mean?

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Page 134: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

~w =

p∑

i=1

αid(i)~x(i) 0 =

p∑

i=1

αid(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 55

Page 135: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

~w =

p∑

i=1

αid(i)~x(i) 0 =

p∑

i=1

αid(i)

first case:∑p

i=1 αid(i) = 0

in this case, we can find ~w that zeros the partial derivatives for any value of

b. its the minimum=infimum. Substituting ~w by∑p

i=1 αid(i)~x(i) yields:

Q(~α) = −1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

Introduction to Neuroinformatics: Support Vector Machines – p. 55

Page 136: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

~w =

p∑

i=1

αid(i)~x(i) 0 =

p∑

i=1

αid(i)

second case:∑p

i=1 αid(i) 6= 0

don’t matter of ~w we can make L arbitrary small varying b. Hence,

Q(~α) = −∞

Introduction to Neuroinformatics: Support Vector Machines – p. 55

Page 137: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

the dual problem:

maximize~α

−1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

if

p∑

i=1

αid(i) = 0

−∞ otherwise

subject to αi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 56

Page 138: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

since the second case in case distinction will never be the maximum, we canrewrite the dual:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 57

Page 139: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

since the second case in case distinction will never be the maximum, we canrewrite the dual:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

note: dual is a problem in ~α. ~w and b do not occur

Introduction to Neuroinformatics: Support Vector Machines – p. 57

Page 140: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

since the second case in case distinction will never be the maximum, we canrewrite the dual:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

note: dual is a problem in ~α. ~w and b do not occur

for now, there is no advantage of solving the dual instead of the primalproblem. Later on, we will see the advantage of the dual.

for both the dual and the primal problem, powerful algorithmic solutionmethods exist (e.g. active set methods)

Introduction to Neuroinformatics: Support Vector Machines – p. 57

Page 141: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

back to primal, KT-conditions:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b ≤ 0 for all i = 1, . . . , p

αi ·(1 − d(i)

⟨~x(i), ~w

⟩− d(i)b

)= 0 for all i = 1, . . . , p

~w =

p∑

i=1

(αid(i)~x(i))

0 =

p∑

i=1

αid(i)

αi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 58

Page 142: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

back to primal, KT-conditions:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b ≤ 0 for all i = 1, . . . , p

αi ·(1 − d(i)

⟨~x(i), ~w

⟩− d(i)b

)= 0 for all i = 1, . . . , p

~w =

p∑

i=1

(αid(i)~x(i))

0 =

p∑

i=1

αid(i)

αi ≥ 0 for all i = 1, . . . , p

what are ~w and b if αi are already known?

Introduction to Neuroinformatics: Support Vector Machines – p. 58

Page 143: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))

Introduction to Neuroinformatics: Support Vector Machines – p. 59

Page 144: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))

• for i with αi 6= 0 we get from the complementary condition:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Introduction to Neuroinformatics: Support Vector Machines – p. 59

Page 145: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))

• for i with αi 6= 0 we get from the complementary condition:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Hence:

b =1

d(i)−

⟨~x(i), ~w

⟩=

1

d(i)−

p∑

j=1

(αjd

(j)⟨~x(i), ~x(j)

⟩ )

Introduction to Neuroinformatics: Support Vector Machines – p. 59

Page 146: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))

• for i with αi 6= 0 we get from the complementary condition:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Hence:

b =1

d(i)−

⟨~x(i), ~w

⟩=

1

d(i)−

p∑

j=1

(αjd

(j)⟨~x(i), ~x(j)

⟩ )

the margin ρ:

ρ =1

||~w|| =1

√∑p

i=1

∑pj=1

(αiαjd(i)d(j) 〈~x(i), ~x(j)〉

)

Introduction to Neuroinformatics: Support Vector Machines – p. 59

Page 147: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

some constraints become active(αi > 0), some inactive (αi = 0)

each constraint refers to onepattern

constraint is active if respectivepattern is located next to separatinghyperplane

patterns that refer to activeconstraints are called supportvectors

solution only depends on supportvectors

margin

margin

positivessv

sv

sv

negatives

optimal separating hyperplane

Introduction to Neuroinformatics: Support Vector Machines – p. 60

Page 148: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

some constraints become active(αi > 0), some inactive (αi = 0)

each constraint refers to onepattern

constraint is active if respectivepattern is located next to separatinghyperplane

patterns that refer to activeconstraints are called supportvectors

solution only depends on supportvectors

margin

margin

positivessv

sv

sv

negatives

optimal separating hyperplane

Lemma:Removing non support vectors froma classification task does notchange the solution found by asupport vector machine.

Introduction to Neuroinformatics: Support Vector Machines – p. 60

Page 149: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

example (blackboard):

i x(i)1 x

(i)2 d(i)

1 0 0 −1

2 0 1 −1

3 −1 1 −1

4 1 0 +1

5 2 1 +1

Introduction to Neuroinformatics: Support Vector Machines – p. 61

Page 150: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Support vector machines(cont.)

example (blackboard):

i x(i)1 x

(i)2 d(i)

1 0 0 −1

2 0 1 −1

3 −1 1 −1

4 1 0 +1

5 2 1 +1

solution:

w1 = 2, w2 = 0, b = −1

support vectors: ~x(1), ~x(2), ~x(4)

Introduction to Neuroinformatics: Support Vector Machines – p. 61

Page 151: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Fault tolerant SVMs

Introduction to Neuroinformatics: Support Vector Machines – p. 62

Page 152: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Page 153: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

soft margin SVM

margin

margin

positives

negatives

additional error

additional error

additional error

separating hyperplane

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Page 154: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

soft margin SVM

positives

negatives

additional error

separating hyperplane

additional error

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Page 155: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

soft margin SVM

positives

negatives

separating hyperplane

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Page 156: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

soft margin SVM

opposing aspects: maximizing themargin vs. minimizing the errors

positives

negatives

separating hyperplane

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Page 157: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

mathematical model of errors: slack variables ξi. ξi models the error that is

made for the training pattern ~x(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 64

Page 158: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

mathematical model of errors: slack variables ξi. ξi models the error that is

made for the training pattern ~x(i)

extended constraints:

d(i) · (⟨~w, ~x(i)

⟩+ b) ≥ 1 − ξi

ξi ≥ 0

Introduction to Neuroinformatics: Support Vector Machines – p. 64

Page 159: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

mathematical model of errors: slack variables ξi. ξi models the error that is

made for the training pattern ~x(i)

extended constraints:

d(i) · (⟨~w, ~x(i)

⟩+ b) ≥ 1 − ξi

ξi ≥ 0

extended optimization target:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

C > 0 is a fixed canstant to balance margin and errors

Introduction to Neuroinformatics: Support Vector Machines – p. 64

Page 160: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

primal of soft margin classification:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 − ξi for i = 1, . . . , p

ξi ≥ 0 for i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 65

Page 161: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

primal of soft margin classification:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 − ξi for i = 1, . . . , p

ξi ≥ 0 for i = 1, . . . , p

Lagrange function:

L(~w, b, ~ξ,~α, ~β) =1

2||~w||2 + C

p∑

i=1

ξ2i +

+

p∑

i=1

(αi(1 − ξi − d(i)(

⟨~x(i), ~w

⟩+ b))

)+

p∑

i=1

(βi · (−ξi)

)

primal variables: ~w, b, ~ξ. Lagrange multipliers: ~α, ~β

Introduction to Neuroinformatics: Support Vector Machines – p. 65

Page 162: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

KT conditions: 1 − ξi − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

−ξi ≤ 0 for i = 1, . . . , p

αi ·(

1 − ξi − d(i)( ⟨

~x(i), ~w⟩

+ b))

= 0 for i = 1, . . . , p

βi(−ξi) = 0 for i = 1, . . . , p

∂L

∂wj

= wj −p

i=1

αid(i)x

(i)j = 0 for i = 1, . . . , p

∂L

∂b= −

p∑

i=1

αid(i) = 0

∂L

∂ξi

= 2Cξi − αi − βi = 0 for i = 1, . . . , p

αi ≥ 0 for i = 1, . . . , p

βi ≥ 0 for i = 1, . . . , pIntroduction to Neuroinformatics: Support Vector Machines – p. 66

Page 163: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

deriving the solution from the KT conditions. We get ~w and ξi directly:

~w =

p∑

i=1

(αid(i)~x(i))

ξi =αi + βi

2C

Introduction to Neuroinformatics: Support Vector Machines – p. 67

Page 164: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

deriving the solution from the KT conditions. We get ~w and ξi directly:

~w =

p∑

i=1

(αid(i)~x(i))

ξi =αi + βi

2C

to derive b, we exploit the complementary condition for a pattern with nonzero αi:

b =1 − ξi − d(i)

⟨~x(i), ~w

d(i)

=1 − αi+βi

2C

d(i)−

p∑

j=1

(αjd(j)

⟨~x(i), ~x(j)

⟩)

Introduction to Neuroinformatics: Support Vector Machines – p. 67

Page 165: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

which patterns become support vectors?

Introduction to Neuroinformatics: Support Vector Machines – p. 68

Page 166: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

which patterns become support vectors?

• patterns on the boundary of the margin band

• misclassified patterns and patterns within the margin band

Introduction to Neuroinformatics: Support Vector Machines – p. 68

Page 167: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)

Introduction to Neuroinformatics: Support Vector Machines – p. 69

Page 168: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)

first case:∑p

i=1 αid(i) = 0:

Then, minimum exists with ~w =∑p

i=1 αid(i)~x(i), ξi = αi+βi

2C. b arbitrary.

Q(~α, ~β) = − 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

Introduction to Neuroinformatics: Support Vector Machines – p. 69

Page 169: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)

first case:∑p

i=1 αid(i) = 0:

Then, minimum exists with ~w =∑p

i=1 αid(i)~x(i), ξi = αi+βi

2C. b arbitrary.

Q(~α, ~β) = − 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

second case:∑p

i=1 αid(i) 6= 0.

Hence, L is unbounded below. Q(~α, ~β) = −∞

Introduction to Neuroinformatics: Support Vector Machines – p. 69

Page 170: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)

first case:∑p

i=1 αid(i) = 0:

Then, minimum exists with ~w =∑p

i=1 αid(i)~x(i), ξi = αi+βi

2C. b arbitrary.

Q(~α, ~β) = − 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

second case:∑p

i=1 αid(i) 6= 0.

Hence, L is unbounded below. Q(~α, ~β) = −∞ observation: second case does not contribute to the maximum of Q

Introduction to Neuroinformatics: Support Vector Machines – p. 69

Page 171: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

the dual:

maximize~α,~β

(

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

)

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

βi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 70

Page 172: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

there is a variant of soft margin SVM that uses the following optimizationtarget:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξi

instead of:minimize

~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.

Introduction to Neuroinformatics: Support Vector Machines – p. 71

Page 173: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Soft margin SVM(cont.)

there is a variant of soft margin SVM that uses the following optimizationtarget:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξi

instead of:minimize

~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.

C controls the balance between errors and margin width:

• large C prefers small errors

• small C prefers large margin

adequate value of C has to be optimized experimentally

→ demo in practical exercises

Introduction to Neuroinformatics: Support Vector Machines – p. 71

Page 174: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Non-linear SVMs

Introduction to Neuroinformatics: Support Vector Machines – p. 72

Page 175: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Non-linear SVMs

linear discrimination is not appropriate in all cases, we need a non-linearvariant of SVMs

switching from linear constraints to non-linear constraints causes a lot ofnumerical problems:

• losing convexity

• KT conditions not sufficient

• optimization algorithms with low performance

Introduction to Neuroinformatics: Support Vector Machines – p. 73

Page 176: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Non-linear SVMs(cont.)

classical idea: use non-linear features

feature mapping Φ (non-linear) maps the original input vectors to featurevectors

we have to replace ~x in all formulae by Φ(~x)

Introduction to Neuroinformatics: Support Vector Machines – p. 74

Page 177: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Non-linear SVMs(cont.)

classical idea: use non-linear features

feature mapping Φ (non-linear) maps the original input vectors to featurevectors

we have to replace ~x in all formulae by Φ(~x)

dual of the hard margin case becomes:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨Φ(~x(i)),Φ(~x(j))

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 74

Page 178: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick

defining a function KΦ(·) as: KΦ(~x, ~y) = 〈Φ(~x),Φ(~y)〉 we get:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)KΦ(~x(i), ~x(j)))

+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

now, patterns only occur as arguments of the function KΦ in the dual

the function KΦ is called a kernel function

Introduction to Neuroinformatics: Support Vector Machines – p. 75

Page 179: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Page 180: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Page 181: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Page 182: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)

weight vector: no, bias: yes, margin: yes

b =1

d(s)−

p∑

j=1

(αjd

(j)KΦ(~x(s), ~x(j)))

for support vector s

ρ =1

√∑p

i=1

∑p

j=1

(αiαjd(i)d(j)KΦ(~x(i), ~x(j))

)

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Page 183: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)

weight vector: no, bias: yes, margin: yes

b =1

d(s)−

p∑

j=1

(αjd

(j)KΦ(~x(s), ~x(j)))

for support vector s

ρ =1

√∑p

i=1

∑p

j=1

(αiαjd(i)d(j)KΦ(~x(i), ~x(j))

)

can we apply the learned classification to a new input pattern ~x(new) ?

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Page 184: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)

weight vector: no, bias: yes, margin: yes

b =1

d(s)−

p∑

j=1

(αjd

(j)KΦ(~x(s), ~x(j)))

for support vector s

ρ =1

√∑p

i=1

∑p

j=1

(αiαjd(i)d(j)KΦ(~x(i), ~x(j))

)

can we apply the learned classification to a new input pattern ~x(new) ?Yes! (see next slide)

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Page 185: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

~w =

p∑

i=1

(αid(i)Φ(~x(i)))

applying ~w and b to a new pattern:

⟨~w,Φ(~x(new))

⟩+ b =

⟨p

i=1

(αid(i)Φ(~x(i))),Φ(~x(new))

+ b

=

p∑

i=1

(αid(i)KΦ(~x(i), ~x(new))) + b

Introduction to Neuroinformatics: Support Vector Machines – p. 77

Page 186: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

quintessence:

• replacing the dotproduct by a kernel function is possible

• an explicit representation is not possible if we do not have access to Φ

• nonetheless, predictions can be made

• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi

Introduction to Neuroinformatics: Support Vector Machines – p. 78

Page 187: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

quintessence:

• replacing the dotproduct by a kernel function is possible

• an explicit representation is not possible if we do not have access to Φ

• nonetheless, predictions can be made

• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi

the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick

classification

linearoptimal

Φ−1Φ

Introduction to Neuroinformatics: Support Vector Machines – p. 78

Page 188: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

quintessence:

• replacing the dotproduct by a kernel function is possible

• an explicit representation is not possible if we do not have access to Φ

• nonetheless, predictions can be made

• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi

the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick

classification

linearoptimal

Φ−1

kernel−trick

Φ

Introduction to Neuroinformatics: Support Vector Machines – p. 78

Page 189: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

example (one dimensional input space):

D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space

feature mapping: Φ(x) =

x2

√2x

1

kernel function:

KΦ(x, y) =

x2

√2x

1

,

y2

√2y

1

= (xy)2 +2(xy)+1 = (xy +1)2

Introduction to Neuroinformatics: Support Vector Machines – p. 79

Page 190: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

example (one dimensional input space):

D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space

feature mapping: Φ(x) =

x2

√2x

1

kernel function:

KΦ(x, y) =

x2

√2x

1

,

y2

√2y

1

= (xy)2 +2(xy)+1 = (xy +1)2

solution: α1 = 34, α2 = 2

3, α3 = 1

12

b = −1, ~w = (1,−12

√2, 0)T (in feature space!)

Introduction to Neuroinformatics: Support Vector Machines – p. 79

Page 191: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

example, illustration:

input space0 1

Introduction to Neuroinformatics: Support Vector Machines – p. 80

Page 192: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

example, illustration:fe

atu

re s

pac

e

input space0 1

1 f1

f2

Φ

Φ

Φ

√2

Introduction to Neuroinformatics: Support Vector Machines – p. 80

Page 193: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

example, illustration:fe

atu

re s

pac

e

input space0 1

1 f1

f2

√2

Introduction to Neuroinformatics: Support Vector Machines – p. 80

Page 194: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

example, illustration:fe

atu

re s

pac

e

input space0 1

1 f1

f2

Φ

Φ

√2

Which points in input space are on thedecision boundary?look for z with∑p

i=1(αid(i)KΦ(~x(i), z)) + b = 0

yields: z = 12(1 ±

√5)

Introduction to Neuroinformatics: Support Vector Machines – p. 80

Page 195: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

kernels can be designed individually for an application:

• look for senseful features

• create Φ

• derive KΦ

In these cases, using the kernel instead is not better than an explicitcalculation in feature space

Introduction to Neuroinformatics: Support Vector Machines – p. 81

Page 196: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Kernel trick(cont.)

kernels can be designed individually for an application:

• look for senseful features

• create Φ

• derive KΦ

In these cases, using the kernel instead is not better than an explicitcalculation in feature space

there are classes of generic kernels

• can be used for many tasks

• often, Φ is very complicated and feature space is very high dimensional

• for some generic kernels, feature space has infinite dimension and Φcannot be calculated explicitely

• typically, using kernel instead of explicit calculation in feature space iscomputationally more efficient

Introduction to Neuroinformatics: Support Vector Machines – p. 81

Page 197: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels

example of kernel (one-dimensional input space):

Φ(x) =

x2

√2x

1

mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions

Introduction to Neuroinformatics: Support Vector Machines – p. 82

Page 198: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels

example of kernel (one-dimensional input space):

Φ(x) =

x2

√2x

1

mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions

corresponding kernel: KΦ(x, y) = (xy + 1)2

kernel evaluation needs: 2 multiplications, 1 addition

Introduction to Neuroinformatics: Support Vector Machines – p. 82

Page 199: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:

KΦ(~x, ~y) = (〈~x, ~y〉)d

or:

KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d

d ∈ N is a kernel parameter thatneeds to be chosen manually

Introduction to Neuroinformatics: Support Vector Machines – p. 83

Page 200: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:

KΦ(~x, ~y) = (〈~x, ~y〉)d

or:

KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d

d ∈ N is a kernel parameter thatneeds to be chosen manually

for polynomial kernels, Φ can becalculated explicitely. It contains allmonomials of degree d (first case) orall monomials of maximal degree d

(second case)But: feature space becomes verylarge, even for small d, e.g. for thefirst variant the number of features is:

n d number features

3 2 6

5 3 35

10 3 220

10 5 2002

10 10 92 378

50 4 292 825

256 4 183 181 376

Introduction to Neuroinformatics: Support Vector Machines – p. 83

Page 201: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

what makes a symmetric function K a kernel function?

Introduction to Neuroinformatics: Support Vector Machines – p. 84

Page 202: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

what makes a symmetric function K a kernel function?

• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉

Introduction to Neuroinformatics: Support Vector Machines – p. 84

Page 203: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

what makes a symmetric function K a kernel function?

• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉

• K meets Mercer’s theorem:

A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,

if and only if for all functions g : [a, b]n → R with∫

[a,b]ng(~x)2d~x < ∞

holds: ∫

[a,b]n

[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0

Proof: omitted

Introduction to Neuroinformatics: Support Vector Machines – p. 84

Page 204: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

what makes a symmetric function K a kernel function?

• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉

• K meets Mercer’s theorem:

A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,

if and only if for all functions g : [a, b]n → R with∫

[a,b]ng(~x)2d~x < ∞

holds: ∫

[a,b]n

[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0

Proof: omitted

consequences: sums of kernels and positive multiples of kernels are alsokernels

Introduction to Neuroinformatics: Support Vector Machines – p. 84

Page 205: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

RBF kernels (Gaussian kernels):

KΦ(~x, ~y) = e−||~x−~y||2

2σ2

σ2 > 0 is a kernel parameter, chosen manually

Introduction to Neuroinformatics: Support Vector Machines – p. 85

Page 206: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

RBF kernels (Gaussian kernels):

KΦ(~x, ~y) = e−||~x−~y||2

2σ2

σ2 > 0 is a kernel parameter, chosen manually

feature spaces have inifinite dimension

local information processing, circular decision boundaries in input space

what is calculated in a SVM with RBF kernel resembles a RBF network, buttraining is done in a different way

kernel parameter controls the complexity of the kernel: if σ2 large, behavioris similar to dot product, if σ2 is very small, decision boundaries in inputspace may become very complex

degeneration σ2 → 0

most popular kernel type in practice

Introduction to Neuroinformatics: Support Vector Machines – p. 85

Page 207: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

sigmoid kernels:

KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))

γ > 0 and θ are kernel parameters, chosen manually

Introduction to Neuroinformatics: Support Vector Machines – p. 86

Page 208: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernels(cont.)

sigmoid kernels:

KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))

γ > 0 and θ are kernel parameters, chosen manually

feature spaces have inifinite dimension

what is calculated in a SVM with sigmoidal kernel resembles a MLP networkwith one hidden layer

no practical use due to numerical degeneration of the optimization problem

Introduction to Neuroinformatics: Support Vector Machines – p. 86

Page 209: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Generic kernelssummary

dot-product kernel: KΦ(~x, ~y) = 〈~x, ~y〉input space=feature space, simple, but useful

polynomial kernels: KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d,KΦ(~x, ~y) = (〈~x, ~y〉)d

extensions of dot product kernel, feature spaces with finite dimension

RBF kernels: KΦ(~x, ~y) = e−||~x−~y||2

2σ2

very important in practice, very flexible

sigmoid kernels: KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))limited practical use

problem dependent kernels

→ demo in practical exercises

Introduction to Neuroinformatics: Support Vector Machines – p. 87

Page 210: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Summary (SVMs)

linear classification with optimal separating hyperplane

mathematical problem description as convex quadratic program

numerical solving by various approaches, often based on active set methods:libSVM, SMO, ...

representation of the solution in terms of support vectors

soft margin case: classification with errors

kernel-trick: implicit calculations in feature space, non-linear SVM

extensions:

• SVM for regression

• SVM for one class classification

important ideas:

• large margin techniques

• kernel-based techniques Introduction to Neuroinformatics: Support Vector Machines – p. 88

Page 211: Introduction to Neuroinformatics: Support Vector Machinesml.informatik.uni-freiburg.de/_media/teaching/ss13/ml/riedmiller/ried... · Introduction to Neuroinformatics: Support Vector

Further readings

Nello Christianini and John Shawe-Taylor, An introduction to support vectormachines and other kernel-based learning methods, Cambridge Univ. Press,2000

C. J. C. Burges, A Tutorial on Support Vector Machines for PatternRecognition, 1998. available at: http://www.kernel-machines.org

Roger Fletcher, Practical methods of optimization, Wiley, 1987

Introduction to Neuroinformatics: Support Vector Machines – p. 89