introduction to neuroinformatics: support vector...

Post on 04-Nov-2019

16 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to Neuroinformatics:

Support Vector Machines

Prof. Dr. Martin Riedmiller

University of Osnabruck

Institute of Computer Science and Institute of Cognitive Science

Introduction to Neuroinformatics: Support Vector Machines – p. 1

Outline

support vector machines (SVM) for classification: basic ideas

optimization under constraints: Lagrange multiplier, Kuhn-Tucker conditions,dual, algorithmic approaches

applying optimization theory to SVM

kernel trick: make linear problems non-linear

variants of SVMs

Introduction to Neuroinformatics: Support Vector Machines – p. 2

Linear classification

perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:

〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x

〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x

Introduction to Neuroinformatics: Support Vector Machines – p. 3

Linear classification

perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:

〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x

〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x

perceptron learning finds solving hyperplane (if possible) by adding andsubtracting patterns to the weight vector

Introduction to Neuroinformatics: Support Vector Machines – p. 3

Linear classification

perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:

〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x

〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x

perceptron learning finds solving hyperplane (if possible) by adding andsubtracting patterns to the weight vector

if there are several solving hyperplanes, perceptron learning finds anarbitrary one

Introduction to Neuroinformatics: Support Vector Machines – p. 3

Linear classification(cont.)

Which is the best separating hyperplane?

−−

+ +

++

++

the performance on the training set is equally good, but what do you expectfor an independent test set?

Introduction to Neuroinformatics: Support Vector Machines – p. 4

Linear classification(cont.)

Which is the best separating hyperplane?

−−

+ +

++

++

+

+

++

++

+

+

−−

−−

the performance on the training set is equally good, but what do you expectfor an independent test set?

Introduction to Neuroinformatics: Support Vector Machines – p. 4

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−−

+ +

++

++

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−−

+ +

++

++

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−−

+ +

++

++

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Linear classification(cont.)

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−−

+ +

++

++

Introduction to Neuroinformatics: Support Vector Machines – p. 5

Support vector machines

support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.

Introduction to Neuroinformatics: Support Vector Machines – p. 6

Support vector machines

support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.

notation:

• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,

d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b

(= w0, to be in line with SVM literature)

• margin ρ

Introduction to Neuroinformatics: Support Vector Machines – p. 6

Support vector machines

support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.

notation:

• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,

d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b

(= w0, to be in line with SVM literature)

• margin ρ

distance of a point ~x to a hyperplane (~w, b):

∣∣∣∣

〈~x, ~w〉 + b

||~w||

∣∣∣∣

Introduction to Neuroinformatics: Support Vector Machines – p. 6

Support vector machines

find the separating hyperplane with maximal margin:

maximize~w,b,ρ

ρ2

subject to ρ > 0

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D

Introduction to Neuroinformatics: Support Vector Machines – p. 7

Support vector machines

find the separating hyperplane with maximal margin:

maximize~w,b,ρ

ρ2

subject to ρ > 0

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D

this task contains a degree of freedom: ~w and b can be scaled by a positivenumber without changing the separating hyperplane.We can remove this degree of freedom and simplify the problem by adding aconstraint on the length of ~w. Here:

||~w|| =1

ρ

Introduction to Neuroinformatics: Support Vector Machines – p. 7

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

ρ2 1

||~w||2

subject to ρ > 0

||~w|| =

1

ρρ =

1

||~w||

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ1

||~w|| for all (~x(i), d(i)) ∈ D

rewriting the optimization problem

Finally, we can make it a minimzation task minimizing the reciprocal of 1||w||2

Introduction to Neuroinformatics: Support Vector Machines – p. 8

Support vector machines(cont.)

the minimization task:

minimize~w,b,ρ

1

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

ρ =1

||~w||

Introduction to Neuroinformatics: Support Vector Machines – p. 9

Support vector machines(cont.)

the minimization task:

minimize~w,b,ρ

1

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

ρ =

1

||~w||

ρ occurs only in one equality constraint, i.e. we can solve the optimizationproblem for ~w and b and calculate ρ afterwards from ~w

Introduction to Neuroinformatics: Support Vector Machines – p. 9

Support vector machines(cont.)

final mathematical form of the minimization task:

minimize~w,b

1

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)

Introduction to Neuroinformatics: Support Vector Machines – p. 10

Optimization Theory

Introduction to Neuroinformatics: Support Vector Machines – p. 11

Optimization theory

a general form of optimization problem:

minimize~x

f(~x)

subject to ~x ∈ Ω

Ω ⊆ Rn: the set of feasible (permitted) points

f : Rn → R: the target function that should be minimized

Introduction to Neuroinformatics: Support Vector Machines – p. 12

Optimization theory

a general form of optimization problem:

minimize~x

f(~x)

subject to ~x ∈ Ω

Ω ⊆ Rn: the set of feasible (permitted) points

f : Rn → R: the target function that should be minimized

maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))

Introduction to Neuroinformatics: Support Vector Machines – p. 12

Optimization theory

a general form of optimization problem:

minimize~x

f(~x)

subject to ~x ∈ Ω

Ω ⊆ Rn: the set of feasible (permitted) points

f : Rn → R: the target function that should be minimized

maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))

often, Ω is implicitly given by equality and inequality constraints

Introduction to Neuroinformatics: Support Vector Machines – p. 12

Local and global minima

a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:

f(~x) ≤ f(~y)

Introduction to Neuroinformatics: Support Vector Machines – p. 13

Local and global minima

a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:

f(~x) ≤ f(~y)

a feasible point ~x ∈ Ω is called alocal minimum if there is a radiusr > 0 so that for all points ~y ∈ Ωwith ||~x − ~y|| < r holds:

f(~x) ≤ f(~y)

x

f(x)

Ω

loc.glob.loc.

Introduction to Neuroinformatics: Support Vector Machines – p. 13

Convexity

A set X is called a convex set if for all points ~x, ~y ∈ X and any 0 ≤ θ ≤ 1holds:

θ~x + (1 − θ)~y ∈ X

intuitively: a set is convex, if all points between two elements of the set arealso elements of the set.

~x ∈ X

~y ∈ Xset X

θ~x + (1 − θ)~y ∈ X ?

Introduction to Neuroinformatics: Support Vector Machines – p. 14

Convexity(cont.)

convex sets:

non convex sets:

Introduction to Neuroinformatics: Support Vector Machines – p. 15

Convexity(cont.)

intersections of convex sets are also convex.Proof: → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 16

Convexity(cont.)

intersections of convex sets are also convex.Proof: → blackboard

unions of convex sets are not guaranteed to be convex.

Introduction to Neuroinformatics: Support Vector Machines – p. 16

Convexity(cont.)

A function f is called a convex function, if for all points ~x, ~y and all0 ≤ θ ≤ 1 holds:

f(θ~x + (1 − θ)~y) ≤ θf(~x) + (1 − θ)f(~y)

x

f(x)

~x ~y

Introduction to Neuroinformatics: Support Vector Machines – p. 17

Convexity(cont.)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

x

f(x)

convex functions:

non−convex functions:

Introduction to Neuroinformatics: Support Vector Machines – p. 18

Convexity(cont.)

sums of convex functions are also convex.Proof: → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 19

Convexity(cont.)

sums of convex functions are also convex.Proof: → blackboard

positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)

Introduction to Neuroinformatics: Support Vector Machines – p. 19

Convexity(cont.)

sums of convex functions are also convex.Proof: → blackboard

positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)

differences of convex functions are not guaranteed to be convex.

Introduction to Neuroinformatics: Support Vector Machines – p. 19

Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)

Introduction to Neuroinformatics: Support Vector Machines – p. 20

Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)

the function: f : x 7→ x2 is a convex function.Proof:

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

Introduction to Neuroinformatics: Support Vector Machines – p. 20

Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)

the function: f : x 7→ x2 is a convex function.Proof:

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above

Introduction to Neuroinformatics: Support Vector Machines – p. 20

Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)

the function: f : x 7→ x2 is a convex function.Proof:

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above

if f is a convex function, then the set ~x|f(~x) ≤ 0 is a convex set.Proof: → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 20

Convexity and minima

Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.

Proof:→ blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 21

Convexity and minima

Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.

Proof:→ blackboard

this lemma does not mean that for each convex set and each convex functiona minimum exists!

Introduction to Neuroinformatics: Support Vector Machines – p. 21

Convexity and minima

Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.

Proof:→ blackboard

this lemma does not mean that for each convex set and each convex functiona minimum exists!

there are conditions for which the existence of a global minimum can beguaranteed. (e.g. Ω non-empty and compact, f continuous)

Introduction to Neuroinformatics: Support Vector Machines – p. 21

Convexity and minima (cont.)

Lemma:If Ω is a convex set and f is a convex function w.r.t Ω, each point satisfyinggradf(~x) = 0 is a local minimum.

Proof:→ omitted (literature about convex functions)

Introduction to Neuroinformatics: Support Vector Machines – p. 22

Constraints

often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:

Ω =(

k⋂

i=1

~x|gi(~x) ≤ 0)∩

(m⋂

i=1

~x|hi(~x) = 0)

gj and hi are real valued functions

• constraints of the form gj(~x) ≤ 0 are called inequality constraints

• constraints of the form hi(~x) = 0 are called equality constraints

Introduction to Neuroinformatics: Support Vector Machines – p. 23

Constraints

often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:

Ω =(

k⋂

i=1

~x|gi(~x) ≤ 0)∩

(m⋂

i=1

~x|hi(~x) = 0)

gj and hi are real valued functions

• constraints of the form gj(~x) ≤ 0 are called inequality constraints

• constraints of the form hi(~x) = 0 are called equality constraints

equality constraints hi(~x) = 0 can be replaced by two inequality constraints:

hi(~x) ≤ 0 and −hi(~x) ≤ 0.

Introduction to Neuroinformatics: Support Vector Machines – p. 23

Characterizing local minima

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible

direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r

Introduction to Neuroinformatics: Support Vector Machines – p. 24

Characterizing local minima

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible

direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a descending

direction if exists r > 0 so that f(~x + θ~v) < f(~x) for all 0 < θ ≤ r

Introduction to Neuroinformatics: Support Vector Machines – p. 24

Characterizing local minima

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible

direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a descending

direction if exists r > 0 so that f(~x + θ~v) < f(~x) for all 0 < θ ≤ r

Introduction to Neuroinformatics: Support Vector Machines – p. 24

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

minimum(no descending directions)

local minimum(desc. directions non−feasible)

no minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 25

Linear constraints, differentiable target function

theory simplifies if the target function f is differentiable and the constraintsare linear:

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn

(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)

hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn

Introduction to Neuroinformatics: Support Vector Machines – p. 26

Linear constraints, differentiable target function

theory simplifies if the target function f is differentiable and the constraintsare linear:

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn

(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)

hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn

if f is a linear function, these tasks are called linear programs

if f is a quadratic function, these tasks are called quadratic programs

Introduction to Neuroinformatics: Support Vector Machines – p. 26

Linear constraints, differentiable target function(cont.)

what are the descending directions at point ~x ?

Introduction to Neuroinformatics: Support Vector Machines – p. 27

Linear constraints, differentiable target function(cont.)

what are the descending directions at point ~x ?a direction ~v is descending if

〈~v, gradf(~x)〉 < 0

Introduction to Neuroinformatics: Support Vector Machines – p. 27

Linear constraints, differentiable target function(cont.)

what are the feasible directions inpoint ~x ?

Introduction to Neuroinformatics: Support Vector Machines – p. 28

Linear constraints, differentiable target function(cont.)

what are the feasible directions inpoint ~x ?

• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.

grad g(x)

grad g(x)

Introduction to Neuroinformatics: Support Vector Machines – p. 28

Linear constraints, differentiable target function(cont.)

what are the feasible directions inpoint ~x ?

• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.

• inactive constraints:inactive constraints do notrestrict feasible directions

grad g(x)

feasible directions

grad g(x)

Introduction to Neuroinformatics: Support Vector Machines – p. 28

Linear constraints, differentiable target function(cont.)

what are the feasible directions inpoint ~x ?

• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.

• inactive constraints:inactive constraints do notrestrict feasible directions

• active constraints:directions ~v with〈~v, gradg(~x)〉 > 0 areinfeasible

grad g(x)

feasible directions

grad g(x)

feasible directions

Introduction to Neuroinformatics: Support Vector Machines – p. 28

Linear constraints, differentiable target function(cont.)

what are the feasible directions in point ~x ?

Introduction to Neuroinformatics: Support Vector Machines – p. 29

Linear constraints, differentiable target function(cont.)

what are the feasible directions in point ~x ?A direction ~v is feasible in ~x if for all constraints gi holds:

gi is inactive or 〈~v, gradgi(~x)〉 ≤ 0

Introduction to Neuroinformatics: Support Vector Machines – p. 29

Characterizing local minima

local minima are feasible points at which the intersection of descendingdirections and feasible directions is empty.

Introduction to Neuroinformatics: Support Vector Machines – p. 30

Characterizing local minima

local minima are feasible points at which the intersection of descendingdirections and feasible directions is empty.

examples with no active constraints:

grad f grad f=0

gradf(~x) 6= ~0 ⇒ nominimum

gradf(~x) = ~0 ⇒minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 30

Characterizing local minima(cont.)

examples with one active constraints:

grad g(x)

grad f(x)

desc. dir.

feasible dir.

−gradf = λgradg with λ < 0all descending directions feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 31

Characterizing local minima(cont.)

examples with one active constraints:

grad g(x)

grad f(x)

desc. dir.

feasible dir.

−gradf = λgradg with λ < 0all descending directions feasibleno minimum

grad g(x)

feasible dir.

grad f(x)

desc. dir.

−gradf = λgradg + ~u with λ < 0

and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 31

Characterizing local minima(cont.)

grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg + ~u with λ > 0

and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 32

Characterizing local minima(cont.)

grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg + ~u with λ > 0

and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum

grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg with λ > 0all descending directions non-feasibleminimum

Introduction to Neuroinformatics: Support Vector Machines – p. 32

Characterizing local minima(cont.)

grad g(x)

feasible dir.

grad f=0 −gradf = ~0 = 0 · gradg

no descending directionsminimum

Introduction to Neuroinformatics: Support Vector Machines – p. 33

Characterizing local minima(cont.)

grad g(x)

feasible dir.

grad f=0 −gradf = ~0 = 0 · gradg

no descending directionsminimum

For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)

only if λ < 0 or ~u 6= ~0 we find feasible descending directions

Introduction to Neuroinformatics: Support Vector Machines – p. 33

Characterizing local minima(cont.)

grad g(x)

feasible dir.

grad f=0 −gradf = ~0 = 0 · gradg

no descending directionsminimum

For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)

only if λ < 0 or ~u 6= ~0 we find feasible descending directions

~x is minimum if we find λ ≥ 0 with gradf(~x) + λgradg(~x) = ~0

Remark: this principle can be generalized to more active constraints

Introduction to Neuroinformatics: Support Vector Machines – p. 33

Characterizing local minima(cont.)

examples with two active constraints:

grad g1(x)

grad g2(x)

feasible dir.

desc. dir.

grad f(x)−gradf = λ1gradg1 + λ2gradg2

with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 34

Characterizing local minima(cont.)

examples with two active constraints:

grad g1(x)

grad g2(x)

feasible dir.

desc. dir.

grad f(x)−gradf = λ1gradg1 + λ2gradg2

with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum

grad g2(x)

grad g1(x)

desc. dir.

grad f(x)feasible dir.

−gradf = λ1gradg1 + λ2gradg2

with λ1 ≥ 0, λ2 ≥ 0all descending directions non-feasibleminimum

Introduction to Neuroinformatics: Support Vector Machines – p. 34

Characterizing local minima(cont.)

examples with equality constraint:

grad h(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradh + ~u with

~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 35

Characterizing local minima(cont.)

examples with equality constraint:

grad h(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradh + ~u with

~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum

grad h(x)desc. dir.

grad f(x)

feasible dir.−gradf = λgradh

all descending directions non-feasibleminimum

Introduction to Neuroinformatics: Support Vector Machines – p. 35

Characterizing local minima(cont.)

general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:

A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:

grad f(~x)+∑

active inequalityconstraints j

(αj ·grad gj(~x))+m∑

i=1

(βi·grad hi(~x)) = ~0

(corollary to Farka’s lemma)

Introduction to Neuroinformatics: Support Vector Machines – p. 36

Characterizing local minima(cont.)

general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:

A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:

grad f(~x)+∑

active inequalityconstraints j

(αj ·grad gj(~x))+m∑

i=1

(βi·grad hi(~x)) = ~0

(corollary to Farka’s lemma)

Lagrange function:

L(~x, ~α, ~β) = f(~x) +k∑

j=1

(αj · gj(~x)) +m∑

i=1

(βi · hi(~x))

αi, βi are called Lagrange multipliers

Introduction to Neuroinformatics: Support Vector Machines – p. 36

(Karush-) Kuhn-Tucker conditions

rewriting these conditions yield the (K)KT conditions:

gi(~x) ≤ 0 for all inequality constraints

hj(~x) = 0 for all equality constraints

αi · gi(~x) = 0 for all inequality constraints

0 = ∂L(~x,~α,~β)∂xi

for all i = 1, . . . , n

αi ≥ 0 for all inequality constraints

Lemma:A point ~x is minimum, if and only if exist αj and βi so that the KT conditionsare met.

Proof: omitted, see literature

Introduction to Neuroinformatics: Support Vector Machines – p. 37

Kuhn-Tucker conditions(cont.)

the KT conditions do not give an algorithm to find the minima, but they allowto check whether a given point is a minimum

example:

minimizex1,x2

x21 + x2

2

subject to − x1 + x2 + 2 ≤ 0

− 2x1 − x2 − 2 ≤ 0

Lagrange function, KT conditions, check whether (0, 0), (4, 2), (1,−1) areminima. → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 38

Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive

Introduction to Neuroinformatics: Support Vector Machines – p. 39

Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive

• at the minimum ~x∗, a certain subset A of inequality constraints is active

• if we would know A and f looks “nice”, we could find ~x∗ analytically

Introduction to Neuroinformatics: Support Vector Machines – p. 39

Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive

• at the minimum ~x∗, a certain subset A of inequality constraints is active

• if we would know A and f looks “nice”, we could find ~x∗ analytically

• main idea: check all possible combinations of active constraints for theminimum

example → blackboard

Introduction to Neuroinformatics: Support Vector Machines – p. 39

Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive

• at the minimum ~x∗, a certain subset A of inequality constraints is active

• if we would know A and f looks “nice”, we could find ~x∗ analytically

• main idea: check all possible combinations of active constraints for theminimum

example → blackboard

can be applied only for linear and quadratic target functions

number of possible combinations of active constraints grows exponentially

Introduction to Neuroinformatics: Support Vector Machines – p. 39

Ways to find the minimum(cont.)

active set methods: incrementally build the set of active constraints

Introduction to Neuroinformatics: Support Vector Machines – p. 40

Ways to find the minimum(cont.)

active set methods: incrementally build the set of active constraints

start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is

consistent with the constraints in A

if a constraint 6∈ A is violated it is added to A

if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.

if necessary, constraints are removed from A

Introduction to Neuroinformatics: Support Vector Machines – p. 40

Ways to find the minimum(cont.)

active set methods: incrementally build the set of active constraints

start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is

consistent with the constraints in A

if a constraint 6∈ A is violated it is added to A

if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.

if necessary, constraints are removed from A

sketch of algorithm → next slide

Introduction to Neuroinformatics: Support Vector Machines – p. 40

Ways to find the minimum(cont.)

1: repeat2: calculate direction of maximal descent3: project direction of maximal descent onto boundaries of constraints in A (yields search

direction ~v)

4: if ~v 6= ~0 then5: calculate minimum on ray from ~x into direction ~v (yields steplength τ )6: check whether constraints are violated when step is performed. If yes, shorten τ and add

restricting constraint to A

7: perform step from ~x to ~x + τ~v

8: else9: calculate Lagrange multipliers

10: if Lagrange multipliers are all non-negative then11: minimum found, stop.12: else13: remove constraint with negative Lagrange multiplier from A

14: end if15: end if16: until minimum found

Introduction to Neuroinformatics: Support Vector Machines – p. 41

Duality

a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.

the original task is called primal problem, the other problem is called dualproblem

Introduction to Neuroinformatics: Support Vector Machines – p. 42

Duality

a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.

the original task is called primal problem, the other problem is called dualproblem

primal task: (reminder)

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

Introduction to Neuroinformatics: Support Vector Machines – p. 42

Duality

a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.

the original task is called primal problem, the other problem is called dualproblem

primal task: (reminder)

minimize~x

f(~x)

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

the Lagrange function of the primal: (reminder)

L(~x, ~α, ~β) = f(~x) +k∑

j=1

(αj · gj(~x)) +m∑

i=1

(βi · hi(~x))

Introduction to Neuroinformatics: Support Vector Machines – p. 42

Duality(cont.)

the function Q:

Q(~α, ~β) = inf~x∈Rn

L(~x, ~α, ~β)

remarks:

• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞

• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.

Introduction to Neuroinformatics: Support Vector Machines – p. 43

Duality(cont.)

the function Q:

Q(~α, ~β) = inf~x∈Rn

L(~x, ~α, ~β)

remarks:

• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞

• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.

dual task:

maximize~α,~β

Q(~α, ~β)

subject to αi ≥ 0 for all i = 1, . . . , k

Introduction to Neuroinformatics: Support Vector Machines – p. 43

Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)

Introduction to Neuroinformatics: Support Vector Machines – p. 44

Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

Introduction to Neuroinformatics: Support Vector Machines – p. 44

Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

= f(~x′) +k∑

i=1

( α′i

︸︷︷︸

≥0

· gi(~x′)

︸ ︷︷ ︸

≤0

) +m∑

j=1

(β′j · hj(~x

′)︸ ︷︷ ︸

=0

)

Introduction to Neuroinformatics: Support Vector Machines – p. 44

Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

= f(~x′) +k∑

i=1

( α′i

︸︷︷︸

≥0

· gi(~x′)

︸ ︷︷ ︸

≤0

) +m∑

j=1

(β′j · hj(~x

′)︸ ︷︷ ︸

=0

)

≤ f(~x′)

Introduction to Neuroinformatics: Support Vector Machines – p. 44

Duality(cont.)

Corollary:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the

dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem

and (~α′, ~β′) is solution of the dual problem.

Introduction to Neuroinformatics: Support Vector Machines – p. 45

Duality(cont.)

Corollary:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the

dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem

and (~α′, ~β′) is solution of the dual problem.

Proof:For any feasible ~x:

f(~x) ≥ Q(~α′, ~β′) = f(~x′)

For any feasible (~α, ~β):

Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)

Introduction to Neuroinformatics: Support Vector Machines – p. 45

Duality(cont.)

Corollary:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the

dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem

and (~α′, ~β′) is solution of the dual problem.

Proof:For any feasible ~x:

f(~x) ≥ Q(~α′, ~β′) = f(~x′)

For any feasible (~α, ~β):

Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)

Potential Problem: points ~x′ and (~α′, ~β′) do not necessarily exist for generalproblems ⇒ restrictions on f and type of constraints.

Introduction to Neuroinformatics: Support Vector Machines – p. 45

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x

from KT conditions: ∂L∂xi

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x

from KT conditions: ∂L∂xi

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x

from KT conditions: ∂L∂xi

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual

proof shows: KT conditions are the link between primal solution and dualsolution

solutions do not need to be unique

Introduction to Neuroinformatics: Support Vector Machines – p. 46

Duality(cont.)

main message from duality: under certain circumstances,

• you can solve the primal and get the solution of the dual for free

• you can solve the dual and get the solution of the primal for free

• you can use the Kuhn-Tucker conditions to transform the primal solutioninto a dual solution and vice versa

Introduction to Neuroinformatics: Support Vector Machines – p. 47

Summary (optimization theory)

general ideas about optimization (local, global minima)

convex problems: each local minimum is a global minimum

general characterization of minima (feasible directions, descendingdirections, Kuhn-Tucker)

algorithms to find the minimum (brute force, active set)

duality

literature: Roger Fletcher, Practical methods of optimization, Wiley 1991

Introduction to Neuroinformatics: Support Vector Machines – p. 48

Intermediate Last Slide

Introduction to Neuroinformatics: Support Vector Machines – p. 49

Applying Results of Optimization Theoryto Support Vector Machines

Introduction to Neuroinformatics: Support Vector Machines – p. 50

Support vector machines

final mathematical form of the minimization task (reminder):

minimize~w,b

1

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for i = 1, . . . , p

the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)

Introduction to Neuroinformatics: Support Vector Machines – p. 51

Support vector machines(cont.)

transforming into the appropriate form:

minimize~w,b

1

2||~w||2

subject to 1 − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 52

Support vector machines(cont.)

transforming into the appropriate form:

minimize~w,b

1

2||~w||2

subject to 1 − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

note:

• target function is convex

• constraints are linear

• hence, local minima are also global minima (if exist feasible points)

Introduction to Neuroinformatics: Support Vector Machines – p. 52

Support vector machines(cont.)

Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi

(

1 − d(i)( ⟨

~x(i), ~w⟩

+ b))

=1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 53

Support vector machines(cont.)

Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi

(

1 − d(i)( ⟨

~x(i), ~w⟩

+ b))

=1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

derivatives:

∂L

∂wj

= wj −p

i=1

αid(i)x

(i)j

Introduction to Neuroinformatics: Support Vector Machines – p. 53

Support vector machines(cont.)

Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi

(

1 − d(i)( ⟨

~x(i), ~w⟩

+ b))

=1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

derivatives:

∂L

∂wj

= wj −p

i=1

αid(i)x

(i)j

∂L

∂b= −

p∑

i=1

αid(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 53

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

zeroing the derivatives and resolving w.r.t. ~w and b:

wj =

p∑

i=1

αid(i)x

(i)j

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

zeroing the derivatives and resolving w.r.t. ~w and b:

wj =

p∑

i=1

αid(i)x

(i)j

~w =

p∑

i=1

αid(i)~x(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

zeroing the derivatives and resolving w.r.t. ~w and b:

wj =

p∑

i=1

αid(i)x

(i)j

~w =

p∑

i=1

αid(i)~x(i)

0 =

p∑

i=1

αid(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Support vector machines(cont.)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

zeroing the derivatives and resolving w.r.t. ~w and b:

wj =

p∑

i=1

αid(i)x

(i)j

~w =

p∑

i=1

αid(i)~x(i)

0 =

p∑

i=1

αid(i)

strange equation, b is lost, what does it mean?

Introduction to Neuroinformatics: Support Vector Machines – p. 54

Support vector machines(cont.)

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

~w =

p∑

i=1

αid(i)~x(i) 0 =

p∑

i=1

αid(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 55

Support vector machines(cont.)

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

~w =

p∑

i=1

αid(i)~x(i) 0 =

p∑

i=1

αid(i)

first case:∑p

i=1 αid(i) = 0

in this case, we can find ~w that zeros the partial derivatives for any value of

b. its the minimum=infimum. Substituting ~w by∑p

i=1 αid(i)~x(i) yields:

Q(~α) = −1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

Introduction to Neuroinformatics: Support Vector Machines – p. 55

Support vector machines(cont.)

L(~w, b, ~α) =1

2||~w||2 +

p∑

i=1

αi −p

i=1

αid(i)

⟨~x(i), ~w

⟩− b

p∑

i=1

αid(i)

~w =

p∑

i=1

αid(i)~x(i) 0 =

p∑

i=1

αid(i)

second case:∑p

i=1 αid(i) 6= 0

don’t matter of ~w we can make L arbitrary small varying b. Hence,

Q(~α) = −∞

Introduction to Neuroinformatics: Support Vector Machines – p. 55

Support vector machines(cont.)

the dual problem:

maximize~α

−1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

if

p∑

i=1

αid(i) = 0

−∞ otherwise

subject to αi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 56

Support vector machines(cont.)

since the second case in case distinction will never be the maximum, we canrewrite the dual:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 57

Support vector machines(cont.)

since the second case in case distinction will never be the maximum, we canrewrite the dual:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

note: dual is a problem in ~α. ~w and b do not occur

Introduction to Neuroinformatics: Support Vector Machines – p. 57

Support vector machines(cont.)

since the second case in case distinction will never be the maximum, we canrewrite the dual:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

note: dual is a problem in ~α. ~w and b do not occur

for now, there is no advantage of solving the dual instead of the primalproblem. Later on, we will see the advantage of the dual.

for both the dual and the primal problem, powerful algorithmic solutionmethods exist (e.g. active set methods)

Introduction to Neuroinformatics: Support Vector Machines – p. 57

Support vector machines(cont.)

back to primal, KT-conditions:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b ≤ 0 for all i = 1, . . . , p

αi ·(1 − d(i)

⟨~x(i), ~w

⟩− d(i)b

)= 0 for all i = 1, . . . , p

~w =

p∑

i=1

(αid(i)~x(i))

0 =

p∑

i=1

αid(i)

αi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 58

Support vector machines(cont.)

back to primal, KT-conditions:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b ≤ 0 for all i = 1, . . . , p

αi ·(1 − d(i)

⟨~x(i), ~w

⟩− d(i)b

)= 0 for all i = 1, . . . , p

~w =

p∑

i=1

(αid(i)~x(i))

0 =

p∑

i=1

αid(i)

αi ≥ 0 for all i = 1, . . . , p

what are ~w and b if αi are already known?

Introduction to Neuroinformatics: Support Vector Machines – p. 58

Support vector machines(cont.)

resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))

Introduction to Neuroinformatics: Support Vector Machines – p. 59

Support vector machines(cont.)

resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))

• for i with αi 6= 0 we get from the complementary condition:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Introduction to Neuroinformatics: Support Vector Machines – p. 59

Support vector machines(cont.)

resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))

• for i with αi 6= 0 we get from the complementary condition:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Hence:

b =1

d(i)−

⟨~x(i), ~w

⟩=

1

d(i)−

p∑

j=1

(αjd

(j)⟨~x(i), ~x(j)

⟩ )

Introduction to Neuroinformatics: Support Vector Machines – p. 59

Support vector machines(cont.)

resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))

• for i with αi 6= 0 we get from the complementary condition:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Hence:

b =1

d(i)−

⟨~x(i), ~w

⟩=

1

d(i)−

p∑

j=1

(αjd

(j)⟨~x(i), ~x(j)

⟩ )

the margin ρ:

ρ =1

||~w|| =1

√∑p

i=1

∑pj=1

(αiαjd(i)d(j) 〈~x(i), ~x(j)〉

)

Introduction to Neuroinformatics: Support Vector Machines – p. 59

Support vector machines(cont.)

some constraints become active(αi > 0), some inactive (αi = 0)

each constraint refers to onepattern

constraint is active if respectivepattern is located next to separatinghyperplane

patterns that refer to activeconstraints are called supportvectors

solution only depends on supportvectors

margin

margin

positivessv

sv

sv

negatives

optimal separating hyperplane

Introduction to Neuroinformatics: Support Vector Machines – p. 60

Support vector machines(cont.)

some constraints become active(αi > 0), some inactive (αi = 0)

each constraint refers to onepattern

constraint is active if respectivepattern is located next to separatinghyperplane

patterns that refer to activeconstraints are called supportvectors

solution only depends on supportvectors

margin

margin

positivessv

sv

sv

negatives

optimal separating hyperplane

Lemma:Removing non support vectors froma classification task does notchange the solution found by asupport vector machine.

Introduction to Neuroinformatics: Support Vector Machines – p. 60

Support vector machines(cont.)

example (blackboard):

i x(i)1 x

(i)2 d(i)

1 0 0 −1

2 0 1 −1

3 −1 1 −1

4 1 0 +1

5 2 1 +1

Introduction to Neuroinformatics: Support Vector Machines – p. 61

Support vector machines(cont.)

example (blackboard):

i x(i)1 x

(i)2 d(i)

1 0 0 −1

2 0 1 −1

3 −1 1 −1

4 1 0 +1

5 2 1 +1

solution:

w1 = 2, w2 = 0, b = −1

support vectors: ~x(1), ~x(2), ~x(4)

Introduction to Neuroinformatics: Support Vector Machines – p. 61

Fault tolerant SVMs

Introduction to Neuroinformatics: Support Vector Machines – p. 62

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

soft margin SVM

margin

margin

positives

negatives

additional error

additional error

additional error

separating hyperplane

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

soft margin SVM

positives

negatives

additional error

separating hyperplane

additional error

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

soft margin SVM

positives

negatives

separating hyperplane

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

soft margin SVM

opposing aspects: maximizing themargin vs. minimizing the errors

positives

negatives

separating hyperplane

Introduction to Neuroinformatics: Support Vector Machines – p. 63

Soft margin SVM(cont.)

mathematical model of errors: slack variables ξi. ξi models the error that is

made for the training pattern ~x(i)

Introduction to Neuroinformatics: Support Vector Machines – p. 64

Soft margin SVM(cont.)

mathematical model of errors: slack variables ξi. ξi models the error that is

made for the training pattern ~x(i)

extended constraints:

d(i) · (⟨~w, ~x(i)

⟩+ b) ≥ 1 − ξi

ξi ≥ 0

Introduction to Neuroinformatics: Support Vector Machines – p. 64

Soft margin SVM(cont.)

mathematical model of errors: slack variables ξi. ξi models the error that is

made for the training pattern ~x(i)

extended constraints:

d(i) · (⟨~w, ~x(i)

⟩+ b) ≥ 1 − ξi

ξi ≥ 0

extended optimization target:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

C > 0 is a fixed canstant to balance margin and errors

Introduction to Neuroinformatics: Support Vector Machines – p. 64

Soft margin SVM(cont.)

primal of soft margin classification:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 − ξi for i = 1, . . . , p

ξi ≥ 0 for i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 65

Soft margin SVM(cont.)

primal of soft margin classification:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 − ξi for i = 1, . . . , p

ξi ≥ 0 for i = 1, . . . , p

Lagrange function:

L(~w, b, ~ξ,~α, ~β) =1

2||~w||2 + C

p∑

i=1

ξ2i +

+

p∑

i=1

(αi(1 − ξi − d(i)(

⟨~x(i), ~w

⟩+ b))

)+

p∑

i=1

(βi · (−ξi)

)

primal variables: ~w, b, ~ξ. Lagrange multipliers: ~α, ~β

Introduction to Neuroinformatics: Support Vector Machines – p. 65

Soft margin SVM(cont.)

KT conditions: 1 − ξi − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

−ξi ≤ 0 for i = 1, . . . , p

αi ·(

1 − ξi − d(i)( ⟨

~x(i), ~w⟩

+ b))

= 0 for i = 1, . . . , p

βi(−ξi) = 0 for i = 1, . . . , p

∂L

∂wj

= wj −p

i=1

αid(i)x

(i)j = 0 for i = 1, . . . , p

∂L

∂b= −

p∑

i=1

αid(i) = 0

∂L

∂ξi

= 2Cξi − αi − βi = 0 for i = 1, . . . , p

αi ≥ 0 for i = 1, . . . , p

βi ≥ 0 for i = 1, . . . , pIntroduction to Neuroinformatics: Support Vector Machines – p. 66

Soft margin SVM(cont.)

deriving the solution from the KT conditions. We get ~w and ξi directly:

~w =

p∑

i=1

(αid(i)~x(i))

ξi =αi + βi

2C

Introduction to Neuroinformatics: Support Vector Machines – p. 67

Soft margin SVM(cont.)

deriving the solution from the KT conditions. We get ~w and ξi directly:

~w =

p∑

i=1

(αid(i)~x(i))

ξi =αi + βi

2C

to derive b, we exploit the complementary condition for a pattern with nonzero αi:

b =1 − ξi − d(i)

⟨~x(i), ~w

d(i)

=1 − αi+βi

2C

d(i)−

p∑

j=1

(αjd(j)

⟨~x(i), ~x(j)

⟩)

Introduction to Neuroinformatics: Support Vector Machines – p. 67

Soft margin SVM(cont.)

which patterns become support vectors?

Introduction to Neuroinformatics: Support Vector Machines – p. 68

Soft margin SVM(cont.)

which patterns become support vectors?

• patterns on the boundary of the margin band

• misclassified patterns and patterns within the margin band

Introduction to Neuroinformatics: Support Vector Machines – p. 68

Soft margin SVM(cont.)

calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)

Introduction to Neuroinformatics: Support Vector Machines – p. 69

Soft margin SVM(cont.)

calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)

first case:∑p

i=1 αid(i) = 0:

Then, minimum exists with ~w =∑p

i=1 αid(i)~x(i), ξi = αi+βi

2C. b arbitrary.

Q(~α, ~β) = − 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

Introduction to Neuroinformatics: Support Vector Machines – p. 69

Soft margin SVM(cont.)

calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)

first case:∑p

i=1 αid(i) = 0:

Then, minimum exists with ~w =∑p

i=1 αid(i)~x(i), ξi = αi+βi

2C. b arbitrary.

Q(~α, ~β) = − 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

second case:∑p

i=1 αid(i) 6= 0.

Hence, L is unbounded below. Q(~α, ~β) = −∞

Introduction to Neuroinformatics: Support Vector Machines – p. 69

Soft margin SVM(cont.)

calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)

first case:∑p

i=1 αid(i) = 0:

Then, minimum exists with ~w =∑p

i=1 αid(i)~x(i), ξi = αi+βi

2C. b arbitrary.

Q(~α, ~β) = − 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

second case:∑p

i=1 αid(i) 6= 0.

Hence, L is unbounded below. Q(~α, ~β) = −∞ observation: second case does not contribute to the maximum of Q

Introduction to Neuroinformatics: Support Vector Machines – p. 69

Soft margin SVM(cont.)

the dual:

maximize~α,~β

(

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )

− 1

2

p∑

i=1

p∑

j=1

(αi + βi)2

2C+

p∑

i=1

αi

)

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

βi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 70

Soft margin SVM(cont.)

there is a variant of soft margin SVM that uses the following optimizationtarget:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξi

instead of:minimize

~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.

Introduction to Neuroinformatics: Support Vector Machines – p. 71

Soft margin SVM(cont.)

there is a variant of soft margin SVM that uses the following optimizationtarget:

minimize~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξi

instead of:minimize

~w,b,~ξ

1

2||~w||2 + C

p∑

i=1

ξ2i

advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.

C controls the balance between errors and margin width:

• large C prefers small errors

• small C prefers large margin

adequate value of C has to be optimized experimentally

→ demo in practical exercises

Introduction to Neuroinformatics: Support Vector Machines – p. 71

Non-linear SVMs

Introduction to Neuroinformatics: Support Vector Machines – p. 72

Non-linear SVMs

linear discrimination is not appropriate in all cases, we need a non-linearvariant of SVMs

switching from linear constraints to non-linear constraints causes a lot ofnumerical problems:

• losing convexity

• KT conditions not sufficient

• optimization algorithms with low performance

Introduction to Neuroinformatics: Support Vector Machines – p. 73

Non-linear SVMs(cont.)

classical idea: use non-linear features

feature mapping Φ (non-linear) maps the original input vectors to featurevectors

we have to replace ~x in all formulae by Φ(~x)

Introduction to Neuroinformatics: Support Vector Machines – p. 74

Non-linear SVMs(cont.)

classical idea: use non-linear features

feature mapping Φ (non-linear) maps the original input vectors to featurevectors

we have to replace ~x in all formulae by Φ(~x)

dual of the hard margin case becomes:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)⟨Φ(~x(i)),Φ(~x(j))

⟩ )+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

Introduction to Neuroinformatics: Support Vector Machines – p. 74

Kernel trick

defining a function KΦ(·) as: KΦ(~x, ~y) = 〈Φ(~x),Φ(~y)〉 we get:

maximize~α

− 1

2

p∑

i=1

p∑

j=1

(αiαjd

(i)d(j)KΦ(~x(i), ~x(j)))

+

p∑

i=1

αi

subject to

p∑

i=1

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

now, patterns only occur as arguments of the function KΦ in the dual

the function KΦ is called a kernel function

Introduction to Neuroinformatics: Support Vector Machines – p. 75

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)

weight vector: no, bias: yes, margin: yes

b =1

d(s)−

p∑

j=1

(αjd

(j)KΦ(~x(s), ~x(j)))

for support vector s

ρ =1

√∑p

i=1

∑p

j=1

(αiαjd(i)d(j)KΦ(~x(i), ~x(j))

)

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)

weight vector: no, bias: yes, margin: yes

b =1

d(s)−

p∑

j=1

(αjd

(j)KΦ(~x(s), ~x(j)))

for support vector s

ρ =1

√∑p

i=1

∑p

j=1

(αiαjd(i)d(j)KΦ(~x(i), ~x(j))

)

can we apply the learned classification to a new input pattern ~x(new) ?

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)

weight vector: no, bias: yes, margin: yes

b =1

d(s)−

p∑

j=1

(αjd

(j)KΦ(~x(s), ~x(j)))

for support vector s

ρ =1

√∑p

i=1

∑p

j=1

(αiαjd(i)d(j)KΦ(~x(i), ~x(j))

)

can we apply the learned classification to a new input pattern ~x(new) ?Yes! (see next slide)

Introduction to Neuroinformatics: Support Vector Machines – p. 76

Kernel trick(cont.)

~w =

p∑

i=1

(αid(i)Φ(~x(i)))

applying ~w and b to a new pattern:

⟨~w,Φ(~x(new))

⟩+ b =

⟨p

i=1

(αid(i)Φ(~x(i))),Φ(~x(new))

+ b

=

p∑

i=1

(αid(i)KΦ(~x(i), ~x(new))) + b

Introduction to Neuroinformatics: Support Vector Machines – p. 77

Kernel trick(cont.)

quintessence:

• replacing the dotproduct by a kernel function is possible

• an explicit representation is not possible if we do not have access to Φ

• nonetheless, predictions can be made

• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi

Introduction to Neuroinformatics: Support Vector Machines – p. 78

Kernel trick(cont.)

quintessence:

• replacing the dotproduct by a kernel function is possible

• an explicit representation is not possible if we do not have access to Φ

• nonetheless, predictions can be made

• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi

the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick

classification

linearoptimal

Φ−1Φ

Introduction to Neuroinformatics: Support Vector Machines – p. 78

Kernel trick(cont.)

quintessence:

• replacing the dotproduct by a kernel function is possible

• an explicit representation is not possible if we do not have access to Φ

• nonetheless, predictions can be made

• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi

the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick

classification

linearoptimal

Φ−1

kernel−trick

Φ

Introduction to Neuroinformatics: Support Vector Machines – p. 78

Kernel trick(cont.)

example (one dimensional input space):

D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space

feature mapping: Φ(x) =

x2

√2x

1

kernel function:

KΦ(x, y) =

x2

√2x

1

,

y2

√2y

1

= (xy)2 +2(xy)+1 = (xy +1)2

Introduction to Neuroinformatics: Support Vector Machines – p. 79

Kernel trick(cont.)

example (one dimensional input space):

D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space

feature mapping: Φ(x) =

x2

√2x

1

kernel function:

KΦ(x, y) =

x2

√2x

1

,

y2

√2y

1

= (xy)2 +2(xy)+1 = (xy +1)2

solution: α1 = 34, α2 = 2

3, α3 = 1

12

b = −1, ~w = (1,−12

√2, 0)T (in feature space!)

Introduction to Neuroinformatics: Support Vector Machines – p. 79

Kernel trick(cont.)

example, illustration:

input space0 1

Introduction to Neuroinformatics: Support Vector Machines – p. 80

Kernel trick(cont.)

example, illustration:fe

atu

re s

pac

e

input space0 1

1 f1

f2

Φ

Φ

Φ

√2

Introduction to Neuroinformatics: Support Vector Machines – p. 80

Kernel trick(cont.)

example, illustration:fe

atu

re s

pac

e

input space0 1

1 f1

f2

√2

Introduction to Neuroinformatics: Support Vector Machines – p. 80

Kernel trick(cont.)

example, illustration:fe

atu

re s

pac

e

input space0 1

1 f1

f2

Φ

Φ

√2

Which points in input space are on thedecision boundary?look for z with∑p

i=1(αid(i)KΦ(~x(i), z)) + b = 0

yields: z = 12(1 ±

√5)

Introduction to Neuroinformatics: Support Vector Machines – p. 80

Kernel trick(cont.)

kernels can be designed individually for an application:

• look for senseful features

• create Φ

• derive KΦ

In these cases, using the kernel instead is not better than an explicitcalculation in feature space

Introduction to Neuroinformatics: Support Vector Machines – p. 81

Kernel trick(cont.)

kernels can be designed individually for an application:

• look for senseful features

• create Φ

• derive KΦ

In these cases, using the kernel instead is not better than an explicitcalculation in feature space

there are classes of generic kernels

• can be used for many tasks

• often, Φ is very complicated and feature space is very high dimensional

• for some generic kernels, feature space has infinite dimension and Φcannot be calculated explicitely

• typically, using kernel instead of explicit calculation in feature space iscomputationally more efficient

Introduction to Neuroinformatics: Support Vector Machines – p. 81

Generic kernels

example of kernel (one-dimensional input space):

Φ(x) =

x2

√2x

1

mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions

Introduction to Neuroinformatics: Support Vector Machines – p. 82

Generic kernels

example of kernel (one-dimensional input space):

Φ(x) =

x2

√2x

1

mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions

corresponding kernel: KΦ(x, y) = (xy + 1)2

kernel evaluation needs: 2 multiplications, 1 addition

Introduction to Neuroinformatics: Support Vector Machines – p. 82

Generic kernels(cont.)

a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:

KΦ(~x, ~y) = (〈~x, ~y〉)d

or:

KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d

d ∈ N is a kernel parameter thatneeds to be chosen manually

Introduction to Neuroinformatics: Support Vector Machines – p. 83

Generic kernels(cont.)

a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:

KΦ(~x, ~y) = (〈~x, ~y〉)d

or:

KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d

d ∈ N is a kernel parameter thatneeds to be chosen manually

for polynomial kernels, Φ can becalculated explicitely. It contains allmonomials of degree d (first case) orall monomials of maximal degree d

(second case)But: feature space becomes verylarge, even for small d, e.g. for thefirst variant the number of features is:

n d number features

3 2 6

5 3 35

10 3 220

10 5 2002

10 10 92 378

50 4 292 825

256 4 183 181 376

Introduction to Neuroinformatics: Support Vector Machines – p. 83

Generic kernels(cont.)

what makes a symmetric function K a kernel function?

Introduction to Neuroinformatics: Support Vector Machines – p. 84

Generic kernels(cont.)

what makes a symmetric function K a kernel function?

• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉

Introduction to Neuroinformatics: Support Vector Machines – p. 84

Generic kernels(cont.)

what makes a symmetric function K a kernel function?

• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉

• K meets Mercer’s theorem:

A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,

if and only if for all functions g : [a, b]n → R with∫

[a,b]ng(~x)2d~x < ∞

holds: ∫

[a,b]n

[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0

Proof: omitted

Introduction to Neuroinformatics: Support Vector Machines – p. 84

Generic kernels(cont.)

what makes a symmetric function K a kernel function?

• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉

• K meets Mercer’s theorem:

A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,

if and only if for all functions g : [a, b]n → R with∫

[a,b]ng(~x)2d~x < ∞

holds: ∫

[a,b]n

[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0

Proof: omitted

consequences: sums of kernels and positive multiples of kernels are alsokernels

Introduction to Neuroinformatics: Support Vector Machines – p. 84

Generic kernels(cont.)

RBF kernels (Gaussian kernels):

KΦ(~x, ~y) = e−||~x−~y||2

2σ2

σ2 > 0 is a kernel parameter, chosen manually

Introduction to Neuroinformatics: Support Vector Machines – p. 85

Generic kernels(cont.)

RBF kernels (Gaussian kernels):

KΦ(~x, ~y) = e−||~x−~y||2

2σ2

σ2 > 0 is a kernel parameter, chosen manually

feature spaces have inifinite dimension

local information processing, circular decision boundaries in input space

what is calculated in a SVM with RBF kernel resembles a RBF network, buttraining is done in a different way

kernel parameter controls the complexity of the kernel: if σ2 large, behavioris similar to dot product, if σ2 is very small, decision boundaries in inputspace may become very complex

degeneration σ2 → 0

most popular kernel type in practice

Introduction to Neuroinformatics: Support Vector Machines – p. 85

Generic kernels(cont.)

sigmoid kernels:

KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))

γ > 0 and θ are kernel parameters, chosen manually

Introduction to Neuroinformatics: Support Vector Machines – p. 86

Generic kernels(cont.)

sigmoid kernels:

KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))

γ > 0 and θ are kernel parameters, chosen manually

feature spaces have inifinite dimension

what is calculated in a SVM with sigmoidal kernel resembles a MLP networkwith one hidden layer

no practical use due to numerical degeneration of the optimization problem

Introduction to Neuroinformatics: Support Vector Machines – p. 86

Generic kernelssummary

dot-product kernel: KΦ(~x, ~y) = 〈~x, ~y〉input space=feature space, simple, but useful

polynomial kernels: KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d,KΦ(~x, ~y) = (〈~x, ~y〉)d

extensions of dot product kernel, feature spaces with finite dimension

RBF kernels: KΦ(~x, ~y) = e−||~x−~y||2

2σ2

very important in practice, very flexible

sigmoid kernels: KΦ(~x, ~y) = tanh(γ(〈~x, ~y〉 − θ))limited practical use

problem dependent kernels

→ demo in practical exercises

Introduction to Neuroinformatics: Support Vector Machines – p. 87

Summary (SVMs)

linear classification with optimal separating hyperplane

mathematical problem description as convex quadratic program

numerical solving by various approaches, often based on active set methods:libSVM, SMO, ...

representation of the solution in terms of support vectors

soft margin case: classification with errors

kernel-trick: implicit calculations in feature space, non-linear SVM

extensions:

• SVM for regression

• SVM for one class classification

important ideas:

• large margin techniques

• kernel-based techniques Introduction to Neuroinformatics: Support Vector Machines – p. 88

Further readings

Nello Christianini and John Shawe-Taylor, An introduction to support vectormachines and other kernel-based learning methods, Cambridge Univ. Press,2000

C. J. C. Burges, A Tutorial on Support Vector Machines for PatternRecognition, 1998. available at: http://www.kernel-machines.org

Roger Fletcher, Practical methods of optimization, Wiley, 1987

Introduction to Neuroinformatics: Support Vector Machines – p. 89

top related