introduction to neuroinformatics: support vector...

Introduction to Neuroinformatics:

Support Vector Machines

Prof. Dr. Martin Riedmiller

University of Osnabruck

Institute of Computer Science and Institute of Cognitive Science

Introduction to Neuroinformatics: Support Vector Machines – p. 1

Outline

support vector machines (SVM) for classification: basic ideas

optimization under constraints: Lagrange multiplier, Kuhn-Tucker conditions,dual, algorithmic approaches

applying optimization theory to SVM

kernel trick: make linear problems non-linear

variants of SVMs

Linear classification

perceptron revisited: find hyperplane that classifies a given set of positiveand negative training patterns correct, i.e. find (~w,w0) so that:

〈~w, ~x〉 + w0 > 0 for all positive training patterns ~x

〈~w, ~x〉 + w0 < 0 for all negative training patterns ~x

perceptron learning finds solving hyperplane (if possible) by adding andsubtracting patterns to the weight vector

if there are several solving hyperplanes, perceptron learning finds anarbitrary one

Linear classification(cont.)

Which is the best separating hyperplane?

−−

the performance on the training set is equally good, but what do you expectfor an independent test set?

Which is the best separating hyperplane?

−−

the performance on the training set is equally good, but what do you expectfor an independent test set?

for each separating hyperplane we can define the margin:The margin is the smallest distance of a training pattern to the separatinghyperplane.

optimality:The risk of misclassification is the smaller the larger the margin is.

−−

Support vector machines

support vector machines (SVM) (Vapnik, 1970s): find a separating hyperplanethat maximizes the margin.

notation:

• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,

d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b

(= w0, to be in line with SVM literature)

• margin ρ

notation:

• trainings set D = (~x(1), d(1)), . . . , (~x(p), d(p)), ~x(i) ∈ Rn,

d(i) ∈ −1,+1• separating hyperplane defined by weight vector ~w and bias weight b

(= w0, to be in line with SVM literature)

• margin ρ

distance of a point ~x to a hyperplane (~w, b):

∣∣∣∣

〈~x, ~w〉 + b

||~w||

∣∣∣∣

find the separating hyperplane with maximal margin:

maximize~w,b,ρ

subject to ρ > 0

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D

find the separating hyperplane with maximal margin:

maximize~w,b,ρ

subject to ρ > 0

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ for all (~x(i), d(i)) ∈ D

this task contains a degree of freedom: ~w and b can be scaled by a positivenumber without changing the separating hyperplane.We can remove this degree of freedom and simplify the problem by adding aconstraint on the length of ~w. Here:

||~w|| =1

Support vector machines(cont.)

optimization problem with length constraint:

maximize~w,b,ρ

||~w||2

subject to ρ > 0

||~w|| =

ρρ =

||~w||

d(i) ·⟨~x(i), ~w

⟩+ b

||~w|| ≥ ρ1

||~w|| for all (~x(i), d(i)) ∈ D

rewriting the optimization problem

Finally, we can make it a minimzation task minimizing the reciprocal of 1||w||2

the minimization task:

minimize~w,b,ρ

2||~w||2

subject to d(i)( ⟨

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

||~w||

the minimization task:

minimize~w,b,ρ

2||~w||2

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

||~w||

ρ occurs only in one equality constraint, i.e. we can solve the optimizationproblem for ~w and b and calculate ρ afterwards from ~w

final mathematical form of the minimization task:

minimize~w,b

2||~w||2

~x(i), ~w⟩

+ b)≥ 1 for all (~x(i), d(i)) ∈ D

the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)

Optimization Theory

Optimization theory

a general form of optimization problem:

minimize~x

subject to ~x ∈ Ω

Ω ⊆ Rn: the set of feasible (permitted) points

f : Rn → R: the target function that should be minimized

Optimization theory

minimize~x

maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))

Optimization theory

minimize~x

maximization problems can be described in the same way(maximize f(x) ⇔ minimize −f(x))

often, Ω is implicitly given by equality and inequality constraints

Local and global minima

a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:

f(~x) ≤ f(~y)

Local and global minima

a feasible point ~x ∈ Ω is called aglobal minimum if for all ~y ∈ Ωholds:

f(~x) ≤ f(~y)

a feasible point ~x ∈ Ω is called alocal minimum if there is a radiusr > 0 so that for all points ~y ∈ Ωwith ||~x − ~y|| < r holds:

f(~x) ≤ f(~y)

loc.glob.loc.

Convexity

A set X is called a convex set if for all points ~x, ~y ∈ X and any 0 ≤ θ ≤ 1holds:

θ~x + (1 − θ)~y ∈ X

intuitively: a set is convex, if all points between two elements of the set arealso elements of the set.

~x ∈ X

~y ∈ Xset X

θ~x + (1 − θ)~y ∈ X ?

Convexity(cont.)

convex sets:

non convex sets:

Convexity(cont.)

intersections of convex sets are also convex.Proof: → blackboard

Convexity(cont.)

intersections of convex sets are also convex.Proof: → blackboard

unions of convex sets are not guaranteed to be convex.

Convexity(cont.)

A function f is called a convex function, if for all points ~x, ~y and all0 ≤ θ ≤ 1 holds:

f(θ~x + (1 − θ)~y) ≤ θf(~x) + (1 − θ)f(~y)

Convexity(cont.)

convex functions:

non−convex functions:

Convexity(cont.)

sums of convex functions are also convex.Proof: → blackboard

Convexity(cont.)

positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)

Convexity(cont.)

positive multiples of convex functions are also convex.Proof: → for your own practice (very simple)

differences of convex functions are not guaranteed to be convex.

Convexity(cont.)

linear functions are convexProof: → for your own practice (very simple)

Convexity(cont.)

the function: f : x 7→ x2 is a convex function.Proof:

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

Convexity(cont.)

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above

Convexity(cont.)

f(θx + (1 − θ)y) − (θf(x) + (1 − θ)f(y)) =

= (θx + (1 − θ)y)2 − (θx2 + (1 − θ)y2) = ... =

= −θ(1 − θ)(x2 − 2xy + y2) = −θ(1 − θ)(x − y)2 ≤ 0

the function f : ~x 7→ ||~x||2 is a convex functionProof: follows from above

if f is a convex function, then the set ~x|f(~x) ≤ 0 is a convex set.Proof: → blackboard

Convexity and minima

Lemma:if Ω is a convex set and f is a convex function, each local minimum of f w.r.tΩ is also a global minimum.

Proof:→ blackboard

this lemma does not mean that for each convex set and each convex functiona minimum exists!

there are conditions for which the existence of a global minimum can beguaranteed. (e.g. Ω non-empty and compact, f continuous)

Convexity and minima (cont.)

Lemma:If Ω is a convex set and f is a convex function w.r.t Ω, each point satisfyinggradf(~x) = 0 is a local minimum.

Proof:→ omitted (literature about convex functions)

Constraints

often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:

~x|gi(~x) ≤ 0)∩

~x|hi(~x) = 0)

gj and hi are real valued functions

• constraints of the form gj(~x) ≤ 0 are called inequality constraints

• constraints of the form hi(~x) = 0 are called equality constraints

Constraints

often, the set Ω is given implicitly by a set of constraints g1, . . . , gk andh1, . . . , hm in the following form:

~x|gi(~x) ≤ 0)∩

~x|hi(~x) = 0)

gj and hi are real valued functions

• constraints of the form gj(~x) ≤ 0 are called inequality constraints

• constraints of the form hi(~x) = 0 are called equality constraints

equality constraints hi(~x) = 0 can be replaced by two inequality constraints:

hi(~x) ≤ 0 and −hi(~x) ≤ 0.

Characterizing local minima

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a feasible

direction if exists r > 0 so that ~x + θ~v is a feasible point for all 0 ≤ θ ≤ r

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a descending

direction if exists r > 0 so that f(~x + θ~v) < f(~x) for all 0 < θ ≤ r

for a feasible point ~x ∈ Ω we call a direction ~v ∈ Rn \ ~0 a descending

direction if exists r > 0 so that f(~x + θ~v) < f(~x) for all 0 < θ ≤ r

Characterizing local minima(cont.)

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.

a local minimum of f with respect to Ω is a feasible point for which theintersection of feasible and descending directions is empty.there is no feasible and descending direction in ~x ⇔ there is a small areaaround ~x where we do not find any feasible point for which the value of f issmaller ⇔ local minimum

no minimum

local minimum(desc. directions non−feasible)

no minimum

minimum(no descending directions)

no minimum

Linear constraints, differentiable target function

theory simplifies if the target function f is differentiable and the constraintsare linear:

minimize~x

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn

(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)

hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn

Linear constraints, differentiable target function

theory simplifies if the target function f is differentiable and the constraintsare linear:

minimize~x

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

with gi(~x) = gi,0 + gi,1x1 + gi,2x2 + · · · + gi,nxn

(gi,0, . . . , gi,n are coefficients of the offset and slope of gi)

hi(~x) = hi,0 + hi,1x1 + hi,2x2 + · · · + hi,nxn

if f is a linear function, these tasks are called linear programs

if f is a quadratic function, these tasks are called quadratic programs

Linear constraints, differentiable target function(cont.)

what are the descending directions at point ~x ?

what are the descending directions at point ~x ?a direction ~v is descending if

〈~v, gradf(~x)〉 < 0

what are the feasible directions inpoint ~x ?

• active and inactive constraints:an inequality constraint g iscalled active at point ~x ifg(~x) = 0. Otherwise it is calledinactive.

grad g(x)

• inactive constraints:inactive constraints do notrestrict feasible directions

grad g(x)

feasible directions

grad g(x)

• inactive constraints:inactive constraints do notrestrict feasible directions

• active constraints:directions ~v with〈~v, gradg(~x)〉 > 0 areinfeasible

grad g(x)

feasible directions

grad g(x)

feasible directions

what are the feasible directions in point ~x ?

what are the feasible directions in point ~x ?A direction ~v is feasible in ~x if for all constraints gi holds:

gi is inactive or 〈~v, gradgi(~x)〉 ≤ 0

local minima are feasible points at which the intersection of descendingdirections and feasible directions is empty.

examples with no active constraints:

grad f grad f=0

gradf(~x) 6= ~0 ⇒ nominimum

gradf(~x) = ~0 ⇒minimum

examples with one active constraints:

grad g(x)

grad f(x)

desc. dir.

feasible dir.

−gradf = λgradg with λ < 0all descending directions feasibleno minimum

examples with one active constraints:

grad g(x)

grad f(x)

desc. dir.

feasible dir.

−gradf = λgradg with λ < 0all descending directions feasibleno minimum

grad g(x)

feasible dir.

grad f(x)

desc. dir.

−gradf = λgradg + ~u with λ < 0

and ~u ⊥ gradg, u 6= ~0some descending directions feasibleno minimum

grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg + ~u with λ > 0

grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg + ~u with λ > 0

grad g(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradg with λ > 0all descending directions non-feasibleminimum

grad g(x)

feasible dir.

grad f=0 −gradf = ~0 = 0 · gradg

no descending directionsminimum

grad g(x)

feasible dir.

For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)

only if λ < 0 or ~u 6= ~0 we find feasible descending directions

grad g(x)

feasible dir.

For one active constraint we found:decomposing −gradf(~x) = λgradg(~x) + ~u with ~u ⊥ gradg(~x)

only if λ < 0 or ~u 6= ~0 we find feasible descending directions

~x is minimum if we find λ ≥ 0 with gradf(~x) + λgradg(~x) = ~0

Remark: this principle can be generalized to more active constraints

examples with two active constraints:

grad g1(x)

grad g2(x)

feasible dir.

desc. dir.

grad f(x)−gradf = λ1gradg1 + λ2gradg2

with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum

examples with two active constraints:

grad g1(x)

grad g2(x)

feasible dir.

desc. dir.

grad f(x)−gradf = λ1gradg1 + λ2gradg2

with λ1 < 0, λ2 ≥ 0some descending directions feasibleno minimum

grad g2(x)

grad g1(x)

desc. dir.

grad f(x)feasible dir.

−gradf = λ1gradg1 + λ2gradg2

with λ1 ≥ 0, λ2 ≥ 0all descending directions non-feasibleminimum

examples with equality constraint:

grad h(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradh + ~u with

~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum

examples with equality constraint:

grad h(x)desc. dir.

grad f(x)

feasible dir.

−gradf = λgradh + ~u with

~u ⊥ gradh, u 6= ~0one descending direction feasibleno minimum

grad h(x)desc. dir.

grad f(x)

feasible dir.−gradf = λgradh

all descending directions non-feasibleminimum

general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:

A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:

grad f(~x)+∑

active inequalityconstraints j

(αj ·grad gj(~x))+m∑

(βi·grad hi(~x)) = ~0

(corollary to Farka’s lemma)

general principle for problems with differentiable target function, linearequality constraints hi and linear inequality constraints gi:

A feasible point ~x is minimum if exist αj ≥ 0 and βi so that:

grad f(~x)+∑

active inequalityconstraints j

(αj ·grad gj(~x))+m∑

(βi·grad hi(~x)) = ~0

(corollary to Farka’s lemma)

Lagrange function:

L(~x, ~α, ~β) = f(~x) +k∑

(αj · gj(~x)) +m∑

(βi · hi(~x))

αi, βi are called Lagrange multipliers

(Karush-) Kuhn-Tucker conditions

rewriting these conditions yield the (K)KT conditions:

gi(~x) ≤ 0 for all inequality constraints

hj(~x) = 0 for all equality constraints

αi · gi(~x) = 0 for all inequality constraints

0 = ∂L(~x,~α,~β)∂xi

for all i = 1, . . . , n

αi ≥ 0 for all inequality constraints

Lemma:A point ~x is minimum, if and only if exist αj and βi so that the KT conditionsare met.

Proof: omitted, see literature

Kuhn-Tucker conditions(cont.)

the KT conditions do not give an algorithm to find the minima, but they allowto check whether a given point is a minimum

example:

minimizex1,x2

x21 + x2

subject to − x1 + x2 + 2 ≤ 0

− 2x1 − x2 − 2 ≤ 0

Lagrange function, KT conditions, check whether (0, 0), (4, 2), (1,−1) areminima. → blackboard

Ways to find the minimum

brute force approach

• equality constraints are always active

• inequality constraints may be active or inactive

• at the minimum ~x∗, a certain subset A of inequality constraints is active

• if we would know A and f looks “nice”, we could find ~x∗ analytically

• main idea: check all possible combinations of active constraints for theminimum

example → blackboard

• main idea: check all possible combinations of active constraints for theminimum

example → blackboard

can be applied only for linear and quadratic target functions

number of possible combinations of active constraints grows exponentially

Ways to find the minimum(cont.)

active set methods: incrementally build the set of active constraints

start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is

consistent with the constraints in A

if a constraint 6∈ A is violated it is added to A

if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.

if necessary, constraints are removed from A

start with a feasible point ~x and assumes A = ∅ performs line search on the most steepest descending direction that is

consistent with the constraints in A

if a constraint 6∈ A is violated it is added to A

if no step is possible, Lagrange multipliers are calculated to check whetherwe found the minimum.

if necessary, constraints are removed from A

sketch of algorithm → next slide

1: repeat2: calculate direction of maximal descent3: project direction of maximal descent onto boundaries of constraints in A (yields search

direction ~v)

4: if ~v 6= ~0 then5: calculate minimum on ray from ~x into direction ~v (yields steplength τ )6: check whether constraints are violated when step is performed. If yes, shorten τ and add

restricting constraint to A

7: perform step from ~x to ~x + τ~v

8: else9: calculate Lagrange multipliers

10: if Lagrange multipliers are all non-negative then11: minimum found, stop.12: else13: remove constraint with negative Lagrange multiplier from A

14: end if15: end if16: until minimum found

Duality

a mathematical concept to solve a task different from the original task. Bydoing so, the original task is implicitly solved.

the original task is called primal problem, the other problem is called dualproblem

Duality

primal task: (reminder)

minimize~x

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

Duality

primal task: (reminder)

minimize~x

subject to gj(~x) ≤ 0 j = 1, . . . , k

hi(~x) = 0 i = 1, . . . ,m

the Lagrange function of the primal: (reminder)

L(~x, ~α, ~β) = f(~x) +k∑

(αj · gj(~x)) +m∑

(βi · hi(~x))

Duality(cont.)

the function Q:

Q(~α, ~β) = inf~x∈Rn

L(~x, ~α, ~β)

remarks:

• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞

• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.

Duality(cont.)

the function Q:

Q(~α, ~β) = inf~x∈Rn

L(~x, ~α, ~β)

remarks:

• infimum is the largest lower bound of L; if no lower bound exist, we saythe infimum is −∞

• the infiumum is taken over all points ~x don’t matter whether these pointsare feasible or infeasible w.r.t. the primal.

dual task:

maximize~α,~β

Q(~α, ~β)

subject to αi ≥ 0 for all i = 1, . . . , k

Duality(cont.)

Lemma:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of thedual task, then:

Q(~α′, ~β′) ≤ f(~x′)

Duality(cont.)

Lemma:

Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

Duality(cont.)

Lemma:

Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

= f(~x′) +k∑

( α′i

︸︷︷︸

· gi(~x′)

︸︷︷︸

) +m∑

(β′j · hj(~x

′)︸︷︷︸

Duality(cont.)

Lemma:

Q(~α′, ~β′) ≤ f(~x′)

Proof:

Q(~α′, ~β′) ≤ L(~x′, ~α′, ~β′)

= f(~x′) +k∑

( α′i

︸︷︷︸

· gi(~x′)

︸︷︷︸

) +m∑

(β′j · hj(~x

′)︸︷︷︸

≤ f(~x′)

Duality(cont.)

Corollary:

If ~x′ is feasible point of the primal task and (~α′, ~β′) is feasible point of the

dual task and f(~x′) = Q(~α′, ~β′), then ~x′ is solution of the primal problem

and (~α′, ~β′) is solution of the dual problem.

Duality(cont.)

Corollary:

Proof:For any feasible ~x:

f(~x) ≥ Q(~α′, ~β′) = f(~x′)

For any feasible (~α, ~β):

Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)

Duality(cont.)

Corollary:

Proof:For any feasible ~x:

f(~x) ≥ Q(~α′, ~β′) = f(~x′)

For any feasible (~α, ~β):

Q(~α, ~β) ≤ f(~x′) = Q(~α′, ~β′)

Potential Problem: points ~x′ and (~α′, ~β′) do not necessarily exist for generalproblems ⇒ restrictions on f and type of constraints.

Duality(cont.)

Lemma:If f is convex and differentiable and gj , hi are linear and ~x′ is solution of the

primal problem then a solution of the dual (~α′, ~β′) exists and

f(~x′) = Q(~α′, ~β′)

Duality(cont.)

f(~x′) = Q(~α′, ~β′)

Proof:

~x′ solution of the primal ⇒ exist (~α′, ~β′) that meet the KT conditions

Duality(cont.)

f(~x′) = Q(~α′, ~β′)

Proof:

f convex, gj , hi linear ⇒ L(~x, ~α′, ~β′) convex in ~x

Duality(cont.)

f(~x′) = Q(~α′, ~β′)

Proof:

from KT conditions: ∂L∂xi

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Duality(cont.)

f(~x′) = Q(~α′, ~β′)

Proof:

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual

Duality(cont.)

f(~x′) = Q(~α′, ~β′)

Proof:

(~x′, ~α′, ~β′) = 0 ⇒ L(~x′, ~α′, ~β′) = Q(~α′, ~β′)

Q(~α′, ~β′) = L(~x′, ~α′, ~β′) = f(~x′) ⇒ (~α′, ~β′) solution of dual

proof shows: KT conditions are the link between primal solution and dualsolution

solutions do not need to be unique

Duality(cont.)

main message from duality: under certain circumstances,

• you can solve the primal and get the solution of the dual for free

• you can solve the dual and get the solution of the primal for free

• you can use the Kuhn-Tucker conditions to transform the primal solutioninto a dual solution and vice versa

Summary (optimization theory)

general ideas about optimization (local, global minima)

convex problems: each local minimum is a global minimum

general characterization of minima (feasible directions, descendingdirections, Kuhn-Tucker)

algorithms to find the minimum (brute force, active set)

duality

literature: Roger Fletcher, Practical methods of optimization, Wiley 1991

Intermediate Last Slide

Applying Results of Optimization Theoryto Support Vector Machines

final mathematical form of the minimization task (reminder):

minimize~w,b

2||~w||2

~x(i), ~w⟩

+ b)≥ 1 for i = 1, . . . , p

the solution of this optimization task (if exists) constitutes the optimalseparating hyperplane with maximal margin. It is known as support vectormachine (SVM)

transforming into the appropriate form:

minimize~w,b

2||~w||2

subject to 1 − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

transforming into the appropriate form:

minimize~w,b

2||~w||2

subject to 1 − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

• target function is convex

• constraints are linear

• hence, local minima are also global minima (if exist feasible points)

Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

1 − d(i)( ⟨

~x(i), ~w⟩

2||~w||2 +

αi −p

αid(i)

⟨~x(i), ~w

⟩− b

αid(i)

Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

1 − d(i)( ⟨

~x(i), ~w⟩

2||~w||2 +

αi −p

αid(i)

⟨~x(i), ~w

⟩− b

αid(i)

derivatives:

= wj −p

αid(i)x

Lagrange function:

L(~w, b, ~α) =1

2||~w||2 +

1 − d(i)( ⟨

~x(i), ~w⟩

2||~w||2 +

αi −p

αid(i)

⟨~x(i), ~w

⟩− b

αid(i)

derivatives:

= wj −p

αid(i)x

∂b= −

αid(i)

calculating Q(~α) = inf ~w,b L(~w, b, ~α)

L is convex. Hence, if we can find a point that zeros the partial derivatives,its the minimum

zeroing the derivatives and resolving w.r.t. ~w and b:

αid(i)x

αid(i)~x(i)

αid(i)x

αid(i)~x(i)

αid(i)

αid(i)x

αid(i)~x(i)

αid(i)

strange equation, b is lost, what does it mean?

L(~w, b, ~α) =1

2||~w||2 +

αi −p

αid(i)

⟨~x(i), ~w

⟩− b

αid(i)

αid(i)~x(i) 0 =

αid(i)

L(~w, b, ~α) =1

2||~w||2 +

αi −p

αid(i)

⟨~x(i), ~w

⟩− b

αid(i)

αid(i)~x(i) 0 =

αid(i)

first case:∑p

i=1 αid(i) = 0

in this case, we can find ~w that zeros the partial derivatives for any value of

b. its the minimum=infimum. Substituting ~w by∑p

i=1 αid(i)~x(i) yields:

Q(~α) = −1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

L(~w, b, ~α) =1

2||~w||2 +

αi −p

αid(i)

⟨~x(i), ~w

⟩− b

αid(i)

αid(i)~x(i) 0 =

αid(i)

second case:∑p

i=1 αid(i) 6= 0

don’t matter of ~w we can make L arbitrary small varying b. Hence,

Q(~α) = −∞

the dual problem:

maximize~α

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

αid(i) = 0

−∞ otherwise

subject to αi ≥ 0 for all i = 1, . . . , p

since the second case in case distinction will never be the maximum, we canrewrite the dual:

maximize~α

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

subject to

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

maximize~α

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

subject to

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

note: dual is a problem in ~α. ~w and b do not occur

maximize~α

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

⟩ )+

subject to

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

note: dual is a problem in ~α. ~w and b do not occur

for now, there is no advantage of solving the dual instead of the primalproblem. Later on, we will see the advantage of the dual.

for both the dual and the primal problem, powerful algorithmic solutionmethods exist (e.g. active set methods)

back to primal, KT-conditions:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b ≤ 0 for all i = 1, . . . , p

αi ·(1 − d(i)

⟨~x(i), ~w

⟩− d(i)b

)= 0 for all i = 1, . . . , p

(αid(i)~x(i))

αid(i)

αi ≥ 0 for all i = 1, . . . , p

back to primal, KT-conditions:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b ≤ 0 for all i = 1, . . . , p

αi ·(1 − d(i)

⟨~x(i), ~w

⟩− d(i)b

)= 0 for all i = 1, . . . , p

(αid(i)~x(i))

αid(i)

αi ≥ 0 for all i = 1, . . . , p

what are ~w and b if αi are already known?

resolving the KT conditions w.r.t ~w and b:

• ~w =∑p

i=1(αid(i)~x(i))

• ~w =∑p

i=1(αid(i)~x(i))

• for i with αi 6= 0 we get from the complementary condition:

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

• ~w =∑p

i=1(αid(i)~x(i))

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Hence:

d(i)−

⟨~x(i), ~w

d(i)−

(j)⟨~x(i), ~x(j)

• ~w =∑p

i=1(αid(i)~x(i))

1 − d(i)⟨~x(i), ~w

⟩− d(i)b = 0

Hence:

d(i)−

⟨~x(i), ~w

d(i)−

(j)⟨~x(i), ~x(j)

the margin ρ:

||~w|| =1

√∑p

∑pj=1

(αiαjd(i)d(j) 〈~x(i), ~x(j)〉

some constraints become active(αi > 0), some inactive (αi = 0)

each constraint refers to onepattern

constraint is active if respectivepattern is located next to separatinghyperplane

patterns that refer to activeconstraints are called supportvectors

solution only depends on supportvectors

margin

positivessv

negatives

optimal separating hyperplane

some constraints become active(αi > 0), some inactive (αi = 0)

each constraint refers to onepattern

constraint is active if respectivepattern is located next to separatinghyperplane

patterns that refer to activeconstraints are called supportvectors

solution only depends on supportvectors

margin

positivessv

negatives

optimal separating hyperplane

Lemma:Removing non support vectors froma classification task does notchange the solution found by asupport vector machine.

example (blackboard):

i x(i)1 x

(i)2 d(i)

1 0 0 −1

2 0 1 −1

3 −1 1 −1

4 1 0 +1

5 2 1 +1

example (blackboard):

i x(i)1 x

(i)2 d(i)

1 0 0 −1

2 0 1 −1

3 −1 1 −1

4 1 0 +1

5 2 1 +1

solution:

w1 = 2, w2 = 0, b = −1

support vectors: ~x(1), ~x(2), ~x(4)

Fault tolerant SVMs

Soft margin SVM

SVM discussed up to here(so-called “hard margin case”):perfect classification

case of data which are not linearlyseparable?

Soft margin SVM

soft margin SVM

margin

positives

negatives

additional error

separating hyperplane

Soft margin SVM

soft margin SVM

positives

negatives

additional error

Soft margin SVM

soft margin SVM

positives

negatives

Soft margin SVM

soft margin SVM

opposing aspects: maximizing themargin vs. minimizing the errors

positives

negatives

Soft margin SVM(cont.)

mathematical model of errors: slack variables ξi. ξi models the error that is

made for the training pattern ~x(i)

extended constraints:

d(i) · (⟨~w, ~x(i)

⟩+ b) ≥ 1 − ξi

ξi ≥ 0

extended constraints:

d(i) · (⟨~w, ~x(i)

⟩+ b) ≥ 1 − ξi

ξi ≥ 0

extended optimization target:

minimize~w,b,~ξ

2||~w||2 + C

C > 0 is a fixed canstant to balance margin and errors

primal of soft margin classification:

minimize~w,b,~ξ

2||~w||2 + C

~x(i), ~w⟩

+ b)≥ 1 − ξi for i = 1, . . . , p

ξi ≥ 0 for i = 1, . . . , p

primal of soft margin classification:

minimize~w,b,~ξ

2||~w||2 + C

~x(i), ~w⟩

+ b)≥ 1 − ξi for i = 1, . . . , p

ξi ≥ 0 for i = 1, . . . , p

Lagrange function:

L(~w, b, ~ξ,~α, ~β) =1

2||~w||2 + C

ξ2i +

(αi(1 − ξi − d(i)(

⟨~x(i), ~w

⟩+ b))

(βi · (−ξi)

primal variables: ~w, b, ~ξ. Lagrange multipliers: ~α, ~β

KT conditions: 1 − ξi − d(i)( ⟨

~x(i), ~w⟩

+ b)≤ 0 for i = 1, . . . , p

−ξi ≤ 0 for i = 1, . . . , p

αi ·(

1 − ξi − d(i)( ⟨

~x(i), ~w⟩

= 0 for i = 1, . . . , p

βi(−ξi) = 0 for i = 1, . . . , p

= wj −p

αid(i)x

(i)j = 0 for i = 1, . . . , p

∂b= −

αid(i) = 0

∂ξi

= 2Cξi − αi − βi = 0 for i = 1, . . . , p

αi ≥ 0 for i = 1, . . . , p

βi ≥ 0 for i = 1, . . . , pIntroduction to Neuroinformatics: Support Vector Machines – p. 66

deriving the solution from the KT conditions. We get ~w and ξi directly:

(αid(i)~x(i))

ξi =αi + βi

deriving the solution from the KT conditions. We get ~w and ξi directly:

(αid(i)~x(i))

ξi =αi + βi

to derive b, we exploit the complementary condition for a pattern with nonzero αi:

b =1 − ξi − d(i)

⟨~x(i), ~w

=1 − αi+βi

d(i)−

(αjd(j)

⟨~x(i), ~x(j)

which patterns become support vectors?

• patterns on the boundary of the margin band

• misclassified patterns and patterns within the margin band

calculating the function Q(~α, ~β) = inf ~w,b,~ξ L(~w, b, ~ξ, ~α, ~β)

first case:∑p

i=1 αid(i) = 0:

Then, minimum exists with ~w =∑p

i=1 αid(i)~x(i), ξi = αi+βi

2C. b arbitrary.

Q(~α, ~β) = − 1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

(αi + βi)2

first case:∑p

i=1 αid(i) = 0:

2C. b arbitrary.

Q(~α, ~β) = − 1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

(αi + βi)2

second case:∑p

i=1 αid(i) 6= 0.

Hence, L is unbounded below. Q(~α, ~β) = −∞

first case:∑p

i=1 αid(i) = 0:

2C. b arbitrary.

Q(~α, ~β) = − 1

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

(αi + βi)2

second case:∑p

i=1 αid(i) 6= 0.

Hence, L is unbounded below. Q(~α, ~β) = −∞ observation: second case does not contribute to the maximum of Q

the dual:

maximize~α,~β

(αiαjd

(i)d(j)⟨~x(i), ~x(j)

(αi + βi)2

subject to

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

βi ≥ 0 for all i = 1, . . . , p

there is a variant of soft margin SVM that uses the following optimizationtarget:

minimize~w,b,~ξ

2||~w||2 + C

instead of:minimize

~w,b,~ξ

2||~w||2 + C

advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.

there is a variant of soft margin SVM that uses the following optimizationtarget:

minimize~w,b,~ξ

2||~w||2 + C

instead of:minimize

~w,b,~ξ

2||~w||2 + C

advantage: calculation of the solution simplifies at a certain point andsolution is more robust w.r.t. outliers.

C controls the balance between errors and margin width:

• large C prefers small errors

• small C prefers large margin

adequate value of C has to be optimized experimentally

→ demo in practical exercises

Non-linear SVMs

linear discrimination is not appropriate in all cases, we need a non-linearvariant of SVMs

switching from linear constraints to non-linear constraints causes a lot ofnumerical problems:

• losing convexity

• KT conditions not sufficient

• optimization algorithms with low performance

Non-linear SVMs(cont.)

classical idea: use non-linear features

feature mapping Φ (non-linear) maps the original input vectors to featurevectors

we have to replace ~x in all formulae by Φ(~x)

Non-linear SVMs(cont.)

classical idea: use non-linear features

feature mapping Φ (non-linear) maps the original input vectors to featurevectors

we have to replace ~x in all formulae by Φ(~x)

dual of the hard margin case becomes:

maximize~α

(αiαjd

(i)d(j)⟨Φ(~x(i)),Φ(~x(j))

⟩ )+

subject to

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

Kernel trick

defining a function KΦ(·) as: KΦ(~x, ~y) = 〈Φ(~x),Φ(~y)〉 we get:

maximize~α

(αiαjd

(i)d(j)KΦ(~x(i), ~x(j)))

subject to

αid(i) = 0

αi ≥ 0 for all i = 1, . . . , p

now, patterns only occur as arguments of the function KΦ in the dual

the function KΦ is called a kernel function

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?

Kernel trick(cont.)

experiment in mind: if we wouldn’t know Φ but we would have access to thekernel KΦ, could we solve the dual optimization problem?Yes!

Kernel trick(cont.)

can we calculate the solving weight vector ~w, bias b and the margin ρ ?(see slide 59)

Kernel trick(cont.)

weight vector: no, bias: yes, margin: yes

d(s)−

(j)KΦ(~x(s), ~x(j)))

for support vector s

√∑p

(αiαjd(i)d(j)KΦ(~x(i), ~x(j))

Kernel trick(cont.)

d(s)−

(j)KΦ(~x(s), ~x(j)))

√∑p

can we apply the learned classification to a new input pattern ~x(new) ?

Kernel trick(cont.)

d(s)−

(j)KΦ(~x(s), ~x(j)))

√∑p

can we apply the learned classification to a new input pattern ~x(new) ?Yes! (see next slide)

Kernel trick(cont.)

(αid(i)Φ(~x(i)))

applying ~w and b to a new pattern:

⟨~w,Φ(~x(new))

⟩+ b =

(αid(i)Φ(~x(i))),Φ(~x(new))

(αid(i)KΦ(~x(i), ~x(new))) + b

Kernel trick(cont.)

quintessence:

• replacing the dotproduct by a kernel function is possible

• an explicit representation is not possible if we do not have access to Φ

• nonetheless, predictions can be made

• we need to memorize the support vectors and the non-zero Lagrangemultipliers αi

Kernel trick(cont.)

quintessence:

the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick

classification

linearoptimal

Φ−1Φ

Kernel trick(cont.)

quintessence:

the technique to implicitly calculate a separating hyperplane in featurespaceby replacing dot product by kernel function is called kernel trick

classification

linearoptimal

Φ−1

kernel−trick

Kernel trick(cont.)

example (one dimensional input space):

D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space

feature mapping: Φ(x) =

kernel function:

KΦ(x, y) =

= (xy)2 +2(xy)+1 = (xy +1)2

Kernel trick(cont.)

example (one dimensional input space):

D = (0;−1), (−1; 1), (2; 1) is not linearly separable in input space

feature mapping: Φ(x) =

kernel function:

KΦ(x, y) =

= (xy)2 +2(xy)+1 = (xy +1)2

solution: α1 = 34, α2 = 2

3, α3 = 1

b = −1, ~w = (1,−12

√2, 0)T (in feature space!)

Kernel trick(cont.)

example, illustration:

input space0 1

Kernel trick(cont.)

example, illustration:fe

input space0 1

Kernel trick(cont.)

input space0 1

Kernel trick(cont.)

input space0 1

Which points in input space are on thedecision boundary?look for z with∑p

i=1(αid(i)KΦ(~x(i), z)) + b = 0

yields: z = 12(1 ±

Kernel trick(cont.)

kernels can be designed individually for an application:

• look for senseful features

• create Φ

• derive KΦ

In these cases, using the kernel instead is not better than an explicitcalculation in feature space

Kernel trick(cont.)

kernels can be designed individually for an application:

• look for senseful features

• create Φ

• derive KΦ

In these cases, using the kernel instead is not better than an explicitcalculation in feature space

there are classes of generic kernels

• can be used for many tasks

• often, Φ is very complicated and feature space is very high dimensional

• for some generic kernels, feature space has infinite dimension and Φcannot be calculated explicitely

• typically, using kernel instead of explicit calculation in feature space iscomputationally more efficient

Generic kernels

example of kernel (one-dimensional input space):

Φ(x) =

mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions

Generic kernels

example of kernel (one-dimensional input space):

Φ(x) =

mapping from input space to feature space needs: 2 multiplicationscalculation of dot product in feature space: 3 multiplications, 2 additionsevaluating 〈Φ(x),Φ(y)〉: 7 multiplications, 2 additions

corresponding kernel: KΦ(x, y) = (xy + 1)2

kernel evaluation needs: 2 multiplications, 1 addition

Generic kernels(cont.)

a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:

KΦ(~x, ~y) = (〈~x, ~y〉)d

KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d

d ∈ N is a kernel parameter thatneeds to be chosen manually

a class of generic kernels:polynomial kernelspolynomial kernels are functions ofthe form:

KΦ(~x, ~y) = (〈~x, ~y〉)d

KΦ(~x, ~y) = (〈~x, ~y〉 + 1)d

d ∈ N is a kernel parameter thatneeds to be chosen manually

for polynomial kernels, Φ can becalculated explicitely. It contains allmonomials of degree d (first case) orall monomials of maximal degree d

(second case)But: feature space becomes verylarge, even for small d, e.g. for thefirst variant the number of features is:

n d number features

5 3 35

10 3 220

10 5 2002

10 10 92 378

50 4 292 825

256 4 183 181 376

what makes a symmetric function K a kernel function?

• a mapping Φ into some space with dot-product (Hilbert-space) exists andK(~x, ~y) = 〈Φ(~x),Φ(~y)〉

• K meets Mercer’s theorem:

A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,

if and only if for all functions g : [a, b]n → R with∫

[a,b]ng(~x)2d~x < ∞

holds: ∫

[a,b]n

[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0

Proof: omitted

• K meets Mercer’s theorem:

A continuous, symmetric function K : [a, b]n × [a, b]n → R is a kernel,

if and only if for all functions g : [a, b]n → R with∫

[a,b]ng(~x)2d~x < ∞

holds: ∫

[a,b]n

[a,b]nK(~x, ~y)g(~x)g(~y)d~xd~y ≥ 0

Proof: omitted

consequences: sums of kernels and positive multiples of kernels are alsokernels

RBF kernels (Gaussian kernels):

KΦ(~x, ~y) = e−||~x−~y||2

σ2 > 0 is a kernel parameter, chosen manually

RBF kernels (Gaussian kernels):

KΦ(~x, ~y) = e−||~x−~y||2

σ2 > 0 is a kernel parameter, chosen manually

feature spaces have inifinite dimension

local information processing, circular decision boundaries in input space

what is calculated in a SVM with RBF kernel resembles a RBF network, buttraining is done in a different way

kernel parameter controls the complexity of the kernel: if σ2 large, behavioris similar to dot product, if σ2 is very small, decision boundaries in inputspace may become very complex

degeneration σ2 → 0

introduction to neuroinformatics: support vector...

Documents

neuroinformatics 18: the bootstrap kenneth d. harris ucl,...

introduction to neuroinformatics: winner-takes-all...

neuroinformatics for telemedicine and medical services...

neuroinformatics working group update 10/26/2009 h jeremy...

neuroinformatics challenges in mri data integration

sp5 - neuroinformatics synapsessa tutorial computational...

neuroinformatics aapo hyvärinen professor, group leader

introduction to neuroinformatics: winner-takes-all...

neuroinformatics, the iconic grid, and oregon’s science...

neuroinformatics, the iconic grid, and gemini allen d....

neuroinformatics research at uo

neuroinformatics: sharing, organizing and accessing data...

high-performance computing, computational science, and...

n2a a neuroinformatics framework

sp5 neuroinformatics platform - results for sga2 year 1...

wiki(pedia) and neuroinformatics · wiki(pedia) and...

department of informatics - institute of neuroinformatics -...

statistical data analysis in neuroinformatics ·...

databasing the brain | neuroinformatics ·...

jim bednar november 2010 doctoral training centre in...