support vector machines - university at buffalosrihari/cse574/chap7/7.1-svms.pdfmachine learning...

Machine Learning Srihari

Support Vector Machines

[email protected]


SVM Discussion Overview

1. Overview of SVMs2. Margin Geometry3. SVM Optimization4. Overlapping Distributions5. Relationship to Logistic Regression6. Dealing with Multiple Classes7. SVM for Regression8. SVM and Computational Learning Theory9. Relevance Vector Machines

Machine Learning Srihari Kernel Methods• In the kernel method for regression

• Given data set {xi,ti}, i = 1..N• Gram matrix K has elements Knm= ϕ (xn)Tϕ (xm)=k(xn,xm)

• Output y predicted for x isy(x)=k(x)T(K +λIN)-1t, where k(x) has elements kn(x)=(xn,x)

• Many choices of kernel1. Gaussian: k(x,x’) = exp (-||x-x’||2/2σ2)2. Cosine similarity (BOW, TF-IDF):

3. Mercer kernel:# substrings shared

k(x

i,x

i ') =

xiTx

i '

||xi||

2||x

i '||

2

xi=[xi1,..,xiD]

k(x,x ') = w

ss∈A*∑ φ

s(x)φ

s(x ') ϕs(x) is no of times

substring s appears in x

J(w) =

12

wTφ(xn)−t

n{ }2

+λ2n=1

N

∑ wTw

J(w) =

12aTΦΦTΦΦTa−aTΦΦTt +

12

tTt +λ2aTΦΦTa

J(a) =

12aTKKa−aTKt +

12

tTt +λ2aTKa

a =(K +λIN)-1t

w = ΦTa a

n=−

1λ

wTφ(xn)−t

n{ }

Primal optimization

Dual optimization

w appearing in a eliminated using kernel


Sparse Kernel Methods

• Kernel methods have a limitation • that k(xn,xm) must be evaluated for all possible

pairs (xn,xm) of training points• Which is computationally infeasible

• In sparse solutions, predictions for new inputs depend only on kernels evaluated at a subset of training data points


Alternative Names for SVM• Also called Sparse kernel machines

• Prediction is based on a linear combination of a kernel evaluated at the training points

• Sparse because not all pairs of training points need be used

• Also called Maximum margin classifiers


Importance of SVMs• SVM is a discriminative method• SVM brings together:

1. Computational learning theory 2. Previous methods in linear discriminants 3. Optimization theory

• Widely used for ML problems: • Classification, regression, novelty detection


Ideas Used in SVM• Linearly separable case considered

• By using higher-dimensions than original feature space

• Kernel trick reduces computational overhead

• Determination of model parameters is a convex optimization problem

• Extensive use of Lagrange multipliers

k(y

j,y

k) = y

jt ⋅y

k= φ(x

j)t ⋅φ(x

k)


Boundaries induced by kernels

Source: https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805


2. Maximum Margin Classifiers• Begin with 2-class linear classifier

y(x)=wTϕ(x)+b• where ϕ(x) is a feature space transformation

• We will introduce a dual representation which avoids working with feature space• Training set: x1,.., xN with targets t1,..,tN , tn ε {-1,1}

• New x classified according to sign of y(x)• Assume training data is linearly separable in

feature space, • i.e., there exists at least one choice of w and b

such that tny(xn)>0 for all points


Definition of margin • Perceptron guarantees a solution in finite

no. of steps, which depends on 1. Initial values for w,b2. Order in which data are presented

• If there are multiple solutions we need one with smallest generalization error• SVM approaches this with margin concept

• Defined as smallest distance between decision boundary and any of the samples


Support Vector Definition

• Margin:• Perpendicular distance between between

boundary and the closest data point (left)• Maximizing margin leads to a particular

choice of decision boundary • Determined by a subset of the data points

• Which are known as support vectors (circles)

y = 1y = 0

y = �1

margin

y = 1

y = 0

y = �1


Three support vectors are shown as solid dots

Support Vectors and Margin

• Support vectors are those nearest patterns at distance b from hyperplane

• SVM finds hyperplanewith maximum distance

(margin distance b)

from nearest training patterns


Motivation for maximum margin• Maximum margin solution is motivated by

computational learning theory• Simple insight

• Based on a hybrid of generative and discriminative approaches• Seen next (Tong, Koller 2000)


Max Margin Insight• Model class-conditionals p(x |Ci)

• Using Parzen estimator with Gaussian kernels– With common parameter σ2

• defines the Bayes classifier – Determines best hyperplane for learned density model

• As σ2 à 0 hyperplane has maximum margin– In the limit hyperplane becomes independent of points

that are not support vectors

100 normallydistributed noswith decreasing σp(x|Ci)

Different priors p(Ci)for linearlyseparable data leads toa decision boundarythat lies in the middle

P(C

1|x) =

P(x |C1)P(C

1)

P(x |C1)P(C

1)+ P(x |C

2)P(C

2)


y(x) = wt(x

p+ r

w||w ||

)+ b

Distance of a point x from planeTheorem: Given a plane y(x)=wTx+b=0, perpendiculardistance from x to the plane is

|||| wwrxx p +=

r =

y(x)||w ||

px y(x) < 0

= wtx

p+w

0+ r

wtw||w ||

= y(xp)+ r

||w ||2

||w || = r ||w ||

r = y(x)

||w ||

y(x) = wt ⋅x +b = 0

y(x) > 0

Corollary: Distance of origin to plane is

r = y(0)/||w|| = b/||w||

Since y(0)=wT0+b=bthen b=0 implies thatplane passes through origin

= 0

QED x = x

p+ r

w||w ||

Proof:


y(x) = 1

SVM Margin geometry

Shortest line connectingConvex hulls, whichhas length = 2/||w||

Optimal hyperplaneis orthogonal to..

y1y2y(x) = -1

y(x) = wtx = 0y2

x1


Maximum Margin Solution• We interested in solutions for which tny(xn)>0 for all n• Distance of point xn to decision surface is

• Margin is closest point xn of data set• We wish to optimize w, b to maximize distance

• Maximum margin solution is found from

• Direct solution is very complex• So convert into equivalent problem

tny(x

n)

||w ||=

tn(wTφ(x

n)+b)

||w ||

argmax

w,b

1||w ||

minn

tn(wTφ(x

n)+b)⎡

⎣⎢⎤⎦⎥

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪


Equivalent problem• If we rescale wàκw and bàκb then the

distance from any point xn to the decision surface given by tny(xn)/||w|| is unchanged• We can use this freedom to set, for the point

closest to the surface, tn (wTϕ(xn)+b)=1• In this case all points will satisfy

tn (wTϕ(xn)+b)≥1 n=1,..,N• This is the canonical representation of the

hyperplane


Canonical Representationtn (wTϕ(xn)+b)≥1 n=1,..,N

• When equality holds, constraints are active• For the remainder they are inactive• There will always be at least one active

constraint, because there will be a closest point• Once margin is maximized there will be two

active constraints

Machine Learning Srihari Optimization Problem (OP) Restated• Augmented space:

y(x) = wtx by choosing w0= b and x0=1, i.e, plane passes through origin• Canonical representation of decision hyperplane is then

tn y(xn) > 0, n = 1,.., N• Since distance of a point y to hyperplane y(x)=0 is

Maximizing the margin is equivalent to maximizing ||w||-1

Which is equivalent to minimizing ||w||2

y(x)||w ||

margin

xy(x)= wTx

So the optimization problem

is equivalent to

subject to the constraintstn (wTϕ(xn)+b)≥1 n=1,..,N

argmax

w,b

1||w ||

minn

tn(wTφ(x

n)+b)⎡

⎣⎢⎤⎦⎥

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

argmin

w,b

12

w2 The factor ½ is for later convenience

This is an example of quadratic programmingWe are trying to maximize a quadratic function subject to a set of linear inequality constraints


• Can be cast as an unconstrained problem • by introducing Lagrange undetermined multipliers

with one multiplier an ≥ 0 for each constraint• The Lagrange function we wish to minimize is

L w,b,a( ) =

12

w2− a

ntn(wtφ(x

n)+b)−1⎡

⎣⎢⎤⎦⎥

n=1

N

∑

SVM Constrained Optimization

Optimize

arg minw,b

12

||w ||2

subject to constraints

tn(wtφ(x

n)+b)≥1, n = 1,.....N

where a=(a1,..,aN)T

Note the minus sign in front of the Lagrange multiplier termBecause we are minimizing wrt w and b, and maximizing wrt a


• We wish to minimize

• We seek to minimize L ( ) • with respect to w and b• maximize it w.r.t . the undetermined

multipliers • Last term represents the goal of classifying the points correctly• Karush-Kuhn-Tucker construction shows that this can be recast

as a maximization problem which is computationally better

Optimization of Lagrange function

L w,b,a( ) =

12

w2− a

ntnwt(φ(x

n)+b)−1⎡

⎣⎢⎤⎦⎥

n=1

N

∑

an ≥ 0


Eliminating w and b

• Setting derivatives of L(w,b,a) wrt w and bequal to zero• We obtain the following conditions

• Eliminating w and b from L(w,b,a) using these conditions then gives the dual representation of the maximum margin problem

L w,b,a( ) =

12

w2− a

ntnwt(φ(x

n)+b)−1⎡

⎣⎢⎤⎦⎥

n=1

N

∑

w = a

nn∑ t

nx

n 0 = a

nn∑ t

n


• Problem is one of maximizing

• Subject to the constraints

given the training data• Here the kernel function is defined by

• Again a quadratic optimization problem • Optimize a quadratic function subject to a set

of inequality constraints

!L a( ) = a

nn=1

N

∑ −12 m=1

N

∑ ama

ntntmk(x

n,x

m)

n=1

N

∑

Dual O.P.

a

nn=1

N

∑ tn

= 0 an≥ 0, n = 1,.....,N

k(x,x ') = φ(x)t ⋅φ(x')


Complexity of Primal and Dual• Quadratic programming complexity is O(M3)

• Primal OP of minimizing • over M basis functions • Dual OP of maximizing• over N data points

• For M<<N it appears disadvantageous• But it allows model to be reformulated using kernels

• So it can be applied to feature spaces whose dimensionality M exceeds no of data points N

• Kernel formulation also makes clear that k(x,x’) be positive definite, which bounds L(a) is lower-bounded

argmin

w,b

12

w2

!L a( ) = a

nn=1

N

∑ −12 m=1

N

∑ ama

ntntmk(x

n,x

m)

n=1

N

∑


• To classify points using trained model:• Evaluate sign of y(x)=wTϕ(x)+b

• This can be expressed in terms of the parameters {an} and the kernel function by substituting for w using to give

Classifying new points

w = a

nn∑ t

nx

n

y(x) = a

ntnk(x,x

n)+b

n=1

N

∑


Definition of Support Vectors• This OP satisfies KKT conditions

• which requires the following hold:

• Thus for every data point either an=0 or tny(xn)=1– A point with an=0 will not appear in sum

• Remaining data points are called support vectors– Because they satisfy tny(xn)=1 they correspond to

points that lie on the maximum margin hyperplanes in feature space

an≥ 0

tny(x

n)−1≥ 0

an{t

ny(x

n)−1} = 0

y(x) = a

ntnk(x,x

n)+b

n=1

N

∑

y = 1y = 0

y = �1

margin

y = 1

y = 0

y = �1

This property is central to practical applicability ofSVMs. Once it is trained a significant proportion ofData points can be discarded and only the support vectors retained


Solution for b• Having solved the quadratic programming

problem and found a we can can determine value of threshold parameter b• Noting that for any support vector tny(xn)=1• Using gives

• Where S denotes the indices of support vectors• Solving for b gives

• Where NS is the total no of support vectors

y(x) = a

ntnk(x,x

n)+b

n=1

N

∑ tn

amtmk(x

n,x

m)+b

m∈S∑⎛

⎝⎜⎜⎜

⎞

⎠⎟⎟⎟⎟= 1

b =

1N

S

tn− a

mtmk(x

n,x

m)

m∈S∑

⎛

⎝⎜⎜⎜

⎞

⎠⎟⎟⎟⎟


Equivalent Error Function of SVM• For comparison with alternative models

• Express the maximum margin classifier in terms of minimization of an error function:

• Where E∞(z) is zero if z≥0 and ∞ otherwise– Ensures that tn (wTϕ(xn)+b)≥1 n=1,..,N are satisfied

• Gives infinite error if point is misclassified• Zero error if it was classified correctly

• We optimize parameters to maximize margin

E∞

n=1

N

∑ y(xn)t

n−1( )+λ w

2


Example of SVM Classifier • Two classes in two

dimensions• Synthetic Data• Shows contours of constant

y (x)• Obtained from SVM with

Gaussian kernel function• Decision boundary is shown• Margin boundaries are also

shown• Support vectors are shown• Shows sparsity of SVM


Quadratic term

Summary of SVM Optimization ProblemsDifferentNotation here!


Kernel Function: key property

• If kernel function is chosen with propertyk(x,x’) = (ϕ (x)T.ϕ(x’))

then computational expense of increased dimensionality is avoided.

• Polynomial kernel k(x,x’) = (xT.x’)pcan be shown (next slide) to correspond to a mapϕ into the space spanned by all products of exactly d dimensions.


A Polynomial Kernel Function

• Inner product needs computing six feature values and 3 ✕ 3 = 9 multiplications

• Kernel function k(x,y) has 2 multiplications, one add and a squaring

Suppose x = (x1,x

2) is a 2 - dimensional input vector

The feature space is 3 - dimensional :

φ(x) = x12,( 2x

1x

2,x

22)

t

Then inner product is

φ(x)Tφ(y) = x12,( 2x

1x

2,x

22)

t

y12,( 2y

1y

2,y

22) = (x

1y

1+ x

2y

2)2

Polynomial kernel function to compute the same value is

k(x,y) = (xT.y)2 = x1,( x

2)T y1,( y

2)⎛⎝⎜⎜⎜

⎞⎠⎟⎟⎟

2

= (x1y

1+ x

2y

2)2

or k(x,y) = φ(x)Tφ(y)

φ(x)Tφ(y)

same


Another Polynomial kernel function

• k(x,y) = (x.y + 1)2• This one maps d = 2, p = 2 into a six- dimensional space

• Contains all the powers of x

• Inner product needs 36multiplications

• Kernel function needs 4multiplications

k(x,y) = φ(x)φ(y)where

φ(x) = x12 ,( x2

2 , 2x1, 2x2 , 2x1x2 ,1)x22


Non-Linear Case• Mapping function ϕ (.) to a

sufficiently high dimension• So that data from two

categories can always be separated by a hyperplane

• Assume each pattern xkhas been transformed toϕ(xk), for k=1,.., n

• First choose the non-linear ϕfunctions

• To map the input vector to a higher dimensional feature space

• Dimensionality of space can be arbitrarily high only limited by computational resources


Mapping into Higher Dimensional Feature Space

• Mapping each input point x by map

Points on 1-d line are mapped onto curve in 3-d.

• Linear separation in 3-d space is possible. Linear discriminant function in 3-d is in the form

Φ(x) =1xx 2

⎛

⎝

⎜⎜⎜⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟⎟⎟⎟

y(x) = a1x

1+a

2x

2+a

3x

3


Gaussian (RBF) Kernel• ϕ(x)maps x to an infinite space

• Achieves perfect separationFor sufficiently small kernel bandwidth, decision boundary will look like you just drew little circles around the points whenever they are needed to separatepositiveand negative examples.


RBF Features from Taylor’s Series


Pattern Transformation using Kernels

• Problem with high-dimensional mapping • Very many parameters• Polynomial of degree p over d variables leads to O(dp) variables in feature space

• Example: if d = 50 and p =2 we need a feature space of size 2500• Solution:

• Dual Optimization problem needs only inner products

• Each pattern xk transformed into

• Dimensionality of mapped space can be arbitrarily high

φ(xk)


SVM for the XOR problem• XOR: binary valued features x1, x2• Not solved by linear discriminant• Function ϕ maps input x = [x1, x2] into six-D

feature space y = [1, √2x1, √2x2, √2x1x2 , x12 , x2

2]T

HyperbolasCorresponding tox1 x2 = +1

HyperplanesCorresponding to√2x1 x2 = +1

input space feature sub-space


• We seek to maximize

• Subject to the constraints , i.e.,

• From problem symmetry, at the solution

SVM for XOR: maximization problem

a1−a

2+a

3−a

4= 0

0≤ak k = 1, 2, 3, 4

a

kk=1

4

∑ −12

ak

j=1

4

∑k=1

4

∑ ajt

jtkφ(x

j)φ(x

k)

a1=a3 and a2=a4

0 = a

nn∑ t

n


SVM for XOR: maximization problem

• Can use iterative gradient descent• Or use analytical techniques for small problem• The solution is

a* = (1/8, 1/8, 1/8, 1/8)• Last term of Optimizn Problem implies that all four points are support vectors

(unusual and due to symmetric nature of XOR)• The final discriminant function is y(x1,x2) = x1✕ x2• Decision hyperplane is defined by y(x1,x2) = 0• Margin is given by b=1/||a|| = √2

HyperbolasCorresponding tox1 x2 = +1

HyperplanesCorresponding to_

√2x1 x2 = +1

support vector machines - university at buffalosrihari/cse574/chap7/7.1-svms.pdfmachine learning...

Documents