support vector machines - university at buffalosrihari/cse574/chap7/7.1-svms.pdfmachine learning...

43
Machine Learning Srihari Support Vector Machines [email protected]

Upload: others

Post on 01-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Support Vector Machines

[email protected]

Page 2: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

SVM Discussion Overview

1. Overview of SVMs2. Margin Geometry3. SVM Optimization4. Overlapping Distributions5. Relationship to Logistic Regression6. Dealing with Multiple Classes7. SVM for Regression8. SVM and Computational Learning Theory9. Relevance Vector Machines

Page 3: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari Kernel Methods• In the kernel method for regression

• Given data set {xi,ti}, i = 1..N• Gram matrix K has elements Knm= ϕ (xn)Tϕ (xm)=k(xn,xm)

• Output y predicted for x isy(x)=k(x)T(K +λIN)-1t, where k(x) has elements kn(x)=(xn,x)

• Many choices of kernel1. Gaussian: k(x,x’) = exp (-||x-x’||2/2σ2)2. Cosine similarity (BOW, TF-IDF):

3. Mercer kernel:# substrings shared

k(x

i,x

i ') =

xiTx

i '

||xi||

2||x

i '||

2

xi=[xi1,..,xiD]

k(x,x ') = w

ss∈A*∑ φ

s(x)φ

s(x ') ϕs(x) is no of times

substring s appears in x

J(w) =

12

wTφ(xn)−t

n{ }2

+λ2n=1

N

∑ wTw

J(w) =

12aTΦΦTΦΦTa−aTΦΦTt +

12

tTt +λ2aTΦΦTa

J(a) =

12aTKKa−aTKt +

12

tTt +λ2aTKa

a =(K +λIN)-1t

w = ΦTa a

n=−

wTφ(xn)−t

n{ }

Primal optimization

Dual optimization

w appearing in a eliminated using kernel

Page 4: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Sparse Kernel Methods

• Kernel methods have a limitation • that k(xn,xm) must be evaluated for all possible

pairs (xn,xm) of training points• Which is computationally infeasible

• In sparse solutions, predictions for new inputs depend only on kernels evaluated at a subset of training data points

Page 5: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Alternative Names for SVM• Also called Sparse kernel machines

• Prediction is based on a linear combination of a kernel evaluated at the training points

• Sparse because not all pairs of training points need be used

• Also called Maximum margin classifiers

Page 6: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Importance of SVMs• SVM is a discriminative method• SVM brings together:

1. Computational learning theory 2. Previous methods in linear discriminants 3. Optimization theory

• Widely used for ML problems: • Classification, regression, novelty detection

Page 7: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Ideas Used in SVM• Linearly separable case considered

• By using higher-dimensions than original feature space

• Kernel trick reduces computational overhead

• Determination of model parameters is a convex optimization problem

• Extensive use of Lagrange multipliers

k(y

j,y

k) = y

jt ⋅y

k= φ(x

j)t ⋅φ(x

k)

Page 8: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Boundaries induced by kernels

Source: https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805

Page 9: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

2. Maximum Margin Classifiers• Begin with 2-class linear classifier

y(x)=wTϕ(x)+b• where ϕ(x) is a feature space transformation

• We will introduce a dual representation which avoids working with feature space• Training set: x1,.., xN with targets t1,..,tN , tn ε {-1,1}

• New x classified according to sign of y(x)• Assume training data is linearly separable in

feature space, • i.e., there exists at least one choice of w and b

such that tny(xn)>0 for all points

Page 10: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Definition of margin • Perceptron guarantees a solution in finite

no. of steps, which depends on 1. Initial values for w,b2. Order in which data are presented

• If there are multiple solutions we need one with smallest generalization error• SVM approaches this with margin concept

• Defined as smallest distance between decision boundary and any of the samples

Page 11: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Support Vector Definition

• Margin:• Perpendicular distance between between

boundary and the closest data point (left)• Maximizing margin leads to a particular

choice of decision boundary • Determined by a subset of the data points

• Which are known as support vectors (circles)

y = 1y = 0

y = �1

margin

y = 1

y = 0

y = �1

Page 12: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Three support vectors are shown as solid dots

Support Vectors and Margin

• Support vectors are those nearest patterns at distance b from hyperplane

• SVM finds hyperplanewith maximum distance

(margin distance b)

from nearest training patterns

Page 13: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Motivation for maximum margin• Maximum margin solution is motivated by

computational learning theory• Simple insight

• Based on a hybrid of generative and discriminative approaches• Seen next (Tong, Koller 2000)

Page 14: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Max Margin Insight• Model class-conditionals p(x |Ci)

• Using Parzen estimator with Gaussian kernels– With common parameter σ2

• defines the Bayes classifier – Determines best hyperplane for learned density model

• As σ2 à 0 hyperplane has maximum margin– In the limit hyperplane becomes independent of points

that are not support vectors

100 normallydistributed noswith decreasing σp(x|Ci)

Different priors p(Ci)for linearlyseparable data leads toa decision boundarythat lies in the middle

P(C

1|x) =

P(x |C1)P(C

1)

P(x |C1)P(C

1)+ P(x |C

2)P(C

2)

Page 15: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

y(x) = wt(x

p+ r

w||w ||

)+ b

Distance of a point x from planeTheorem: Given a plane y(x)=wTx+b=0, perpendiculardistance from x to the plane is

|||| wwrxx p +=

r =

y(x)||w ||

px y(x) < 0

= wtx

p+w

0+ r

wtw||w ||

= y(xp)+ r

||w ||2

||w || = r ||w ||

r = y(x)

||w ||

y(x) = wt ⋅x +b = 0

y(x) > 0

Corollary: Distance of origin to plane is

r = y(0)/||w|| = b/||w||

Since y(0)=wT0+b=bthen b=0 implies thatplane passes through origin

= 0

QED x = x

p+ r

w||w ||

Proof:

Page 16: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

y(x) = 1

SVM Margin geometry

Shortest line connectingConvex hulls, whichhas length = 2/||w||

Optimal hyperplaneis orthogonal to..

y1y2y(x) = -1

y(x) = wtx = 0y2

x1

Page 17: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Maximum Margin Solution• We interested in solutions for which tny(xn)>0 for all n• Distance of point xn to decision surface is

• Margin is closest point xn of data set• We wish to optimize w, b to maximize distance

• Maximum margin solution is found from

• Direct solution is very complex• So convert into equivalent problem

tny(x

n)

||w ||=

tn(wTφ(x

n)+b)

||w ||

argmax

w,b

1||w ||

minn

tn(wTφ(x

n)+b)⎡

⎣⎢⎤⎦⎥

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

Page 18: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Equivalent problem• If we rescale wàκw and bàκb then the

distance from any point xn to the decision surface given by tny(xn)/||w|| is unchanged• We can use this freedom to set, for the point

closest to the surface, tn (wTϕ(xn)+b)=1• In this case all points will satisfy

tn (wTϕ(xn)+b)≥1 n=1,..,N• This is the canonical representation of the

hyperplane

Page 19: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Canonical Representationtn (wTϕ(xn)+b)≥1 n=1,..,N

• When equality holds, constraints are active• For the remainder they are inactive• There will always be at least one active

constraint, because there will be a closest point• Once margin is maximized there will be two

active constraints

Page 20: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari Optimization Problem (OP) Restated• Augmented space:

y(x) = wtx by choosing w0= b and x0=1, i.e, plane passes through origin• Canonical representation of decision hyperplane is then

tn y(xn) > 0, n = 1,.., N• Since distance of a point y to hyperplane y(x)=0 is

Maximizing the margin is equivalent to maximizing ||w||-1

Which is equivalent to minimizing ||w||2

y(x)||w ||

margin

xy(x)= wTx

So the optimization problem

is equivalent to

subject to the constraintstn (wTϕ(xn)+b)≥1 n=1,..,N

argmax

w,b

1||w ||

minn

tn(wTφ(x

n)+b)⎡

⎣⎢⎤⎦⎥

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

argmin

w,b

12

w2 The factor ½ is for later convenience

This is an example of quadratic programmingWe are trying to maximize a quadratic function subject to a set of linear inequality constraints

Page 21: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

• Can be cast as an unconstrained problem • by introducing Lagrange undetermined multipliers

with one multiplier an ≥ 0 for each constraint• The Lagrange function we wish to minimize is

L w,b,a( ) =

12

w2− a

ntn(wtφ(x

n)+b)−1⎡

⎣⎢⎤⎦⎥

n=1

N

SVM Constrained Optimization

Optimize

arg minw,b

12

||w ||2

subject to constraints

tn(wtφ(x

n)+b)≥1, n = 1,.....N

where a=(a1,..,aN)T

Note the minus sign in front of the Lagrange multiplier termBecause we are minimizing wrt w and b, and maximizing wrt a

Page 22: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

• We wish to minimize

• We seek to minimize L ( ) • with respect to w and b• maximize it w.r.t . the undetermined

multipliers • Last term represents the goal of classifying the points correctly• Karush-Kuhn-Tucker construction shows that this can be recast

as a maximization problem which is computationally better

Optimization of Lagrange function

L w,b,a( ) =

12

w2− a

ntnwt(φ(x

n)+b)−1⎡

⎣⎢⎤⎦⎥

n=1

N

an ≥ 0

Page 23: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Eliminating w and b

• Setting derivatives of L(w,b,a) wrt w and bequal to zero• We obtain the following conditions

• Eliminating w and b from L(w,b,a) using these conditions then gives the dual representation of the maximum margin problem

L w,b,a( ) =

12

w2− a

ntnwt(φ(x

n)+b)−1⎡

⎣⎢⎤⎦⎥

n=1

N

w = a

nn∑ t

nx

n 0 = a

nn∑ t

n

Page 24: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

• Problem is one of maximizing

• Subject to the constraints

given the training data• Here the kernel function is defined by

• Again a quadratic optimization problem • Optimize a quadratic function subject to a set

of inequality constraints

!L a( ) = a

nn=1

N

∑ −12 m=1

N

∑ ama

ntntmk(x

n,x

m)

n=1

N

Dual O.P.

a

nn=1

N

∑ tn

= 0 an≥ 0, n = 1,.....,N

k(x,x ') = φ(x)t ⋅φ(x')

Page 25: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Complexity of Primal and Dual• Quadratic programming complexity is O(M3)

• Primal OP of minimizing • over M basis functions • Dual OP of maximizing• over N data points

• For M<<N it appears disadvantageous• But it allows model to be reformulated using kernels

• So it can be applied to feature spaces whose dimensionality M exceeds no of data points N

• Kernel formulation also makes clear that k(x,x’) be positive definite, which bounds L(a) is lower-bounded

argmin

w,b

12

w2

!L a( ) = a

nn=1

N

∑ −12 m=1

N

∑ ama

ntntmk(x

n,x

m)

n=1

N

Page 26: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

• To classify points using trained model:• Evaluate sign of y(x)=wTϕ(x)+b

• This can be expressed in terms of the parameters {an} and the kernel function by substituting for w using to give

Classifying new points

w = a

nn∑ t

nx

n

y(x) = a

ntnk(x,x

n)+b

n=1

N

Page 27: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Definition of Support Vectors• This OP satisfies KKT conditions

• which requires the following hold:

• Thus for every data point either an=0 or tny(xn)=1– A point with an=0 will not appear in sum

• Remaining data points are called support vectors– Because they satisfy tny(xn)=1 they correspond to

points that lie on the maximum margin hyperplanes in feature space

an≥ 0

tny(x

n)−1≥ 0

an{t

ny(x

n)−1} = 0

y(x) = a

ntnk(x,x

n)+b

n=1

N

y = 1y = 0

y = �1

margin

y = 1

y = 0

y = �1

This property is central to practical applicability ofSVMs. Once it is trained a significant proportion ofData points can be discarded and only the support vectors retained

Page 28: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Solution for b• Having solved the quadratic programming

problem and found a we can can determine value of threshold parameter b• Noting that for any support vector tny(xn)=1• Using gives

• Where S denotes the indices of support vectors• Solving for b gives

• Where NS is the total no of support vectors

y(x) = a

ntnk(x,x

n)+b

n=1

N

∑ tn

amtmk(x

n,x

m)+b

m∈S∑⎛

⎝⎜⎜⎜

⎠⎟⎟⎟⎟= 1

b =

1N

S

tn− a

mtmk(x

n,x

m)

m∈S∑

⎝⎜⎜⎜

⎠⎟⎟⎟⎟

Page 29: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Equivalent Error Function of SVM• For comparison with alternative models

• Express the maximum margin classifier in terms of minimization of an error function:

• Where E∞(z) is zero if z≥0 and ∞ otherwise– Ensures that tn (wTϕ(xn)+b)≥1 n=1,..,N are satisfied

• Gives infinite error if point is misclassified• Zero error if it was classified correctly

• We optimize parameters to maximize margin

E∞

n=1

N

∑ y(xn)t

n−1( )+λ w

2

Page 30: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Example of SVM Classifier • Two classes in two

dimensions• Synthetic Data• Shows contours of constant

y (x)• Obtained from SVM with

Gaussian kernel function• Decision boundary is shown• Margin boundaries are also

shown• Support vectors are shown• Shows sparsity of SVM

Page 31: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Quadratic term

Summary of SVM Optimization ProblemsDifferentNotation here!

Page 32: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Page 33: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Kernel Function: key property

• If kernel function is chosen with propertyk(x,x’) = (ϕ (x)T.ϕ(x’))

then computational expense of increased dimensionality is avoided.

• Polynomial kernel k(x,x’) = (xT.x’)pcan be shown (next slide) to correspond to a mapϕ into the space spanned by all products of exactly d dimensions.

Page 34: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

A Polynomial Kernel Function

• Inner product needs computing six feature values and 3 ✕ 3 = 9 multiplications

• Kernel function k(x,y) has 2 multiplications, one add and a squaring

Suppose x = (x1,x

2) is a 2 - dimensional input vector

The feature space is 3 - dimensional :

φ(x) = x12,( 2x

1x

2,x

22)

t

Then inner product is

φ(x)Tφ(y) = x12,( 2x

1x

2,x

22)

t

y12,( 2y

1y

2,y

22) = (x

1y

1+ x

2y

2)2

Polynomial kernel function to compute the same value is

k(x,y) = (xT.y)2 = x1,( x

2)T y1,( y

2)⎛⎝⎜⎜⎜

⎞⎠⎟⎟⎟

2

= (x1y

1+ x

2y

2)2

or k(x,y) = φ(x)Tφ(y)

φ(x)Tφ(y)

same

Page 35: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Another Polynomial kernel function

• k(x,y) = (x.y + 1)2• This one maps d = 2, p = 2 into a six- dimensional space

• Contains all the powers of x

• Inner product needs 36multiplications

• Kernel function needs 4multiplications

k(x,y) = φ(x)φ(y)where

φ(x) = x12 ,( x2

2 , 2x1, 2x2 , 2x1x2 ,1)x22

Page 36: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Non-Linear Case• Mapping function ϕ (.) to a

sufficiently high dimension• So that data from two

categories can always be separated by a hyperplane

• Assume each pattern xkhas been transformed toϕ(xk), for k=1,.., n

• First choose the non-linear ϕfunctions

• To map the input vector to a higher dimensional feature space

• Dimensionality of space can be arbitrarily high only limited by computational resources

Page 37: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Mapping into Higher Dimensional Feature Space

• Mapping each input point x by map

Points on 1-d line are mapped onto curve in 3-d.

• Linear separation in 3-d space is possible. Linear discriminant function in 3-d is in the form

Φ(x) =1xx 2

⎜⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟⎟

y(x) = a1x

1+a

2x

2+a

3x

3

Page 38: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Gaussian (RBF) Kernel• ϕ(x)maps x to an infinite space

• Achieves perfect separationFor sufficiently small kernel bandwidth, decision boundary will look like you just drew little circles around the points whenever they are needed to separatepositiveand negative examples.

Page 39: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

RBF Features from Taylor’s Series

Page 40: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

Pattern Transformation using Kernels

• Problem with high-dimensional mapping • Very many parameters• Polynomial of degree p over d variables leads to O(dp) variables in feature space

• Example: if d = 50 and p =2 we need a feature space of size 2500• Solution:

• Dual Optimization problem needs only inner products

• Each pattern xk transformed into

• Dimensionality of mapped space can be arbitrarily high

φ(xk)

Page 41: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

SVM for the XOR problem• XOR: binary valued features x1, x2• Not solved by linear discriminant• Function ϕ maps input x = [x1, x2] into six-D

feature space y = [1, √2x1, √2x2, √2x1x2 , x12 , x2

2]T

HyperbolasCorresponding tox1 x2 = +1

HyperplanesCorresponding to√2x1 x2 = +1

input space feature sub-space

Page 42: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

• We seek to maximize

• Subject to the constraints , i.e.,

• From problem symmetry, at the solution

SVM for XOR: maximization problem

a1−a

2+a

3−a

4= 0

0≤ak k = 1, 2, 3, 4

a

kk=1

4

∑ −12

ak

j=1

4

∑k=1

4

∑ ajt

jtkφ(x

j)φ(x

k)

a1=a3 and a2=a4

0 = a

nn∑ t

n

Page 43: Support Vector Machines - University at Buffalosrihari/CSE574/Chap7/7.1-SVMs.pdfMachine Learning 2.Maximum Margin ClassifiersSrihari •Begin with 2-classlinear classifier y(x)=wTϕ(x)+b

Machine Learning Srihari

SVM for XOR: maximization problem

• Can use iterative gradient descent• Or use analytical techniques for small problem• The solution is

a* = (1/8, 1/8, 1/8, 1/8)• Last term of Optimizn Problem implies that all four points are support vectors

(unusual and due to symmetric nature of XOR)• The final discriminant function is y(x1,x2) = x1✕ x2• Decision hyperplane is defined by y(x1,x2) = 0• Margin is given by b=1/||a|| = √2

HyperbolasCorresponding tox1 x2 = +1

HyperplanesCorresponding to_

√2x1 x2 = +1