machine learning - columbia universityjebara/4771/notes/class6x.pdf · machine learning 4771...

Tony Jebara, Columbia University

Machine Learning 4771

Instructor: Tony Jebara


Topic 6 • Review: Support Vector Machines

• Primal & Dual Solution

• Non-separable SVMs

• Kernels

• SVM Demo


Review: SVM • Support vector machines are (in the simplest case) linear classifiers that do structural risk minimization (SRM) • Directly maximize margin to reduce guaranteed risk J(θ) • Assume first the 2-class data is linearly separable:

• Decision boundary or hyperplane given by • Note: can scale w & b while keeping same boundary • Many solutions exist which have empirical error Remp(θ)=0 • Want widest or thickest one (max margin), also it’s unique!

have x

1,y

1( ),…, xN,y

N( ){ } where xi∈ D and y

i∈ −1,1{ }

f x;θ( ) = sign wTx +b( )

wTx +b = 0

⇒


Support Vector Machines • Define: w

Tx +b = 0

H+=positive margin hyperplane H- =negative margin hyperplane q =distance from decision plane to origin

q = min

x

x −0 subject to wTx +b = 0

min

x12

x −0

2−λ wTx +b( )

∂∂x

12xTx −λ wTx +b( )( ) = 0

x −λw = 0x = λw

2) plug into constraint

wTx +b = 0

wT λw( ) +b = 0

λ = − bwTw

1) grad

3) Sol’n x̂ = − b

wTw( )w4) distance

q = x̂ −0 = − b

wTww =

b

wTwwTw =

b

w5) Define without loss of generality since can scale b & w

H → wTx +b = 0

H+→ wTx +b = +1

H− → wTx +b = −1


Support Vector Machines • The constraints on the SVM for Remp(θ)=0 are thus:

• Or more simply: • The margin of the SVM is:

• Distance to origin:

• Therefore: and margin

• Want to max margin, or equivalently minimize: • SVM Problem: • This is a quadratic program! • Can plug this into a matlab function called “qp()”, done!

d+

d−

H+

H−

H

H+→ wTx +b = +1

H− → wTx +b = −1

wTx

i+b ≥+1 ∀y

i= +1

wTx

i+b ≤−1 ∀y

i= −1

y

iwTx

i+b( )−1≥ 0

m = d+

+d−

H → q =

b

w H

+→ q

+=

b−1

w H−→ q

−=

−1−b

w

d

+= d

−= 1

w

m =2

w

w or w

2

min 1

2w

2subject to y

iwTx

i+b( )−1≥ 0


SVM in Dual Form • We can also solve the problem via convex duality • Primal SVM problem LP: • This is a quadratic program, quadratic cost function with multiple linear inequalities (these carve out a convex hull) • Subtract from cost each inequality times an α Lagrange multiplier, take derivatives of w & b:

• Plug back in, dual: • Also have constraints: • Above LD must be maximized! convex duality… also qp()

min 1

2w

2subject to y

iwTx

i+b( )−1≥ 0

L

P= min

w,bmax

α≥012

w2− α

iy

iwTx

i+b( )−1( )i∑

∂∂w

LP

= w− αiy

ix

i= 0 →

i∑ w = αiy

ix

ii∑ ∂∂b

LP

=− αiy

i= 0

i∑ L

D= α

ii∑ − 12

αiα

jy

iy

jx

iTx

jj∑i∑ α

iy

i= 0

i∑ & αi≥ 0


SVM Dual Solution Properties • We have dual convex program:

• Solve for N alphas (one per data point) instead of D w’s • Still convex (qp) so unique solution, gives alphas • Alphas can be used to get w: • Support Vectors: have non-zero alphas shown with thicker circles, all live on the margin: • Solution is sparse, most alphas=0 these are non-support vectors SVM ignores them if they move (without crossing margin) or if they are deleted, SVM doesn’t change (stays robust)

α

ii∑ − 12

αiα

jy

iy

jx

iTx

ji, j∑ subject to αiy

i= 0

i∑ & αi≥ 0

wTx

i+b = ±1

w = α

iy

ix

ii∑


SVM Dual Solution Properties • Primal & Dual Illustration:

• Recall we could get w from alphas: • Or could use as is: • Karush-Kuhn-Tucker Conditions (KKT): solve value of b on margin (for nonzero alphas) have: using known w, compute b for each support vector then… • Sparsity (few nonzero alphas) is useful for several reasons • Means SVM only uses some of training data to learn • Should help improve its ability to generalize to test data • Computationally faster when using final learned classifier

w = α

iy

ix

ii∑ f x( ) = sign xTw +b( ) = sign α

iy

ixTx

ii∑ +b( )

wTx

i+b ≥±1

wTx

i+b = y

i

bi

= yi−wTx

i∀i : α

i> 0

b = average b

i( )


Non-Separable SVMs • What happens when non-separable? • There is no solution and convex hull shrinks to nothing

• Not all constraints can be resolved, their alphas go to • Instead of perfectly classifying each point: we “Relax” the problem with (positive) slack variables xi’s allow data to (sometimes) fall on wrong side, for example:

• New constraints:

• But too much xi’s means too much slack, so penalize them

∞

wTx

i+b ≥+1−ξ

iif y

i= +1 where ξ

i≥ 0

wTx

i+b ≤−1 + ξ

iif y

i= −1 where ξ

i≥ 0

L

P: min 1

2w

2+C ξ

ii∑ subject to yi

wTxi

+b( )−1 + ξi≥ 0

wTx

i+b ≥−0.03 if y

i= +1

y

iwTx

i+b( )≥ 1


Non-Separable SVMs • This new problem is still convex, still qp()! • User chooses scalar C (or cross-validates) which controls how much slack xi to use (how non-separable) and how robust to outliers or bad points on the wrong side

• Can now write dual problem (to maximize):

• Same dual as before but alphas can’t grow beyond C

L

P: min 1

2w

2+C ξ

ii∑ − αi

yi

wTxi

+b( )−1 + ξi( )i∑ − β

iξ

ii∑

∂∂b

LPand ∂

∂wL

Pasbefore...

∂∂ξi

LP

= C −αi−β

i= 0

αi= C −β

ibut... α

i& β

i≥ 0

∴ 0 ≤ αi≤C

For xi positivity Large margin Low slack On right side

L

D: max α

ii∑ − 12

αiα

jy

iy

jx

iTx

ji, j∑ subject to αiy

i= 0

i∑ and αi∈ 0,C⎡⎣⎢

⎤⎦⎥


Non-Separable SVMs • As we try to enforce a classification for a data point its Lagrange multiplier alpha keeps growing endlessly • Clamping alpha to stop growing at C makes SVM “give up” on those non-separable points • The dual program is now: • Solve as before with extra constraints that alphas positive AND less than C… gives alphas… from alphas get

• Karush-Kuhn-Tucker Conditions (KKT): solve value of b on margin for not=zero alphas AND not=C alphas for all others have support vectors, assume and use formula to get and • Mechanical analogy: support vector forces & torques

yi

wTxi

+ bi( )−1 + ξ

i= 0

w = α

iy

ix

ii∑

b = average b

i( ) bi

ξi = 0


Nonlinear SVMs • What if the problem is not linear?


Nonlinear SVMs • What if the problem is not linear? • We can use our old trick… • Map d-dimensional x data from L-space to high dimensional H (Hilbert) feature-space via basis functions Φ(x) • For example, quadratic classifier:

• Call phi’s feature vectors computed from original x inputs • Replace all x’s in the SVM equations with phi’s • Now solve the following learning problem:

• Which gives a nonlinear classifier in original space:

L

Η

xi→Φ x

i( ) via Φx( ) =

x

vecxxT( )

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

L

D: max α

ii∑ − 12

αiα

jy

iy

jφ x

i( )T φ xj( )i, j∑ s.t. α

i∈ 0,C⎡⎣⎢

⎤⎦⎥ , α

iy

i= 0

i∑

f x( ) = sign α

iy

iφ x( )T φ x

i( )i∑ +b⎛⎝⎜⎜⎜

⎞⎠⎟⎟⎟


Kernels (see http://www.youtube.com/watch?v=3liCbRZPrZA) • One important aspect of SVMs: all math involves only the inner products between the phi features!

• Replace all inner products with a general kernel function • Mercer kernel: accepts 2 inputs and outputs a scalar via:

• Example: quadratic polynomial

f x( ) = sign α

iy

iφ x( )T φ x

i( )i∑ +b⎛⎝⎜⎜⎜

⎞⎠⎟⎟⎟

k x, x( ) = φ x( ),φ x( ) =φ x( )T φ x( ) for finite φ

φ x,t( )φ x,t( )dtt∫ otherwise

⎧

⎨

⎪⎪⎪⎪

⎩⎪⎪⎪⎪

φ x( ) = x

12 2x

1x

2x

22⎡

⎣⎢

⎤⎦⎥T

k x, x( ) = φ x( )T φ x( )= x

12x

12 + 2x

1x

1x

2x

2+ x

22x

22

= x1x

1+ x

2x

2( )2


Kernels • Sometimes, many Φ(x) will produce the same k(x,x’) • Sometimes k(x,x’) computable but features huge or infinite! • Example: polynomials If explicit polynomial mapping, feature space Φ(x) is huge

d-dimensional data, p-th order polynomial,

images of size 16x16 with p=4 have dim(H)=183million

but can equivalently just use kernel:

dim Η( ) =d + p−1

p

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

k x,y( ) = xTy( )p

k x, x( ) = xT x( )p= x

ix

ii∑( )p

∝p !

r1!r

2!r

3!… p−r

1−r

2−…( )!

x1

r1x2

r2xd

rd

r∑ x

1

r1 x2

r2xd

rd

∝ wrx

1

r1x2

r2xd

rd( )r∑ w

rx

1

r1 x2

r2xd

rd( )∝ φ x( )φ x( )

Multinomial Theorem

w=weight on term

Equivalent!


Kernels

• Replace each , for example: P-th Order Polynomial Kernel:

RBF Kernel (infinite!):

Sigmoid (hyperbolic tan) Kernel:

• Using kernels we get generalized inner product SVM:

• Still qp solver, just use Gram matrix K (positive definite)

L

D: max α

ii∑ − 12

αiα

jy

iy

jk x

i,x

j( )i, j∑ s.t. αi∈ 0,C⎡⎣⎢

⎤⎦⎥ , α

iy

i= 0

i∑

K

ij= k x

i,x

j( )

K =

k x1,x

1( ) k x1,x

2( ) k x1,x

3( )k x

1,x

2( ) k x2,x

2( ) k x2,x

3( )k x

1,x

3( ) k x2,x

3( ) k x3,x

3( )

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥

k x, x( ) = xT x +1( )p

k x, x( ) = exp − 1

2σ2 x − x2⎛

⎝⎜⎜⎜

⎞⎠⎟⎟⎟

k x, x( ) = tanh κxT x −δ( )

f x( ) = sign α

iy

ik x

i,x( )i∑ +b( )

x

iTx

j→ k x

i,x

j( )


Kernelized SVMs • Polynomial kernel:

• Radial basis function kernel:

Polynomial Kernel RBF kernel

• Least-squares, logistic-regression, perceptron are also kernelizable

k xi,x

j( ) = xiTx

j+1( )p

k xi,x

j( ) = exp − 12σ2 x

i−x

j

2⎛⎝⎜⎜⎜

⎞⎠⎟⎟⎟


SVM Demo • SVM Demo by Steve Gunn: http://www.isis.ecs.soton.ac.uk/resources/svminfo/ • In svc.m replace [alpha lambda how] = qp(…); with [alpha lambda how] = quadprog(H,c,[],[],A,b,vlb,vub,x0);

This replaces the old Matlab command qp (quadratic programming) with the new one for more recent versions

machine learning - columbia universityjebara/4771/notes/class6x.pdf · machine learning 4771...

Documents