support vector machine and kernel methodscse.msu.edu/~cse802/s17/slides/lec_17_mar27_svm.pdfjiayu...

Post on 21-Aug-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Support Vector Machine and Kernel Methods

Jiayu Zhou

1Department of Computer Science and EngineeringMichigan State UniversityEast Lansing, MI USA

February 26, 2017

Jiayu Zhou CSE 847 Machine Learning 1 / 50

Which Separator Do You Pick?

Jiayu Zhou CSE 847 Machine Learning 2 / 50

Robustness to Noisy Data

Being robust to noise (measurement error) is good (rememberregularization).

Jiayu Zhou CSE 847 Machine Learning 3 / 50

Thicker Cushion Means More Robustness

We call such hyperplanes fat

Jiayu Zhou CSE 847 Machine Learning 4 / 50

Two Crucial Questions

1 Can we efficiently find the fattest separating hyperplane?

2 Is a fatter hyperplane better than a thin one?

Jiayu Zhou CSE 847 Machine Learning 5 / 50

Pulling Out the Bias

Before

x ∈ {1} × Rd;w ∈ Rd+1

x =

1x1...xd

;w =

w0

w1...wd

signal = wTx

After

x ∈ Rd; b ∈ R,w ∈ Rd

x =

x1...xd

;w =

w1...wd

bias b

signal = wTx+ b

Jiayu Zhou CSE 847 Machine Learning 6 / 50

Separating The Data

Hyperplane h = (b,w)h separates the data means:

yn(wTxn + b) > 0

By rescaling the weights andbias,

minn=1,...,N

yn(wTxn + b) = 1

Jiayu Zhou CSE 847 Machine Learning 7 / 50

Distance to the Hyperplane

w is normal to the hyperplane (why?)wT (x2 − x1) = wTx2 −wTx1 = −b+ b = 0

Scalar projection:

aTb = ∥a∥∥b∥ cos(a,b)⇒ aTb/∥b∥ = ∥a∥ cos(a,b)

let x⊥ be the orthogonal projection ofx to h, distance to hyperplane is givenby projection of x− x⊥ to w (why?)

dist(x, h) =1

∥w∥· |wTx−wTx⊥|

=1

∥w∥· |wTx+ b|

Jiayu Zhou CSE 847 Machine Learning 8 / 50

Fatness of a Separating Hyperplane

dist(x, h) =1

∥w∥· |wTx+ b| = 1

∥w∥· |yn(wTx+ b)| = 1

∥w∥· yn(wTx+ b)

Fatness

= Distance to the closest point

Fatness = minn

dist(xn, h)

=1

∥w∥minn

yn(wTx+ b)

=1

∥w∥

Jiayu Zhou CSE 847 Machine Learning 9 / 50

Maximizing the Margin

Formal definition of margin:

margin: γ(h) =1

∥w∥

NOTE: Bias b does not appear in the margin.

Objective maximizing margin:

minb,w

1

2wTw

subject to: minn=1,...,N

yn(wTxn + b) = 1

An equivalent objective:

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

Jiayu Zhou CSE 847 Machine Learning 10 / 50

Example - Our Toy Data Set

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

Training Data:

X =

0 02 22 03 0

,y =

−1−1+1+1

What is the margin?

Jiayu Zhou CSE 847 Machine Learning 11 / 50

Example - Our Toy Data Set

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

X =

0 02 22 03 0

,y =

−1−1+1+1

(1) : −b ≥ 1

(2) : −(2w1 + 2w2 + b) ≥ 1

(3) : 2w1 + b ≥ 1

(4) : 3w1 + b ≥ 1

{(1) + (3) → w1 ≥ 1

(2) + (3) → w2 ≤ −1⇒ 1

2wTw =

1

2(w2

1 + w22) ≥ 1

Thus: w1 = 1, w2 = −1, b = −1Jiayu Zhou CSE 847 Machine Learning 12 / 50

Example - Our Toy Data Set

Given data X =

0 02 22 03 0

Optimal solution

w∗ =

[w1 = 1w2 = −1

], b∗ = −1

Optimal hyperplaneg(x) = sign(x1 − x2 − 1)

margin:1

∥w∥ = 1√2≈ 0.707

For data points (1), (2) and(3) yn(x

Tnw

∗ + b∗) = 1Support Vectors

Jiayu Zhou CSE 847 Machine Learning 13 / 50

Solver: Quadratic Programming

minu∈Rq

1

2uTQu+ pTu

subject to: Au ≥ c

u∗ ← QP (Q,p, A, c)

(Q = 0 is linear programming.)http://cvxopt.org/examples/tutorial/qp.html

Jiayu Zhou CSE 847 Machine Learning 14 / 50

Maximum Margin Hyperplane is QP

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1, ∀n

minu∈Rq

1

2uTQu+ pTu

subject to: Au ≥ c

u =

[bw

]∈ Rd+1 ⇒

1

2wTw = [b,wT ]

[0 0T

d0d Id

] [b

wT

]= uT

[0 0T

d0d Id

]u

Q =

[0 0T

d0d Id

],p = 0d+1

yn(wTxn + b) ≥ 1 = [yn, ynx

Tn ]u ≥ 1 ⇒

y1 y1xT1

......

yN yNxTN

u ≥

1...1

A =

y1 y1xT1

......

yN yNxTN

, c =

1...1

Jiayu Zhou CSE 847 Machine Learning 15 / 50

Back To Our Example

Exercise:

X =

0 02 22 03 0

,y =

−1−1+1+1

(1) : −b ≥ 1

(2) : −(2w1 + 2w2 + b) ≥ 1

(3) : 2w1 + b ≥ 1

(4) : 3w1 + b ≥ 1

Show the corresponding Q,p, A, c.

Q =

0 0 00 1 00 0 1

,p =

000

, A =

−1 0 0−1 −2 −21 2 01 3 0

, c =

1111

Use your QP-solver to give

u∗ = [b∗, w∗1 , w

∗2 ]

T = [−1, 1,−1]

Jiayu Zhou CSE 847 Machine Learning 16 / 50

Primal QP algorithm for linear-SVM

1 Let p = 0d+1 be the (d+ 1)-vector of zeros and c = 1N theN -vector of ones. Construct matrices Q and A, where

A =

y1 −y1xT1−

......

yN −yNxTN−

, Q =

[0 0Td0d Id

]

2 Return

[b∗

w∗

]= u∗ ← QP (Q,p, A, c).

3 The final hypothesis is g(x) = sign(xTw∗ + b∗).

Jiayu Zhou CSE 847 Machine Learning 17 / 50

Link to Regularization

minw

Ein(w)

subject to: wTw ≤ C

optimal hyperplane regularization

minimize wTw Ein

subject to Ein = 0 wTw ≤ C

Jiayu Zhou CSE 847 Machine Learning 18 / 50

How to Handle Non-Separable Data?

(a) Tolerate noisy data points: soft-margin SVM.

(b) Inherent nonlinear boundary: non-linear transformation.

Jiayu Zhou CSE 847 Machine Learning 19 / 50

Non-Linear Transformation

Φ1(x) = (x1, x2)

Φ2(x) = (x1, x2, x21, x1x2, x

22)

Φ3(x) = (x1, x2, x21, x1x2, x

22, x

31, x

21x2, x1x

22, x

32)

Jiayu Zhou CSE 847 Machine Learning 20 / 50

Non-Linear Transformation

Using the nonlinear transform with the optimal hyperplaneusing a transform Φ: Rd → Rd̃:

zn = Φ(xn)

Solve the hard-margin SVM in the Z-space (w̃∗, b̃∗):

minb̃,w̃

1

2w̃T w̃

subject to: yn(w̃Tzn + b̃) ≥ 1, ∀n

Final hypothesis:

g(x) = sign(w̃∗TΦ(x) + b̃∗)

Jiayu Zhou CSE 847 Machine Learning 21 / 50

SVM and non-linear transformation

The margin is shaded in yellow, and the support vectors are boxed.

For Φ2, d̃2 = 5 and for Φ3, d̃3 = 9d̃2 is nearly double d̃3, yet the resulting SVM separator is notseverely overfitting with Φ3 (regularization?).

Jiayu Zhou CSE 847 Machine Learning 22 / 50

Support Vector Machine Summary

A very powerful, easy to use linear model which comes withautomatic regularization.

Fully exploit SVM: Kernel

potential robustness to overfitting even after transforming to amuch higher dimensionHow about infinite dimensional transforms?Kernel Trick

Jiayu Zhou CSE 847 Machine Learning 23 / 50

SVM Dual: Formulation

Primal and dual in optimization.

The dual view of SVM enables us to exploit the kernel trick.

In the primal SVM problem we solve w ∈ Rd, b, while in thedual problem we solve α ∈ RN

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

which is also a QP problem.

Jiayu Zhou CSE 847 Machine Learning 24 / 50

SVM Dual: Prediction

We can obtain the primal solution:

w∗ =

N∑n=1

ynα∗nxn

where for support vectors αn > 0

The optimal hypothesis:

g(x) = sign(w∗Tx+ b∗)

= sign

(N∑

n=1

ynα∗nx

Tnx+ b∗

)

= sign

∑α∗n>0

ynα∗nx

Tnx+ b∗

Jiayu Zhou CSE 847 Machine Learning 25 / 50

Dual SVM: Summary

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

w∗ =

N∑n=1

ynα∗nxn

Jiayu Zhou CSE 847 Machine Learning 26 / 50

Common SVM Basis Functions

zk = polynomial terms of xk of degree 1 to q

zk = radial basis function of xk

zk(j) = ϕj(xk) = exp(−|xk − cj |2/σ2)

zk = sigmoid functions of xk

Jiayu Zhou CSE 847 Machine Learning 27 / 50

Quadratic Basis Functions

Φ(x) =

1√2x1

...√2xd

x21...x2d√

2x1x2

...√2x1xd√2x2x3

...√2xd−1xd

Including Constant Term, LinearTerms, Pure Quadratic Terms,Quadratic Cross-Terms

The number of terms is approximatelyd2/2.

You may be wondering what those√2s are doing. Youll find out why

theyre there soon.

Jiayu Zhou CSE 847 Machine Learning 28 / 50

Dual SVM: Non-linear Transformation

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

w∗ =

N∑n=1

ynα∗nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)

Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d2/2 additions and multiplications,The whole thing costs N2d2/4.

Jiayu Zhou CSE 847 Machine Learning 29 / 50

Quadratic Dot Products

Φ(a)TΦ(b) =

1√2a1...√2ama21...

a2m√2a1a2...√

2a1ad√2a2a3...√

2ad−1ad

T

1√2b1...√2bdb21...b2d√2b1b2...√

2b1bd√2b2b3...√

2bd−1bd

Constant Term 1

Linear Terms

d∑i=1

2aibi

Pure Quadratic Terms

d∑i=1

a2i b2i

Quadratic Cross-Terms

d∑i=1

d∑j=i+1

2aiajbibj

Jiayu Zhou CSE 847 Machine Learning 30 / 50

Quadratic Dot Product

Does Φ(a)TΦ(b) look familiar?

Φ(a)TΦ(b) = 1 + 2∑d

i=1aibi +

∑d

i=1a2i b

2i +

∑d

i=1

∑d

j=i+12aiajbibj

Try this: (aTb+ 1)2

(aT b+ 1)2 = (aT b)2 + 2aT b+ 1

=

(∑d

i=1aibi

)2

+ 2∑d

i=1aibi + 1

=∑d

i=1

∑d

j=1aibiajbj + 2

∑d

i=1aibi + 1

=∑d

i=1a2i b

2i + 2

∑d

i=1

∑d

j=i+1aiajbibj + 2

∑d

i=1aibi + 1

They’re the same! And this is only O(d) to compute!

Jiayu Zhou CSE 847 Machine Learning 31 / 50

Dual SVM: Non-linear Transformation

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

w∗ =N∑

n=1

ynα∗nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)

Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d additions and multiplications.

Jiayu Zhou CSE 847 Machine Learning 32 / 50

Higher Order Polynomials

Φ(x) Cost 100dim

Quadratic d2/2 terms d2N2/4 2.5kN2

Cubic d3/6 terms d3N2/12 83kN2

Quartic d4/24 terms d4N2/48 1.96mN2

Φ(a)TΦ(b) Cost 100dim

Quadratic (aTb+ 1)2 dN2/2 50N2

Cubic (aTb+ 1)3 dN2/2 50N2

Quartic (aTb+ 1)4 dN2/2 50N2

Jiayu Zhou CSE 847 Machine Learning 33 / 50

Dual SVM with Quintic basis functions

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)︸ ︷︷ ︸

(xTnxm+1)5

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

Classification:

g(x) = sign(w∗TΦ(x) + b∗) = sign

(∑α∗n>0

ynα∗nΦ(xn)

TΦ(x) + b∗)

= sign

(∑α∗n>0

ynα∗n(x

Tnx+ 1)5 + b∗

)

Jiayu Zhou CSE 847 Machine Learning 34 / 50

Dual SVM with general kernel functions

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymK(xn,xm)

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

Classification:

g(x) = sign(w∗TΦ(x) + b∗) = sign

(∑α∗n>0

ynα∗nΦ(xn)

TΦ(x) + b∗)

= sign

(∑α∗n>0

ynα∗nK(xn,xm) + b∗

)

Jiayu Zhou CSE 847 Machine Learning 35 / 50

Kernel Tricks

Replacing dot product with a kernel function

Not all functions are kernel functions!

Need to be decomposable K(a,b) = Φ(a)TΦ(b)Could K(a,b) = (a− b)3 be a kernel function?Could K(a,b) = (a− b)4 − (a+ b)2 be a kernel function?

Mercer’s condition

To expand Kernel function K(a,b) into a dot product, i.e.,K(a,b) = Φ(a)TΦ(b), K(a,b) has to be positivesemi-definite function.kernel matrix K is always symmetric PSD for any givenx1, . . . ,xN .

Jiayu Zhou CSE 847 Machine Learning 36 / 50

Kernel Design: expression kernel

mRNA expression data:

Each matrix entry is an mRNAexpression measurement.Each column is an experiment.Each row corresponds to a gene.

Similar or dissimilarSimilar

Dissimilar

Kernel

K(x, y) =

∑i xiyi√∑

i xixi√∑

i yiyi

Jiayu Zhou CSE 847 Machine Learning 37 / 50

Kernel Design: sequence kernel

Work with non-vectorial data

Scalar product on a pair of variable-length, discrete strings?>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH

>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

Jiayu Zhou CSE 847 Machine Learning 38 / 50

Commonly Used SVM Kernel Functions

K(a,b) = (α · aTb+ β)Q is an example of an SVM kernelfunction.

Beyond polynomials there are other very high dimensionalbasis functions that can be made practical by finding the rightKernel Function

Radial-basis style kernel (RBF)/Gaussian kernel function

K(a,b) = exp(−γ∥a− b∥2

)Sigmoid functions

Jiayu Zhou CSE 847 Machine Learning 39 / 50

2nd Order Polynomial Kernel

K(a,b) = (α · aTb+ β)2

Jiayu Zhou CSE 847 Machine Learning 40 / 50

Gaussian Kernels

K(a,b) = exp(−γ∥a− b∥2

)

When γ is large, we clearly see that even the protection of a largemargin cannot suppress overfitting. However, for a reasonablysmall γ, the sophisticated boundary discovered by SVM with theGaussian-RBF kernel looks quite good.

Jiayu Zhou CSE 847 Machine Learning 41 / 50

Gaussian Kernels

For (a) a noisy data set that linear classifier appears to work quitewell, (b) using the Gaussian-RBF kernel with the hard-margin SVMleads to overfitting.

Jiayu Zhou CSE 847 Machine Learning 42 / 50

From hard-margin to soft-margin

When there are outliers, hard margin SVM + Gaussian-RBFkernel result in an unnecessarily complicated decisionboundary that overfits the training noise.

Remedy: a soft formulation that allows small violation of themargins or even some classification errors.

Soft-margin: margin violation εn ≥ 0 for each data point(xn, yn) and require that

yn(wTxn + b) ≥ 1− εn

εn captures by how much (xn, yn) fails to be separated.

Jiayu Zhou CSE 847 Machine Learning 43 / 50

Soft-Margin SVM

We modify the hard-margin SVM to the soft-margin SVM byallowing margin violations but adding a penalty term to discouragelarge violations:

minb,w,ε

1

2wTw + C

N∑n=1

εn

subject to: yn(wTxn + b) ≥ 1− εn for n = 1, . . . , N

εn ≥ 0, for n = 1, . . . , N

The meaning of C?

When C is large, it means we care more about violating the margin,which gets us closer to the hard-margin SVM.

When C is small, on the other hand, we care less about violating themargin.

Jiayu Zhou CSE 847 Machine Learning 44 / 50

Soft Margin Example

Jiayu Zhou CSE 847 Machine Learning 45 / 50

Soft Margin and Hard Margin

minb,w,ε

12w

Tw︸ ︷︷ ︸margin

+C∑N

n=1εn︸ ︷︷ ︸

error tolerance

subject to: yn(wTxn + b) ≥ 1− εn, εn ≥ 0,∀N

Jiayu Zhou CSE 847 Machine Learning 46 / 50

The Hinge Loss

The trade-off sounds very similar, right?

We have εn ≥ 0, and thatyn(w

Txn + b) ≥ 1− εn ⇒ εn ≥ 1− yn(wTxn + b)

The SVM loss (aka. Hinge Loss) function

ESVM(b,w) =1

N

∑N

n=1max(1− yn(w

Txn + b), 0)

The soft-margin SVM can be re-written as the followingoptimization problem:

minb,w

ESVM(b,w) + λwTw

Jiayu Zhou CSE 847 Machine Learning 47 / 50

Dual Soft-Margin SVM

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

n=1

ynαn = 0, 0 ≤ αn≤ C, ∀n

w∗ =

N∑n=1

ynα∗nxn

Jiayu Zhou CSE 847 Machine Learning 48 / 50

Summary of Dual SVM

Deliver a large-margin hyperplane, and in so doing it cancontrol the effective model complexity.

Deal with high- or infinite-dimensional transforms using thekernel trick.

Express the final hypothesis g(x) using only a few supportvectors, their corresponding dual variables (Lagrangemultipliers), and the kernel.

Control the sensitivity to outliers and regularize the solutionthrough setting C appropriately.

Jiayu Zhou CSE 847 Machine Learning 49 / 50

Support Vector Machine

Robust Classifier: Maximum Margin

Primal Objective

Dual Objective

QP Solver - d

QP Solver - N

Design: Hard Margin

Design: Soft Margin

Primal Objective

Dual Objective

QP Solver - d

QP Solver - N

Kernel TrickAllow Training Error

Kernel Trick

Support Vector

Support Vector

Hinge Loss

Jiayu Zhou CSE 847 Machine Learning 50 / 50

top related