cs456: machine learning

40
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CS456: Machine Learning Support Vector Machine + Kernel Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS456: Machine Learning 1 / 40

Upload: others

Post on 04-Jun-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

CS456: Machine LearningSupport Vector Machine + Kernel

Jakramate Bootkrajang

Department of Computer ScienceChiang Mai University

Jakramate Bootkrajang CS456: Machine Learning 1 / 40

Page 2: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Objectives

To understand what decision boundary is

To understand how SVM works and be able to apply SVM modeleffectively

To understand what kernel is, and be able to choose appropriatekernel

Jakramate Bootkrajang CS456: Machine Learning 2 / 40

Page 3: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outlines

Linear Decision Boundary

SVM’s philosophy

Margin

SVM + Kernel function

Regularised SVM

SVM in Sklearn

Jakramate Bootkrajang CS456: Machine Learning 3 / 40

Page 4: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Linear decision boundary

A linear decision boundary is an imaginery line that separates data classes

••

Jakramate Bootkrajang CS456: Machine Learning 4 / 40

Page 5: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

How to draw such a line ?

Consider a linear classifier

sign(wTx)

for binary classification where class label y is either 1 or −1

This classifier predicts −1 if wTx < 0

and predicts 1 if wTx > 0

Jakramate Bootkrajang CS456: Machine Learning 5 / 40

Page 6: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Concrete example [1/3]

For example, if w =

[22

]

w =

[22

]

We see that, wTxviolet =

[22

]T [11

]> 0 and wTxorange < 0

Jakramate Bootkrajang CS456: Machine Learning 6 / 40

Page 7: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Concrete example [2/3]

Now, observe that wTxgreen = 0.

w

••

••

Jakramate Bootkrajang CS456: Machine Learning 7 / 40

Page 8: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Concrete example [3/3]

A set of points where wTx = 0 defines the decision boundary.

Geometrically, they are the points(vectors) which are perpendicular tow. (dot product is zero)

w

••

••

Note that changing size of w does not change its decision boundary

Jakramate Bootkrajang CS456: Machine Learning 8 / 40

Page 9: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Decision boundary and prediction

w

Intuitively, we are more certain about the prediction of point that isfar away from the boundary (orange point).

while we are less sure about points lie closer to the decision boundary(violet point).

Jakramate Bootkrajang CS456: Machine Learning 9 / 40

Page 10: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

SVM’s philosophy

SVM makes use of such intuition and tries to find a decisionboundary which is

1 farthest away from data of both classes (maximum margin)

2 predicts class labels correctly

••

Jakramate Bootkrajang CS456: Machine Learning 10 / 40

Page 11: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Distance of x to the decision hyperplane

w

xrxp

••

Represent x = xp + r w||w|| (r × unit vector)

Since wTxp = 0, we then have wTx = wTxp + wTr w||w||

In other words, r = wTx||w|| , (note that r is invariant to scaling of w.)

Jakramate Bootkrajang CS456: Machine Learning 11 / 40

Page 12: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Margin

The distance from the closest point in the class to decision boundaryis called margin

Usually, we assume margin of positive class and margin of negativeclass are equal

This means decision boundary is placed in the middle of the distancebetween any two closest points having different class labels.

Jakramate Bootkrajang CS456: Machine Learning 12 / 40

Page 13: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

SVM’s goal = maximum margin

According to a theorem from learning theory, from all possible lineardecision functions the one that maximises the (geometric) margin of thetraining set will minimise the generalisation error

••

margin

Jakramate Bootkrajang CS456: Machine Learning 13 / 40

Page 14: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Types of margins

Geometric margin: r = wTx||w||

▶ Normalised functional margin

Functional margin: wTx▶ Can be increased without bound by multiplying a constant to w.

Jakramate Bootkrajang CS456: Machine Learning 14 / 40

Page 15: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Maximum (geometric) margin (1/2)

Since we can scale the functional margin, we can demand thefunctional margin for the nearest points to be +1 and −1 on the twoside of the decision boundary.

Denoting a nearest positive example by x+ and a nearest negativeexample by x−, we have

▶ wTx+ = +1

▶ wTx− = −1

••

••

••

margin

x+x−

Jakramate Bootkrajang CS456: Machine Learning 15 / 40

Page 16: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Maximum (geometric) margin (2/2)

We then compute the geometric margin from functional marginconstraints

margin =12(

wTx+||w||

− wTx−||w||

)

=1

2||w||(wTx+ − wTx−)

=1

||w||

Jakramate Bootkrajang CS456: Machine Learning 16 / 40

Page 17: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

SVM’s objective function

Given a linearly separable training set S = {xi, yi}mi=1.

We need to find w which maximises 1||w||

Maximising 1||w|| is equivalent to minimising ||w||2

The objective of SVM is then the following quadratic programming

minimise: 12 ||w||2

subject to: wTxi ≥ +1 for yi = +1wTxi ≤ −1 for yi = −1

Jakramate Bootkrajang CS456: Machine Learning 17 / 40

Page 18: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

SVM’s philosophy: revisit

SVM tries to find a decision boundary which is1 farthest away from data of both classes

minimise: 12 ||w||2

2 predicts class labels correctly

subject to: wTxi ≥ +1 for yi = +1wTxi ≤ −1 for yi = −1

We can write the above as:

argminw12 ||W||2 +

N∑i=1

max(0, 1 − yiwTxi) (1)

Jakramate Bootkrajang CS456: Machine Learning 18 / 40

Page 19: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Support vectors

The training points that are nearest to the decision boundary arecalled support vectors.

Quiz: what is the output of our decision function for these points?

••

svsv

Jakramate Bootkrajang CS456: Machine Learning 19 / 40

Page 20: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Non-linear data

What to do when data is not linearly separable?

Jakramate Bootkrajang CS456: Machine Learning 20 / 40

Page 21: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Non-linear classifiers

First approach▶ Use non-linear model, e.g., neural network

▶ (problems: many parameters, local minima)

Second approach▶ Transform data into a richer feature space (including high

dimensioal/non-linear features), then use a linear classifier

Jakramate Bootkrajang CS456: Machine Learning 21 / 40

Page 22: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Learning in the feature space

Map data into a feature space where they are linearly separable

Jakramate Bootkrajang CS456: Machine Learning 22 / 40

Page 23: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Feature transformation function

It is computationally infeasible to explicitly compute the image of themapping ϕ()

because some ϕ() even maps x into infinite dimensional space

Instead of explicitly calculating ϕ(x) we can focus on the relativesimilarity between the mapped data

That is, we calculate ϕ(xi)Tϕ(xi) for all i, j

Jakramate Bootkrajang CS456: Machine Learning 23 / 40

Page 24: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Kernel

A kernel is a function that gives the dot product between the vectorsin feature space induced by the mapping ϕ.

K(xi, xj) = ϕ(xi)Tϕ(xj)

In a matrix form, a kernel matrix (a.k.a Gram matrix) is given by

K =

K(x1, x1) . . . K(x1, xj) . . . K(x1, xm)... . . . ... ...

K(xi, x1) . . . K(xi, xj) . . . K(x1, xm)... ... . . . ...

K(xm, x1) . . . K(xm, xj) . . . K(xm, xm)

Jakramate Bootkrajang CS456: Machine Learning 24 / 40

Page 25: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Kernel

Each row of K is regarded as a trasformed data point, and K can beused in place of data matrix in training classification models

K =

K(x1, x1) . . . K(x1, xj) . . . K(x1, xm)... . . . ... ...

K(xi, x1) . . . K(xi, xj) . . . K(x1, xm)... ... . . . ...

K(xm, x1) . . . K(xm, xj) . . . K(xm, xm)

Jakramate Bootkrajang CS456: Machine Learning 25 / 40

Page 26: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Example 1: Polynomial kernel

mapping x, y points in 2D input to 3D feature space (note y is a datapoint not class label in this slide)

x1 =

[x1x2

]y =

[y1y2

]

The corresponding mapping ϕ is

ϕ(x) =

x21√

2x1x2x2

2

ϕ(y) =

y21√

2y1y2y2

2

So we get a polynomial kernel K(x, y) = ϕ(x)Tϕ(y) = (xTy)2

▶ of degree 2

Jakramate Bootkrajang CS456: Machine Learning 26 / 40

Page 27: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Visualising

Figure: http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf

SVM with Polynomial kernelhttps://www.youtube.com/watch?v=3liCbRZPrZA

Jakramate Bootkrajang CS456: Machine Learning 27 / 40

Page 28: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Kernels: Gaussian (rbf) kernel

A Gaussian kernel is defined as

K(x, y) = e−||x−y||2/σ

The mapping for Gaussian kernel ϕ() is of infinite dimensions

σ is kernel parameter called ‘kernel width’, which must be chosencarefully

Large σ assumes that neighourhood of x spread further

Small σ assumes small area of neighourhood

Jakramate Bootkrajang CS456: Machine Learning 28 / 40

Page 29: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Overlapping data ?

We can not hope for every data being perfectly separable (bothlinearly and non-linearly)

This includes naturally overlapping classes

And also datasets which are quite noisy▶ Originally separable but due to some noise the observed data is not.

If we try to apply SVM to this kind of data, there will be no solution▶ Since we require that all data must be correctly classified

Jakramate Bootkrajang CS456: Machine Learning 29 / 40

Page 30: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Regularised SVM (1/3)

We will relax constraint

subject to: wTxi ≥ +1 for yi = +1wTxi ≤ −1 for yi = −1

To

subject to: wTxi ≥ +1 − ξi for yi = +1wTxi ≤ −1 + ξi for yi = −1

Jakramate Bootkrajang CS456: Machine Learning 30 / 40

Page 31: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Regularised SVM (2/3)

ξi s are called the slack variables

Our new objective is then

minimise: 12 ||w||

2 + CN∑

i=1ξi

subject to: wTxi ≥ +1 − ξi for yi = +1wTxi ≤ −1 + ξi for yi = −1

ξi ≥ 0 for all i

Jakramate Bootkrajang CS456: Machine Learning 31 / 40

Page 32: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Regularised SVM (3/3)

Parameter C controls the trade-off between fitting the data well andallowing some slackness

Predictive performance of SVM is known to depend on C paramater▶ Picking C usually done via cross-validation

Jakramate Bootkrajang CS456: Machine Learning 32 / 40

Page 33: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Example of various C settings

Figure: http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdfJakramate Bootkrajang CS456: Machine Learning 33 / 40

Page 34: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Example of various C settings

Figure: http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdfJakramate Bootkrajang CS456: Machine Learning 34 / 40

Page 35: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Example of various C settings

Figure: http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdfJakramate Bootkrajang CS456: Machine Learning 35 / 40

Page 36: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

SVM in sklearn

Python’s sklearn implements SVM in svc function

Jakramate Bootkrajang CS456: Machine Learning 36 / 40

Page 37: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

SVC parameters

We have seen all of the important parameters (gamma is σ in the slide)

Jakramate Bootkrajang CS456: Machine Learning 37 / 40

Page 38: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

SVC example

An example of using SVC

Jakramate Bootkrajang CS456: Machine Learning 38 / 40

Page 39: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Objectives: revisited

To understand what decision boundary is

To understand how SVM works and be able to apply SVM modeleffectively

To understand what kernel is, and be able to choose appropriatekernel

Jakramate Bootkrajang CS456: Machine Learning 39 / 40

Page 40: CS456: Machine Learning

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Lastly

Questions please ..

Jakramate Bootkrajang CS456: Machine Learning 40 / 40