machine learning classifiers: support vector...

MACHINE LEARNING

Classifiers:

Support Vector Machine

MACHINE LEARNING

What is Classification?

Detecting facial attributes

Sony (Make Believe) He & Zhang, Pattern Recognition, 2011

Children

Female Adult

Training set must be as unambiguous as possible

Not easy, especially as members of different classes may share similar attributes

Learning implies generalization; which of the features of each member of class makes

the class most distinguishable from the other classes.

MACHINE LEARNING

Multi-Class Classification

Children

Female Adult

Male Adult

Whenever possible, the classes should be balanced

Garbage model:

Male adult versus anything that is neither a female

adult nor a child

Classes can no longer be balanced!

MACHINE LEARNING

Classifiers

There is a plethora of classifiers, e.g:

- Neural networks (feed-forward with backpropagation, multi-layer perceptron)

- Decision trees (C4.5, random forest)

- Kernel methods (support vector machine, gaussian process classifier)

- Mixtures of linear classifiers (boosting)

In this class, we will see only SVM and Boosting for mixture of classifiers

Each classifier type has its pros and cons:

- Complex model: embed non-linearity but heavy computation

- Simple models: often high number of models, hence high stack memory

- Number of hyperparameters: high extensive crossvalidation to determine

the optimal classifier

- Some of the classifiers come with guarantees for global optimal solution;

other have only local optimality guarantee

MACHINE LEARNING

Brief history:

SVM was invented by Vladimir Vapnik

Started with the invention of the statistical learning theory (Vapnik1979)

The current form of SVM was presented in (Boser, Guyon and Vapnik 1992) and Cortes and Vapnik (1995)

Textbooks:

An easy introduction to SVM is given in Learning with

Kernels by Bernhard Scholkopf and Alexander Smola.

A good survey of the theory behind SVM is

given in Support Vector Machines and other

Kernel Based Learning methods by Nello

Cristianini and John-Shawe Taylor.

MACHINE LEARNING

The success of SVM is mainly due to:

- Its ease of use (lots of software available, good documentation)

- Excellent performance on variety of datasets

- Good solvers making optimization (learning phase) very quick

- Very fast at retrieval time – does not hinder practical applications

Was applied to numerous classification problems:

- Computer vision (face detection, object recognition, feature categorization,

- Bioinformatics (categorization of gene expression, of microarray data)

- WWW (categorization of websites)

- Production (control of quality, detection of defaults)

- Robotics (categorization of sensor readings)

- Finance (bankruptcy prediction)

MACHINE LEARNING

Optimal Linear Classification

• Which choice is better?

• How could we formulate this problem?

‘good’

‘OK’

‘bad’

MACHINE LEARNING

Linear Classifiers

denotes -1

denotes +1

How would you classify this data?

(W, b)

; , sgn ,f x w b w x b

MACHINE LEARNING

f x yest

denotes -1

denotes +1

(W, b)

Linear Classifiers

MACHINE LEARNING

f x yest

denotes -1

denotes +1

(W, b)

Linear Classifiers

MACHINE LEARNING

f x yest

denotes -1

denotes +1

(W, b)

Linear Classifiers

MACHINE LEARNING

f x yest

denotes -1

denotes +1

Any of these would be fine..

..but which is best?

(W, b)

Linear Classifiers

MACHINE LEARNING

Classifier Margin

f x yest

denotes -1

denotes +1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

(W, b)

MACHINE LEARNING

f x yest

denotes -1

denotes +1 The maximum margin linear classifier is the linear classifier with the maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Linear SVM

(W, b)

Classifier Margin

MACHINE LEARNING

f x yest

denotes -1

denotes +1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

Linear SVM

(W, b)

Classifier Margin

MACHINE LEARNING

f x yest

denotes -1

denotes +1

(W, b)

Classifier Margin

Need to determine a

measure of the

margin

MACHINE LEARNING

f x yest

denotes -1

denotes +1

(W, b)

Classifier Margin

Need to determine a

measure of the

margin

To maximize this

measure

MACHINE LEARNING

Determining the Optimal Separating Hyperplane

: , 1x w x b

Definition:

The margin on either side of the hyperplane satisfy , 1.w x b

: , 0x w x b

MACHINE LEARNING

: , 2x w x b

Class with label y=-1

Class with label y=+1

: , 2x w x b

Points on either side of the separating plane have negative and positive

coordinates, respectively .

: , 3x w x b

Decision function:

MACHINE LEARNING

What is the distance from a point x to the hyperplane

<w,x>+b= 0?

, 0w x b

MACHINE LEARNING

unitary vector

' . . , ' 0

, ' , ' , ,

, ' , .

Projection of x-x' onto w:

x s t w x b

w x x w x w x

w x x b w x

w x b w x b ww

x’-x

,Distance to hyperplane

The margin between two classes is at least 2/||w||.

MACHINE LEARNING

Class with label y=-1

Class with label y=+1

Two points on either side of

the margin:

, x x 2

The margin between two classes is at least 2/||w||.

MACHINE LEARNING – 2012

2Separating condition is measured by .

To maximize this condition is equivalent to minimizing .2

Better even is to minimize the convex form .2

• Finding the Optimal Separating Hyperplane turns out to be

an optimization problem of the following form:

• N+1 parameters (N: dimension of data)

• M constraints (M: nm of datapoints)

• It is called the primal problem.

i=1,2,....,M.

, 1 when 1, 1,

, 1 when 1

w x b yy w x b

w x b y

Rephrase the minimization under constraint problem in terms of the

Lagrange Multipliers ai, i = 1, ..., M (M, # of data points), one for each of

the inequality constraints and we get the dual problem:

1, , , 1

with 0

L w b w y w x ba a

(Minimization of convex function under linear constraints through Lagrange gives the optimal solution)

max min , ,

1, , , 1

L w b w y w x b

The solution of this problem is found when maximizing over a and

minimizing over w and b:

Requesting that the gradient of L vanishes with w.

, ,0 i i

L w bw y x

The vector defining the hyperplane

is determined by the training points.

Note that while w is unique (minimization of convex function), the alpha are not unique.

Requires minimum one datapoint on

each class

, ,0 0i

L w by

Original constraints

, ,0 , 1

L w by w x b

Complete optimization problem:

, ,0 (Dual feasibility)

, ,0 0 (Dual feasibility)

, ,0 , 1 (Primal feasibility)

Karush-

L w bw y x

L w by

L w by w x b

Kuhn-Tucker conditions:

, 1 0 1,.. (Complementarity conditions)

0, 1,..

y w x b i M

Karush-

L w bw y x

L w by

L w by w x b

0, 1,..

y w x b i M

Feasibility conditions from KKT require 0 for all points x

for the primal constraints to be satisfied:

y w x b

Points correctly classified

Karush-

L w bw y x

L w by

L w by w x b

0, 1,..

y w x b i M

The , 1,.. determine the solutions to the constraints

All the pairs of data points , for which 0 are the support vectors

All the pairs of data points , for which 0 are "irrelevant"

hen computing the margin.

Consider 3 cases:

0 1 1 <1 inside the margin

1 -1 for -1 or >1 for 1 outside the margin

0 >0 for -1 or <0 for

i i i i

T T i T i

T T i T

y w x b w x b

y w x b w x b y w x b y

y w x b w x b y w x b

1 do not satisfy the constraintiy

: , 1x w x b

: , 0x w x b

The decision function is then expressed in terms of the support vectors:

,f x sgn w x b

sgn y x x ba

, ,0 i i

L w bw y x

Use , 1 0

to compute b.

i y w x ba

To determine how good the hyperplane is

crossvalidation to get an estimation of the error on the testing set.

MACHINE LEARNING

Non-Separable Data Sets

denotes +1

denotes -1 This is going to be a problem!

What should we do?

Idea :

Introduce some slack on the constraints

MACHINE LEARNING

Support Vector Machine for non-separable datasets

The constraints are relaxed by

introducing slack variables , 1,... :

1 if y 1

T i ii

denotes +1

denotes -1

Can again be expressed through compact notation:

y 1i T iiw x b

MACHINE LEARNING

The objective function gives a penalty

for too large slack variables:

Find a trade off between maximizing

the margin and minimizing the

classification errors.

denotes +1

denotes -1

C>0 weights the influence of the penalty term

MACHINE LEARNING

Optimization under constraint problem of the form:

0 j=1,...M

y w x b

The hyperplane has the same solution as for the separable case:

MACHINE LEARNING

denotes -1

denotes +1

1Tw x b

Support Vectors

Decision boundary is determined only by those support vectors !

: 1 0i iT

i ii y w x ba

ai = 0 for non-support vectors

ai 0 for support vectors (i=0)

Non-Linear Classification

What if the points in the input space cannot be separated

by a linear hyperplane?

The Kernel Trick for SVM

As usual, observe that the decision function of the linear SVM computes

an inner product across pairs of observations:

,i jx x

The decision function:

, ,i i

f x sgn w x b sgn y x x ba

The decision function in linear classification with SVM was given by:

, ,i i

f x sgn w x b sgn y x x ba

Non-Linear Classification with Support Vector Machines

Replace linear plane in original space ,

by linear plane in feature space: and ,

The decision function becomes:

x x w x b

f x sgn w x b y x x b

W lives in

feature space

x x w x b

f x sgn w x b y x x b

Use the kernel trick by exploiting the fact that the decision function

depends on a dot product in feature space

, ,i j i jk x x x x

x x w x b

f x sgn w x b y k x x b

Use the kernel trick by exploiting the fact that the decision function

depends on a dot product in feature space

, ,i j i jk x x x x

f ,i i

x sgn y k x x ba

1max ,

subject to 0 for all 1,... and 0.

M Mi j i j

L y y k x x

aa a a a

The optimization problem in feature space becomes:

The decision function in feature space is computed as follows:

Kernel

The offset b can be estimated from the KKT conditions.

Best is to compute the expectation over the constraints and hence

to compute an estimate of through regression:

1- ,j ji i

b y y k x xM

MACHINE LEARNING

How to read out the result of SVM

Hyperplane

Support vectors

Color gradient

= distance to

the hyperplane

MACHINE LEARNING

How to read out the result of SVM

Hyperplane

Support vectors

The margin

MACHINE LEARNING

SVM decision function:

f sgn , i i i

x y k x x ba

The hyperparameters of SVM

2Gaussian kernel: , ' , .

k x x e

The kernel has several open parameters (Hyperparameters)

that need to be determined before running SVM

Kernel width

Order of the polynomial

Usually d=1

Inhomogeneous polynomial kernel: , ' , ' , , 0p

k x x x x d p d

MACHINE LEARNING

Optimization under constraint problem of the form:

0 j=1,...M

y w x b

The hyperparameters of SVM

C that determines the costs

associated to incorrectly classifying

datapoints is an open parameter of

the optimization function

MACHINE LEARNING

Non-Linear Support Vector Machines: Examples

MACHINE LEARNING

Effect of the penalty factor C

RBF kernel width=0.20; C=1000; several misclassified datapoints

MACHINE LEARNING

RBF kernel width=0.20; C=2000; less misclassified datapoints

Effect of the penalty factor C

MACHINE LEARNING

RBF kernel width =0.001; C=1000; 113 support vectors out of 345 total nm of datapoints

Effect of the width of Gaussian kernel

MACHINE LEARNING

RBF kernel width =0.008; C=1000; 64 support vectors out of 345 total nm of datapoints

MACHINE LEARNING

RBF kernel width = 0.02; C=1000; 33 support vectors out of 345 total nm of datapoints

MACHINE LEARNING

Different optimization runs end up with different solutions

Several combination of the support vectors yield the same optimum

MACHINE LEARNING

Different optimization runs end up with different solutions

Several combination of the support vectors yield the same optimum

MACHINE LEARNING

The original objective function:

Determining C may be difficult in practice

MACHINE LEARNING

1min ,

subject to y ,

and 0, 0.

-SVM is an alternative that optimizes for the best tradeoff between

model complexity (the largest margin) and penalty on the error

automatically.

MACHINE LEARNING

is an upper bound on the fraction of margin error (i.e.

the number of datapoints misclassified in the margin)

is a lower bound on the number of support vectors

MACHINE LEARNING

-svm =0.001, rbf kernel width 0.1

MACHINE LEARNING

Increase in the number of SV-s with =0.2

MACHINE LEARNING

Increase in the number of SV-s with =0.9

MACHINE LEARNING

Increase in the error with =0.2

MACHINE LEARNING

Increase in the error with =0.9

MACHINE LEARNING

SVM have wide range application for all type of data (vision, text,

handwriting, etc).

SVM is very powerful for large scale classification. Optimized

solvers for the training stage. Rapid during recall.

One issue is that the computation grows linearly with number of

SV and the algorithm is not very sparse in SV.

Another issue is that it can predict only two classes. For multi-

class classification, one needs to run several two-class classifiers.

Summary: When to use SVM?

MACHINE LEARNING

Multi-Class SVM

Children

Female Adult

Male Adult

1Construct a set of K binary classifiers f ,....,f , each trained to

separate one class from the rest.

1,... 1

Compute the class label in a winner-take-all approach:

j= , arg max i i

ij M i

y k x x ba

Sufficient to compute only K-

1 classifier for K classes

But computing the K’th

classifier may provide tighter

bounds on the Kth class.

MACHINE LEARNING

Multi-Class SVM

MACHINE LEARNING

Multi-Class SVM

MACHINE LEARNING

Multi-Class SVM Drawbacks of combining multiple binary classifiers:

- How to reject an instance if it belongs to none of the classes (garbage

model or threshold on minimum of associated classifier function; but

difficult to compare the scale of each classifier functions)

- Asymmetric classification (some classes have many more positive

examples than others); one can play with the C penalty to give a relative

influence as a function of the number of patterns, e.g. C=10*M.

Alternative is to compute all the classes as part of the optimization

function. Performance seem however comparable to one against all

optimization (Frank & Hlavak, Multi-class support vector machine, 2002; J. Weston

and C. Watkins. Multi-class support vector machines, 1998.)

MACHINE LEARNING

Relevance Vector Machine

Even though SVM usually results in a relatively small number of support

vectors compared to total number of data-points, nothing ensures to obtain a

sparse solution.

SVM requires also to finds hyper-parameters (C, ) and to have special form

for the basis function (the kernel must satisfy the Mercer conditions).

RVM relaxes these two assumptions, by taking a Bayesian approach.

MACHINE LEARNING

Start from the solution of SVM

, .......1

y x f x k x x b

y x x xM

Rewrite the solution of SVM

as a linear combination over

M basis functions

A sparse solution has majority of

entries in alpha zero.

MACHINE LEARNING

Model the distribution on the class label with Bernoulli distribuion

Replace sign function with continuous but steep Sigmoid function

i i i i

p y x g x b g x b

Doing maximum likelihood would lead to overfitting,

as we have more parameters than datapoints.

Approximate the distribution of the alpha with a

probability density function which reduces the number of

parameters to estimate.

Cannot compute the optimal alpha in close form. Must

perform an iterative method similar to expectation

maximization (see Tipping 2001, supplementary material).

MACHINE LEARNING

This parameter determines how good the fit is.

The smaller the value, the closer the true parameters

are fitted.

MACHINE LEARNING

RVM with l=0.04, kernel width = 0.01

477 datapoints, 19 support vectors

MACHINE LEARNING

SVM with C=1000, kernel width = 0.01

477 datapoints, 51 support vectors

MACHINE LEARNING

Pros and Cons of RVM

- Yields sparser solutions (less support vectors)

- Outputs estimates of the posterior probability of class

membership (uncertainty on prediction of class label)

- Iterative method (need a stopping criterion; arbitrary value on

precision)

- Very intensive computationally at training (memory and number of

iteration grow with square and cube of the number of basis

functions, i.e. with number of datapoints)

MACHINE LEARNING

Other classifiers

- Yields sparser solutions (less support vectors)

- Outputs estimates of the posterior probability of class

membership (uncertainty on prediction of class label)

- Iterative method (need a stopping criterion; arbitrary value on

precision)

- Very intensive computationally at training (memory and number of

iteration grow with square and cube of the number of basis

functions, i.e. with number of datapoints)

machine learning classifiers: support vector...

Documents

8. classifiers - mpg.pure

asl classifiers. what are asl classifiers? classifiers are...

separating and classifiers

naïve bayes classifiers

interpolants as classifiers

linear classifiers/svms

perceptron linear classifiers

lecture8 classifiers ldc_rules

personalized classifiers

commonly used classifiers

5 character classifiers

evaluating classifiers

numeral classifiers in chinese

list of chinese classifiers

linear classifiers

nearest neighbourhood classifiers in biometric fusion ·...

boosting of classifiers

comparison of classifiers

more classifiers

decision tree classifiers