machine learning classifiers: support vector...
Post on 08-Jul-2020
15 Views
Preview:
TRANSCRIPT
MACHINE LEARNING
1
MACHINE LEARNING
Classifiers:
Support Vector Machine
MACHINE LEARNING
2
What is Classification?
Detecting facial attributes
Sony (Make Believe) He & Zhang, Pattern Recognition, 2011
Children
Female Adult
Training set must be as unambiguous as possible
Not easy, especially as members of different classes may share similar attributes
Learning implies generalization; which of the features of each member of class makes
the class most distinguishable from the other classes.
MACHINE LEARNING
3
Multi-Class Classification
Children
Female Adult
Male Adult
Whenever possible, the classes should be balanced
Garbage model:
Male adult versus anything that is neither a female
adult nor a child
Classes can no longer be balanced!
MACHINE LEARNING
4
Classifiers
There is a plethora of classifiers, e.g:
- Neural networks (feed-forward with backpropagation, multi-layer perceptron)
- Decision trees (C4.5, random forest)
- Kernel methods (support vector machine, gaussian process classifier)
- Mixtures of linear classifiers (boosting)
In this class, we will see only SVM and Boosting for mixture of classifiers
Each classifier type has its pros and cons:
- Complex model: embed non-linearity but heavy computation
- Simple models: often high number of models, hence high stack memory
- Number of hyperparameters: high extensive crossvalidation to determine
the optimal classifier
- Some of the classifiers come with guarantees for global optimal solution;
other have only local optimality guarantee
MACHINE LEARNING
5
Support Vector Machine
Brief history:
SVM was invented by Vladimir Vapnik
Started with the invention of the statistical learning theory (Vapnik1979)
The current form of SVM was presented in (Boser, Guyon and Vapnik 1992) and Cortes and Vapnik (1995)
Textbooks:
An easy introduction to SVM is given in Learning with
Kernels by Bernhard Scholkopf and Alexander Smola.
A good survey of the theory behind SVM is
given in Support Vector Machines and other
Kernel Based Learning methods by Nello
Cristianini and John-Shawe Taylor.
MACHINE LEARNING
6
Support Vector Machine
The success of SVM is mainly due to:
- Its ease of use (lots of software available, good documentation)
- Excellent performance on variety of datasets
- Good solvers making optimization (learning phase) very quick
- Very fast at retrieval time – does not hinder practical applications
Was applied to numerous classification problems:
- Computer vision (face detection, object recognition, feature categorization,
etc)
- Bioinformatics (categorization of gene expression, of microarray data)
- WWW (categorization of websites)
- Production (control of quality, detection of defaults)
- Robotics (categorization of sensor readings)
- Finance (bankruptcy prediction)
MACHINE LEARNING
7
Optimal Linear Classification
• Which choice is better?
• How could we formulate this problem?
‘good’
‘OK’
‘bad’
MACHINE LEARNING
8
Linear Classifiers
x
denotes -1
denotes +1
How would you classify this data?
f
(W, b)
yest
; , sgn ,f x w b w x b
MACHINE LEARNING
9
f x yest
denotes -1
denotes +1
How would you classify this data?
(W, b)
Linear Classifiers
; , sgn ,f x w b w x b
MACHINE LEARNING
10
f x yest
denotes -1
denotes +1
How would you classify this data?
(W, b)
Linear Classifiers
; , sgn ,f x w b w x b
MACHINE LEARNING
11
f x yest
denotes -1
denotes +1
How would you classify this data?
(W, b)
Linear Classifiers
; , sgn ,f x w b w x b
MACHINE LEARNING
12
; , sgn ,f x w b w x b
f x yest
denotes -1
denotes +1
Any of these would be fine..
..but which is best?
(W, b)
Linear Classifiers
MACHINE LEARNING
13
Classifier Margin
f x yest
denotes -1
denotes +1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
(W, b)
; , sgn ,f x w b w x b
MACHINE LEARNING
14
f x yest
denotes -1
denotes +1 The maximum margin linear classifier is the linear classifier with the maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Linear SVM
(W, b)
Classifier Margin
; , sgn ,f x w b w x b
MACHINE LEARNING
15
f x yest
denotes -1
denotes +1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
Linear SVM
(W, b)
Classifier Margin
; , sgn ,f x w b w x b
MACHINE LEARNING
16
f x yest
denotes -1
denotes +1
(W, b)
Classifier Margin
; , sgn ,f x w b w x b
Need to determine a
measure of the
margin
MACHINE LEARNING
17
f x yest
denotes -1
denotes +1
(W, b)
Classifier Margin
; , sgn ,f x w b w x b
Need to determine a
measure of the
margin
To maximize this
measure
MACHINE LEARNING
18
Determining the Optimal Separating Hyperplane
: , 1x w x b
: , 1x w x b
Definition:
The margin on either side of the hyperplane satisfy , 1.w x b
: , 0x w x b
MACHINE LEARNING
19
Determining the Optimal Separating Hyperplane
: , 2x w x b
Class with label y=-1
Class with label y=+1
: , 2x w x b
Points on either side of the separating plane have negative and positive
coordinates, respectively .
: , 3x w x b
: , 3x w x b
Decision function:
; , sgn ,f x w b w x b
MACHINE LEARNING
20
What is the distance from a point x to the hyperplane
<w,x>+b= 0?
?
Determining the Optimal Separating Hyperplane
x
, 0w x b
MACHINE LEARNING
21
Determining the Optimal Separating Hyperplane
w
x
2
unitary vector
' . . , ' 0
, ' , ' , ,
, ' , .
Projection of x-x' onto w:
, ,
x s t w x b
w x x w x w x
w x x b w x
w x b w x b ww
w ww
x’-x
x’
,Distance to hyperplane
w x b
w
The margin between two classes is at least 2/||w||.
MACHINE LEARNING
22
Determining the Optimal Separating Hyperplane
Class with label y=-1
Class with label y=+1
1
2
1 2
1 2
Two points on either side of
the margin:
, 1
, 1
, x x 2
2x x
w x b
w x b
w
w
x1 x2
The margin between two classes is at least 2/||w||.
MACHINE LEARNING – 2012
23
2
2Separating condition is measured by .
To maximize this condition is equivalent to minimizing .2
Better even is to minimize the convex form .2
w
w
w
Determining the Optimal Separating Hyperplane
MACHINE LEARNING – 2012
24
• Finding the Optimal Separating Hyperplane turns out to be
an optimization problem of the following form:
• N+1 parameters (N: dimension of data)
• M constraints (M: nm of datapoints)
• It is called the primal problem.
2
,
i=1,2,....,M.
1
2
, 1 when 1, 1,
, 1 when 1
min
i i
i i
i i
w b
w
w x b yy w x b
w x b y
Determining the Optimal Separating Hyperplane
MACHINE LEARNING – 2012
25
Rephrase the minimization under constraint problem in terms of the
Lagrange Multipliers ai, i = 1, ..., M (M, # of data points), one for each of
the inequality constraints and we get the dual problem:
2
1
1, , , 1
2
with 0
i i
M
i
i
i
L w b w y w x ba a
a
Determining the Optimal Separating Hyperplane
(Minimization of convex function under linear constraints through Lagrange gives the optimal solution)
MACHINE LEARNING – 2012
26
,0
2
1
max min , ,
where
1, , , 1
2i i
w b
M
i
i
L w b
L w b w y w x b
aa
a a
The solution of this problem is found when maximizing over a and
minimizing over w and b:
Determining the Optimal Separating Hyperplane
MACHINE LEARNING – 2012
27
Requesting that the gradient of L vanishes with w.
1
, ,0 i i
M
i
i
L w bw y x
w
aa
The vector defining the hyperplane
is determined by the training points.
Note that while w is unique (minimization of convex function), the alpha are not unique.
Determining the Optimal Separating Hyperplane
MACHINE LEARNING – 2012
28
Requesting that the gradient of L vanishes with w.
Requires minimum one datapoint on
each class
Determining the Optimal Separating Hyperplane
1
, ,0 0i
M
i
i
L w by
b
aa
MACHINE LEARNING – 2012
29
Requesting that the gradient of L vanishes with w.
Original constraints
Determining the Optimal Separating Hyperplane
, ,0 , 1
, ,0 , 1
i i
i i
L w by w x b
L w by w x b
a
a
a
a
MACHINE LEARNING – 2012
30
1
1
Complete optimization problem:
, ,0 (Dual feasibility)
, ,0 0 (Dual feasibility)
, ,0 , 1 (Primal feasibility)
Karush-
i i
i
i i
M
i
i
M
i
i
L w bw y x
w
L w by
b
L w by w x b
aa
aa
a
a
Kuhn-Tucker conditions:
, 1 0 1,.. (Complementarity conditions)
0, 1,..
i i
i
i
y w x b i M
i M
a
a
Determining the Optimal Separating Hyperplane
MACHINE LEARNING – 2012
31
1
1
Complete optimization problem:
, ,0 (Dual feasibility)
, ,0 0 (Dual feasibility)
, ,0 , 1 (Primal feasibility)
Karush-
i i
i
i i
M
i
i
M
i
i
L w bw y x
w
L w by
b
L w by w x b
aa
aa
a
a
Kuhn-Tucker conditions:
, 1 0 1,.. (Complementarity conditions)
0, 1,..
i i
i
i
y w x b i M
i M
a
a
Determining the Optimal Separating Hyperplane
Feasibility conditions from KKT require 0 for all points x
for the primal constraints to be satisfied:
, 1
i
i i
i
y w x b
a
Points correctly classified
MACHINE LEARNING – 2012
32
1
1
Complete optimization problem:
, ,0 (Dual feasibility)
, ,0 0 (Dual feasibility)
, ,0 , 1 (Primal feasibility)
Karush-
i i
i
i i
M
i
i
M
i
i
L w bw y x
w
L w by
b
L w by w x b
aa
aa
a
a
Kuhn-Tucker conditions:
, 1 0 1,.. (Complementarity conditions)
0, 1,..
i i
i
i
y w x b i M
i M
a
a
Determining the Optimal Separating Hyperplane
The , 1,.. determine the solutions to the constraints
All the pairs of data points , for which 0 are the support vectors
All the pairs of data points , for which 0 are "irrelevant"
w
i i
i i
i
i
i
i M
x y
x y
a
a
a
hen computing the margin.
MACHINE LEARNING – 2012
34
Consider 3 cases:
0 1 1 <1 inside the margin
1 -1 for -1 or >1 for 1 outside the margin
0 >0 for -1 or <0 for
i i i
i i i i
i i i i
T T
T T i T i
T T i T
y w x b w x b
y w x b w x b y w x b y
y w x b w x b y w x b
1 do not satisfy the constraintiy
Determining the Optimal Separating Hyperplane
: , 1x w x b
: , 1x w x b
: , 0x w x b
MACHINE LEARNING – 2012
35
The decision function is then expressed in terms of the support vectors:
,f x sgn w x b
Determining the Optimal Separating Hyperplane
1
,i i
M
i
i
sgn y x x ba
1
, ,0 i i
M
i
i
L w bw y x
w
aa
Use , 1 0
to compute b.
i i
i y w x ba
To determine how good the hyperplane is
crossvalidation to get an estimation of the error on the testing set.
MACHINE LEARNING
36
Non-Separable Data Sets
denotes +1
denotes -1 This is going to be a problem!
What should we do?
Idea :
Introduce some slack on the constraints
MACHINE LEARNING
37
Support Vector Machine for non-separable datasets
The constraints are relaxed by
introducing slack variables , 1,... :
1 if y 1
1 if y 1
i
T i ii
T i ii
i M
w x b
w x b
denotes +1
denotes -1
1
2
3
Can again be expressed through compact notation:
y 1i T iiw x b
MACHINE LEARNING
39
2
1,
The objective function gives a penalty
for too large slack variables:
1min
2
Find a trade off between maximizing
the margin and minimizing the
classification errors.
M
jjw
Cw
M
denotes +1
denotes -1
1
2
3
C>0 weights the influence of the penalty term
Support Vector Machine for non-separable datasets
MACHINE LEARNING
40
2
1,
Optimization under constraint problem of the form:
1min
2
u.c.
1 ,
0 j=1,...M
j j
M
jjw
Tj
j
Cw
M
y w x b
Support Vector Machine for non-separable datasets
The hyperplane has the same solution as for the separable case:
1
j j
M
j
j
α y
w x
MACHINE LEARNING
42
denotes -1
denotes +1
1Tw x b
1Tw x b
w
Support Vectors
Decision boundary is determined only by those support vectors !
1
Mi i
i
i
α y
w x
: 1 0i iT
i ii y w x ba
ai = 0 for non-support vectors
ai 0 for support vectors (i=0)
Support Vector Machine for non-separable datasets
MACHINE LEARNING – 2012
43
Non-Linear Classification
What if the points in the input space cannot be separated
by a linear hyperplane?
MACHINE LEARNING – 2012
44
The Kernel Trick for SVM
As usual, observe that the decision function of the linear SVM computes
an inner product across pairs of observations:
,i jx x
1
The decision function:
, ,i i
M
i
i
f x sgn w x b sgn y x x ba
MACHINE LEARNING – 2012
49
1
The decision function in linear classification with SVM was given by:
, ,i i
M
i
i
f x sgn w x b sgn y x x ba
Non-Linear Classification with Support Vector Machines
1
Replace linear plane in original space ,
by linear plane in feature space: and ,
The decision function becomes:
, ,M
i i
i
i
w x b
x x w x b
f x sgn w x b y x x b
a
W lives in
feature space
now!
MACHINE LEARNING – 2012
50
Non-Linear Classification with Support Vector Machines
1
Replace linear plane in original space ,
by linear plane in feature space: and ,
The decision function becomes:
, ,M
i i
i
i
w x b
x x w x b
f x sgn w x b y x x b
a
Use the kernel trick by exploiting the fact that the decision function
depends on a dot product in feature space
, ,i j i jk x x x x
MACHINE LEARNING – 2012
51
Non-Linear Classification with Support Vector Machines
1
Replace linear plane in original space ,
by linear plane in feature space: and ,
The decision function becomes:
, ,M
i i
i
i
w x b
x x w x b
f x sgn w x b y k x x b
a
Use the kernel trick by exploiting the fact that the decision function
depends on a dot product in feature space
, ,i j i jk x x x x
MACHINE LEARNING – 2012
52
1
f ,i i
M
i
i
x sgn y k x x ba
1 , 1
1
1max ,
2
subject to 0 for all 1,... and 0.
N
M Mi j i j
i i j
i i j
Mi
i i
i
L y y k x x
i M y
aa a a a
a a
The optimization problem in feature space becomes:
The decision function in feature space is computed as follows:
Kernel
Kernel
Non-Linear Classification with Support Vector Machines
MACHINE LEARNING – 2012
53
1 1
The offset b can be estimated from the KKT conditions.
Best is to compute the expectation over the constraints and hence
to compute an estimate of through regression:
1- ,j ji i
M M
i
j i
b
b y y k x xM
a
Non-Linear Classification with Support Vector Machines
MACHINE LEARNING
55
How to read out the result of SVM
Hyperplane
Support vectors
Color gradient
= distance to
the hyperplane
MACHINE LEARNING
56
How to read out the result of SVM
Hyperplane
Support vectors
The margin
MACHINE LEARNING
57
1
SVM decision function:
f sgn , i i i
M
i
i
x y k x x ba
The hyperparameters of SVM
2
'
2Gaussian kernel: , ' , .
x x
k x x e
The kernel has several open parameters (Hyperparameters)
that need to be determined before running SVM
Kernel width
Order of the polynomial
Usually d=1
Inhomogeneous polynomial kernel: , ' , ' , , 0p
k x x x x d p d
MACHINE LEARNING
58
2
1,
Optimization under constraint problem of the form:
1min
2
u.c.
1 ,
0 j=1,...M
j j
M
jjw
Tj
j
Cw
M
y w x b
The hyperparameters of SVM
C that determines the costs
associated to incorrectly classifying
datapoints is an open parameter of
the optimization function
MACHINE LEARNING
59
Non-Linear Support Vector Machines: Examples
MACHINE LEARNING
60
Effect of the penalty factor C
RBF kernel width=0.20; C=1000; several misclassified datapoints
MACHINE LEARNING
61
RBF kernel width=0.20; C=2000; less misclassified datapoints
Effect of the penalty factor C
MACHINE LEARNING
62
RBF kernel width =0.001; C=1000; 113 support vectors out of 345 total nm of datapoints
Effect of the width of Gaussian kernel
MACHINE LEARNING
63
RBF kernel width =0.008; C=1000; 64 support vectors out of 345 total nm of datapoints
Effect of the width of Gaussian kernel
MACHINE LEARNING
64
RBF kernel width = 0.02; C=1000; 33 support vectors out of 345 total nm of datapoints
Effect of the width of Gaussian kernel
MACHINE LEARNING
65
Different optimization runs end up with different solutions
Several combination of the support vectors yield the same optimum
MACHINE LEARNING
66
Different optimization runs end up with different solutions
Several combination of the support vectors yield the same optimum
MACHINE LEARNING
67
2
1,
The original objective function:
1min
2
M
jjw
Cw
M
Support Vector Machine for non-separable datasets
Determining C may be difficult in practice
MACHINE LEARNING
68
2
, 1
1min ,
subject to y ,
and 0, 0.
M
i
w i
i i
i
i
wM
w x b
Support Vector Machine for non-separable datasets
-SVM is an alternative that optimizes for the best tradeoff between
model complexity (the largest margin) and penalty on the error
automatically.
MACHINE LEARNING
69
Support Vector Machine for non-separable datasets
is an upper bound on the fraction of margin error (i.e.
the number of datapoints misclassified in the margin)
is a lower bound on the number of support vectors
MACHINE LEARNING
70
Support Vector Machine for non-separable datasets
-svm =0.001, rbf kernel width 0.1
MACHINE LEARNING
71
Support Vector Machine for non-separable datasets
Increase in the number of SV-s with =0.2
MACHINE LEARNING
72
Support Vector Machine for non-separable datasets
Increase in the number of SV-s with =0.9
MACHINE LEARNING
73
Support Vector Machine for non-separable datasets
Increase in the error with =0.2
MACHINE LEARNING
74
Increase in the error with =0.9
Support Vector Machine for non-separable datasets
MACHINE LEARNING
75
SVM have wide range application for all type of data (vision, text,
handwriting, etc).
SVM is very powerful for large scale classification. Optimized
solvers for the training stage. Rapid during recall.
One issue is that the computation grows linearly with number of
SV and the algorithm is not very sparse in SV.
Another issue is that it can predict only two classes. For multi-
class classification, one needs to run several two-class classifiers.
Summary: When to use SVM?
MACHINE LEARNING
76
Multi-Class SVM
Children
Female Adult
Male Adult
1Construct a set of K binary classifiers f ,....,f , each trained to
separate one class from the rest.
K
3f1f
2f
1,... 1
Compute the class label in a winner-take-all approach:
j= , arg max i i
Mj j
ij M i
y k x x ba
Sufficient to compute only K-
1 classifier for K classes
But computing the K’th
classifier may provide tighter
bounds on the Kth class.
MACHINE LEARNING
78
Multi-Class SVM
MACHINE LEARNING
79
Multi-Class SVM
MACHINE LEARNING
80
Multi-Class SVM Drawbacks of combining multiple binary classifiers:
- How to reject an instance if it belongs to none of the classes (garbage
model or threshold on minimum of associated classifier function; but
difficult to compare the scale of each classifier functions)
- Asymmetric classification (some classes have many more positive
examples than others); one can play with the C penalty to give a relative
influence as a function of the number of patterns, e.g. C=10*M.
Alternative is to compute all the classes as part of the optimization
function. Performance seem however comparable to one against all
optimization (Frank & Hlavak, Multi-class support vector machine, 2002; J. Weston
and C. Watkins. Multi-class support vector machines, 1998.)
MACHINE LEARNING
81
Relevance Vector Machine
Even though SVM usually results in a relatively small number of support
vectors compared to total number of data-points, nothing ensures to obtain a
sparse solution.
SVM requires also to finds hyper-parameters (C, ) and to have special form
for the basis function (the kernel must satisfy the Mercer conditions).
RVM relaxes these two assumptions, by taking a Bayesian approach.
MACHINE LEARNING
82
Relevance Vector Machine
Start from the solution of SVM
1
sgn ,
, .......1
Mi
i
i
TT
y x f x k x x b
xi
y x x xM
a
a
Rewrite the solution of SVM
as a linear combination over
M basis functions
A sparse solution has majority of
entries in alpha zero.
1
1
0
1
. 0
. 0
. ...
. 1
0M
a
a
a
a
MACHINE LEARNING
83
Relevance Vector Machine
1
1
Model the distribution on the class label with Bernoulli distribuion
; 1
Replace sign function with continuous but steep Sigmoid function
1/ 1
M
i i i i
i
z
ii
yy
p y x g x b g x b
g z e
a a a
Doing maximum likelihood would lead to overfitting,
as we have more parameters than datapoints.
Approximate the distribution of the alpha with a
probability density function which reduces the number of
parameters to estimate.
Cannot compute the optimal alpha in close form. Must
perform an iterative method similar to expectation
maximization (see Tipping 2001, supplementary material).
MACHINE LEARNING
84
Relevance Vector Machine
This parameter determines how good the fit is.
The smaller the value, the closer the true parameters
are fitted.
[0,1]
MACHINE LEARNING
85
Relevance Vector Machine
MACHINE LEARNING
86
Relevance Vector Machine
RVM with l=0.04, kernel width = 0.01
477 datapoints, 19 support vectors
MACHINE LEARNING
87
Relevance Vector Machine
SVM with C=1000, kernel width = 0.01
477 datapoints, 51 support vectors
MACHINE LEARNING
89
Pros and Cons of RVM
Pros:
- Yields sparser solutions (less support vectors)
- Outputs estimates of the posterior probability of class
membership (uncertainty on prediction of class label)
Cons:
- Iterative method (need a stopping criterion; arbitrary value on
precision)
- Very intensive computationally at training (memory and number of
iteration grow with square and cube of the number of basis
functions, i.e. with number of datapoints)
MACHINE LEARNING
90
Other classifiers
Pros:
- Yields sparser solutions (less support vectors)
- Outputs estimates of the posterior probability of class
membership (uncertainty on prediction of class label)
Cons:
- Iterative method (need a stopping criterion; arbitrary value on
precision)
- Very intensive computationally at training (memory and number of
iteration grow with square and cube of the number of basis
functions, i.e. with number of datapoints)
top related