support vector machines - polimi.it
TRANSCRIPT
![Page 1: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/1.jpg)
Prof. Matteo Matteucci
Machine LearningSupport Vector Machines
![Page 2: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/2.jpg)
Prof. Matteo Matteucci – Machine Learning
Discriminative vs. Generative Approaches
o Generative approach: we derived the classifier from some
generative hypothesis about the way data have been generated
Linearity of the log odds for posteriors (Logistic Regression)
Multivariate Gaussian given the class for the likelihood (LDA)
o Discriminative approach: find the prescribed boundary (e.g., a
linear separating boundary) able to reduce the classifier error
Define a discriminating function instead of making
assumptions on the data distribution
o Example: find the separating hyperplane which separates the data
points in the best way (e.g., with the minimum error rate)
2
From ESL
![Page 3: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/3.jpg)
Prof. Matteo Matteucci – Machine Learning
Perceptron
o A perceptron computes the value of a weighted sum and returns
its sign (name dates back to ’50s literature on neural networks)
o Basically a perceptron is a linear classifier for which:
We do not assume any particular probabilistic model
generating the data
We learn the parameters using some optimization technique
so to minimize an error function (e.g., the error rate)
We cannot infer the role of the single variables in the model
from the values of the weights
o Inference in discriminative models is quite complex and usually it
is not the goal which is prediction.
3
![Page 4: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/4.jpg)
Prof. Matteo Matteucci – Machine Learning
An Example: Simulated Data 4
LDA Solution
Perceptrons
![Page 5: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/5.jpg)
Prof. Matteo Matteucci – Machine Learning
Hyperplanes Linear Algebra
o Let consider the hyperplane (affine set) L in ℝ2
Any two points x1 and x2 on L have
The vector normal to the surface L is
For any point x0 in L we have
The signed distance of any point x to L is defined by
5
f(x) proportional to the
distance of x from the
plane defined by f(x)=0
![Page 6: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/6.jpg)
Prof. Matteo Matteucci – Machine Learning
Perceptron Learning Algorithm (1/2)
o The error function for the perceptron learning is the distance of
misclassified points from the decision boundary
The output is coded with +1/-1
If an output which should be +1 is misclassified
For an output with -1 we have the opposite
o The goal becomes minimizing
This is non negative and proportional to the distance of the
misclassified points from
6
Set of points
misclassified
![Page 7: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/7.jpg)
Prof. Matteo Matteucci – Machine Learning
Perceptron Learning Algorithm (2/2)
o Minimize by stochastic gradient descend the error function
The gradients with respect to the model parameters are
Stochastic gradient descent applies for each misclassified point
o If data are linearly separable, it converges to a separating
hyperplane in a finite number of steps
7
Learning rate
![Page 8: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/8.jpg)
Prof. Matteo Matteucci – Machine Learning
An Example: Simulated Data 8
LDA Solution
Perceptrons
Is there any
optimal plane?
![Page 9: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/9.jpg)
Prof. Matteo Matteucci – Machine Learning
An Example: Simulated data 9
Maximum margin
classifier …
![Page 10: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/10.jpg)
Prof. Matteo Matteucci – Machine Learning
Separable Case Formulation (1)
o Maximize the margin M
o The two constraints respectively
Select one of the possible
hyper planes
Each point is on the right side
of the margin
10
From ESL
![Page 11: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/11.jpg)
Prof. Matteo Matteucci – Machine Learning
Separable Case Formulation (2)
o We can remove the constraint on the parameter norm by
changing the margin constraint
which becomes
o If we redefine we obtain the equivalent problem
o Which is a “simple” convex optimization problem
(quadratic with linear constraints)
11
![Page 12: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/12.jpg)
Prof. Matteo Matteucci – Machine Learning
Separable Case Solution (1)
o We can solve the constrained quadratic problem by the use of
Lagrange multipliers
Setting the derivatives to zero we have
And by substitution the so-called Wolf dual
12
Can you derive it?
![Page 13: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/13.jpg)
Prof. Matteo Matteucci – Machine Learning
Separable Case Solution (2) 13
o The Karush-Kuhn-Tucker conditions must hold too
If then we have
If this means
o The final output comes from:
![Page 14: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/14.jpg)
Prof. Matteo Matteucci – Machine Learning
Non Separable Case Formulation (1)
o Maximize the margin M
o To accommodate for “errors” we
use some extra variables …
14
This gives a convex
optimization problem
![Page 15: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/15.jpg)
Prof. Matteo Matteucci – Machine Learning
Non Separable Case Formulation (2)
o We can remove the constraint on the parameter norm obtaining
o This can be rewritten as
o Solved by Lagrange multipliers as for the separable case …
15
From ESL
Used defined costInfinite corresponds to
separable case …
![Page 16: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/16.jpg)
Prof. Matteo Matteucci – Machine Learning
Non Separable Case Solution (1)
o The primal Lagrange function is
Setting the derivatives to zero gives
with
o By substitution we obtain the Lagrangian (Wolf) dual function …
16
![Page 17: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/17.jpg)
Prof. Matteo Matteucci – Machine Learning
Non Separable Case Solution (2)
o The dual Lagrange function is
Subject to and
Having the Karush-Kuhn-Tucker conditions
o Solving the optimization problem we have
computed using only the support vectors …
17
Can you derive it?
![Page 18: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/18.jpg)
Prof. Matteo Matteucci – Machine Learning
Non Separable Case Solution (3) 18
o The solution of the Dual optimization problem provides
o Karush-Kuhn-Tucker conditions imply
is non zero only for support vectors
If the support is on the margin and
If we have
Using the margin points we can solve for
![Page 19: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/19.jpg)
Prof. Matteo Matteucci – Machine Learning
Separable Data vs Non Separable Data 19
![Page 20: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/20.jpg)
Prof. Matteo Matteucci – Machine Learning
A Synthetic Example 20
![Page 21: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/21.jpg)
Prof. Matteo Matteucci – Machine Learning
A Synthetic Example 21
The smaller the C the
bigger the margin
![Page 22: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/22.jpg)
Prof. Matteo Matteucci – Machine Learning
Support Vector Machines (1)
o Learning the classifier involves only the scalar product of features
o We can compute this scalar product in a new feature space
o The feature space can grow up to infinity with SVM …
22
Scalar product
Scalar product
![Page 23: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/23.jpg)
Prof. Matteo Matteucci – Machine Learning
The Kernel Trick
o What we need for learning and prediction is the result of the
scalar product only
o In some cases this scalar product can be written as a Kernel
o Popular choices are
23
![Page 24: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/24.jpg)
Prof. Matteo Matteucci – Machine Learning
The Kernel Trick Example
o Consider an Input space with two variables and a polynomial
kernel of degree 2
o It turns out the that M=6 and if we chose
o We obtain
24
![Page 25: Support Vector Machines - polimi.it](https://reader030.vdocuments.us/reader030/viewer/2022012512/618ab0896f759c7aec7cb867/html5/thumbnails/25.jpg)
Prof. Matteo Matteucci – Machine Learning
A Synthetic Example (Non Linear) 25