1 support vector machines İsmail gÜneŞ. 2 what is svm? a new generation learning system. a new...
TRANSCRIPT
1
SUPPORT SUPPORT VECTOR VECTOR
MACHINESMACHINES
İsmail GÜNEŞİsmail GÜNEŞ
2
What is SVM?What is SVM?
A new generation learning system. A new generation learning system. Based on recent advances in Based on recent advances in
statistical learning theory.statistical learning theory. Use a hypothesis space of linear Use a hypothesis space of linear
functions,functions, High dimensional feature space,High dimensional feature space, Optimisation theory,Optimisation theory, Statistical learning theory.Statistical learning theory.
3
Features of SVMFeatures of SVM
Invented by Vapnik.Invented by Vapnik. Simple, geometric, and always trained Simple, geometric, and always trained
to find global optimumto find global optimum.. Used for pattern recognition, Used for pattern recognition,
regression, and linear operator regression, and linear operator inversioninversion..
Considered too slow at the beginning. Considered too slow at the beginning. Now for most application, this problem Now for most application, this problem
is overcomeis overcome..
4
Features of SVM(Cont’d)Features of SVM(Cont’d)
Based on simple ideaBased on simple idea.. High performance in practical High performance in practical
applicationsapplications.. Can deal with complex nonlinear Can deal with complex nonlinear
problemsproblems.. But working with a simple linear But working with a simple linear
algorithmalgorithm..
5
The main idea of SVMs :The main idea of SVMs : Finding Finding Optimal hyperplaneOptimal hyperplane for for
linearly separable patternslinearly separable patterns ! ! Extend to patterns that are not Extend to patterns that are not
linearly separablelinearly separable ! !
6
Separating Line (or Separating Line (or hyperplane)hyperplane)
Goal: Find the best line (or hyperplane) to separate the training data. Goal: Find the best line (or hyperplane) to separate the training data. How to formalize?How to formalize? ●In two dimensions, equation of the line is given by:In two dimensions, equation of the line is given by:
Class 1Class -1
w1x + w2y = b ●Better notation for n dimensions:
bxiiwn
i
.0
bxw
.
7
Simple ClassifierSimple Classifier The Simple Classifier:The Simple Classifier:
Points that fall on the right are classified as “1”Points that fall on the right are classified as “1” Points that fall on the left are classified as “-1”Points that fall on the left are classified as “-1”
UUsing the training set, find a hyperplane sing the training set, find a hyperplane (line) so that(line) so that
w is a weight vector. x is input vector. b is bias.
How can we improve this simple classifier ?How can we improve this simple classifier ?
bxw i
. 1classifor
1classiforbxw i
.
8
Finding the Best PlaneFinding the Best Plane Which of the following two planes are Which of the following two planes are
better ?better ?
Class 1Class -1
The green plane is the better choice, since it The green plane is the better choice, since it is more likely to do well on future test data.is more likely to do well on future test data.
9
Separating the planesSeparating the planes Construct the Construct the bounding planesbounding planes::
Draw two parallel planes to the classification plane.Draw two parallel planes to the classification plane. Push them as far apart as possible, until they hit data Push them as far apart as possible, until they hit data
points.points. The classification plane with bounding planes furthest apart The classification plane with bounding planes furthest apart
is the best one.is the best one.
bxw
.
Class 1Class -1
1. bxw 1. bxw
10
Finding the Best Finding the Best PlanePlane(Cont’d)(Cont’d)
All points in class 1 should be All points in class 1 should be to theto the right of bounding plane right of bounding plane 1.1.
All points in class -1 should be All points in class -1 should be to theto theleft of bounding plane -1.left of bounding plane -1.
yyii isis +1 or -1 depending on the +1 or -1 depending on the classification. Then the above classification. Then the above two inequalities can be written two inequalities can be written as oneas one..
The distance between The distance between bounding planes should be bounding planes should be maximized.maximized.
1. bxw i
1. bxw i
1).( bxwy ii
2222
21
2
...
2
wwww n
11
The Optimization The Optimization ProblemProblem
Mathematical techniques to find hyperplanes Mathematical techniques to find hyperplanes optimizing measures.(maximize distance).optimizing measures.(maximize distance).
This is a This is a mathematical programmathematical program.. Optimization problem subject to Optimization problem subject to
constraintsconstraints.. More specifically, this is a quadratic More specifically, this is a quadratic
programprogram.. There are high powered software tools for There are high powered software tools for
solving this kind of problem (both solving this kind of problem (both commercial and academic)commercial and academic)
12
Data Which is Not Linearly Data Which is Not Linearly SeparableSeparable
What if a separating plane does What if a separating plane does not exist?not exist?
1. bxw i
Class 1Class -1
error
1. bxw ii
Find the plane that maximizes the margin and Find the plane that maximizes the margin and minimizes the errorsminimizes the errors on the training points. on the training points.
Take original inequality and add a Take original inequality and add a slack variableslack variable to to measure error:measure error:
13
The Support Vector The Support Vector MachineMachine
Push the Push the planes apartplanes apart and and minimize minimize the errorthe error at the same time: at the same time:
such that CC is a positive number that is chosen to balance these two goals. is a positive number that is chosen to balance these two goals. This problem is called a This problem is called a Support Vector MachineSupport Vector Machine, or SVM., or SVM. The SVM is one of many techniques for doing supervised The SVM is one of many techniques for doing supervised machine learningmachine learning..
Others: Neural networks, decision trees, k-nearest neighborOthers: Neural networks, decision trees, k-nearest neighbor
m
ii
bw
Cwi 1
2
,, 2
1min
1).( iii bxwy
14
TerminologyTerminology Those points that touch the bounding plane, or Those points that touch the bounding plane, or
lie on the wrong side, are called lie on the wrong side, are called support vectorssupport vectors..
If all the support vectors were removed, the solution would If all the support vectors were removed, the solution would bebe the the same.same. They are the most difficult to classifyThey are the most difficult to classify..
15
What about nonlinear What about nonlinear surfaces?surfaces?
Some datasets may not be best separated by a plane.
First Idea :(Simple and effective) MMap each data point into a higher ap each data point into a higher
dimensional space, and find a linear fit dimensional space, and find a linear fit therethere. .
Finding Quadratic solution.Finding Quadratic solution. Problem: If dimensionality of space If dimensionality of space
is high, lots of calculationsis high, lots of calculations..
16
SolutionSolution
Nonlinear surfaces can be used Nonlinear surfaces can be used without these problems through without these problems through the use of a the use of a kernel functionkernel function..
The kernel function specifies a The kernel function specifies a similarity measure between two similarity measure between two vectorsvectors..
17
SolutionSolution(Cont’d)(Cont’d) The only way in which the data appears in the The only way in which the data appears in the
training problem is in the form of dot products xtraining problem is in the form of dot products xiixxjj..
First map the data to some other (possibly First map the data to some other (possibly infinite dimensional) space infinite dimensional) space H H using a using a mappingmapping ..
Training algorithm now only depends on Training algorithm now only depends on data through dot products in data through dot products in HH: : (x(xii))(x(xjj))
If there is a kernel function K such that If there is a kernel function K such that
K(xK(xii,x,xjj)=)=(x(xii))(x(xjj))we would only need to use K in the training we would only need to use K in the training algorithm and would never need to know algorithm and would never need to know explicitly.explicitly.
18
SVM Applications.SVM Applications. Pattern Recognition :Pattern Recognition :
handwriting recognitionhandwriting recognition 3D 3D object recognitionobject recognition speaker identificationspeaker identification face detectionface detection text categorizationtext categorization bio-informaticsbio-informatics
Regression estimationRegression estimation.. Density estimationDensity estimation.. More…More…
19
ConclusionsConclusions SVM assure that SVM assure that good performancegood performance in a in a
variety of applications such as Pattern variety of applications such as Pattern Recognition,Recognition, regression estimation,regression estimation, time time series prediction etc.series prediction etc.
SSome open issues,ome open issues, Considered too slow at the beginning. Now this Considered too slow at the beginning. Now this
problem is problem is solvedsolved.. The choice of kernel function : there are no The choice of kernel function : there are no
guidelinesguidelines.. In most cases, SVM generalizes better thIn most cases, SVM generalizes better than an
other competing methodsother competing methods(Holds the record for (Holds the record for lowest handwriting recog. error rate, 0.56%)lowest handwriting recog. error rate, 0.56%)..
20
ReferencesReferences
Cristianini, N. and B. Shawe-Taylor, J. “An Inroduction to Support Vector Machines and other kernel-based learning methods”, 2000.
www.support-vector.net Burges, J. C. “A tutorial on support
vector machines for pattern recognition,” Data Mining and Knowledge Discovery, 1998.