support vector machine
TRANSCRIPT
![Page 1: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/1.jpg)
Submitted by:Garisha Chowdhary ,MCSE 1st year,Jadavpur University
![Page 2: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/2.jpg)
A set of related supervised learning methods
Non-probablistic binary linear classifier
Linear learners like perceptrons but unlike them uses concept of : maximum margin ,linearization and kernel function
Used for classification and regression analysis
![Page 3: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/3.jpg)
A good separation
Map non-lineraly separable instances to higher
dimensions to overcome linearity constraints
Select between hyper planes, use maximum margin as a
test
Class 1
Class 2
Class 1
Class 2
Class 1
Class 2
![Page 4: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/4.jpg)
Intuitively , a good separation is achieved by a hyperplane that has largest distance to nearest training data point of any class
Since, larger the margin lower the generalization error(more confident predications)
Class 1
Class 2
![Page 5: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/5.jpg)
•{(x1,y1), (x2,y2), … , (xn,yn)
•Where y = +1/ -1 are labels of data, x belongs to Rn
Given N samples
•wTxi+ b > 0 : for all i such that y=+1
•wTxi+ b < 0 : for all i such that y=-1
Find a hyperplane wTx + b =0
Functional Margin• With respect to the training example, defined by
ˆγ(i)=y(i)(wT x(i) + b).• Want functional margin to be large i.e. y(i)(wT x(i) + b) >> 0• May rescale w and b, without altering the decision function
but multiplying functional margin by the scale factor• Allows us to impose a normalization condition ||w|| = 1 and
consider the functional margin of (w/||w||,b/||w||)• w.r.t. training set defined by ˆγ = min ˆγ(i) for all i
![Page 6: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/6.jpg)
Geometric margin• Defined by γ(i)=y(i)((w/||w||)Tx(i)+b/||w||).• If ||w|| = 1, functional margin = geometric margin• Invariant to scaling of parameters w and b. w may be scaled such
that ||w|| = 1• Also, γ = min γ(i) for all i
Maximize γ w.r.t. γ,w,b s.t.
• y(i)(wTx(i) +b) >= γ for all i• ||w|| = 1
Maximize ˆγ/||w|| w.r.t. ˆγ,w,b s.t.
• y(i)(wTx(i) +b) >= ˆγ for all i
• Intrducing the scaling constraint that the functional margin be 1, the objective function may further be simplified as to maximize 1/||w|| , or
Minimize (1/2)(||w||2) s.t.
• y(i)(wTx(i) +b) >= 1
Now, Objective is to
![Page 7: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/7.jpg)
Solve for α and recover
w = Σαiyixi , b =( −maxi:∗ y(i)=−1 wT x(i) + mini:y(i)=1 wT x(i))/2
Substituitng w in L we get the corresponding dual problem of the primal problem to
maximize W(α) = Σαi - ½ΣΣαiαjyiyjxiTxj , s.t. αi >=0 , Σαiyi = 0
Setting gradiant to L w.r.t. w and b to 0 we have,
w = Σαiyixi for all i , Σαiyi = 0
Using Lagrangian to solve the inequality constrained optimization problem , we have
L = ½||w||2 - Σαi(yi(wTxi +b) - 1)
![Page 8: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/8.jpg)
For conversion of primal problem to dual problem the following Karish-Kuhn-Tucker conditions must be satisfied
• (∂/∂wi)L(w, α) = 0, i = 1, . . . , n• αi gi(w,b) = 0, i = 1, . . . , k• gi(w,b) <= 0, i = 1, . . . , k• αi >= 0
From the KKT complementary condition(2nd)
• αi > 0 => gi(w,b) = 0 (active constraint) => x(i),y(i) has functional margin 1 (support vectors)
• gi(w,b) < 0 => αi = 0 (inactive constraint, non-support vectors)
Class 1
Class 2Support vectors
![Page 9: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/9.jpg)
In case of non-linearly separable data , mapping data to high dimensional feature space via non linear mapping function, φ increases the likelihood that data is linearly separable
Use of kernel function, to simplify computations over high dimensional mapped data, that corresponds to dot product of some non-linear mapping of data
Having found αi , calculate a quantity that depends only on the inner product between x (test point) and support vectors
Kernel function is the measure of similarity between the 2 vectors
A kernel function is valid if it satisfies the Mercer Theorem which states that the corresponding kernel matrix must be symmetric positive semi-definite (zTKz >= 0 )
![Page 10: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/10.jpg)
Polynomial kernel with degree d• K(x,y) = (xy + 1 )^d
Radial basis function kernel with width s• K(x,y) = exp(-||x-y||2/(2s2))• Feature space is infinite dimensional
Sigmoid with parameter k and q • K(x,y) = tanh(k xTy+ )q• It does not satisfy the Mercer condition on all k and q
![Page 11: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/11.jpg)
High dimensionality doesn’t guarantee linear separation; hypeplane might be susceptible to outliers
Relax the constraint introducing ‘slack variables’, ξi, that allow violations of constraint by a small quantity
Penalize the objective function for violation
Parameter C will control the trade off between penalty and margin.
So the objective now becomes, to minw,b,γ (1/2)||w||2 + C Σξi s.t. y(i)(wTx(i)+b)>= 1 – ξi, ξi >=0
Tries to ensure that most examples have functional margin atleast 1
Formind the corresponding Lagrangian , the dual problem now is to: maxαΣαi
- ½ΣΣαiαjyiyjxiTxj , s.t. 0<=αi <= C , Σαiyi = 0 .
![Page 12: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/12.jpg)
Class 1
Class 2
Parameter Selection
• The effectiveness of SVM depends on selection of kernel, kernel parameters and the parameter C
• Common is Gaussian kernel, with a single parameter γ• Best combination of C and γ is often selected by grid search with
exponentially increasing sequences of C and γ.• Each combination is checked using crossvalidation and the one with best
accuracy is chosen.
![Page 13: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/13.jpg)
Drawbacks• Cannot be directly applied to
multiclass problems, but need use of algorithms that convert multiclass problem to multiple binary class problems
• Uncalibrated class membership probabilities
![Page 14: support vector machine](https://reader036.vdocuments.us/reader036/viewer/2022082513/557864a8d8b42a4c748b4fe1/html5/thumbnails/14.jpg)
THANK YOU