support vector machine

14
Submitted by: Garisha Chowdhary , MCSE 1 st year, Jadavpur University

Upload: garisha-chowdhary

Post on 10-Jun-2015

167 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: support vector machine

Submitted by:Garisha Chowdhary ,MCSE 1st year,Jadavpur University

Page 2: support vector machine

A set of related supervised learning methods

Non-probablistic binary linear classifier

Linear learners like perceptrons but unlike them uses concept of : maximum margin ,linearization and kernel function

Used for classification and regression analysis

Page 3: support vector machine

A good separation

Map non-lineraly separable instances to higher

dimensions to overcome linearity constraints

Select between hyper planes, use maximum margin as a

test

Class 1

Class 2

Class 1

Class 2

Class 1

Class 2

Page 4: support vector machine

Intuitively , a good separation is achieved by a hyperplane that has largest distance to nearest training data point of any class

Since, larger the margin lower the generalization error(more confident predications)

Class 1

Class 2

Page 5: support vector machine

•{(x1,y1), (x2,y2), … , (xn,yn)

•Where y = +1/ -1 are labels of data, x belongs to Rn

Given N samples

•wTxi+ b > 0 : for all i such that y=+1

•wTxi+ b < 0 : for all i such that y=-1

Find a hyperplane wTx + b =0

Functional Margin• With respect to the training example, defined by

ˆγ(i)=y(i)(wT x(i) + b).• Want functional margin to be large i.e. y(i)(wT x(i) + b) >> 0• May rescale w and b, without altering the decision function

but multiplying functional margin by the scale factor• Allows us to impose a normalization condition ||w|| = 1 and

consider the functional margin of (w/||w||,b/||w||)• w.r.t. training set defined by ˆγ = min ˆγ(i) for all i

Page 6: support vector machine

Geometric margin• Defined by γ(i)=y(i)((w/||w||)Tx(i)+b/||w||).• If ||w|| = 1, functional margin = geometric margin• Invariant to scaling of parameters w and b. w may be scaled such

that ||w|| = 1• Also, γ = min γ(i) for all i

Maximize γ w.r.t. γ,w,b s.t.

• y(i)(wTx(i) +b) >= γ for all i• ||w|| = 1

Maximize ˆγ/||w|| w.r.t. ˆγ,w,b s.t.

• y(i)(wTx(i) +b) >= ˆγ for all i

• Intrducing the scaling constraint that the functional margin be 1, the objective function may further be simplified as to maximize 1/||w|| , or

Minimize (1/2)(||w||2) s.t.

• y(i)(wTx(i) +b) >= 1

Now, Objective is to

Page 7: support vector machine

Solve for α and recover

w = Σαiyixi , b =( −maxi:∗ y(i)=−1 wT x(i) + mini:y(i)=1 wT x(i))/2

Substituitng w in L we get the corresponding dual problem of the primal problem to

maximize W(α) = Σαi - ½ΣΣαiαjyiyjxiTxj , s.t. αi >=0 , Σαiyi = 0

Setting gradiant to L w.r.t. w and b to 0 we have,

w = Σαiyixi for all i , Σαiyi = 0

Using Lagrangian to solve the inequality constrained optimization problem , we have

L = ½||w||2 - Σαi(yi(wTxi +b) - 1)

Page 8: support vector machine

For conversion of primal problem to dual problem the following Karish-Kuhn-Tucker conditions must be satisfied

• (∂/∂wi)L(w, α) = 0, i = 1, . . . , n• αi gi(w,b) = 0, i = 1, . . . , k• gi(w,b) <= 0, i = 1, . . . , k• αi >= 0

From the KKT complementary condition(2nd)

• αi > 0 => gi(w,b) = 0 (active constraint) => x(i),y(i) has functional margin 1 (support vectors)

• gi(w,b) < 0 => αi = 0 (inactive constraint, non-support vectors)

Class 1

Class 2Support vectors

Page 9: support vector machine

In case of non-linearly separable data , mapping data to high dimensional feature space via non linear mapping function, φ increases the likelihood that data is linearly separable

Use of kernel function, to simplify computations over high dimensional mapped data, that corresponds to dot product of some non-linear mapping of data

Having found αi , calculate a quantity that depends only on the inner product between x (test point) and support vectors

Kernel function is the measure of similarity between the 2 vectors

A kernel function is valid if it satisfies the Mercer Theorem which states that the corresponding kernel matrix must be symmetric positive semi-definite (zTKz >= 0 )

Page 10: support vector machine

Polynomial kernel with degree d• K(x,y) = (xy + 1 )^d

Radial basis function kernel with width s• K(x,y) = exp(-||x-y||2/(2s2))• Feature space is infinite dimensional

Sigmoid with parameter k and q • K(x,y) = tanh(k xTy+ )q• It does not satisfy the Mercer condition on all k and q

Page 11: support vector machine

High dimensionality doesn’t guarantee linear separation; hypeplane might be susceptible to outliers

Relax the constraint introducing ‘slack variables’, ξi, that allow violations of constraint by a small quantity

Penalize the objective function for violation

Parameter C will control the trade off between penalty and margin.

So the objective now becomes, to minw,b,γ (1/2)||w||2 + C Σξi s.t. y(i)(wTx(i)+b)>= 1 – ξi, ξi >=0

Tries to ensure that most examples have functional margin atleast 1

Formind the corresponding Lagrangian , the dual problem now is to: maxαΣαi

- ½ΣΣαiαjyiyjxiTxj , s.t. 0<=αi <= C , Σαiyi = 0 .

Page 12: support vector machine

Class 1

Class 2

Parameter Selection

• The effectiveness of SVM depends on selection of kernel, kernel parameters and the parameter C

• Common is Gaussian kernel, with a single parameter γ• Best combination of C and γ is often selected by grid search with

exponentially increasing sequences of C and γ.• Each combination is checked using crossvalidation and the one with best

accuracy is chosen.

Page 13: support vector machine

Drawbacks• Cannot be directly applied to

multiclass problems, but need use of algorithms that convert multiclass problem to multiple binary class problems

• Uncalibrated class membership probabilities

Page 14: support vector machine

THANK YOU