support vector machines

Support Vector Machines

Theory and Implementation in python

byNachi

Definition

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis.

- Wikipedia

Properties of an SVM

Non probabilistic binary linear classifier

Support for non-linear classification using the 'kernel trick'

Linear separability

Two sets of points in p-dimensional space are said to be linearly separable if they can be separated using a p-1 dimensional hyperplane.

Example - The two sets of 2D datain the image are separated by a single straight line (1D hyperplane), and hence are linearly separable

Linear Discriminant

The hyperplane that separates the two sets of data is called the linear discriminant.

Equation:WT X = C

W = [w1,w2,.......wn]X = [X1,X2,......Xn]for the nth dimension

Selecting the hyperplane

For every linearly separable data, there exist infinite number of separating hyperplanes. Hence, we must choose the most suitable one for classification.

Maximal Margin Hyperplane

We can compute the (perpendicular) distance from each observation in the data set to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal margin hyperplane is the separating hyperplane for which the margin is largest.

Example - maximal margin hyperplane

Finding the shortest distance (margin)Find Xp

Such that ||Xp-X|| is minimum and

Wt Xp =C (as Xp is on decision boundary)

[Wt - W transpose]

Maximizing the marginMaximize D such that

D = (WT X – C) / ||W||

where X is the support v

Why maximum margin hyperplane?● Supposing we have a maximal margin hyperplane for

a data set and want to predict the class for a new observation, we compute the distance from the hyperplane.

● The more the distance from the hyperplane the more confident we are that the sample belongs to that class.

● Thus the hyperplane with the farthest smallest distance from the training observation would be the most suitable.

Classifying a new sample

Consider a new sample x’ = [x1,x2,....xn]. To predict the class to which the sample belongs, we must simply compute WT X = C.

If WT X > C it lies on one side (positive half space) of the hyperplane or if WT X < C it lies on the other side (negative half space) of the hyperplane. The sample belongs to the class which represents the corresponding half space.

SVM - A linear discriminant

An SVM is simply a linear discriminant which tries to build a hyperplane such that it has a large margin.

It classifies a new sample by simply computing the distance from the hyperplane.

Support Vectors

● Observations (represented as vectors) which lie at marginal distance from the hyperplane are called support vectors.

● These are important as shifting them even slightly might change the position of the hyperplane to a great extent.

Example - Support vectors

The vectors lying on the green lines in the image are the support vectors.

Soft margin

To avoid ‘overfitting’ of data (i.e. low sensitivity of individual observations) by trying to make perfectly linearly separable sets, we may opt to allow some amount of misclassification keeping in mind the greater robustness to individual observations and better classification of most of the observations.

Achieving soft marginEach observation has something known as the ‘slack variable’ that allow individual observations to be on the wrong side of the margin or the hyperplane.Sum of slack variables <= CWhere C is a nonnegative tuning parameter. C is our budget for the amount that the margin can violated by all the observations.

Tuning parameter C & Support vectors relation

Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. These observations do affect the support vector classifier.When the tuning parameter C is large, then the margin is wide, many observations violate the margin, and so there are many support vectors.

Non linearly separable

In this case, an SVM would not able to linearly classify the data. Hence SVM uses what is known as the ‘kernel trick’.The idea is that the enlarged feature space might have a linear boundary which might not quite be linear in the original feature space. In this ‘trick’ the feature space is enlarged. This can be done using various kernel functions.

Enlarged feature space

Multi-Category Classification

● One-Versus-One Classification

● One-Versus-All Classification

Sample Data

X = [ [0,0], [1,1], [2,2], [3,3], [4,4] ]

Y =[ 0, 0, 0, 1, 1]

SVM in sklearn

clfy = svm.SVC()

Default:class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma=0.0, coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, random_state=None)

‘Fit’ the model

clfy.fit(x,y)

Fit the SVM model i.e., compute and build a hyperplane.

Features of sklearn

clfy.support_vectors_ Retrieve all the support vectors of the model

clfy.predict([3,3]) Predict the class of the given sample

Features of sklearn

clfy.score(x,y)Returns the mean accuracy on the given test data and labels.

clfy.decision_function([2.5,2.5])Distance of the samples X to the separating hyperplane.

Conclusion

Parameter and kernel selection is crucial in an SVM model.