stochastic subgradient approach for solving linear support vector machines

20
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute

Upload: anais

Post on 22-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Stochastic Subgradient Approach for Solving Linear Support Vector Machines. Jan Rupnik Jozef Stefan Institute. Outline. Introduction Support Vector Machines Stochastic Subgradient Descent SVM - Pegasos Experiments. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

  • Stochastic Subgradient Approach for Solving Linear Support Vector Machines

    Jan RupnikJozef Stefan Institute

  • OutlineIntroductionSupport Vector MachinesStochastic Subgradient Descent SVM - PegasosExperiments

  • IntroductionSupport Vector Machines (SVMs) have become one of the most popular classification tools in the last decadeStraightforward implementations could not handle large sets of training examplesRecently methods for solving SVMs arose with linear computational complexityPegasos: primal estimated subgradient approach for solving SVMs

  • Hard Margin SVMMany possible hyperplanes perfectly separate the two classes (e.g. red line).

    Only one hyperplane with the maximum margin (blue line).Support vectors

  • Soft Margin SVMAllow a small number of examples to be misclassified to find a large margin classifier.

    The trade off between the classification accuracy on the training set and the size of the margin is a parameter for the SVM

  • Problem settingLet S = {(x1,y1),..., (xm, ym)} be the set of input-output pairs, where xi RN and yi {-1,1}.Find the hyperplane with the normal vector w RN and offset b R that has good classification accuracy on the training set S and has a large margin.Classify a new example x as sign(wx b)

  • Optimization problemRegularized hinge loss:

    minw /2 ww + 1/m i(1 yi(wxi b))+

    Trade off between margin and lossSize of the marginExpected hinge loss on the training setPositive for correctly classified examples, else negative(1 z)+ := max{0, 1-z} (hinge loss)First summand is a quadratic function, the sum is a piecewise linear function. The whole objective: piecewise quadratic.

  • PerceptronWe ignore the offset parameter b from now on (b = 0)Regularized Hinge Loss (SVM):minw /2 ww + 1/m i(1 yi(wxi))+Perceptronminw 1/m i(yi(wxi))+

  • 0 1Loss functions0 10 1Standard 0/1 lossPerceptron lossHinge lossPenalizes all incorrectly classified examples with the same amountPenalizes incorrectly classified examples x proportionally to the size of |wx|Penalizes incorrectly classified examples and correctly classified examples that lie within the marginExamples that are correctly classified but fall within the margin

  • Stochastic Subgradient DescentGradient descent optimization in perceptron (smooth objective)Subgradient descent in pegasos (non differentiable objective)Gradient (unique)Subgradients (all equally valid)

  • Stochastic SubgradientSubgradient in perceptron: 1/m yixi for all misclassified examplesSubgradient in SVM: w + 1/m i(1 yixi) for all misclassified examplesFor every point w the subgradient is a function of the training sample S. We can estimate it from a smaller random subset of S of size k, A, (stochastic part) and speed up computations.Stochastic subgradient in SVM: w + 1/k (x,y) A && misclassified (1 yx)

  • Pegasos the algorithmLearning rateSubsampleSubgradient stepProjection into a ball (rescaling)Subgradient is zero on other training points

  • Whats new in pegasos?Sub-gradient descent technique 50 years oldSoft Margin SVM 14 years oldTypically the gradient descent methods suffer from slow convergenceAuthors of Pegasos proved that aggressive decrease in learning rate t still leads to convergence. Previous works: t = 1/(t))pegasos: t = 1/ (t)Proved that the solution always lies in a ball of radius 1/

  • SVM LightA popular SVM solver with superlinear computational complexitySolves a large quadratic programSolution can be expressed in terms a small subset of training vectors, called support vectorsActive set method to find the support vectorsSolve a series of smaller quadratic problemsHighly accurate solutionsAlgorithm and implementation by Thorsten JoachimsT. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schlkopf and C. Burges and A. Smola (ed.), MIT Press, 1999

  • ExperimentsDataReuters RCV2Roughly 800.000 news articlesAlready preprocessed to bag of word vectors, publicly availableNumber of features roughly 50.000Sparse vectorscategory CCAT, which consists of 381.327 news

  • Quick convergence to suboptimal solutions200 iterations took 9.2 CPU seconds, the objective value was 0.3% higher than the optimal solution

    560 iterations to get a 0.1% accurate solution

    SVM Light takes roughly 4 hours of CPU time

  • Test set errorOptimizing the objective value to a high precission is often not necessary

    The lowest error on the test set is achieved much earlier

  • Parameters k and TThe product of kT determines how close to the optimal value we get

    If kT is fixed the k does not play a significant role

  • Conclusions and final notesPegasos one of the most efficient suboptimal SVM solversSuboptimal solutions often generalize to new examples wellCan take advantage of sparsityLinear solverNonlinear extensions have been proposed, but they suffer from slower convergence

  • Thank you!