support vector machines lecturer: yishay mansour itay kirshenbaum

Support Vector Machines

Lecturer: Yishay Mansour

Itay Kirshenbaum

Lecture Overview

In this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs).

Lecture Overview – Cont.

We begin with building the intuition behind SVMs

continue to define SVM as an optimization problem and discuss how to efficiently solve it.

We conclude with an analysis of the error rate of SVMs using two techniques: Leave One Out and VC-dimension.

Introduction

Support Vector Machine is a supervised learning algorithm

Used to learn a hyperplane that can solve the binary classification problem

Among the most extensively studied problems in machine learning.

Binary Classification Problem

Input space: Output space: Training data: S drawn i.i.d with distribution D Goal: Select hypothesis

that best predicts other points drawn i.i.d from D

)},(),...,,{( 11 mm yxyxS }1,1{ Y

Binary Classification – Cont.

Consider the problem of predicting the success of a new drug based on a patient height and weight m ill people are selected and treated This generates m 2d vectors (height

and weight) Each point is assigned +1 to indicate

successful treatment or -1 otherwise This can be used as training data

Binary classification – Cont.

Infinitely many ways to classify Occam’s razor – simple

classification rules provide better results Linear classifier or hyperplane

Our class of linear classifiers:

0)*( if 1 to maps bxwXxHh

},|)*(sign {x RbRwbxwH n

Choosing a Good Hyperplane

Intuition Consider two cases of positive

classification: w*x + b = 0.1 w*x + b = 100

More confident in the decision made by the latter rather than the former

Choose a hyperplane with maximal margin

Good Hyperplane – Cont.

Definition: Functional margin S

A linear classifier:

)*(ˆ with ˆˆ min},...,1{

s bxwy iiii

),( toaccording oftion classifica theis bwxy ii

Maximal Margin

w,b can be scaled to increase margin sign(w*x + b) = sign(5w*x + 5b) for

all x (5w, 5b) is 5 times greater than (w,b)

Cope by adding an additional constraint: ||w|| = 1

Maximal Margin – Cont.

Geometric Margin Consider the geometric distance

between the hyperplane and the closest points

Geometric Margin

Definition: Definition: Geometric margin S

Relation to functional margin

Both are equal when

)*( with min},...,1{

wy iiii

ii yw ̂

The Algorithm

We saw: Two definitions of the margin Intuition behind seeking a

maximizing hyperplane Goal: Write an optimization

program that finds such a hyperplan

We always look for (w,b) maximizing the margin

The Algorithm – Take 1

First try:

Idea Maximize - For each sample the

Functional margin is at least Functional and geometric margin are

the same as Largest possible geometric margin

with respect to the training set

1,,...,1,)*(max wmibxwy ii

The first try can’t be solved by any off-the-shelf optimization software The constraint is non-linear In fact, it’s even non-convex

How can we discard the constraint? Use geometric margin!

mibxwyw

ii ,...,1,ˆ)*(ˆ

We now have a non-convex objective function – The problem remains

Remember We can scale (w,b) as we wish Force the functional margin to be 1 Objective function: Same as: Factor of 0.5 and power of 2 do not

change the program – Make things easier

1min w

The algorithm – Final version

The final program:

The objective is convex (quadratic) All constraints are linear Can solve efficiently using standard

quadratic programing (QP) software

mibxwyw ii ,...,1,1)*(2

Convex Optimization

We want to solve the optimization problem more efficiently than generic QP

Solution – Use convex optimization techniques

Convex Optimization – Cont.

Definition: A convex function

Theorem

)()1()())1((

:1,0,,

yfxfyxf

))(()()( :,

functionconvex abledifferenti a be :Let

xyxfxfyfXyx

Convex Optimization Problem

Convex optimization problem

We look for a value of Minimizes Under the constraint

mixgxf

,..,1,0)( s.t. )(min Find

functionconvex be ,..,1,:,Let

Xx)(xf

mixgi ,..,1,0)(

Lagrange Multipliers

Used to find a maxima or a minima of a function subject to constraints

Use to solve out optimization problem

Definition

sMultiplier Lagrange thecalled are

0, )()(),(

,..,1,

sconstraint subject to function of Lagragian

XxxgxfxL

Primal Program

Plan Use the Lagrangian to write a

program called the Primal Program Equal to f(x) is all the constraints are

met Otherwise –

Definition – Primal Program

),(max)( 0 xLxP

Primal Progam – Cont.

The constraints are of the form If they are met

is maximized when all are 0, and the summation is 0

Otherwise is maximized for

0)( xgi

iii xg

)()( xfxP

iii xg

Primal Progam – Cont.

Our convex optimization problem is now:

Define as the value of the primal program

),(maxmin)(min 0 xLx XxPXx )(min* xp PXx

Dual Program

We define the Dual Program as:

We’ll look at

Same as our primal program Order of min / max is different

Define the value of our Dual Program

),(min)( xLx XxD

),(minmax)(minmax 00 xLx XxaDXxa

),(minmax 0* xLd Xxa

Dual Program – Cont.

We want to show If we find a solution to one problem,

we find the solution to the second problem

Start with “max min” is always less then “min

max”

Now on to

* ),(maxmin),(minmax pxLxLd aXxXxa

Dual Program – Cont.

Conclude

)( osolution t a is and then

),(),(),( :feasible is which ,0

andpoint saddle a are which 0 and exists if

axLaxLaxLxa

),(infsup),(inf

),(),(sup),(supinf

daxLaxL

axLaxLaxLp

Karush-Kuhn-Tucker (KKT) conditions

KKT conditions derive a characterization of an optimal solution to a convex problem.

Theorem

0)()( 3.

0)( ),( .2

0)()( ),( 1.

:s.t. 0 problemon optimizati theosolution t a is

convex. and abledifferenti are ,..,1, and that Assume

xgxfxL

KKT Conditions – Cont.

The other direction holds as well

xxxfxfxf

)]()([

)()()()(

: feasibleevery For

KKT Conditions – Cont.

Example Consider the following optimization

problem: We have The Lagragian will be

2 .. 2

1min 2 xtsx

xxgxxf 2)(,2

1),( 2 xxxL

202),(

Optimal Margin Classifier

Back to SVM Rewrite our optimization program

Following the KKT conditions Only for points in the training

set with a margin of exactly 1 These are the support vectors of the

training set

01)*(),(

,...,1,1)*(2

bxwybwg

mibxwyw

Optimal Margin – Cont.

Optimal margin classifier and its support vectors

Construct the Lagragian

Find the dual form First minimize to get Do so by setting the derivatives to

]1)*([2

bxwywbwL iim

),,( bwLD

xywbwL

Optimal Margin – Cont. Take the derivative with respect to

Use in the Lagrangian

We saw the last tem is zero

ii ybwL

ii ybxxyybwL

** WxxyybwL jim

The dual optimization problem

The KKT conditions hold Can solve by finding that maximize Assuming we have – define The solution to the primal problem

,..,1,0:)(max

iixyw1

Optimal Margin – Cont. Still need to find

Assume is a support vector We get

Error Analysis Using Leave-One-Out

The Leave-One-Out (LOO) method Remove one point at a time from the

training set Calculate an SVM for the remaining

points Test our result using the removed

point Definition

The indicator function I(exp) is 1 if exp is true, otherwise 0

))((1ˆ

xSLOO yxhIm

LOO Error Analysis – Cont.

Expected error

It follows the expected error of LOO for a training set of size m is the same as for a training set of size m-1

)]([])([

)])(([1

'~'}{,

xSLOODS

herrorEyxhE

yxhIEm

LOO Error Analysis – Cont.

Theorem

ProofSSN

SNEherrorE

SVDSSDS mm

in ctorssupport ve ofnumber theis )(

)([)]([ 1~~

)(ˆ :Hence

ctor.support ve a bemust point they,incorrectlpoint a classifies if

Generalization Bounds Using VC-dimension

Theorem

Then .,min:)(set

hyperplane theofdimension -VC thebe Let .:Let

dwxwxwsign

xyxywxy

dixwyw

:dover Summing .,..,1 )(:

1,1yevery for So shattered. is ,..,set that theAssume

Generalization Bounds Using VC-dimension – Cont.

Proof – Cont.

Therefore

: thatconcludecan we

when 1][ and when 0][ Since

:ondistributi uniform with s' over the Averaging

dRxyyxxE

jiyyEjiyyE

yyxxExyExyE

support vector machines lecturer: yishay mansour itay kirshenbaum

maximal margin slide

d slide

margin intuition

maximal margin w

geometric margin definition

margin signw

functional margin s

training data slide

Documents

min-max trees based on slides by: rob powers ian gent yishay...

review of the theories of financial crises-itay

presented by: reshef schreiber itay leibovitz

itay talgam - Издательство «МИФ» · 2018....

1 computing nash equilibrium presenter: yishay mansour

computer networking yishay mansour (mansour@cs.tau.ac.il)...

sadna – ad auction lecture #3 time series yishay mansour...

nurit kirshenbaum

implementing the wisdom of the crowd ilan kremer, yishay...

sadna – ad auction session #2 yishay mansour mariano...

itay greenspon

reinforcement learning: learning algorithms yishay mansour...

ssas 2012 tabular mode best practices itay braun cto & bi...

computational game theory amos fiat modified slides prepared...

itay fischhendler, galit cohen-blankshtain and yoav shuali

itay maoz - 9itgschile · 2020. 4. 21. · the chemosensory...

elad hoffer, itay hubara, daniel soudry · 2017. 12....

assaf hamdani (hebrew university and ecgi) yishay yafeh...

yishay mor, yishaymor

itay budin and jack w. szostak* *to whom correspondence...