classification - linear svmsvm { discussions nonlinear separable cs479/579: data mining slide-13...

Linear Support Vector Machine

ClassificationLinear SVM

Huiping Cao

Huiping Cao, Slide 1/26


Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate thedata.

CS479/579: Data Mining Slide-2


  Find a linear hyperplane (decision boundary) that will separate the data Huiping Cao, Slide 2/26



One possible solution.



  One Possible Solution

B1




Another possible solution.



  Another possible solution

B2




Other possible solutions.



  Other possible solutions

B2




Which one is better? B1 or B2?How do you define better? The bigger the margin, the better.



  Which one is better? B1 or B2?   How do you define better?

B1

B2




Find hyperplane that maximizes the margin; so B1 is better thanB2



  Find hyperplane maximizes the margin => B1 is better than B2

B1

B2

b11

b12

b21b22

margin






0=+• bxw

B1

b11

b12

1−=+• bxw

1+=+• bxw

⎩⎨⎧

−≤+•−

≥+•=

1bxw if11bxw if1

)(

xf Margin = 2|| !w ||

B1 : wᵀx + b = 0

b11 : wᵀx + b = 1

b12 : wᵀx + b = −1




0=+• bxw B1

b11

b12

!w• !x + b = +1

!w• !x + b = !1!w

X2 X1

d

The margin of the decision boundaries:

wᵀ(x1 − x2) = 2wᵀ(x1 − x2) = ‖w‖ × ‖(x1 − x2)‖ × cos(θ) , where ‖ · ‖ is thenorm of a vector.‖w‖ × ‖(x1 − x2)‖ × cos(θ) = ‖w‖ × d , where d is the lengthof vector (x1 − x2) in the direction of vector wThus, ‖w‖ × d = 2Margin: d = 2

‖w‖



Learn a linear SVM Model

The training phase of SVM is to estimate the parameters wand b.

The parameters must follow two conditions:

yi =

{1, if wᵀxi + b ≥ 1

−1, if wᵀxi + b ≤ −1



Formulate the problem – rationale

We want to maximize margin: d = 2‖w‖

which is equivalent to minimize L(w) = ‖w‖2

2 .

Subjected to the following constraints:

yi =

{1 if wᵀxi + b ≥ 1

−1 if wᵀxi + b ≤ −1

which is equivalent to

yi (wᵀxi + b) ≥ 1, i = 1, · · · ,N



Formulate the problem

Formalized to the following constrained optimization problem.Objective function:

Minimize‖w‖2

2

subject toyi (w

ᵀxi + b) ≥ 1, i = 1, · · · ,NConvex optimization problem:

Objective function is quadraticConstraints are linearCan be solved using the standard Lagrange multipliermethod.



Lagrange formulation

Lagrangian for the optimization problem (take into accountthe constraints by rewriting the objective function).

LP(w, b, λ) =‖w‖2

2−

N∑i=1

λi (yi (wᵀxi + b)− 1)

minimize w.r.t. w and b and maximize w.r.t. each λi ≥ 0where λi : Lagrange multiplierTo minimize the Lagrangian, take the derivative ofLP(w, b, λ) w.r.t. w and b and set them to 0:

∂LP∂w

= 0 =⇒ w =N∑i=1

λiyixi

∂LP∂b

= 0 =⇒N∑i=1

λiyi = 0



Substituting, LP(w, b, λ) to LD(λ)

Solving LP(w, b, λ) is still difficult because it solves a largenumber of parameters w, b, and λi .

Idea: Transform Lagrangian into a function of the Lagrangemultipliers only by substituting w and b in LP(w, b, λ), weget (the dual problem)

LD(λ) =N∑i=1

λi −1

2

N∑i=1

N∑j=1

λiλjyiyjxᵀi xj

Details: see the next slide.

It is very nice to have LD(λ) because it is a simple quadraticform in the vector λ.



Substituting details, get the dual problem LD(λ)

‖w‖2

2=

1

2wᵀw =

1

2(

N∑i=1

λiyixᵀi ) · (

N∑j=1

λjyjxj ) =1

2

N∑i=1

N∑j=1

λiλjyiyjxᵀi xj (1)

−N∑i

λi (yi (wᵀxi + b)− 1) =−

N∑i=1

λiyiwᵀxi −

N∑i=1

λiyib +N∑i=1

λi (2)

=−N∑i=1

λiyi(N∑j=1

λjyjxᵀj )xi +

N∑i=1

λi (3)

=N∑i=1

λi −N∑i=1

N∑j=1

λiλjyiyjxᵀi xj (4)

Add both sides, get

LD(λ) =N∑i=1

λi −1

2

N∑i=1

N∑j=1

λiλjyiyjxᵀi xj



Finalized LD(λ)

LD(λ) =N∑i=1

λi −1

2

N∑i=1

N∑j=1

λiλjyiyjxᵀi xj

Maximize w.r.t. each λsubject to

λi ≥ 0 for i = 1, 2, · · · ,Nand

∑Ni=1 λiyi = 0

Solve LD(λ) using quadratic programming (QP). We get all the λi(Out of the scope).



Solve LD(λ) – QP

We are maximizing LD(λ)

maxλ(N∑i=1

λi −1

2

N∑i=1

N∑j=1

λiλjyiyjxᵀi xj)

subject to constraints

(1) λi ≥ 0 for i = 1, 2, · · · ,N and

(2)∑N

i=1 λiyi = 0.

Translate the objective to minimization because QP packagesgenerally come with minimization.

minλ(1

2

N∑i=1

N∑j=1

λiλjyiyjxᵀi xj −

N∑i=1

λi )



Solve LD(λ) – QP

minλ(1

2λᵀ

y1y1x

ᵀ1x1 y1y2x

ᵀ1x2 · · · y1yNx

ᵀ1xN

y2y1xᵀ2x1 y2y2x

ᵀ2x2 · · · y2yNx

ᵀ2xN

· · · · · · · · · · · ·yNy1x

ᵀNx1 yNy2x

ᵀNx2 · · · yNyNx

ᵀNxN

︸︷︷︸

quadratic coefficients

λ+ (−1)ᵀ︸︷︷︸linear

λ

subject to

yᵀλ = 0︸︷︷︸linear constraint

0︸︷︷︸lower bounds

≤ λ ≤ ∞︸︷︷︸upper bounds

Let Q represent the matrix with the quadratic coefficients

minλ( 12λ

ᵀQλ+ (−1)ᵀλ) subject to yᵀλ = 0; λ ≥ 0



Lagrange multiplier

QP solves λ = λ1, λ2, · · · , λN , where most of them are zeros.

Karush-Kuhn-Tucker (KKT) conditions

λi ≥ 0

The constraint (zero form with extreme value)

λi (yi (wᵀxi + b)− 1) = 0

Either λi is zeroOr yi (wᵀxi + b)− 1) = 0

Support vector xi : yi (wᵀxi + b − 1) = 0 and λi > 0

Training instances that do not reside along these hyperplaneshave λi = 0



Get w and b

w and b depend on support vectors xi and its class label yi .

w value

w =N∑i=1

λiyixi

b valueb = yi −wᵀxi

Idea:

Given a support vector (xi , yi ), we have yi (wᵀxi + b)− 1 = 0.Multiply yi on both sides, → y2

i (wᵀxi + b)− yi = 0.y2i = 1 because yi = 1 or yi = −1.

Then, (wᵀxi + b)− yi = 0.



Get w and b – Example

xi1 xi2 yi λi0.4 0.5 1 1000.5 0.6 -1 1000.9 0.4 -1 00.7 0.9 -1 00.17 0.05 1 00.4 0.35 1 00.9 0.8 -1 00.2 0 1 0

Solve λ using quadratic programming packages

wᵀ = (w1,w2)

w1 =∑2

i=1 λiyixi1 = 100 ∗ 1 ∗ 0.4 + 100 ∗ (−1) ∗ 0.5 = −10

w2 =∑2

i=1 λiyixi2 = 100 ∗ 1 ∗ 0.5 + 100 ∗ (−1) ∗ 0.6 = −10

b = 1−wᵀx1 = 1− ((−10) ∗ 0.4 + (−10) ∗ (0.5)) = 10



SVM: given a test data point z

yz = sign(wᵀz + b) = sign((∑N

i=1 λiyixᵀi )z + b)

if yz = 1, the test instance is classified as positive class

if yz = −1, the test instance is classified as negative class



SVM – discussions

What is the problem if not linearly separable?



 What if the problem is not linearly separable?



SVM – discussions

Nonlinear separable


Nonlinear Support Vector Machines

 What if decision boundary is not linear?

Do not work in X space. Transform the data into higher dimensional Z spacesuch that the data are linearly separable.

LD(λ) =N∑i=1

λi −1

2

N∑i=1

N∑j=1

λiλjyiyjxᵀi xj

→LD(λ) =

N∑i=1

λi −1

2

N∑i=1

N∑j=1

λiλjyiyjzᵀi zj

Apply linear SVM, support vectors live in Z space



Quadratic programming packages – Octave

https://www.gnu.org/software/octave/doc/interpreter/Quadratic-Programming.htmlSolve the quadratic program

minx(0.5xᵀ ∗ H ∗ x + xᵀ ∗ q)

subject to A ∗ x = blb <= x <= ubAlb <= Ain ∗ x <= Aub



Quadratic programming packages (MATLAB)

Optimization toolbox in MATLAB:

minx(1

2xᵀHx + fᵀx)

such that A · x ≤ b,

Aeq · x = beq,lb ≤ x ≤ ub.


classification - linear svmsvm { discussions nonlinear separable cs479/579: data mining slide-13...

Documents