support vector machinesml-intro-2016.wdfiles.com/.../tau_ml16_svm_lect1.pdf · 2016. 11. 27. ·...
TRANSCRIPT
Support Vector Machines
Supervised Learning
quail apple apple corn cornLabeled
Data
Model Class
Consider classifiers of the form y = f(x;w)
Features1.1 -0.5 0 0 0.3 …
quail-1 0 1.2 -0.4 0.1 …
apple1.1 -0.5 0 0 0.3 … -1 0 1.2 -0.4 0.1 …
apple corn
Learning Find that works well on the training data ww
x
y
Optimization!
Linear Classifiers• A simple and effective family of classifiers
(xi, yi)
y = sign [w · x+ b]
• The training problem:
• Given a set of n training points
• Find for the “best fitting” classifier w, b
Training Linear Classifiersy = sign [w · x+ b]
• How do we find it?
• If there exists a classifier with zero training error we can find one with the __________ algorithmperceptron
Many Possible Solutions • If there exists one solution, there exist many.
• Which one should we choose?
• Intuitively: one that’s farther away from the points.
Maximum Margin Classifier• For every point denote its distance
from the hyperplane.
• Margin of a classifier: the shortest distance to the hyperplane:
• Goal: find classifier that maximizes
xi d(xi,w, b)
mini
d(xi,w, b)
mini
d(xi,w, b)
ML and Optimization• We have an optimization problem
• Namely we want to find a set of parameters that will maximize some objective function (the margin) subject to some constraints (classifying correctly)
• Need a toolbox for solving such problems
• In what follows we provide an overview
Unconstrained Optimization• Use w to denote optimization variables
• For example:
• Generally:
• Solve by:
• Find all w such that
• These stationary points are the candidates for the global minimum (and asymptotes)
minw1,w2
(w1 � 2w2)2
minw
f(w)
@f(w)
@w= 0
Constrained Minimization• Suppose we are only interested in variables that
satisfy
• The optimization problem is:
• The zero gradient point may not satisfy the constraint.
h(w) = 0
min f(w)s.t. h(w) = 0
w1
w2h(w) = 0
Directional Derivative• Given function f(w) and direction v.
• What happens if we make a small change in direction v (where |v|2=1?)
• It is
• The direction along the curve h(w)=0 has zero directional derivative.
• Thus the orthogonal is the gradientw1
w2h(w) = 0
rf(w) · v
rh(w)
w1
w2 v
w
f(w + ↵v)
↵
rf(w)
Constrained Minimization• The optimization problem is:
• Consider f along h
• Vector of movement along curve v is orthogonal to
min f(w)s.t. h(w) = 0
w1
w2h(w) = 0
rf(w) = �rh(w)
rh(w)
f(w)
w : h(w) = 0
rh(w)
• Gradient along curve
• Is zero iff
rf(w) · v
vrf(w)
Lagrange Multiplier• The optimum points should satisfy:
rf(w) = �rh(w) For some �h(w) = 0 Constraint satisfied
1.2.
• Alternative formulation. Define Lagrangian:
L(w,�) = f(w) + �h(w)
• The optimum should satisfy:rwL(w,�) = 0
r�L(w,�) = 0
1.2.
Example• What is the distance between hyperplane and point?
x̄
w · x+ b = 0
minx:w·x+b=0
0.5kx� x̄k22
• Use primal feasibility to solve for �
(x̄� �w) ·w + b = 0
L(x,�) = 0.5kx� x̄k22 + � (w · x+ b)
rx
L(x,�) = (x� x̄) + �w = 0
� =x̄ ·w + b
kwk
kx� x̄k2 =|x̄ ·w + b|
kwk
x = x̄� �w
Multiple Constraints• Solve:
min f(w)s.t. hi(w) = 0 8i = 1, . . . , p
• Introduce multiplier per constraint �1, · · · ,�p
• Lagrangian:
• Optimality conditions:rwL(w,�) = 01.r�iL(w,�) = 02.
• May be several such points. Need to check which one is the global optimum.
L(w,�,↵) = f(w) +X
i
�ihi(w)
Inequality Constraints• Solve:
• Optimality conditions:
1.
w1
w2h(w) 0min f(w)
s.t. h(w) 0rh(w)
h(w) 0 Constraint satisfied
w1
w2h(w) 0
rh(w)�rf(w)
• When we are “stuck” if the only directions that decrease f take us outside the constraints. Namely:
h(w) = 0
For some ↵ � 0rf(w) = �↵rh(w)2a.h(w) < 0• When we need:
rf(w) = 02b.
w1
w2h(w) 0
rh(w)�rf(w)
Progress possible
Stuck
Complementary Slackness
• Called the Karush Kuhn Tucker (KKT) conditions. Always necessary.
• Sufficient for convex optimization.
rf(w) = �↵rh(w)
h(w) < 0 rf(w) = 0
h(w) = 0
• Summarize as:rf(w) = �↵rh(w)
↵h(w) = 0
↵ � 0, h(w) 0
Lagrange Multipliers• Consider the general problem:
min f(w)s.t. hi(w) = 0 8i = 1, . . . , p
gi(w) 0 8i = 1, . . . ,m
• Define the Lagrangian:L(w,�,↵) = f(w) +
X
i
�ihi(w) +X
i
↵igi(w)
• Optimum must satisfy:rwL(w,�,↵) = 0
↵igi(x) = 0 8i↵i � 0, gi(w) 0, hi(w) = 0
Typically easy if someone hands us ! ↵,�
Convex Optimization• General optimization problem may have
many local minima/maxima and saddle points. w
f(w)
• Makes minimization hard (e.g., exponential in dimension).
• Convex optimization problems are a “nice” subclass. Require:
• Convex f(w),g(x)
• Linear h(x) w
f(w)
Convex Optimization• Convex function if:
• Value on line is less than linear function.
• Non-negative second derivative (or Hessian)
• Examples:f(w) = w · xf(w) = max [w · x, 0]
f(w) = wTAw A ⌫ 0
Convex Optimization• Nice things:
• No local optima
• KKT conditions are sufficient for global optimality.
• Multipliers can be solved via dual.
Convex Duality• For every convex problem, we can define a dual
problem that has the same value
• Optimization is over the Lagrange multipliers.
• Solution to dual implies solution to primal via KKT
• Dual might be easier to solve.
Convex Duality• Recall the Lagrangian:
L(w,�,↵) = f(w) +X
i
�ihi(w) +X
i
↵igi(w)
• Then:min f(w)s.t. hi(w) = 0
gi(w) 0
= min
wmax
�,↵�0L(w,�,↵) Why?
min
wmax
�,↵�0L(w,�,↵) max
�,↵�0min
wL(w,�,↵)
• Replacing min and max gives:
• In the convex case it is an equality
Convex Duality• Define:
• Dual problem:
g(�,↵) = minw
L(w,�,↵)
max
�,↵�0g(�,↵)
• Has same value as primal problem.
• The resulting are optimal. You can recover the “primal” variables w via KKT. This is often easy.
�,↵
rwL(w,�,↵) = 0
↵igi(x) = 0 8i↵i � 0, gi(w) 0, hi(w) = 0
Maximum Margin Classifier• For every point denote its distance
from the hyperplane.
• Margin of a classifier: the shortest distance to the hyperplane:
• Goal: find classifier that maximizes
xi d(xi,w, b)
mini
d(xi,w, b)
mini
d(xi,w, b)
Geometry of Linear Classifiers
y = sign [w · x+ b]
• w is the orthogonal direction to the hyperplane.
• Proof: if on hyperplane then
What'is'b?
wMx + ; = 0
;w
w x
|x ·w + b|kwk
• Distance from origin to hyperplane is |b|kwk
x1,x2 w · (x1 � x2) = 0
Max Margin Hyperplane• Find a hyperplane that maximizes the minimum distance
• Solve: maxw1
kwk mini |w · xi + b|s.t. yi (w · xi + b) � 0
• Any solution can be rescaled to and not affect the objective or constraints.
(w, b)(cw, cb)
• We can rescale such that mini
|w · xi + b| = 1
maxw kwk�1
s.t. yi (w · xi + b) � 0
mini |w · xi + b| = 1
Max Margin Hyperplanemaxw kwk�1
s.t. yi (w · xi + b) � 0
mini |w · xi + b| = 1
• Equivalently:maxw kwk�1
s.t. mini yi (w · xi + b) = 1
• We can relax to an inequality (why?): mini
yi (w · xi + b) � 1
maxw kwk�1
s.t. yi (w · xi + b) � 1
minw kwk2s.t. yi (w · xi + b) � 1
Support Vector Machines (SVM)• The SVM classifier is the solution to:
• The where this is an equality are “support vectors” xi
• It is a convex optimization problem. Called a convex quadratic program (quad. objective and linear constraints)
minw 0.5kwk2s.t. yi (w · xi + b) � 1
Factor 0.5 doesn’t affect the optimum
SVM History• Initial version by Vapnik and Chervonenkis (63)
• Non linear version by Boser, Guyon, Vapnik (92)
• Much work on generalization theory since (by Bartlett, Shawe Taylor, Mendelson, Schoelkopf, Smola, and others).
• Many variants for regression, unsupervised learning etc.
Solving SVM• The SVM classifier is the solution to:
• You can plug this into a solver and get w, b
• Lets use Lagrangian to understand solution.
minw 0.5kwk2s.t. yi (w · xi + b) � 1
w =X
i
↵iyixi
rbL(w, b,↵) = �X
i
↵iyi = 0X
i
↵iyi = 0
L(w, b,↵) = 0.5kwk2 +X
i
↵i [1� yi (w · xi + b)]
rwL(w, b,↵) = w �X
i
↵iyixi = 0
The Representer Theorem • The optimal weight is a weight
combination of the data pointsw =
X
i
↵iyixi
• This will be very important!
• When is ? (recall KKT) ↵i = 0
• Whenever w · xi + b < 1
• only when ↵i > 0w · xi + b = 1
• Optimal weight is a combination only of support vectors!
Deriving via Dual • How do we find and then b?
• Use the dual!
w =X
i
↵iyixi↵i
• We know the minimizing w. Plug into Lagrangian.
g(↵) = 0.5kX
i
↵iyixik22 �X
i
↵i
2
4yi
0
@
0
@X
j
↵jyjxj
1
A · xi + b
1
A� 1
3
5
X
i
↵iyi = 0
=X
i
↵i � 0.5X
i,j
↵i↵jyiyjxi · xj
• Constrain because otherwise g(↵) = �1
L(w, b,↵) = 0.5kwk2 +X
i
↵i [1� yi (w · xi + b)]
g(↵) = minw,b
L(w, b,↵)
The SVM Dual• The dual problem is:
max
Pi ↵i � 0.5
Pi,j ↵i↵jyiyjxi · xj
s.t. ↵i � 0,P
i ↵iyi = 0
• Number of variables and constraints is number of training points.
• Also a convex quadratic program (why?)
• Obtaining the primal w: w =X
i
↵iyixi
Finding b• Recall from KKT that support vectors ( ) satisfy:
w · xi + b = 1
↵i > 0
• Since we know w we can solve for b.
• Should give same value for all support vectors.
Non Separable Case• So far we assumed a separating
hyperplane exists.• If it doesn’t, our optimization problem
is infeasible. • For real data, we don’t want to make
this assumption. Because:
• Data may be noisy. Linear classifier may still do ok.
• May come from a non linear rule. Next class!
Non Separable Case• Ideally, we would like to find the
classifier that minimizes training error.• But:
• Turns out this is NP hard.
• How do we incorporate margin?
• Let’s start from the separable case.
Non Separable Case
• Separable case:
• Need to “relax” the constraints.
• Allow violation by , but “pay” for violation.
minw 0.5kwk2s.t. yi (w · xi + b) � 1
⇠i � 0
• C is a constant that determines how much we care about classification errors as opposed to margin.
minw 0.5kwk2 + CP
i ⇠is.t. yi (w · xi + b) � 1� ⇠i , ⇠i � 0
Dual for non separable
• Dual is:max
Pi ↵i � 0.5
Pi,j ↵i↵jyiyjxi · xj
s.t. 0 ↵i C,P
i ↵iyi = 0
• Mapping to primal is as before.
Alternative Interpretation
• Primal is: minw 0.5kwk2 + CP
i ⇠is.t. yi (w · xi + b) � 1� ⇠i , ⇠i � 0
• Can solve for to get: ⇠i ⇠i = max [0, 1� yi(w · xi + b)]
• Problem becomes:
min
wCX
i
max [0, 1� yi(w · xi + b)] + 0.5kwk22
Alternative Interpretation
• Primal is: min
wCX
i
max [0, 1� yiw · xi] + 0.5kwk22
• The function: is called the hinge loss.
max [0, 1� yiw · xi]Hinge'loss
• SVM'uses'the'hinge'loss'max(0,1+ − %#1 x# )• an'approximation'to'the'0]1'loss
%#1 x#0
1
2
3
4
5
_3 _2 _1 0 1 2 3 4
0_1hinge
yiw · xi
• Upper bounds the true classification error
• A convex upper bound!
Alternative Interpretation
• Primal is: min
wCX
i
max [0, 1� yiw · xi] + 0.5kwk22
Bound on loss Regularization• Very common design pattern.
• Other losses and regularizers can be considered.
• Logistic loss:
• L1 regularization: . Sparsity inducing.
1
ln 2ln(1 + e�yiw·xi)
kwk1 =X
i
|wi|
SVM and Generalization
• Intuitively choosing a large margin should improve generalization
• Assume true distribution and classifier are such that the margin is
• Expect generalization to behave like
• But can always increase by rescaling
• Denote R the largest norm of x
• Generalization scales with
�
���1
�
R��1
SVM and Generalization
• Assume training error is zero.
• Can be shown that generalization satisfies (up to some logarithmic factors):
�error(w) c1
m
R
2
�
2+
c2
m
log
m
�
• The VC dimension is replaced by
• Appeared in “Structural Risk Minimization over Data-Dependent Hierarchies” (98)
R2
�2
Leave one out bounds
• Another intuition: using few support vectors should lead to good generalization.
• We will show this via leave one out error.
• Denote training sample without
• Denote the hypothesis from training on S
S�i (xi, yi)
hS
R̂LOO(S) =1
m
mX
i=1
I [hS�i(xi) 6= yi]
Leave one out bounds
• LOO error is similar in spirit to generalization error. But we only train on m-1 points.
• Denote R(h) the generalization error of h
• Can show:
• LOO error and generalization error have same expected value
ESm
hR̂LOO(Sm)
i= ESm�1
⇥R(hSm�1)
⇤
R(h) = E(x,y)⇠D
I [h(x) 6= y]
Leave one out bounds for SVM
• What is the expected LOO error of SVM (separable case).
• If a non-support vector is left out, the solution will not change, and error will be zero.
• Otherwise there might be an error:R̂LOO(Sm) NSV (Sm)
m
ESm�1
⇥R(hSm�1)
⇤ 1
mESm [NSV (Sm)]• Therefore:
• Generalization related to number of SVs.