· 2020-03-05 · svm objective function 3 regularization term: •maximize the margin •imposes...
TRANSCRIPT
1
Support Vector Machines: Training with Stochastic Gradient Descent
MachineLearningFall2017
SupervisedLearning:TheSetup
1
Machine LearningSpring 2020
The slides are mainly from Vivek Srikumar
Support vector machines
• Training by maximizing margin
• The SVM objective
• Solving the SVM optimization problem
• Support vectors, duals and kernels
2
SVM objective function
3
Regularization term: • Maximize the margin• Imposes a preference over the
hypothesis space and pushes for better generalization
• Can be replaced with other regularization terms which impose other preferences
Empirical Loss: • Hinge loss • Penalizes weight vectors that make
mistakes
• Can be replaced with other loss functions which impose other preferences
A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Outline: Training SVM by optimization
1. Review of convex functions and gradient descent2. Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
4
Outline: Training SVM by optimization
1. Review of convex functions and gradient descent2. Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
5
Solving the SVM optimization problem
This function is convex in w, bFor convenience, use simplified notation:
w0← ww ← [w0,b]xi ← [xi,1]
6
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
maxw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈ [0,1] we have
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
7
u v
f(v)
f(u)
Recall: Convex functions
From geometric perspective
Every tangent plane lies below the function
A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈ [0,1] we have
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
8
u v
f(v)
f(u)
Recall: Convex functions
From geometric perspective
Every tangent plane lies below the function
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
Convex functions
9
Linear functions max is convex
Some ways to show that a function is convex:
1. Using the definition of convexity
2. Showing that the second derivative is nonnegative (for one dimensional functions)
3. Showing that the second derivative is positive semi-definite (for vector functions)
Not all functions are convex
10
These are concave
These are neither
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≥ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
Convex functions are convenient
A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈[0,1] we have
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
In general: Necessary condition for x to be a minimum for the function f is ∇f (x)= 0
For convex functions, this is both necessary and sufficient
11
u v
f(v)
f(u)
Convex functions are convenient
A function 𝑓 is convex if for every 𝒖, 𝒗 in the domain, and for every 𝜆 ∈[0,1] we have
𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)
In general: Necessary condition for x to be a minimum for the function f is ∇f (x)= 0
For convex functions, this is both necessary and sufficient
12
u v
f(v)
f(u)
This function is convex in w
• This is a quadratic optimization problem because the objective is quadratic
• Older methods: Used techniques from Quadratic Programming– Very slow
• No constraints, can use gradient descent– Still very slow!
Solving the SVM optimization problem
13
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
maxw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
maxw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
maxw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w0
• Iterate till convergence: – Compute the gradient of J at wt
– Update wt to get wt+1 by taking a step in the opposite direction of the gradient
14
J(w)
ww0
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w0
• Iterate till convergence: – Compute the gradient of J at wt
– Update wt to get wt+1 by taking a step in the opposite direction of the gradient
15
J(w)
ww0w1
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w0
• Iterate till convergence: – Compute the gradient of J at wt
– Update wt to get wt+1 by taking a step in the opposite direction of the gradient
16
J(w)
ww0w1w2
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Gradient descent
General strategy for minimizing a function J(w)
• Start with an initial guess for w, say w0
• Iterate till convergence: – Compute the gradient of J at wt
– Update wt to get wt+1 by taking a step in the opposite direction of the gradient
17
J(w)
ww0w1w2w3
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Gradient descent for SVM
1. Initialize w0
2. For t = 0, 1, 2, ….1. Compute gradient of J(w) at wt. Call it ∇J(wt)
2. Update w as follows:
18
r: Called the learning rate .
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Outline: Training SVM by optimization
ü Review of convex functions and gradient descent2. Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
19
Gradient descent for SVM
1. Initialize w0
2. For t = 0, 1, 2, ….1. Compute gradient of J(w) at wt. Call it ∇J(wt)
2. Update w as follows:
20
r: Called the learning rate
Gradient of the SVM objective requires summing over the entire training set
Slow, does not really scale
We are trying to minimize
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Treat (xi, yi) as a full dataset and take the derivative of the SVM objective at the current wt-1 to be rJt(wt-1)
3. Update: wt à wt-1 – °t rJt (wt-1)
3. Return final w
21
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Treat (xi, yi) as a full dataset and take the derivative of the SVM objective at the current wt-1 to be rJt(wt-1)
3. Update: wt à wt-1 – °t rJt (wt-1)
3. Return final w
22
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current w to be ∇Jt(w)
3. Update: wt à wt-1 – °t rJt (wt-1)
3. Return final w
23
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current w to be ∇Jt(w)
3. Update: wt←wt-1 – 𝛾t ∇Jt (wt-1)
3. Return final w
24
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current w to be ∇Jt(w)
3. Update: wt à wt-1 – °t rJt (wt-1)
3. Return final w
25
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
Number of training examples
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current w to be ∇Jt(w)
3. Update: w ←w – 𝛾t ∇Jt (w)
3. Return final w
26
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current w to be ∇Jt(w)
3. Update: w ←w – 𝛾t ∇Jt (w)
3. Return final w
What is the gradient of the hinge loss with respect to w?(The hinge loss is not a differentiable function!)
27
This algorithm is guaranteed to converge to the minimum of J if 𝛾t is small enough.
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Outline: Training SVM by optimization
ü Review of convex functions and gradient descentü Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
28
Gradient Descent vs SGD
29
Gradient descent
Gradient Descent vs SGD
30
Stochastic Gradient descent
Gradient Descent vs SGD
31
Stochastic Gradient descent
Gradient Descent vs SGD
32
Stochastic Gradient descent
Gradient Descent vs SGD
33
Stochastic Gradient descent
Gradient Descent vs SGD
34
Stochastic Gradient descent
Gradient Descent vs SGD
35
Stochastic Gradient descent
Gradient Descent vs SGD
36
Stochastic Gradient descent
Gradient Descent vs SGD
37
Stochastic Gradient descent
Gradient Descent vs SGD
38
Stochastic Gradient descent
Gradient Descent vs SGD
39
Stochastic Gradient descent
Gradient Descent vs SGD
40
Stochastic Gradient descent
Gradient Descent vs SGD
41
Stochastic Gradient descent
Gradient Descent vs SGD
42
Stochastic Gradient descent
Gradient Descent vs SGD
43
Stochastic Gradient descent
Gradient Descent vs SGD
44
Stochastic Gradient descent
Gradient Descent vs SGD
45
Stochastic Gradient descent
Gradient Descent vs SGD
46
Stochastic Gradient descent
Gradient Descent vs SGD
47
Stochastic Gradient descent
Many more updates than gradient descent, but each individual update is less computationally expensive
Outline: Training SVM by optimization
ü Review of convex functions and gradient descentü Stochastic gradient descentüGradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
48
Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S
2. Treat (xi, yi) as a full dataset and take the derivative of the SVM objective at the current w to be ∇Jt(w)
3. Update: w ←w – 𝛾t ∇Jt (w)
3. Return final w
What is the derivative of the hinge loss with respect to w?(The hinge loss is not a differentiable function!)
49
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J(w) =1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
5
Hinge loss is not differentiable!
What is the derivative of the hinge loss with respect to w?
50
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
Detour: Sub-gradients
Generalization of gradients to non-differentiable functions(Remember that every tangent lies below the function for convex functions)
Informally, a sub-tangent line at a point is any line that crossesthe point and lies below the entire function.A sub-gradient is the slope of the sub-tangent line
51
Sub-gradients
52[Example from Boyd]
g1 is a gradient at x1
g2 and g3 is are both subgradients at x2
f is differentiable at x1Tangent at this point
Formally, g is a subgradient to f at x if
Sub-gradients
53[Example from Boyd]
g1 is a gradient at x1
g2 and g3 is are both subgradients at x2
f is differentiable at x1Tangent at this point
Formally, g is a subgradient to f at x if
Sub-gradients
54[Example from Boyd]
g1 is a gradient at x1
g2 and g3 are both subgradients at x2
f is differentiable at x1Tangent at this point
Formally, g is a subgradient to f at x if
Sub-gradient of the SVM objective
55
General strategy: First solve the max and compute the gradient for each case
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
Sub-gradient of the SVM objective
56
General strategy: First solve the max and compute the gradient for each case
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiwx
i) = 0
[w0; 0]� Cyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
5
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw,b,{⇠i}
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiw>xi) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
minx
f(x)
s.t. g1(x) 0, . . . , gm(x) 0
L(x,�) = f(x) +mX
i=1
�igi(x)
minx
max�
L(x,�) = f(x) +mX
i=1
�igi(x)
s.t. �1 � 0, . . . ,�m � 0
minx
max��0
L(x,�) = f(x) +mX
i=1
�igi(x)
(2)
max��0
f(x) +mX
i=1
�igi(x) =
⇢f(x) if x satisfy all the constraints
1 otherwise
5
Outline: Training SVM by optimization
ü Review of convex functions and gradient descentü Stochastic gradient descentüGradient descent vs stochastic gradient descentü Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron
57
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi)2 S:
If yi wTxi · 1, w à (1- °t) w + °t C yi xi
else w à (1- °t) w
3. Return w
58
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw,b,{⇠i}
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiw>xi) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
minx
f(x)
s.t. g1(x) 0, . . . , gm(x) 0
L(x,�) = f(x) +mX
i=1
�igi(x)
minx
max�
L(x,�) = f(x) +mX
i=1
�igi(x)
s.t. �1 � 0, . . . ,�m � 0
minx
max��0
L(x,�) = f(x) +mX
i=1
�igi(x)
(2)
max��0
f(x) +mX
i=1
�igi(x) =
⇢f(x) if x satisfy all the constraints
1 otherwise
5
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi)2 S:
If yi wTxi · 1, w à (1- °t) w + °t C yi xi
else w à (1- °t) w
3. Return w
59
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw,b,{⇠i}
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiw>xi) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
minx
f(x)
s.t. g1(x) 0, . . . , gm(x) 0
L(x,�) = f(x) +mX
i=1
�igi(x)
minx
max�
L(x,�) = f(x) +mX
i=1
�igi(x)
s.t. �1 � 0, . . . ,�m � 0
minx
max��0
L(x,�) = f(x) +mX
i=1
�igi(x)
(2)
max��0
f(x) +mX
i=1
�igi(x) =
⇢f(x) if x satisfy all the constraints
1 otherwise
5
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi) ∈ S:
If yi wTxi · 1, w à (1- °t) w + °t C yi xi
else w à (1- °t) w
3. Return w
60
Update w ← w – 𝛾t ∇Jt
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw,b,{⇠i}
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiw>xi) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
minx
f(x)
s.t. g1(x) 0, . . . , gm(x) 0
L(x,�) = f(x) +mX
i=1
�igi(x)
minx
max�
L(x,�) = f(x) +mX
i=1
�igi(x)
s.t. �1 � 0, . . . ,�m � 0
minx
max��0
L(x,�) = f(x) +mX
i=1
�igi(x)
(2)
max��0
f(x) +mX
i=1
�igi(x) =
⇢f(x) if x satisfy all the constraints
1 otherwise
5
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi) ∈ S:
If yi wTxi ≤ 1, w ← w - 𝛾t [w0; 0] + 𝛾t C N yi xi
else w0← (1- 𝛾t) w0
3. Return w
61
216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269
minw
1
2w>w
s.t.8i, yi(w>xi + b) � 1
minw,b,{⇠i}
1
2w>w + C
X
i
⇠i
s.t.8i, yi(w>xi + b) � 1� ⇠i
⇠i � 0
minw,b
1
2w>w + C
X
i
max�0, 1� yi(w
>xi + b)�
LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)
�
minw
1
2w>
0 w0 + CX
i
max(0, 1� yiw>xi)
J t(w) =
1
2w>
0 w0 + C ·N max(0, 1� yiw>xi)
rJ t=
⇢[w0; 0] if max(0, 1� yiw>xi) = 0
[w0; 0]� C ·Nyixi otherwise
f(x) � f(u) +rf(u)>(x� u)
minx
f(x)
s.t. g1(x) 0, . . . , gm(x) 0
L(x,�) = f(x) +mX
i=1
�igi(x)
minx
max�
L(x,�) = f(x) +mX
i=1
�igi(x)
s.t. �1 � 0, . . . ,�m � 0
minx
max��0
L(x,�) = f(x) +mX
i=1
�igi(x)
(2)
max��0
f(x) +mX
i=1
�igi(x) =
⇢f(x) if x satisfy all the constraints
1 otherwise
5
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. For each training example (xi, yi) ∈ S:
If yi wTxi ≤ 1, w ← w - 𝛾t [w0; 0] + 𝛾t C N yi xi
else w0← (1- 𝛾t) w0
3. Return w
62
𝛾t: learning rate, many tweaks possible
Important to shuffle examples at the start of each epoch
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Shuffle the training set2. For each training example (xi, yi) ∈ S:
If yi wTxi ≤ 1, w ← w - 𝛾t [w0; 0] + 𝛾t C N yi xi
else w0← (1- 𝛾t) w0
3. Return w
63
𝛾t: learning rate, many tweaks possible
Convergence and learning rates
With enough iterations, it will converge in expectation
Provided the step sizes are “square summable, but not summable”
• Step sizes 𝛾t are positive• Sum of squares of step sizes over t = 1 to ∞ is not infinite• Sum of step sizes over t = 1 to ∞ is infinity
• Some examples: 𝛾& ='!
()"!#$or 𝛾& =
'!()&
64
Convergence and learning rates
• Number of iterations to get to accuracy within 𝜖
• For strongly convex functions, N examples, d dimensional:– Gradient descent: O(Nd ln(1/𝜖))– Stochastic gradient descent: O(d/𝜖)
• More subtleties involved, but SGD is generally preferable when the data size is huge
65
Outline: Training SVM by optimization
ü Review of convex functions and gradient descentü Stochastic gradient descentüGradient descent vs stochastic gradient descentü Sub-derivatives of the hinge lossü Stochastic sub-gradient descent for SVM6. Comparison to perceptron
66
Stochastic sub-gradient descent for SVM
Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn
2. For epoch = 1 … T:1. Shuffle the training set2. For each training example (xi, yi) ∈ S:
If yi wTxi ≤ 1, w ← w - 𝛾t [w0; 0] + 𝛾t C N yi xi
else w0← (1- 𝛾t) w0
3. Return w
67
Compare with the Perceptron update:If yiwTxi ≤ 0, update w← w + r yi xi
Perceptron vs. SVM
• Perceptron: Stochastic sub-gradient descent for a different loss– No regularization though
• SVM optimizes the hinge loss– With regularization
68
SVM summary from optimization perspective
• Minimize regularized hinge loss
• Solve using stochastic sub-gradient descent– Very fast, run time does not depend on number of examples
– Compare with Perceptron algorithm: similar framework withdifferent objectives!
– Compare with Perceptron algorithm: Perceptron does not maximize margin width• Perceptron variants can force a margin
• Other successful optimization algorithms exist– Eg: Dual coordinate descent, implemented in liblinear
69Questions?