![Page 1: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/1.jpg)
Introduction to Machine Learning (67577)Lecture 8
Shai Shalev-Shwartz
School of CS and Engineering,The Hebrew University of Jerusalem
Support Vector Machines and Kernel Methods
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 1 / 31
![Page 2: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/2.jpg)
Outline
1 Support Vector MachinesMarginhard-SVMsoft-SVMSolving SVM using SGD
2 KernelsEmbeddings into feature spacesThe Kernel TrickExamples of kernelsSGD with kernelsDuality
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 2 / 31
![Page 3: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/3.jpg)
Which separating hyperplane is better ?
x
x
Intuitively, dashed black is better
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 3 / 31
![Page 4: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/4.jpg)
Margin
w
x
v
Given hyperplane defined by L = {v : 〈w,v〉+ b = 0}, and given x,the distance of x to L is
d(x, L) = min{‖x− v‖ : v ∈ L}
Claim: if ‖w‖ = 1 then d(x, L) = |〈w,x〉+ b|
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 4 / 31
![Page 5: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/5.jpg)
Margin and Support Vectors
Recall: a separating hyperplane is defined by (w, b) s.t.∀i, yi(〈w,xi〉+ b) > 0The margin of a separating hyperplane is the distance of the closestexample to it: mini |〈w,xi〉+ b|
The closest examples are called support vectors
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 5 / 31
![Page 6: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/6.jpg)
Margin and Support Vectors
Recall: a separating hyperplane is defined by (w, b) s.t.∀i, yi(〈w,xi〉+ b) > 0The margin of a separating hyperplane is the distance of the closestexample to it: mini |〈w,xi〉+ b|The closest examples are called support vectors
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 5 / 31
![Page 7: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/7.jpg)
Support Vector Machine (SVM)
Hard-SVM: Seek for the separating hyperplane with largest margin
argmax(w,b):‖w‖=1
mini∈[m]
|〈w,xi〉+ b| s.t. ∀i, yi(〈w,xi〉+ b) > 0 .
Equivalently:argmax
(w,b):‖w‖=1mini∈[m]
yi(〈w,xi〉+ b) .
Equivalently:
(w0, b0) = argmin(w,b)
‖w‖2 s.t. ∀i, yi(〈w,xi〉+ b) ≥ 1
Observe: The margin of(
w0‖w0‖ ,
b0‖w0‖
)is 1/‖w0‖ and is maximal
margin
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 6 / 31
![Page 8: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/8.jpg)
Support Vector Machine (SVM)
Hard-SVM: Seek for the separating hyperplane with largest margin
argmax(w,b):‖w‖=1
mini∈[m]
|〈w,xi〉+ b| s.t. ∀i, yi(〈w,xi〉+ b) > 0 .
Equivalently:argmax
(w,b):‖w‖=1mini∈[m]
yi(〈w,xi〉+ b) .
Equivalently:
(w0, b0) = argmin(w,b)
‖w‖2 s.t. ∀i, yi(〈w,xi〉+ b) ≥ 1
Observe: The margin of(
w0‖w0‖ ,
b0‖w0‖
)is 1/‖w0‖ and is maximal
margin
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 6 / 31
![Page 9: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/9.jpg)
Support Vector Machine (SVM)
Hard-SVM: Seek for the separating hyperplane with largest margin
argmax(w,b):‖w‖=1
mini∈[m]
|〈w,xi〉+ b| s.t. ∀i, yi(〈w,xi〉+ b) > 0 .
Equivalently:argmax
(w,b):‖w‖=1mini∈[m]
yi(〈w,xi〉+ b) .
Equivalently:
(w0, b0) = argmin(w,b)
‖w‖2 s.t. ∀i, yi(〈w,xi〉+ b) ≥ 1
Observe: The margin of(
w0‖w0‖ ,
b0‖w0‖
)is 1/‖w0‖ and is maximal
margin
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 6 / 31
![Page 10: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/10.jpg)
Support Vector Machine (SVM)
Hard-SVM: Seek for the separating hyperplane with largest margin
argmax(w,b):‖w‖=1
mini∈[m]
|〈w,xi〉+ b| s.t. ∀i, yi(〈w,xi〉+ b) > 0 .
Equivalently:argmax
(w,b):‖w‖=1mini∈[m]
yi(〈w,xi〉+ b) .
Equivalently:
(w0, b0) = argmin(w,b)
‖w‖2 s.t. ∀i, yi(〈w,xi〉+ b) ≥ 1
Observe: The margin of(
w0‖w0‖ ,
b0‖w0‖
)is 1/‖w0‖ and is maximal
margin
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 6 / 31
![Page 11: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/11.jpg)
Margin-based Analysis
Margin is Scale Sensitive:if (w, b) separates (x1, y1), . . . , (xm, ym) with margin γ, then itseparates (2x1, y1), . . . , (2xm, ym) with a margin of 2γThe margin depends on the scale of the examples
Margin of distribution: We say that D is separable with a(γ, ρ)-margin if exists (w?, b?) s.t. ‖w?‖ = 1 and
D({(x, y) : ‖x‖ ≤ ρ ∧ y(〈w?,x〉+ b?) ≥ 1}) = 1 .
Theorem: If D is separable with a (γ, ρ)-margin then the samplecomplexity of hard-SVM is
m(ε, δ) ≤ 8
ε2(2(ρ/γ)2 + log(2/δ)
)Unlike VC bounds, here the sample complexity depends on ρ/γinstead of d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 7 / 31
![Page 12: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/12.jpg)
Margin-based Analysis
Margin is Scale Sensitive:if (w, b) separates (x1, y1), . . . , (xm, ym) with margin γ, then itseparates (2x1, y1), . . . , (2xm, ym) with a margin of 2γThe margin depends on the scale of the examples
Margin of distribution: We say that D is separable with a(γ, ρ)-margin if exists (w?, b?) s.t. ‖w?‖ = 1 and
D({(x, y) : ‖x‖ ≤ ρ ∧ y(〈w?,x〉+ b?) ≥ 1}) = 1 .
Theorem: If D is separable with a (γ, ρ)-margin then the samplecomplexity of hard-SVM is
m(ε, δ) ≤ 8
ε2(2(ρ/γ)2 + log(2/δ)
)Unlike VC bounds, here the sample complexity depends on ρ/γinstead of d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 7 / 31
![Page 13: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/13.jpg)
Margin-based Analysis
Margin is Scale Sensitive:if (w, b) separates (x1, y1), . . . , (xm, ym) with margin γ, then itseparates (2x1, y1), . . . , (2xm, ym) with a margin of 2γThe margin depends on the scale of the examples
Margin of distribution: We say that D is separable with a(γ, ρ)-margin if exists (w?, b?) s.t. ‖w?‖ = 1 and
D({(x, y) : ‖x‖ ≤ ρ ∧ y(〈w?,x〉+ b?) ≥ 1}) = 1 .
Theorem: If D is separable with a (γ, ρ)-margin then the samplecomplexity of hard-SVM is
m(ε, δ) ≤ 8
ε2(2(ρ/γ)2 + log(2/δ)
)
Unlike VC bounds, here the sample complexity depends on ρ/γinstead of d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 7 / 31
![Page 14: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/14.jpg)
Margin-based Analysis
Margin is Scale Sensitive:if (w, b) separates (x1, y1), . . . , (xm, ym) with margin γ, then itseparates (2x1, y1), . . . , (2xm, ym) with a margin of 2γThe margin depends on the scale of the examples
Margin of distribution: We say that D is separable with a(γ, ρ)-margin if exists (w?, b?) s.t. ‖w?‖ = 1 and
D({(x, y) : ‖x‖ ≤ ρ ∧ y(〈w?,x〉+ b?) ≥ 1}) = 1 .
Theorem: If D is separable with a (γ, ρ)-margin then the samplecomplexity of hard-SVM is
m(ε, δ) ≤ 8
ε2(2(ρ/γ)2 + log(2/δ)
)Unlike VC bounds, here the sample complexity depends on ρ/γinstead of d
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 7 / 31
![Page 15: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/15.jpg)
Soft-SVM
Hard-SVM assumes that the data is separable
What if it’s not? We can relax the constraint to yield soft-SVM
argminw,b,ξ
(λ‖w‖2 +
1
m
m∑i=1
ξi
)s.t. ∀i, yi(〈w,xi〉+ b) ≥ 1− ξi and ξi ≥ 0
Can be written as regularized loss minimization:
argminw,b
(λ‖w‖2 + Lhinge
S ((w, b)))
where we use the hinge loss
`hinge((w, b), (x, y)) = max{0, 1− y(〈w,x〉+ b)} .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 8 / 31
![Page 16: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/16.jpg)
Soft-SVM
Hard-SVM assumes that the data is separable
What if it’s not? We can relax the constraint to yield soft-SVM
argminw,b,ξ
(λ‖w‖2 +
1
m
m∑i=1
ξi
)s.t. ∀i, yi(〈w,xi〉+ b) ≥ 1− ξi and ξi ≥ 0
Can be written as regularized loss minimization:
argminw,b
(λ‖w‖2 + Lhinge
S ((w, b)))
where we use the hinge loss
`hinge((w, b), (x, y)) = max{0, 1− y(〈w,x〉+ b)} .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 8 / 31
![Page 17: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/17.jpg)
Soft-SVM
Hard-SVM assumes that the data is separable
What if it’s not? We can relax the constraint to yield soft-SVM
argminw,b,ξ
(λ‖w‖2 +
1
m
m∑i=1
ξi
)s.t. ∀i, yi(〈w,xi〉+ b) ≥ 1− ξi and ξi ≥ 0
Can be written as regularized loss minimization:
argminw,b
(λ‖w‖2 + Lhinge
S ((w, b)))
where we use the hinge loss
`hinge((w, b), (x, y)) = max{0, 1− y(〈w,x〉+ b)} .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 8 / 31
![Page 18: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/18.jpg)
The Homogenous Case
Recall: by adding one more feature to x with the constant value of 1we can remove the bias term
However, this will yield a slightly different algorithm, since now we’lleffectively regularize the bias term, b, as well
This has little effect on the sample complexity, and simplify theanalysis and algorithmic, so from now on we omit b
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 9 / 31
![Page 19: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/19.jpg)
Sample complexity of soft-SVM
Observe: soft-SVM = RLM
Observe: the hinge-loss, w 7→ max{0, 1− y〈w,x〉}, is ‖x‖-Lipschitz
Assume that D is s.t. ‖x‖ ≤ ρ with probability 1
Then, we obtain a convex-Lipschitz loss, and by the results fromprevious lecture, for every u,
ES∼Dm
[LhingeD (A(S))] ≤ Lhinge
D (u) + λ‖u‖2 +2ρ2
λm.
Since the hinge-loss upper bounds the 0-1 loss, the right hand side isalso an upper bound on ES∼Dm [L0−1
D (A(S))]
For every B > 0, if we set λ =√
2ρ2
B2mthen:
ES∼Dm
[L0−1D (A(S))] ≤ min
w:‖w‖≤BLhingeD (w) +
√8ρ2B2
m.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 10 / 31
![Page 20: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/20.jpg)
Sample complexity of soft-SVM
Observe: soft-SVM = RLM
Observe: the hinge-loss, w 7→ max{0, 1− y〈w,x〉}, is ‖x‖-Lipschitz
Assume that D is s.t. ‖x‖ ≤ ρ with probability 1
Then, we obtain a convex-Lipschitz loss, and by the results fromprevious lecture, for every u,
ES∼Dm
[LhingeD (A(S))] ≤ Lhinge
D (u) + λ‖u‖2 +2ρ2
λm.
Since the hinge-loss upper bounds the 0-1 loss, the right hand side isalso an upper bound on ES∼Dm [L0−1
D (A(S))]
For every B > 0, if we set λ =√
2ρ2
B2mthen:
ES∼Dm
[L0−1D (A(S))] ≤ min
w:‖w‖≤BLhingeD (w) +
√8ρ2B2
m.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 10 / 31
![Page 21: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/21.jpg)
Sample complexity of soft-SVM
Observe: soft-SVM = RLM
Observe: the hinge-loss, w 7→ max{0, 1− y〈w,x〉}, is ‖x‖-Lipschitz
Assume that D is s.t. ‖x‖ ≤ ρ with probability 1
Then, we obtain a convex-Lipschitz loss, and by the results fromprevious lecture, for every u,
ES∼Dm
[LhingeD (A(S))] ≤ Lhinge
D (u) + λ‖u‖2 +2ρ2
λm.
Since the hinge-loss upper bounds the 0-1 loss, the right hand side isalso an upper bound on ES∼Dm [L0−1
D (A(S))]
For every B > 0, if we set λ =√
2ρ2
B2mthen:
ES∼Dm
[L0−1D (A(S))] ≤ min
w:‖w‖≤BLhingeD (w) +
√8ρ2B2
m.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 10 / 31
![Page 22: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/22.jpg)
Sample complexity of soft-SVM
Observe: soft-SVM = RLM
Observe: the hinge-loss, w 7→ max{0, 1− y〈w,x〉}, is ‖x‖-Lipschitz
Assume that D is s.t. ‖x‖ ≤ ρ with probability 1
Then, we obtain a convex-Lipschitz loss, and by the results fromprevious lecture, for every u,
ES∼Dm
[LhingeD (A(S))] ≤ Lhinge
D (u) + λ‖u‖2 +2ρ2
λm.
Since the hinge-loss upper bounds the 0-1 loss, the right hand side isalso an upper bound on ES∼Dm [L0−1
D (A(S))]
For every B > 0, if we set λ =√
2ρ2
B2mthen:
ES∼Dm
[L0−1D (A(S))] ≤ min
w:‖w‖≤BLhingeD (w) +
√8ρ2B2
m.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 10 / 31
![Page 23: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/23.jpg)
Sample complexity of soft-SVM
Observe: soft-SVM = RLM
Observe: the hinge-loss, w 7→ max{0, 1− y〈w,x〉}, is ‖x‖-Lipschitz
Assume that D is s.t. ‖x‖ ≤ ρ with probability 1
Then, we obtain a convex-Lipschitz loss, and by the results fromprevious lecture, for every u,
ES∼Dm
[LhingeD (A(S))] ≤ Lhinge
D (u) + λ‖u‖2 +2ρ2
λm.
Since the hinge-loss upper bounds the 0-1 loss, the right hand side isalso an upper bound on ES∼Dm [L0−1
D (A(S))]
For every B > 0, if we set λ =√
2ρ2
B2mthen:
ES∼Dm
[L0−1D (A(S))] ≤ min
w:‖w‖≤BLhingeD (w) +
√8ρ2B2
m.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 10 / 31
![Page 24: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/24.jpg)
Sample complexity of soft-SVM
Observe: soft-SVM = RLM
Observe: the hinge-loss, w 7→ max{0, 1− y〈w,x〉}, is ‖x‖-Lipschitz
Assume that D is s.t. ‖x‖ ≤ ρ with probability 1
Then, we obtain a convex-Lipschitz loss, and by the results fromprevious lecture, for every u,
ES∼Dm
[LhingeD (A(S))] ≤ Lhinge
D (u) + λ‖u‖2 +2ρ2
λm.
Since the hinge-loss upper bounds the 0-1 loss, the right hand side isalso an upper bound on ES∼Dm [L0−1
D (A(S))]
For every B > 0, if we set λ =√
2ρ2
B2mthen:
ES∼Dm
[L0−1D (A(S))] ≤ min
w:‖w‖≤BLhingeD (w) +
√8ρ2B2
m.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 10 / 31
![Page 25: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/25.jpg)
Margin/Norm vs. Dimensionality
The VC dimension of learning halfspaces depends on the dimension, d
Therefore, the sample complexity grows with d
In contrast, the sample complexity of SVM depends on (ρ/γ)2, orequivalently, ρ2B2
Sometimes d� ρ2B2 (as we saw in the previous lecture)
No contradiction to the fundamental theorem, since here we boundthe error of the algorithm using Lhinge
D (w?) while in the fundmentaltheorem we have L0−1
D (w?)
This is an additional prior knowledge on the problem, namely, thatLhingeD (w?) is not much larger than L0−1
D (w?).
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 11 / 31
![Page 26: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/26.jpg)
Margin/Norm vs. Dimensionality
The VC dimension of learning halfspaces depends on the dimension, d
Therefore, the sample complexity grows with d
In contrast, the sample complexity of SVM depends on (ρ/γ)2, orequivalently, ρ2B2
Sometimes d� ρ2B2 (as we saw in the previous lecture)
No contradiction to the fundamental theorem, since here we boundthe error of the algorithm using Lhinge
D (w?) while in the fundmentaltheorem we have L0−1
D (w?)
This is an additional prior knowledge on the problem, namely, thatLhingeD (w?) is not much larger than L0−1
D (w?).
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 11 / 31
![Page 27: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/27.jpg)
Margin/Norm vs. Dimensionality
The VC dimension of learning halfspaces depends on the dimension, d
Therefore, the sample complexity grows with d
In contrast, the sample complexity of SVM depends on (ρ/γ)2, orequivalently, ρ2B2
Sometimes d� ρ2B2 (as we saw in the previous lecture)
No contradiction to the fundamental theorem, since here we boundthe error of the algorithm using Lhinge
D (w?) while in the fundmentaltheorem we have L0−1
D (w?)
This is an additional prior knowledge on the problem, namely, thatLhingeD (w?) is not much larger than L0−1
D (w?).
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 11 / 31
![Page 28: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/28.jpg)
Margin/Norm vs. Dimensionality
The VC dimension of learning halfspaces depends on the dimension, d
Therefore, the sample complexity grows with d
In contrast, the sample complexity of SVM depends on (ρ/γ)2, orequivalently, ρ2B2
Sometimes d� ρ2B2 (as we saw in the previous lecture)
No contradiction to the fundamental theorem, since here we boundthe error of the algorithm using Lhinge
D (w?) while in the fundmentaltheorem we have L0−1
D (w?)
This is an additional prior knowledge on the problem, namely, thatLhingeD (w?) is not much larger than L0−1
D (w?).
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 11 / 31
![Page 29: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/29.jpg)
Margin/Norm vs. Dimensionality
The VC dimension of learning halfspaces depends on the dimension, d
Therefore, the sample complexity grows with d
In contrast, the sample complexity of SVM depends on (ρ/γ)2, orequivalently, ρ2B2
Sometimes d� ρ2B2 (as we saw in the previous lecture)
No contradiction to the fundamental theorem, since here we boundthe error of the algorithm using Lhinge
D (w?) while in the fundmentaltheorem we have L0−1
D (w?)
This is an additional prior knowledge on the problem, namely, thatLhingeD (w?) is not much larger than L0−1
D (w?).
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 11 / 31
![Page 30: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/30.jpg)
Margin/Norm vs. Dimensionality
The VC dimension of learning halfspaces depends on the dimension, d
Therefore, the sample complexity grows with d
In contrast, the sample complexity of SVM depends on (ρ/γ)2, orequivalently, ρ2B2
Sometimes d� ρ2B2 (as we saw in the previous lecture)
No contradiction to the fundamental theorem, since here we boundthe error of the algorithm using Lhinge
D (w?) while in the fundmentaltheorem we have L0−1
D (w?)
This is an additional prior knowledge on the problem, namely, thatLhingeD (w?) is not much larger than L0−1
D (w?).
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 11 / 31
![Page 31: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/31.jpg)
Solving SVM using SGD
SGD for solving Soft-SVM
goal: Solve argminw
(λ2‖w‖
2 + 1m
∑mi=1 max{0, 1− y〈w,xi〉}
)parameter: T
initialize: θ(1) = 0for t = 1, . . . , T
Let w(t) = 1λ tθ
(t)
Choose i uniformly at random from [m]
If (yi〈w(t),xi〉 < 1)
Set θ(t+1) = θ(t) + yixiElse
Set θ(t+1) = θ(t)
output: w = 1T
∑Tt=1w
(t)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 12 / 31
![Page 32: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/32.jpg)
Outline
1 Support Vector MachinesMarginhard-SVMsoft-SVMSolving SVM using SGD
2 KernelsEmbeddings into feature spacesThe Kernel TrickExamples of kernelsSGD with kernelsDuality
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 13 / 31
![Page 33: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/33.jpg)
Embeddings into feature spaces
The following sample in R1 is not separable by halfspaces
But, if we map x→ (x, x2) it is separable by halfspaces
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 14 / 31
![Page 34: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/34.jpg)
Embeddings into feature spaces
The following sample in R1 is not separable by halfspaces
But, if we map x→ (x, x2) it is separable by halfspaces
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 14 / 31
![Page 35: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/35.jpg)
Embeddings into feature spaces
The general approach:
Define ψ : X → F , where F is some feature space(formally, we require F to be a subset of a Hilbert space)
Train a halfspace over (ψ(x1), y1), . . . , (ψ(xm), ym)
Questions:
How to choose ψ ?
If F is high dimensional we face
statistical challenge — can be tackled using margincomputational challenge — can be tackled using kernels
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 15 / 31
![Page 36: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/36.jpg)
Embeddings into feature spaces
The general approach:
Define ψ : X → F , where F is some feature space(formally, we require F to be a subset of a Hilbert space)
Train a halfspace over (ψ(x1), y1), . . . , (ψ(xm), ym)
Questions:
How to choose ψ ?
If F is high dimensional we face
statistical challenge — can be tackled using margincomputational challenge — can be tackled using kernels
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 15 / 31
![Page 37: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/37.jpg)
Choosing a mapping
In general, requires prior knowledge
In addition, there are some generic mappings that enrich the class ofhalfspaces, e.g. polynomial mappings
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 16 / 31
![Page 38: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/38.jpg)
Polynomial mappings
Recall, a degree k polynomial over a single variable isp(x) =
∑kj=0wjx
j
Can be rewritten as 〈w, ψ(x)〉 where ψ(x) = (1, x, x2, . . . , xk)
More generally, a degree k multivariate polynomial from Rn to R canbe written as
p(x) =∑
J∈[n]r:r≤k
wJ
r∏i=1
xJi .
As before, we can rewrite p(x) = 〈w, ψ(x)〉 where now ψ : Rn → Rdis such that for every J ∈ [n]r, r ≤ k, the coordinate of ψ(x)associated with J is the monomial
∏ri=1 xJi .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 17 / 31
![Page 39: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/39.jpg)
Polynomial mappings
Recall, a degree k polynomial over a single variable isp(x) =
∑kj=0wjx
j
Can be rewritten as 〈w, ψ(x)〉 where ψ(x) = (1, x, x2, . . . , xk)
More generally, a degree k multivariate polynomial from Rn to R canbe written as
p(x) =∑
J∈[n]r:r≤k
wJ
r∏i=1
xJi .
As before, we can rewrite p(x) = 〈w, ψ(x)〉 where now ψ : Rn → Rdis such that for every J ∈ [n]r, r ≤ k, the coordinate of ψ(x)associated with J is the monomial
∏ri=1 xJi .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 17 / 31
![Page 40: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/40.jpg)
Polynomial mappings
Recall, a degree k polynomial over a single variable isp(x) =
∑kj=0wjx
j
Can be rewritten as 〈w, ψ(x)〉 where ψ(x) = (1, x, x2, . . . , xk)
More generally, a degree k multivariate polynomial from Rn to R canbe written as
p(x) =∑
J∈[n]r:r≤k
wJ
r∏i=1
xJi .
As before, we can rewrite p(x) = 〈w, ψ(x)〉 where now ψ : Rn → Rdis such that for every J ∈ [n]r, r ≤ k, the coordinate of ψ(x)associated with J is the monomial
∏ri=1 xJi .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 17 / 31
![Page 41: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/41.jpg)
Polynomial mappings
Recall, a degree k polynomial over a single variable isp(x) =
∑kj=0wjx
j
Can be rewritten as 〈w, ψ(x)〉 where ψ(x) = (1, x, x2, . . . , xk)
More generally, a degree k multivariate polynomial from Rn to R canbe written as
p(x) =∑
J∈[n]r:r≤k
wJ
r∏i=1
xJi .
As before, we can rewrite p(x) = 〈w, ψ(x)〉 where now ψ : Rn → Rdis such that for every J ∈ [n]r, r ≤ k, the coordinate of ψ(x)associated with J is the monomial
∏ri=1 xJi .
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 17 / 31
![Page 42: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/42.jpg)
The Kernel Trick
A kernel function for a mapping ψ is a function that implements innerproduct in the feature space, namely,
K(x,x′) = 〈ψ(x), ψ(x′)〉
We will see that sometimes, it is easy to calculate K(x,x′) efficiently,without applying ψ at all
But, is this enough ?
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 18 / 31
![Page 43: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/43.jpg)
The Representer Theorem
Theorem
Consider any learning rule of the form
w? = argminw
(f (〈w, ψ(x1)〉, . . . , 〈w, ψ(xm)〉) + λ‖w‖2
),
where f : Rm → R is an arbitrary function. Then, ∃α ∈ Rm such thatw? =
∑mi=1 αiψ(xi).
Proof.
We can rewrite w? as w? =∑m
i=1 αiψ(xi) + u, where 〈u, ψ(xi)〉 = 0 forall i. Set w = w? − u. Observe, ‖w?‖2 = ‖w‖2 + ‖u‖2, and for every i,〈w, ψ(xi)〉 = 〈w?, ψ(xi)〉. Hence, the objective at w equals the objectiveat w? minus λ‖u‖2. By optimality of w?, u must be zero.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 19 / 31
![Page 44: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/44.jpg)
The Representer Theorem
Theorem
Consider any learning rule of the form
w? = argminw
(f (〈w, ψ(x1)〉, . . . , 〈w, ψ(xm)〉) + λ‖w‖2
),
where f : Rm → R is an arbitrary function. Then, ∃α ∈ Rm such thatw? =
∑mi=1 αiψ(xi).
Proof.
We can rewrite w? as w? =∑m
i=1 αiψ(xi) + u, where 〈u, ψ(xi)〉 = 0 forall i. Set w = w? − u. Observe, ‖w?‖2 = ‖w‖2 + ‖u‖2, and for every i,〈w, ψ(xi)〉 = 〈w?, ψ(xi)〉. Hence, the objective at w equals the objectiveat w? minus λ‖u‖2. By optimality of w?, u must be zero.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 19 / 31
![Page 45: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/45.jpg)
Implications of Representer Theorem
By representer theorem, optimal solution can be written as
w =∑i
αiψ(xi)
Denote by G the matrix s.t. Gi,j = 〈ψ(xi), ψ(xj)〉. We have that for all i
〈w, ψ(xi)〉 = 〈∑j
αjψ(xj), ψ(xi)〉 =
m∑j=1
αj〈ψ(xj), ψ(xi)〉 = (Gα)i
and
‖w‖2 = 〈∑j
αjψ(xj),∑j
αjψ(xj)〉 =
m∑i,j=1
αiαj〈ψ(xi), ψ(xj)〉 = α>Gα .
So, we can optimize over α
argminα∈Rm
(f (Gα) + λα>Gα
)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 20 / 31
![Page 46: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/46.jpg)
Implications of Representer Theorem
By representer theorem, optimal solution can be written as
w =∑i
αiψ(xi)
Denote by G the matrix s.t. Gi,j = 〈ψ(xi), ψ(xj)〉. We have that for all i
〈w, ψ(xi)〉 = 〈∑j
αjψ(xj), ψ(xi)〉 =
m∑j=1
αj〈ψ(xj), ψ(xi)〉 = (Gα)i
and
‖w‖2 = 〈∑j
αjψ(xj),∑j
αjψ(xj)〉 =
m∑i,j=1
αiαj〈ψ(xi), ψ(xj)〉 = α>Gα .
So, we can optimize over α
argminα∈Rm
(f (Gα) + λα>Gα
)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 20 / 31
![Page 47: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/47.jpg)
Implications of Representer Theorem
By representer theorem, optimal solution can be written as
w =∑i
αiψ(xi)
Denote by G the matrix s.t. Gi,j = 〈ψ(xi), ψ(xj)〉. We have that for all i
〈w, ψ(xi)〉 = 〈∑j
αjψ(xj), ψ(xi)〉 =
m∑j=1
αj〈ψ(xj), ψ(xi)〉 = (Gα)i
and
‖w‖2 = 〈∑j
αjψ(xj),∑j
αjψ(xj)〉 =
m∑i,j=1
αiαj〈ψ(xi), ψ(xj)〉 = α>Gα .
So, we can optimize over α
argminα∈Rm
(f (Gα) + λα>Gα
)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 20 / 31
![Page 48: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/48.jpg)
Implications of Representer Theorem
By representer theorem, optimal solution can be written as
w =∑i
αiψ(xi)
Denote by G the matrix s.t. Gi,j = 〈ψ(xi), ψ(xj)〉. We have that for all i
〈w, ψ(xi)〉 = 〈∑j
αjψ(xj), ψ(xi)〉 =
m∑j=1
αj〈ψ(xj), ψ(xi)〉 = (Gα)i
and
‖w‖2 = 〈∑j
αjψ(xj),∑j
αjψ(xj)〉 =
m∑i,j=1
αiαj〈ψ(xi), ψ(xj)〉 = α>Gα .
So, we can optimize over α
argminα∈Rm
(f (Gα) + λα>Gα
)Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 20 / 31
![Page 49: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/49.jpg)
The Kernel Trick
Observe: the Gram matrix, G, only depends on inner products, andtherefore can be calculated using K alone
Suppose we found α, then, given a new instance,
〈w, ψ(x)〉 = 〈∑j
ψ(xj), ψ(x)〉 =∑j
〈ψ(xj), ψ(x)〉 =∑j
K(xj ,x)
That is, we can do training and prediction using K alone
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 21 / 31
![Page 50: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/50.jpg)
Representer Theorem for SVM
Soft-SVM:
minα∈Rm
(λαTGα +
1
m
m∑i=1
max{0, 1− yi(Gα)i}
)
Hard-SVMminα∈Rm
αTGα s.t. ∀i, yi(Gα)i ≥ 1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 22 / 31
![Page 51: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/51.jpg)
Representer Theorem for SVM
Soft-SVM:
minα∈Rm
(λαTGα +
1
m
m∑i=1
max{0, 1− yi(Gα)i}
)
Hard-SVMminα∈Rm
αTGα s.t. ∀i, yi(Gα)i ≥ 1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 22 / 31
![Page 52: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/52.jpg)
Polynomial Kernels
The k degree polynomial kernel is defined to be
K(x,x′) = (1 + 〈x,x′〉)k .
Exercise: show that if we define ψ : Rn → R(n+1)k s.t. forJ ∈ {0, 1, . . . , n}k there is an element of ψ(x) that equals to∏ki=1 xJi , then
K(x,x′) = 〈ψ(x), ψ(x′)〉 .
Since ψ contains all the monomials up to degree k, a halfspace overthe range of ψ corresponds to a polynomial predictor of degree k overthe original space.
Observe that calculating K(x,x′) takes O(n) time while thedimension of ψ(x) is nk
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 23 / 31
![Page 53: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/53.jpg)
Polynomial Kernels
The k degree polynomial kernel is defined to be
K(x,x′) = (1 + 〈x,x′〉)k .
Exercise: show that if we define ψ : Rn → R(n+1)k s.t. forJ ∈ {0, 1, . . . , n}k there is an element of ψ(x) that equals to∏ki=1 xJi , then
K(x,x′) = 〈ψ(x), ψ(x′)〉 .
Since ψ contains all the monomials up to degree k, a halfspace overthe range of ψ corresponds to a polynomial predictor of degree k overthe original space.
Observe that calculating K(x,x′) takes O(n) time while thedimension of ψ(x) is nk
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 23 / 31
![Page 54: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/54.jpg)
Polynomial Kernels
The k degree polynomial kernel is defined to be
K(x,x′) = (1 + 〈x,x′〉)k .
Exercise: show that if we define ψ : Rn → R(n+1)k s.t. forJ ∈ {0, 1, . . . , n}k there is an element of ψ(x) that equals to∏ki=1 xJi , then
K(x,x′) = 〈ψ(x), ψ(x′)〉 .
Since ψ contains all the monomials up to degree k, a halfspace overthe range of ψ corresponds to a polynomial predictor of degree k overthe original space.
Observe that calculating K(x,x′) takes O(n) time while thedimension of ψ(x) is nk
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 23 / 31
![Page 55: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/55.jpg)
Polynomial Kernels
The k degree polynomial kernel is defined to be
K(x,x′) = (1 + 〈x,x′〉)k .
Exercise: show that if we define ψ : Rn → R(n+1)k s.t. forJ ∈ {0, 1, . . . , n}k there is an element of ψ(x) that equals to∏ki=1 xJi , then
K(x,x′) = 〈ψ(x), ψ(x′)〉 .
Since ψ contains all the monomials up to degree k, a halfspace overthe range of ψ corresponds to a polynomial predictor of degree k overthe original space.
Observe that calculating K(x,x′) takes O(n) time while thedimension of ψ(x) is nk
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 23 / 31
![Page 56: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/56.jpg)
Gaussian kernel (RBF)
Let the original instance space be R and consider the mapping ψ where foreach non-negative integer n ≥ 0 there exists an element ψ(x)n which
equals to 1√n!e−
x2
2 xn. Then,
〈ψ(x), ψ(x′)〉 =
∞∑n=0
(1√n!e−
x2
2 xn) (
1√n!e−
(x′)2
2 (x′)n)
= e−x2+(x′)2
2
∞∑n=0
((xx′)n
n!
)= e−
(x−x′)22 .
More generally, the Gaussian kernel is defined to be
K(x,x′) = e−‖x−x′‖2
2σ .
Can learn any polynomial ...
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 24 / 31
![Page 57: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/57.jpg)
Gaussian kernel (RBF)
Let the original instance space be R and consider the mapping ψ where foreach non-negative integer n ≥ 0 there exists an element ψ(x)n which
equals to 1√n!e−
x2
2 xn. Then,
〈ψ(x), ψ(x′)〉 =
∞∑n=0
(1√n!e−
x2
2 xn) (
1√n!e−
(x′)2
2 (x′)n)
= e−x2+(x′)2
2
∞∑n=0
((xx′)n
n!
)= e−
(x−x′)22 .
More generally, the Gaussian kernel is defined to be
K(x,x′) = e−‖x−x′‖2
2σ .
Can learn any polynomial ...
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 24 / 31
![Page 58: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/58.jpg)
Gaussian kernel (RBF)
Let the original instance space be R and consider the mapping ψ where foreach non-negative integer n ≥ 0 there exists an element ψ(x)n which
equals to 1√n!e−
x2
2 xn. Then,
〈ψ(x), ψ(x′)〉 =
∞∑n=0
(1√n!e−
x2
2 xn) (
1√n!e−
(x′)2
2 (x′)n)
= e−x2+(x′)2
2
∞∑n=0
((xx′)n
n!
)= e−
(x−x′)22 .
More generally, the Gaussian kernel is defined to be
K(x,x′) = e−‖x−x′‖2
2σ .
Can learn any polynomial ...
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 24 / 31
![Page 59: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/59.jpg)
Characterizing Kernel Functions
Lemma (Mercer’s conditions)
A symmetric function K : X × X → R implements an inner product insome Hilbert space if and only if it is positive semidefinite; namely, for allx1, . . . ,xm, the Gram matrix, Gi,j = K(xi,xj), is a positive semidefinitematrix.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 25 / 31
![Page 60: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/60.jpg)
Implementing soft-SVM with kernels
We can use a generic convex optimization algorithm on the α problem
Alternatively, we can implement the SGD algorithm on the original wproblem, but observe that all the operations of SGD can beimplemented using the kernel alone
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 26 / 31
![Page 61: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/61.jpg)
SGD with kernels for soft-SVM
SGD for Solving Soft-SVM with Kernels
parameter: T
Initialize: β(1) = 0 ∈ Rmfor t = 1, . . . , T
Let α(t) = 1λ tβ
(t)
Choose i uniformly at random from [m]
For all j 6= i set β(t+1)j = β
(t)j
If (yi∑m
j=1 α(t)j K(xj ,xi) < 1)
Set β(t+1)i = β
(t)i + yi
Else
Set β(t+1)i = β
(t)i
Output: w =∑m
j=1 αjψ(xj) where α = 1T
∑Tt=1α
(t)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 27 / 31
![Page 62: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/62.jpg)
Duality
Historically, many of the properties of SVM have been obtained byconsidering a dual problem
It is not a must, but can be helpful
We show how to derive a dual problem to Hard-SVM:
minw‖w‖2 s.t. ∀i, yi〈w,xi〉 ≥ 1
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 28 / 31
![Page 63: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/63.jpg)
Duality
Hard-SVM can be rewritten as:
minw
maxα∈Rm:α≥0
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)
Lets flip the order of min and max. This can only decrease theobjective value, so we obtain the weak duality inequality:
minw
maxα∈Rm:α≥0
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)≥
maxα∈Rm:α≥0
minw
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)
In our case, there’s also strong duality (i.e., the above holds withequality)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 29 / 31
![Page 64: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/64.jpg)
Duality
Hard-SVM can be rewritten as:
minw
maxα∈Rm:α≥0
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)
Lets flip the order of min and max. This can only decrease theobjective value, so we obtain the weak duality inequality:
minw
maxα∈Rm:α≥0
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)≥
maxα∈Rm:α≥0
minw
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)
In our case, there’s also strong duality (i.e., the above holds withequality)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 29 / 31
![Page 65: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/65.jpg)
Duality
Hard-SVM can be rewritten as:
minw
maxα∈Rm:α≥0
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)
Lets flip the order of min and max. This can only decrease theobjective value, so we obtain the weak duality inequality:
minw
maxα∈Rm:α≥0
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)≥
maxα∈Rm:α≥0
minw
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)
In our case, there’s also strong duality (i.e., the above holds withequality)
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 29 / 31
![Page 66: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/66.jpg)
Duality
The dual problem:
maxα∈Rm:α≥0
minw
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)
We can solve analytically the inner optimization and obtain thesolution
w =m∑i=1
αiyixi
Plugging it back, yields
maxα∈Rm:α≥0
1
2
∥∥∥∥∥m∑i=1
αiyixi
∥∥∥∥∥2
+
m∑i=1
αi(1− yi〈∑j
αjyjxj ,xi〉)
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 30 / 31
![Page 67: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/67.jpg)
Duality
The dual problem:
maxα∈Rm:α≥0
minw
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)
We can solve analytically the inner optimization and obtain thesolution
w =
m∑i=1
αiyixi
Plugging it back, yields
maxα∈Rm:α≥0
1
2
∥∥∥∥∥m∑i=1
αiyixi
∥∥∥∥∥2
+
m∑i=1
αi(1− yi〈∑j
αjyjxj ,xi〉)
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 30 / 31
![Page 68: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/68.jpg)
Duality
The dual problem:
maxα∈Rm:α≥0
minw
(1
2‖w‖2 +
m∑i=1
αi(1− yi〈w,xi〉)
)
We can solve analytically the inner optimization and obtain thesolution
w =
m∑i=1
αiyixi
Plugging it back, yields
maxα∈Rm:α≥0
1
2
∥∥∥∥∥m∑i=1
αiyixi
∥∥∥∥∥2
+
m∑i=1
αi(1− yi〈∑j
αjyjxj ,xi〉)
.
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 30 / 31
![Page 69: Introduction to Machine Learning (67577) Lecture 8shais/Lectures2014/lecture8.pdf · 2014. 5. 8. · kwk2 + 1 m Xm i=1 ˘ i! s.t. 8i; y i(hw;x ii+ b) 1 ˘ i and ˘ i 0 Can be written](https://reader035.vdocuments.us/reader035/viewer/2022081621/611dc35d32d0cc2fe62c3888/html5/thumbnails/69.jpg)
Summary
Margin as additional prior knowledge
Hard and Soft SVM
Kernels
Shai Shalev-Shwartz (Hebrew U) IML Lecture 8 SVM 31 / 31