cse446: svms spring 2017 - courses.cs.washington.edu · 2017-05-02 · cse446: svms spring 2017 ali...
TRANSCRIPT
![Page 1: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/1.jpg)
CSE446: SVMsSpring 2017
Ali Farhadi
Slides adapted from Carlos Guestrin, and Luke Zettelmoyer
![Page 2: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/2.jpg)
Linear classifiers – Which line is better?
![Page 3: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/3.jpg)
Pick the one with the largest margin!
w•x = ∑i wi xi
w•x
+ w
0 =
0
w•x
+ w
0 >
0
w•x
+ w
0 >>
0
w•x
+ w
0 <
0
w•x
+ w
0 <<
0
γ
γ
γ
γ
Margin for point j:
Max Margin:
![Page 4: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/4.jpg)
How many possible solutions?
w•x
+ w
0 =
0Any other ways of writing the same dividing line? • w.x + b = 0 • 2w.x + 2b = 0 • 1000w.x + 1000b = 0 • …. • Any constant scaling has the same
intersection with z=0 plane, so same dividing line!
Do we really want to max γ,w,w0?
![Page 5: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/5.jpg)
Review: Normal to a plane
w•x
+w 0
= 0
Key Terms-- projection of xj onto w
-- unit vector normal to w
![Page 6: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/6.jpg)
w•x
+ w
0 =
+1
w•x
+ w
0 =
-1
w•x
+ w
0 =
0
x-x+
Final result: can maximize constrained margin by minimizing ||w||2!!!
γ
Assume: x+ on positive line (y=1 intercept), x- on negative (y=-1)
![Page 7: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/7.jpg)
Max margin using canonical hyperplanes
The assumption of canonical hyperplanes (at +1 and -1) changes the objective and the constraints!
w.x
+ b
= +
1
w.x
+ b
= -1
w.x
+ b
= 0
x-x+
γ
![Page 8: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/8.jpg)
Support vector machines (SVMs)
• Solve efficiently by quadratic programming (QP) – Well-studied solution algorithms – Not simple gradient ascent, but
close
• Decision boundary defined by support vectors
w.x
+ b
= +
1
w.x
+ b
= -1
w.x
+ b
= 0
margin 2γ
Support Vectors: • data points on the
canonical lines
Non-support Vectors: • everything else • moving them will not
change w
![Page 9: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/9.jpg)
What if the data is not linearly separable?
Add More Features!!!
Can use Kernels… (more on this later) What about overfitting?
![Page 10: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/10.jpg)
What if the data is still not linearly separable?
• First Idea: Jointly minimize and number of training mistakes – How to tradeoff two criteria? – Pick C on development / cross
validation
• Tradeoff #(mistakes) and – 0/1 loss – Not QP anymore – Also doesn’t distinguish near misses
and really bad mistakes – NP hard to find optimal solution!!!
+ C #(mistakes)
![Page 11: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/11.jpg)
Slack variables – Hinge loss
For each data point: • If margin ≥ 1, don’t care • If margin < 1, pay linear penalty
+ C Σj ξj
- ξj , ξj≥0
Slack Penalty C > 0: • C=∞ ! have to separate the
data! • C=0 ! ignore data entirely! • Select on dev. set, etc.
w.x
+ w
0 =
+1
w.x
+ w
0 =
-1
w.x
+ w
0 =
0
ξ
ξ
ξ
ξ
![Page 12: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/12.jpg)
Slack variables – Hinge lossw
.x +
w0
= +1
w.x
+ w
0 =
-1
w.x
+ w
0 =
0
ξ
ξ
ξ
ξ
+ C Σj ξj
- ξj , ξj≥0
Hinge Loss
[x]+= max(x,0)
Regularization Solving SVMs: • Differentiate and set equal to zero! • No closed form solution, but quadratic program (top) is concave • Hinge loss is not differentiable, gradient ascent a little trickier…
![Page 13: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/13.jpg)
Logistic Regression as Minimizing Loss
Logistic regression assumes:
And tries to maximize data likelihood, for Y={-1,+1}:
Equivalent to minimizing log loss:
![Page 14: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/14.jpg)
SVMs vs Regularized Logistic Regression
SVM Objective:
Logistic regression objective:
Tradeoff: same l2 regularization term, but different error term
[x]+= max(x,0)
![Page 15: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/15.jpg)
Graphing Loss vs MarginLogistic regression:
We want to smoothly approximate 0/1 loss!
Hinge loss:
0-1 Loss:
![Page 16: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/16.jpg)
What about multiple classes?
![Page 17: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/17.jpg)
One against All
Learn 3 classifiers: • + vs {0,-}, weights w+ • - vs {0,+}, weights w-
• 0 vs {+,-}, weights w0 Output for x: y = argmaxi wi•x
w+
w-
Any problems? Could we learn this ! dataset?
w0
![Page 18: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/18.jpg)
Learn 1 classifier: Multiclass SVMSimultaneously learn 3 sets of weights: • How do we
guarantee the correct labels?
• Need new constraints!
For each class:
w+
w-
w0
![Page 19: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/19.jpg)
Learn 1 classifier: Multiclass SVMAlso, can introduce slack variables, as before:
Now, can we learn it?
!
![Page 20: CSE446: SVMs Spring 2017 - courses.cs.washington.edu · 2017-05-02 · CSE446: SVMs Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, and Luke Zettelmoyer](https://reader033.vdocuments.us/reader033/viewer/2022042407/5f21a080e22097491c7aa5e2/html5/thumbnails/20.jpg)
What you need to know• Maximizing margin • Derivation of SVM formulation • Slack variables and hinge loss • Tackling multiple class
– One against All – Multiclass SVMs