lecture12–part01 supportvector...

45
Lecture 12 – Part 01 Support Vector Machines

Upload: others

Post on 22-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Lecture 12 – Part 01Support VectorMachines

Page 2: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Linear Classifiers

▶ Prediction rule: 𝐻( ⃗𝑥) = �⃗� ⋅ Aug( ⃗𝑥)▶ Predict class 1 if 𝐻( ⃗𝑥) > 0▶ Predict class -1 if 𝐻( ⃗𝑥) < 0

Page 3: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Decision Boundary

▶ Aug( ⃗𝑥) ⋅ �⃗� is proportional to distance fromboundary.

Page 4: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Recall: Logistic Regression

▶ Prediction Rule: 𝐻( ⃗𝑥) = 𝜎(�⃗� ⋅ Aug( ⃗𝑥))

▶ Find �⃗� by maximizing log likelihood.

▶ Predict class 1 if 𝐻( ⃗𝑥) > 0.5, class -1 otherwise.

▶ But 𝜎(�⃗� ⋅ Aug( ⃗𝑥)) > 0.5 ⟺ �⃗� ⋅ Aug( ⃗𝑥) > 0

Page 5: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Recall: Logistic Regression

Page 6: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Recall: Logistic Regression

Page 7: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Recall: the Perceptron

▶ Prediction Rule: 𝐻( ⃗𝑥) = �⃗� ⋅ Aug( ⃗𝑥)

▶ Find �⃗� by minimizing perceptron risk.

▶ Theorem: if the training data is linearlyseparable, the perceptron algorithm find adividing hyperplane.

Page 8: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Perceptron Problems

The learned perceptron may have a small margin.

Page 9: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Perceptron Problems

The learned perceptron may have a small margin.

We prefer large margins for generalization.

Page 10: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Maximum Margin Classifiers

▶ Assume: linear separability (for now).

▶ Many possible boundaries with zero error.

▶ Goal: Find linear boundary with largest marginw.r.t. training data.

Page 11: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Observation

▶ Training data: {( ⃗𝑥(𝑖), 𝑦𝑖)}

▶ Classification is correct when:

{�⃗� ⋅ Aug( ⃗𝑥(𝑖)) > 0, if 𝑦𝑖 = 1]�⃗� ⋅ Aug( ⃗𝑥(𝑖)) < 0, if 𝑦𝑖 = −1]

▶ Equivalently, classification is correct if:

𝑦𝑖 �⃗� ⋅ Aug( ⃗𝑥(𝑖)) > 0

Page 12: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Recall

▶ 𝑦𝑖 �⃗� ⋅ Aug( ⃗𝑥(𝑖)) ∝ to distance from boundary.

▶ Our goal: find �⃗� that maximizes the smallestdistance.

�⃗�best = argmax�⃗�∈ℝ𝑑+1

min⃗𝑖∈1,…,𝑛

[𝑦𝑖 �⃗� ⋅ Aug( ⃗𝑥(𝑖))]

▶ This looks hard. But there is a trick.

Page 13: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Another Observation

▶ If linearly separable, then there is a �⃗� such that

𝑦𝑖 �⃗� ⋅ Aug( ⃗𝑥(𝑖)) > 0

for all 𝑖 = 1, … , 𝑛.

▶ Actually, linearly separable ⟹ there is a �⃗� s.t.

𝑦𝑖 �⃗� ⋅ Aug( ⃗𝑥(𝑖)) ≥ 1

for all 𝑖 = 1, … , 𝑛.

Page 14: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Why?

▶ Suppose �⃗� separates, but𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) = 0.01

▶ Define �⃗� = 10.01�⃗� = 100�⃗�.

▶ Then 𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) = 1

▶ Note: ‖�⃗�‖ is large!

Page 15: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Why?

▶ Suppose �⃗� separates, but𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) = 0.5

▶ Define �⃗� = 10.5�⃗� = 2�⃗�.

▶ Then 𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) = 1

▶ Note: ‖�⃗�‖ is smaller!

Page 16: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

The Trick

▶ We will demand that

𝑦𝑖 �⃗� ⋅ Aug( ⃗𝑥(𝑖)) ≥ 1

▶ The larger ‖�⃗�‖, the smaller the margin.

▶ New Goal: Minimize ‖�⃗�‖2 subject to𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) ≥ 1 for all 𝑖.

Page 17: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Optimization

▶ Minimize ‖�⃗�‖2 subject to 𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) ≥ 1 forall 𝑖.

▶ This is a convex, quadratic optimization problem.

▶ Can be solved efficiently with quadraticprogramming.▶ But there is no exact general formula for the solution

Page 18: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Support Vectors

▶ A support vector is a training point ⃗𝑥(𝑖) such that

𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) = 1

Page 19: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Support Vector Machines (SVMs)

▶ Then maximum margin solution �⃗� is a linearcombination of the support vectors.

▶ Let 𝑆 be the set of support vectors. Then

�⃗� = ∑𝑖∈𝑆

𝑦𝑖𝛼𝑖 Aug( ⃗𝑥(𝑖))

Page 20: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Example: Irises

▶ 3 classes: iris setosa, iris versicolor, iris virginica

▶ 4 measurements: petal width/height, sepal width/height

Page 21: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Example: Irises

▶ Using only sepal width/petal width▶ Two classes: versicolor (black), setosa (red)

Page 22: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Lecture 12 – Part 02Soft-Margin SVMs

Page 23: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Non-Separability

▶ So far we’ve assumed data is linearly separable.

▶ What if it isn’t?

Page 24: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

The Problem

▶ Old Goal: Minimize ‖�⃗�‖2 subject to𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) ≥ 1 for all 𝑖.

▶ This no longer makes sense.

Page 25: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Cut Some Slack

▶ Idea: allow some classifications to be 𝜉𝑖 wrong,but not too wrong.

Page 26: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Cut Some Slack

▶ New problem. Fix some number 𝐶 ≥ 0.

min�⃗�∈ℝ𝑑+1, ⃗𝜉∈ℝ𝑛

‖�⃗�‖2 + 𝐶𝑛∑𝑖=1

𝜉𝑖

subject to 𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) ≥ 1 − 𝜉𝑖 for all 𝑖, ⃗𝜉 ≥ 0.

Page 27: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

The Slack Parameter, C

▶ 𝐶 controls how much slack is given.

min�⃗�∈ℝ𝑑+1, ⃗𝜉∈ℝ𝑛

‖�⃗�‖2 + 𝐶𝑛∑𝑖=1

𝜉𝑖

subject to 𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) ≥ 1 − 𝜉𝑖 for all 𝑖, ⃗𝜉 ≥ 0.▶ Large 𝐶: don’t give much slack. Avoidmisclassifications.

▶ Small 𝐶: allow more slack at the cost ofmisclassifications.

Page 28: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Example: Small C

Page 29: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Example: Large C

Page 30: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Soft and Hard Margins

▶ Max-margin SVM from before has hard margin.

▶ Now: the soft margin SVM.

▶ As 𝐶 → ∞, the margin hardens.

Page 31: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Another View: Loss Functions

▶ Recall our problem:

min�⃗�∈ℝ𝑑+1, ⃗𝜉∈ℝ𝑛

‖�⃗�‖2 + 𝐶𝑛∑𝑖=1

𝜉𝑖

subject to 𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖)) ≥ 1 − 𝜉𝑖 for all 𝑖, ⃗𝜉 ≥ 0.

▶ Note: if ⃗𝑥(𝑖) is misclassified, then

𝜉𝑖 = 1 − 𝑦𝑖�⃗� ⋅ Aug( ⃗𝑥(𝑖))

Page 32: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Another View: Loss Functions

▶ New problem:

min�⃗�∈ℝ𝑑+1, ⃗𝜉∈ℝ𝑛

‖�⃗�‖2 + 𝐶𝑛∑𝑖=1

max{0, 1 − 𝑦𝑖�⃗� ⋅ ⃗𝑥(𝑖)}

▶ max{0, 1 − 𝑦𝑖�⃗� ⋅ ⃗𝑥(𝑖)} is called the hinge loss.

Page 33: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Another Way to Optimize

▶ We can use subgradient descent to minimizeSVM risk.

Page 34: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Lecture 12 – Part 03Sentiment Analysis

Page 35: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Why use linear predictors?

▶ Linear classifiers look to be very simple.

▶ That can be both good and bad.▶ Good: the math is tractable, less likely to overfit▶ Bad: may be too simple, underfit

▶ They can work surprisingly well.

Page 36: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Sentiment Analysis

▶ Given: a piece of text.

▶ Determine: if it is postive or negative in tone

▶ Example: “Needless to say, I wasted my money.”

Page 37: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

The Data

▶ Sentences from reviews on Amazon, Yelp, IMDB.

▶ Each labeled (by a human) positive or negative.

▶ Examples:▶ “Needless to say, I wasted my money.”▶ “I have to jiggle the plug to get it to line up right.”▶ “Will order from them again!”▶ “He was very impressed when going from the originalbattery to the extended battery.”

Page 38: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

The Plan

▶ We’ll train a soft-margin SVM.

▶ Problem: SVMs take fixed-length vectors asinputs, not sentences.

Page 39: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Bags of Words

To turn a document into a fixed-length vector:▶ First, choose a dictionary of words:

▶ E.g.: [”wasted”, ”impressed”, ”great”, ”bad”, ”again”]

▶ Count number of occurrences of each dictionary word indocument.▶ “It was bad. So bad that I was impressed at how badit was.” → (0, 1, 0, 3, 0)𝑇

▶ This is called a bag of words representation.

Page 40: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Choosing the Dictionary

▶ Many ways of choosing the dictionary.

▶ Easiest: take all of the words in the training set.▶ Perhaps throw out stop words like “the”, “a”, etc.

▶ Resulting dimensionality of feature vectors:large.

Page 41: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Experiment

▶ Bag of words features with 4500 word dictionary.

▶ 2500 training sentences, 500 test sentences.

▶ Train a soft margin SVM.

Page 42: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Choosing C

▶ We have to choose the slack parameter, 𝐶.

▶ Use cross validation!

Page 43: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Cross Validation

Page 44: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(

Results

▶ With 𝐶 = 0.32, test error ≈ 15.6%.

Page 45: Lecture12–Part01 SupportVector Machinessierra.ucsd.edu/cse151a-2020-sp/lectures/12-svm/build/lecture.pdf · Recall:LogisticRegression PredictionRule:𝐻( ⃗ )=𝜎( ⃗ ⋅Aug(