lecture notes for stat 231: pattern recognition and machine learning 1. stat 231. a.l. yuille. fall...

15
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and Slack Variables. Duality and Support Vectors. Read 5.10, A.3, 5.11 Duda, Hart, Stork. Or better: 12.1, 12.2, Hastie, Tibshirani, Friedman.

Upload: spencer-rodger-simmons

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 3. Margin, Support Vectors We introduce new concepts: (I) Margin, (II) Support Vectors. This will enable us to understand performance in non-seperable case. Technical methods: quadratic optimization with linear constraints, Lagrange multipliers and duality. Margins will also be important when studying generalization. Everything in this lecture can be extended beyond hyperplanes (next lecture).

TRANSCRIPT

Page 1: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

1. Stat 231. A.L. Yuille. Fall 2004

Linear Separation and Margins. Non-Separable and Slack Variables. Duality and Support Vectors. Read 5.10, A.3, 5.11 Duda, Hart, Stork. Or better: 12.1, 12.2, Hastie, Tibshirani, Friedman.

Page 2: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

2. Separation by Hyperplanes

Data Hyperplane Linear Classifier: By simple geometry, the signed distance of a point x tothe plane is The line through x perpendicular to the plane is:

Hits the plane when which implies that the distance is

Page 3: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

3. Margin, Support Vectors

We introduce new concepts: (I) Margin, (II) Support Vectors. This will enable us to understand performance in non-seperable

case. Technical methods: quadratic optimization with linear constraints,

Lagrange multipliers and duality. Margins will also be important when studying generalization. Everything in this lecture can be extended beyond

hyperplanes (next lecture).

Page 4: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

4. Margin for Separable Data

Assume there is a separating hyperplane:

We seek to find the classifier with the biggest margin:

Page 5: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

5. Margin Non-Separable

Use concept of margin to define optimal criterion for non-separable data.

For data samples, define slack variables Seek the hyperplane that maximizes the margin allowing for a

limited amount K of misclassified data (slack). One criterion:

Subject to

Page 6: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

6. Margin Non-Separable

If, then the data point j is correctly classified by the hyperplane.

If is the proportional amount that data is on the wrong side of the margin.

From this criterion, the points closest to the hyperplane are the ones that most influence its form (more details later). These are

the datapoints that are hardest to classify. These will become the support vectors. By contrast, data points away from the hyperplane are less

important. This differs from probability estimation.

Page 7: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

7. Quadratic Programming

Remove the restriction that Define criterion:

This is a quadratic primal problem with linear constraints (unique soln.). It can be formulated using Lagrange multipliers.

Variables,

Page 8: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

8. Quadratic Programming

Extremize (differentiate) wrt respectively yields,

The solution is

The solution only depends on the support vectors:

Page 9: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

9. Duality

Any quadratic optimization problem L_p with linear constraints can be reformulated in terms of a dual problem L_d.

The variables of the dual problem are the Lagrange parameters of the primal problem. In this case

Linear algebra gives:

Subject to: Standard packages to solve the primal and dual (easier) problem.

Page 10: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

Primal to Dual

To obtain the dual formulation Rewrite the primal as:

Extremize w.r.t. All terms cancel except Set

Page 11: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

10. Support Vectors

The form of the solution is Now

These are the Support Vectors. Two types: (1) those on the margin, for which (2) those past the margin, for which

Karush-Kuhn-Tucker conditions.

Page 12: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

11. Karush-Kuhn-Tucker

KKT Conditions: Lagrange multipliers and constraints

Use any margin point to solve for b.

Page 13: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

12. Perceptron and Margins.

The Perceptron rule can be re-interpreted in terms of the marginand given a formulation in dual space. Perceptron convergence. The critical quantity for the

convergence of the perceptron is , where R is the radius of the smallest ball containing the data and m is the margin. Define Then number of Perceptron errors in one sweep is bounded

above by

Page 14: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

13. Perceptron in Dual Space

The Perceptron learning algorithm works by adding misclassified data examples to the weights. Set initial weights to be zero.

The weight hypothesis will always be of form Perceptron Rule in Dual Space: Update rule for If data is misclassified, I.e.

Page 15: Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

Summary

Linear Separability and Margins. Slack Variables z’s – formulate for non-separable case. Quadratic optimization with linear constraints. Primal problem L_p and Dual Problem L_d (standard

techniques for solution). Dual Perceptron Rule (separable case only).

Solution of form Dual variables alpha’s determine the support vectors

Support Vectors – hard to classify data (no analog in probability).