lecture notes for stat 231: pattern recognition and machine learning 1. stat 231. a.l. yuille. fall...

Lecture notes for Stat 231: Pattern Recognition and Machine Learning

1. Stat 231. A.L. Yuille. Fall 2004

PAC Learning and Generalizability. Margin Errors. Structural Risk Minimization.


2. Induction: History.

Francis Bacon described empiricism. Formulate hypotheses and test by experiments.

English Empiricist School of Philosophy. David Hume. Scottish. Scepticism. “Why should the Sun rise

tomorrow just because it always has”? Karl Popper. The Logic of Scientific Induction. Falsifiability

Principle. “A hypothesis is useless unless it can be disproven”.


3. Risk and Empirical Risk

Risk Specialize: Two classes: M=2. Loss Function is the number of

misclassifications. I.e. Empirical Risk:dataset-- set of learning machines (e.g. all thresholded hyperplanes).


4. Risk and Empirical Risk

Key Concept: the Vapnik-Chervonenkis (VC) dimension h. The VC dimension is a function of the set of classifiers It is independent of the distribution P(x,y) of the dataset. The VC dimension is a measure of the “degrees of freedom” of the set of classifiers. Intuitively, the size of the dataset n must be larger than the VC dimension before you can learn. E.G. Cover’s theorem. Hyperplanes in d space must have at least 2(d+1) samples to prevent the chance of finding a

chance dichotomy.


5. PAC.

Probably Approximately Correct (PAC).

If h < n, is the VC dimension of the classifier set, then with probability at least

where

For hyperplanes h = d+1.


6. PAC

Generalizability: Small empirical risk implies, with high probability, small risk provided is small.

Probably Approximately Correct (PAC). Because we can never be completely sure that we havn’t been

mislead by rare samples. In practice, require h/n to be small with small


7. PAC

This is the basic Machine Learning result. There are a number of variants. VC dimension is one measure of the capacity of the set of

classifiers. Other measures give tighter bounds but are harder to compute: annealed VC entropy, and growth function.

VC dimension is d+1 for thresholded hyperplanes. It can also be bounded nicely for separable kernels. (Later this lecture).

Forthcoming lecture will sketch the derivation of PAC. It makes use of probability of rare events (e.g. Cramer’s theorem, Sanov’s theorem).


8. VC for Margins

VC is the largest number of data points which can be shattered by the classifier set.

Shattered means that all possible dichotomies of the dataset can be expressed by a classifier in the set. (c.f. Cover’s hyperplane)

VC dimension is (d+1) for thresholded hyperplanes in d dimensions.

But we can tighter VC dimensions by considering the margins. These bounds can be extended directly to kernel hyperplanes.


9. VC Margin Hyperplanes

Hyperplanes The are normalized wrt data

Then the set of classifiers satisfying, has VC dimension satisfying:

where is the radius of the smallest sphere containing the datapoints. Recall is the margin. (Margin >.

Enforcing a large margin effectively limits the VC dimension.


10. VC Margin: Kernels.

Same technique applies to kernels. Claim: finding the minimum sphere R than encloses the data

depends only on the feature vectors by the kernel (kernel trick). Primal: minimize

Lagrange multipliers. Dual: maximize

s.t. Depends on dot-product only!


11. Generalizability for Kernels

The capacity term is a monotonic function of h. Use the Margin VC bound to decide which kernels will do best for

learning the US Post Office handwritten dataset. For each kernel choice, solve the dual problem to estimate R. Assume that the empirical risk is negligible – because it is

possible to classify digits correctly using kernels (but not linear). This predicts that the fourth order kernel has the best

generalization – this compares nicely with the results of the classifiers when tested.


12. Generalization for Kernels


13. Structural Risk Minimization

Standard Learning says: pick Traditional: use cross-validation to determine if is generalizing. VC theory says, evaluate the bound Ensure there are sufficient number of samples to ensure that

is is small. Alternative: Structural Risk Minimization. Divide the set of

classifiers into a hierarchy of sets .,... with corresponding VC-dims ...



Select classifiers to minimize:

Empirical Risk + Capacity Term. Capacity Term determines the “generalizability” of the classifier. Increasing the amount of training data allows you to increase p and use a richer class of classifiers. Is the bound tight enough?


16 Summary

PAC Learning and the VC dimension. The VC dimension is a measure of capacity of the set of

classifiers. The risk is bounded by the empirical risk plus a capacity term. VC dimensions can be bounded for linear and kernels by the

margin concept. This can predict which filters are best able to generalize. Structured Risk Minimization – penalize classifiers that have poor

bounds for generalization.

lecture notes for stat 231: pattern recognition and machine learning 1. stat 231. a.l. yuille. fall...

Documents