the learning problem and regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf ·...

105
The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso Poggio The Learning Problem and Regularization

Upload: others

Post on 15-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

The Learning Problem and Regularization

Tomaso Poggio

9.520 Class 02

February 2011

Tomaso Poggio The Learning Problem and Regularization

Page 2: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Computational Learning

Statistical Learning Theory

Learning is viewed as a generalization/inference problem fromusually small sets of high dimensional, noisy data.

Tomaso Poggio The Learning Problem and Regularization

Page 3: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Learning Tasks and Models

SupervisedSemisupervisedUnsupervisedOnlineTransductiveActiveVariable SelectionReinforcement.....

One can consider the data to be created in a deterministic,stochastic or even adversarial way.

Tomaso Poggio The Learning Problem and Regularization

Page 4: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Where to Start?

Statistical and Supervised Learning

Statistical Models are essentially to deal with noisesampling and other sources of uncertainty.Supervised Learning is by far the most understood class ofproblems

Regularization

Regularization provides a a fundamental framework tosolve learning problems and design learning algorithms.We present a set of ideas and tools which are at thecore of several developments in supervised learningand beyond it.This is a theory and associated algorithms which work in practice, eg in products, such as in vision systems

for cars. Later in the semester we will learn about ongoing research combining neuroscience and learning.

The latter research is at the frontier on approaches that may work in practice or may not (similar to Bayes

techniques: still unclear how well they work beyond toy or special problems).

Tomaso Poggio The Learning Problem and Regularization

Page 5: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Where to Start?

Statistical and Supervised Learning

Statistical Models are essentially to deal with noisesampling and other sources of uncertainty.Supervised Learning is by far the most understood class ofproblems

Regularization

Regularization provides a a fundamental framework tosolve learning problems and design learning algorithms.We present a set of ideas and tools which are at thecore of several developments in supervised learningand beyond it.This is a theory and associated algorithms which work in practice, eg in products, such as in vision systems

for cars. Later in the semester we will learn about ongoing research combining neuroscience and learning.

The latter research is at the frontier on approaches that may work in practice or may not (similar to Bayes

techniques: still unclear how well they work beyond toy or special problems).

Tomaso Poggio The Learning Problem and Regularization

Page 6: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Remarks on Foundations of Learning Theory

Intelligent behavior (at least learning) consists of optimizingunder constraints. Constraints are key for solvingcomputational problems; constraints are key for prediction.Constraints may correspond to rather general symmetryproperties of the problem (eg time invariance, space invariance,invariance to physical units (pai theorem), universality ofnumbers and metrics implying normalization, etc.)

Key questions at the core of learning theory:generalization and predictivity not explanationprobabilities are unknown, only data are givenwhich constraints are needed to ensure generalization(therefore which hypotheses spaces)?regularization techniques result usually in computationally“nice” and well-posed optimization problems

Tomaso Poggio The Learning Problem and Regularization

Page 7: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Plan

Part I: Basic Concepts and NotationPart II: Foundational ResultsPart III: Algorithms

Tomaso Poggio The Learning Problem and Regularization

Page 8: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Problem at a Glance

Given a training set of input-output pairs

Sn = (x1, y1), . . . , (xn, yn)

find fS such thatfS(x) ∼ y .

e.g. the x ′s are vectors and the y ′s discrete labels inclassification and real values in regression.

Tomaso Poggio The Learning Problem and Regularization

Page 9: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Learning is Inference

For the above problem to make sense we need to assume inputand output to be related!

Statistical and Supervised Learning

Each input-output pairs is a sample from a fixed butunknown distribution p(x , y).Under general condition we can write

p(x , y) = p(y |x)p(x).

the training set Sn is a set of identically andindependently distributed samples.

Tomaso Poggio The Learning Problem and Regularization

Page 10: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Learning is Inference

For the above problem to make sense we need to assume inputand output to be related!

Statistical and Supervised Learning

Each input-output pairs is a sample from a fixed butunknown distribution p(x , y).Under general condition we can write

p(x , y) = p(y |x)p(x).

the training set Sn is a set of identically andindependently distributed samples.

Tomaso Poggio The Learning Problem and Regularization

Page 11: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Again: Data Generated By A Probability Distribution

We assume that there are an “input” space X and an “output”space Y . We are given a training set S consisting n samplesdrawn i.i.d. from the probability distribution µ(z) on Z = X × Y :

(x1, y1), . . . , (xn, yn)

that is z1, . . . , znWe will use the conditional probability of y given x, writtenp(y |x):

µ(z) = p(x , y) = p(y |x) · p(x)

It is crucial to note that we view p(x , y) as fixed but unknown.

Tomaso Poggio The Learning Problem and Regularization

Page 12: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Noise ...

YX

p (y|x)

x

the same x can generate different y (according to p(y |x)):

the underlying process is deterministic, but there is noisein the measurement of y ;the underlying process is not deterministic;the underlying process is deterministic, but onlyincomplete information is available.

Tomaso Poggio The Learning Problem and Regularization

Page 13: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

...and Sampling

p(x)

y

x

x

even in a noise free case wehave to deal with sampling

the marginal p(x) distributionmight model

errors in the location ofthe input points;discretization error for agiven grid;presence or absence ofcertain input instances

Tomaso Poggio The Learning Problem and Regularization

Page 14: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

...and Sampling

p(x)

��

y

x

x

even in a noise free case wehave to deal with sampling

the marginal p(x) distributionmight model

errors in the location ofthe input points;discretization error for agiven grid;presence or absence ofcertain input instances

Tomaso Poggio The Learning Problem and Regularization

Page 15: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

...and Sampling

x

p(x)

y

x

even in a noise free case wehave to deal with sampling

the marginal p(x) distributionmight model

errors in the location ofthe input points;discretization error for agiven grid;presence or absence ofcertain input instances

Tomaso Poggio The Learning Problem and Regularization

Page 16: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

...and Sampling

���� ��

��

� �� ���

��

� ��������

��

��

!

"#

$%

&'()*+

,-./

y

x01

p(x)

x

even in a noise free case wehave to deal with sampling

the marginal p(x) distributionmight model

errors in the location ofthe input points;discretization error for agiven grid;presence or absence ofcertain input instances

Tomaso Poggio The Learning Problem and Regularization

Page 17: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Problem at a Glance

Given a training set of input-output pairs

Sn = (x1, y1), . . . , (xn, yn)

find fS such thatfS(x) ∼ y .

e.g. the x ′s are vectors and the y ′s discrete labels inclassification and real values in regression.

Tomaso Poggio The Learning Problem and Regularization

Page 18: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Learning, Generalization and Overfitting

Predictivity or GeneralizationGiven the data, the goal is to learn how to makedecisions/predictions about future data / data not belonging tothe training set. Generalization is the key requirementemphasized in Learning Theory. This emphasis makes itdifferent from traditional statistics (especially explanatorystatistics) or Bayesian freakonomics.

The problem is often: Avoid overfitting!!

Tomaso Poggio The Learning Problem and Regularization

Page 19: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Loss functions

As we look for a deterministic estimator in stochasticenvironment we expect to incur into errors.

Loss functionA loss function V : R× Y determines the price V (f (x), y) wepay, predicting f (x) when in fact the true output is y .

Tomaso Poggio The Learning Problem and Regularization

Page 20: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Loss functions

As we look for a deterministic estimator in stochasticenvironment we expect to incur into errors.

Loss functionA loss function V : R× Y determines the price V (f (x), y) wepay, predicting f (x) when in fact the true output is y .

Tomaso Poggio The Learning Problem and Regularization

Page 21: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Loss functions for regression

The most common is the square loss or L2 loss

V (f (x), y) = (f (x)− y)2

Absolute value or L1 loss:

V (f (x), y) = |f (x)− y |

Vapnik’s ε-insensitive loss:

V (f (x), y) = (|f (x)− y | − ε)+

Tomaso Poggio The Learning Problem and Regularization

Page 22: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Loss functions for (binary) classification

The most intuitive one: 0− 1-loss:

V (f (x), y) = θ(−yf (x))

(θ is the step function)The more tractable hinge loss:

V (f (x), y) = (1− yf (x))+

And again the square loss or L2 loss

V (f (x), y) = (1− yf (x))2

Tomaso Poggio The Learning Problem and Regularization

Page 23: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Loss functions

Tomaso Poggio The Learning Problem and Regularization

Page 24: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Expected Risk

A good function – we will also speak about hypothesis – shouldincur in only a few errors. We need a way to quantify this idea.

Expected RiskThe quantity

I[f ] =

∫X×Y

V (f (x), y)p(x , y)dxdy .

is called the expected error and measures the loss averagedover the unknown distribution.

A good function should have small expected risk.

Tomaso Poggio The Learning Problem and Regularization

Page 25: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Expected Risk

A good function – we will also speak about hypothesis – shouldincur in only a few errors. We need a way to quantify this idea.

Expected RiskThe quantity

I[f ] =

∫X×Y

V (f (x), y)p(x , y)dxdy .

is called the expected error and measures the loss averagedover the unknown distribution.

A good function should have small expected risk.

Tomaso Poggio The Learning Problem and Regularization

Page 26: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Target Function

The expected risk is usually defined on some large space Fpossible dependent on p(x , y).

The best possible error is

inff∈F

I[f ]

The infimum is often achieved at a minimizer f∗ that we calltarget function.

Tomaso Poggio The Learning Problem and Regularization

Page 27: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Learning Algorithms and Generalization

A learning algorithm can be seen as a map

Sn → fn

from the training set to the a set of candidate functions.

Tomaso Poggio The Learning Problem and Regularization

Page 28: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Basic definitions

p(x , y) probability distribution,Sn training set,V (f (x), y) loss function,In[f ] = 1

n∑n

i=1 V (f (xi), yi), empirical risk,I[f ] =

∫X×Y V (f (x), y)p(x , y)dxdy , expected risk,

Tomaso Poggio The Learning Problem and Regularization

Page 29: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Reminder

Convergence in probability

Let {Xn} be a sequence of bounded random variables. Then

limn→∞

Xn = X in probability

if∀ε > 0 lim

n→∞P{|Xn − X | ≥ ε} = 0

Convergence in Expectation

Let {Xn} be a sequence of bounded random variables. Then

limn→∞

Xn = X in expectation

iflim

n→∞E(|Xn − X |) = 0

. Convergence in the mean implies convergence in probability.Tomaso Poggio The Learning Problem and Regularization

Page 30: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Reminder

Convergence in probability

Let {Xn} be a sequence of bounded random variables. Then

limn→∞

Xn = X in probability

if∀ε > 0 lim

n→∞P{|Xn − X | ≥ ε} = 0

Convergence in Expectation

Let {Xn} be a sequence of bounded random variables. Then

limn→∞

Xn = X in expectation

iflim

n→∞E(|Xn − X |) = 0

. Convergence in the mean implies convergence in probability.Tomaso Poggio The Learning Problem and Regularization

Page 31: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Consistency and Universal Consistency

A requirement considered of basic importance in classicalstatistics is for the algorithm to get better as we get more data(in the context of machine learning consistency is lessimmediately critical than generalization)...

ConsistencyWe say that an algorithm is consistent if

∀ε > 0 limn→∞

P{I[fn]− I[f∗] ≥ ε} = 0

Universal ConsistencyWe say that an algorithm is universally consistent if for allprobability p,

∀ε > 0 limn→∞

P{I[fn]− I[f∗] ≥ ε} = 0

Tomaso Poggio The Learning Problem and Regularization

Page 32: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Consistency and Universal Consistency

A requirement considered of basic importance in classicalstatistics is for the algorithm to get better as we get more data(in the context of machine learning consistency is lessimmediately critical than generalization)...

ConsistencyWe say that an algorithm is consistent if

∀ε > 0 limn→∞

P{I[fn]− I[f∗] ≥ ε} = 0

Universal ConsistencyWe say that an algorithm is universally consistent if for allprobability p,

∀ε > 0 limn→∞

P{I[fn]− I[f∗] ≥ ε} = 0

Tomaso Poggio The Learning Problem and Regularization

Page 33: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Consistency and Universal Consistency

A requirement considered of basic importance in classicalstatistics is for the algorithm to get better as we get more data(in the context of machine learning consistency is lessimmediately critical than generalization)...

ConsistencyWe say that an algorithm is consistent if

∀ε > 0 limn→∞

P{I[fn]− I[f∗] ≥ ε} = 0

Universal ConsistencyWe say that an algorithm is universally consistent if for allprobability p,

∀ε > 0 limn→∞

P{I[fn]− I[f∗] ≥ ε} = 0

Tomaso Poggio The Learning Problem and Regularization

Page 34: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Sample Complexity and Learning Rates

The above requirements are asymptotic.

Error RatesA more practical question is, how fast does the error decay?This can be expressed as

P{I[fn]− I[f∗]} ≤ ε(n, δ)} ≥ 1− δ.

Sample ComplexityOr equivalently, ‘how many point do we need to achieve anerror ε with a prescribed probability δ?’This can expressed as

P{I[fn]− I[f∗] ≤ ε} ≥ 1− δ,

for n = n(ε, δ).

Tomaso Poggio The Learning Problem and Regularization

Page 35: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Sample Complexity and Learning Rates

The above requirements are asymptotic.

Error RatesA more practical question is, how fast does the error decay?This can be expressed as

P{I[fn]− I[f∗]} ≤ ε(n, δ)} ≥ 1− δ.

Sample ComplexityOr equivalently, ‘how many point do we need to achieve anerror ε with a prescribed probability δ?’This can expressed as

P{I[fn]− I[f∗] ≤ ε} ≥ 1− δ,

for n = n(ε, δ).

Tomaso Poggio The Learning Problem and Regularization

Page 36: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Sample Complexity and Learning Rates

The above requirements are asymptotic.

Error RatesA more practical question is, how fast does the error decay?This can be expressed as

P{I[fn]− I[f∗]} ≤ ε(n, δ)} ≥ 1− δ.

Sample ComplexityOr equivalently, ‘how many point do we need to achieve anerror ε with a prescribed probability δ?’This can expressed as

P{I[fn]− I[f∗] ≤ ε} ≥ 1− δ,

for n = n(ε, δ).

Tomaso Poggio The Learning Problem and Regularization

Page 37: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Sample Complexity and Learning Rates

The above requirements are asymptotic.

Error RatesA more practical question is, how fast does the error decay?This can be expressed as

P{I[fn]− I[f∗]} ≤ ε(n, δ)} ≥ 1− δ.

Sample ComplexityOr equivalently, ‘how many point do we need to achieve anerror ε with a prescribed probability δ?’This can expressed as

P{I[fn]− I[f∗] ≤ ε} ≥ 1− δ,

for n = n(ε, δ).

Tomaso Poggio The Learning Problem and Regularization

Page 38: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Sample Complexity and Learning Rates

The above requirements are asymptotic.

Error RatesA more practical question is, how fast does the error decay?This can be expressed as

P{I[fn]− I[f∗]} ≤ ε(n, δ)} ≥ 1− δ.

Sample ComplexityOr equivalently, ‘how many point do we need to achieve anerror ε with a prescribed probability δ?’This can expressed as

P{I[fn]− I[f∗] ≤ ε} ≥ 1− δ,

for n = n(ε, δ).

Tomaso Poggio The Learning Problem and Regularization

Page 39: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Empirical risk and Generalization

How do we design learning algorithms that work? One of themost natural ideas is ERM...

Empirical RiskThe empirical risk is a natural proxy (how good?) for theexpected risk

In[f ] =1n

n∑i=1

V (f (xi), yi).

Generalization ErrorThe effectiveness of such an approximation error is captured bythe generalization error,

P{|I[fn]− In[fn]| ≤ ε} ≥ 1− δ,

for n = n(ε, δ).

Tomaso Poggio The Learning Problem and Regularization

Page 40: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Empirical risk and Generalization

How do we design learning algorithms that work? One of themost natural ideas is ERM...

Empirical RiskThe empirical risk is a natural proxy (how good?) for theexpected risk

In[f ] =1n

n∑i=1

V (f (xi), yi).

Generalization ErrorThe effectiveness of such an approximation error is captured bythe generalization error,

P{|I[fn]− In[fn]| ≤ ε} ≥ 1− δ,

for n = n(ε, δ).

Tomaso Poggio The Learning Problem and Regularization

Page 41: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Empirical risk and Generalization

How do we design learning algorithms that work? One of themost natural ideas is ERM...

Empirical RiskThe empirical risk is a natural proxy (how good?) for theexpected risk

In[f ] =1n

n∑i=1

V (f (xi), yi).

Generalization ErrorThe effectiveness of such an approximation error is captured bythe generalization error,

P{|I[fn]− In[fn]| ≤ ε} ≥ 1− δ,

for n = n(ε, δ).

Tomaso Poggio The Learning Problem and Regularization

Page 42: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Some (Theoretical and Practical) Questions

How do we go from data to an actual algorithm or class ofalgorithms?Is minimizing error on the data a good idea?Are there fundamental limitations in what we can andcannot learn?

Tomaso Poggio The Learning Problem and Regularization

Page 43: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Some (Theoretical and Practical) Questions

How do we go from data to an actual algorithm or class ofalgorithms?Is minimizing error on the data a good idea?Are there fundamental limitations in what we can andcannot learn?

Tomaso Poggio The Learning Problem and Regularization

Page 44: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Some (Theoretical and Practical) Questions

How do we go from data to an actual algorithm or class ofalgorithms?Is minimizing error on the data a good idea?Are there fundamental limitations in what we can andcannot learn?

Tomaso Poggio The Learning Problem and Regularization

Page 45: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Plan

Part I: Basic Concepts and NotationPart II: Foundational ResultsPart III: Algorithms

Tomaso Poggio The Learning Problem and Regularization

Page 46: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

No Free Lunch Theorem Devroye et al.

Universal ConsistencySince classical statistics worries so much about consistency letus start here even if it is not the practically important concept.Can we learn consistently any problem? Or equivalently douniversally consistent algorithms exist?YES! Neareast neighbors, Histogram rules, SVM with (socalled) universal kernels...

No Free Lunch TheoremGiven a number of points (and a confidence), can we alwaysachieve a prescribed error?NO!

The last statement can be interpreted as follows: inferencefrom finite samples can effectively performed if and only if theproblem satisfies some a priori condition.

Tomaso Poggio The Learning Problem and Regularization

Page 47: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

No Free Lunch Theorem Devroye et al.

Universal ConsistencySince classical statistics worries so much about consistency letus start here even if it is not the practically important concept.Can we learn consistently any problem? Or equivalently douniversally consistent algorithms exist?YES! Neareast neighbors, Histogram rules, SVM with (socalled) universal kernels...

No Free Lunch TheoremGiven a number of points (and a confidence), can we alwaysachieve a prescribed error?NO!

The last statement can be interpreted as follows: inferencefrom finite samples can effectively performed if and only if theproblem satisfies some a priori condition.

Tomaso Poggio The Learning Problem and Regularization

Page 48: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

No Free Lunch Theorem Devroye et al.

Universal ConsistencySince classical statistics worries so much about consistency letus start here even if it is not the practically important concept.Can we learn consistently any problem? Or equivalently douniversally consistent algorithms exist?YES! Neareast neighbors, Histogram rules, SVM with (socalled) universal kernels...

No Free Lunch TheoremGiven a number of points (and a confidence), can we alwaysachieve a prescribed error?NO!

The last statement can be interpreted as follows: inferencefrom finite samples can effectively performed if and only if theproblem satisfies some a priori condition.

Tomaso Poggio The Learning Problem and Regularization

Page 49: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

No Free Lunch Theorem Devroye et al.

Universal ConsistencySince classical statistics worries so much about consistency letus start here even if it is not the practically important concept.Can we learn consistently any problem? Or equivalently douniversally consistent algorithms exist?YES! Neareast neighbors, Histogram rules, SVM with (socalled) universal kernels...

No Free Lunch TheoremGiven a number of points (and a confidence), can we alwaysachieve a prescribed error?NO!

The last statement can be interpreted as follows: inferencefrom finite samples can effectively performed if and only if theproblem satisfies some a priori condition.

Tomaso Poggio The Learning Problem and Regularization

Page 50: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Hypotheses Space

Learning does not happen in void. In statistical learning a firstprior assumption amounts to choosing a suitable space ofhypotheses H.

The hypothesis space H is the space of functions that weallow our algorithm to “look at”. For many algorithms (such asoptimization algorithms) it is the space the algorithm is allowedto search. As we will see in future classes, it is often importantto choose the hypothesis space as a function of the amount ofdata n available.

Tomaso Poggio The Learning Problem and Regularization

Page 51: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Hypotheses Space

Learning does not happen in void. In statistical learning a firstprior assumption amounts to choosing a suitable space ofhypotheses H.

The hypothesis space H is the space of functions that weallow our algorithm to “look at”. For many algorithms (such asoptimization algorithms) it is the space the algorithm is allowedto search. As we will see in future classes, it is often importantto choose the hypothesis space as a function of the amount ofdata n available.

Tomaso Poggio The Learning Problem and Regularization

Page 52: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Hypotheses Space

Examples: linear functions, polynomial, RBFs, SobolevSpaces...

Learning algorithm

A learning algorithm A is then a map from the data space to H,

A(Sn) = fn ∈ H.

Tomaso Poggio The Learning Problem and Regularization

Page 53: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Hypotheses Space

Examples: linear functions, polynomial, RBFs, SobolevSpaces...

Learning algorithm

A learning algorithm A is then a map from the data space to H,

A(Sn) = fn ∈ H.

Tomaso Poggio The Learning Problem and Regularization

Page 54: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Empirical Risk Minimization

How do we choose H? How do we design A?

ERMA prototype algorithm in statistical learning theory is EmpiricalRisk Minimization:

minf∈H

In[f ].

Tomaso Poggio The Learning Problem and Regularization

Page 55: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Reminder: Expected error, empirical error

Given a function f , a loss function V , and a probabilitydistribution µ over Z , the expected or true error of f is:

I[f ] = EzV [f , z] =

∫Z

V (f , z)dµ(z)

which is the expected loss on a new example drawn at randomfrom µ.We would like to make I[f ] small, but in general we do not knowµ.Given a function f , a loss function V , and a training set Sconsisting of n data points, the empirical error of f is:

IS[f ] =1n

∑V (f , zi)

Tomaso Poggio The Learning Problem and Regularization

Page 56: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Reminder: Generalization

A natural requirement for fS is distribution independentgeneralization

limn→∞

|IS[fS]− I[fS]| = 0 in probability

This is equivalent to saying that for each n there exists a εn anda δ(ε) such that

P {|ISn [fSn ]− I[fSn ]| ≥ εn} ≤ δ(εn), (1)

with εn and δ going to zero for n→∞.In other words, the training error for the solution must convergeto the expected error and thus be a “proxy” for it. Otherwise thesolution would not be “predictive”.A desirable additional requirement is consistency

ε > 0 limn→∞

P{

I[fS]− inff∈H

I[f ] ≥ ε}

= 0.

Tomaso Poggio The Learning Problem and Regularization

Page 57: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

A learning algorithm should be well-posed, eg stable

In addition to the key property of generalization, a “good”learning algorithm should also be stable: fS should dependcontinuously on the training set S. In particular, changing oneof the training points should affect less and less the solution asn goes to infinity. Stability is a good requirement for the learningproblem and, in fact, for any mathematical problem. We openhere a small parenthesis on stability and well-posedness.

Tomaso Poggio The Learning Problem and Regularization

Page 58: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

General definition of Well-Posed and Ill-Posedproblems

A problem is well-posed if its solution:

existsis uniquedepends continuously on the data (e.g. it is stable)

A problem is ill-posed if it is not well-posed. In the context ofthis class, well-posedness is mainly used to mean stability ofthe solution.

Tomaso Poggio The Learning Problem and Regularization

Page 59: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

More on well-posed and ill-posed problems

Hadamard introduced the definition of ill-posedness. Ill-posedproblems are typically inverse problems.As an example, assume g is a function in Y and u is a functionin X , with Y and X Hilbert spaces. Then given the linear,continuous operator L, consider the equation

g = Lu.

The direct problem is is to compute g given u; the inverseproblem is to compute u given the data g. In the learning caseL is somewhat similar to a “sampling” operation and the inverseproblem becomes the problem of finding a function that takesthe values

f (xi) = yi , i = 1, ...n

The inverse problem of finding u is well-posed whenthe solution exists,is unique andis stable, that is depends continuously on the initial data g.

Ill-posed problems fail to satisfy one or more of these criteria.Often the term ill-posed applies to problems that are notstable, which in a sense is the key condition.

Tomaso Poggio The Learning Problem and Regularization

Page 60: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

ERM

Given a training set S and a function space H, empirical riskminimization as we have seen is the class of algorithms thatlook at S and select fS as

fS = arg minf∈H

IS[f ]

.For example linear regression is ERM when V (z) = (f (x)− y)2

and H is space of linear functions f = ax .

Tomaso Poggio The Learning Problem and Regularization

Page 61: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Generalization and Well-posedness of Empirical RiskMinimization

For ERM to represent a “good” class of learning algorithms, thesolution should

generalizeexist, be unique and – especially – be stable(well-posedness), according to some definition of stability.

Tomaso Poggio The Learning Problem and Regularization

Page 62: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

ERM and generalization: given a certain number ofsamples...

Tomaso Poggio The Learning Problem and Regularization

Page 63: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

...suppose this is the “true” solution...

Tomaso Poggio The Learning Problem and Regularization

Page 64: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

... but suppose ERM gives this solution.

Tomaso Poggio The Learning Problem and Regularization

Page 65: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Under which conditions the ERM solution convergeswith increasing number of examples to the truesolution? In other words...what are the conditions forgeneralization of ERM?

Tomaso Poggio The Learning Problem and Regularization

Page 66: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

ERM and stability: given 10 samples...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 67: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

...we can find the smoothest interpolating polynomial(which degree?).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 68: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

But if we perturb the points slightly...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 69: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

...the solution changes a lot!

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 70: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

If we restrict ourselves to degree two polynomials...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 71: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

...the solution varies only a small amount under asmall perturbation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 72: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

ERM: conditions for well-posedness (stability) andpredictivity (generalization)

Since Tikhonov, it is well-known that a generally ill-posedproblem such as ERM, can be guaranteed to be well-posedand therefore stable by an appropriate choice of H. Forexample, compactness of H guarantees stability.It seems intriguing that the classical conditions for consistencyof ERM – thus quite a different property – consist ofappropriately restricting H. It seems that the same restrictionsthat make the approximation of the data stable, may providesolutions that generalize...

Tomaso Poggio The Learning Problem and Regularization

Page 73: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

ERM: conditions for well-posedness (stability) andpredictivity (generalization)

We would like to have a hypothesis space that yieldsgeneralization. Loosely speaking this would be a H for whichthe solution of ERM, say fS is such that |IS[fS]− I[fS]| convergesto zero in probability for n increasing.Note that the above requirement is NOT the law of largenumbers; the requirement for a fixed f that |IS[f ]− I[f ]|converges to zero in probability for n increasing IS the law oflarge numbers.

Tomaso Poggio The Learning Problem and Regularization

Page 74: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

ERM: conditions for well-posedness (stability) andpredictivity (generalization) in the case of regressionand classification

Theorem [Vapnik and Cervonenkis (71), Alon et al (97),Dudley, Giné, and Zinn (91)]

A (necessary) and sufficient condition for generalization (andconsistency) of ERM is that H is uGC.DefinitionH is a (weak) uniform Glivenko-Cantelli (uGC) classif

∀ε > 0 limn→∞

supµ

PS

{supf∈H|I[f ]− IS[f ]| > ε

}= 0.

Tomaso Poggio The Learning Problem and Regularization

Page 75: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

ERM: conditions for well-posedness (stability) andpredictivity (generalization) in the case of regressionand classification

The theorem (Vapnik et al.) says that a proper choice ofthe hypothesis space H ensures generalization of ERM(and consistency since for ERM generalization isnecessary and sufficient for consistency and viceversa).Other results characterize uGC classes in terms of measures ofcomplexity or capacity of H (such as VC dimension).

A separate theorem (Niyogi, Poggio et al.) guarantees alsostability (defined in a specific way) of ERM (for supervisedlearning). Thus with the appropriate definition of stability,stability and generalization are equivalent for ERM.

Thus the two desirable conditions for a supervised learningalgorithm – generalization and stability – are equivalent (andthey correspond to the same constraints on H).

Tomaso Poggio The Learning Problem and Regularization

Page 76: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Key Theorem(s) Illustrated

Tomaso Poggio The Learning Problem and Regularization

Page 77: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Key Theorem(s)

Uniform Glivenko-Cantelli ClassesWe say that H is a uniform Glivenko-Cantelli (uGC) class, if forall p,

∀ε > 0 limn→∞

P{

supf∈H|I[f ]− In[f ]| > ε

}= 0.

A necessary and sufficient condition for consistency of ERM isthat H is uGC.See: [Vapnik and Cervonenkis (71), Alon et al (97), Dudley, Giné, andZinn (91)].

In turns the UGC property is equivalent to requiring H to havefinite capacity: Vγ dimension in general and VC dimension inclassification.

Tomaso Poggio The Learning Problem and Regularization

Page 78: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Key Theorem(s)

Uniform Glivenko-Cantelli ClassesWe say that H is a uniform Glivenko-Cantelli (uGC) class, if forall p,

∀ε > 0 limn→∞

P{

supf∈H|I[f ]− In[f ]| > ε

}= 0.

A necessary and sufficient condition for consistency of ERM isthat H is uGC.See: [Vapnik and Cervonenkis (71), Alon et al (97), Dudley, Giné, andZinn (91)].

In turns the UGC property is equivalent to requiring H to havefinite capacity: Vγ dimension in general and VC dimension inclassification.

Tomaso Poggio The Learning Problem and Regularization

Page 79: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Stability

z = (x , y)S = z1, ..., znSi = z1, ..., zi−1, zi+1, ...zn

CV StabilityA learning algorithm A is CV loo stability if for each n thereexists a β(n)

CV and a δ(n)CV such that for all p

P{|V (fSi , zi)− V (fS, zi)| ≤ β

(n)CV

}≥ 1− δ(n)

CV ,

with β(n)CV and δ(n)

CV going to zero for n→∞.

Tomaso Poggio The Learning Problem and Regularization

Page 80: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Key Theorem(s) Illustrated

Tomaso Poggio The Learning Problem and Regularization

Page 81: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

ERM and ill-posedness

Ill posed problems often arise if one tries to infer general lawsfrom few data

the hypothesis space is too largethere are not enough data

In general ERM leads toill-posed solutions because

the solution may be toocomplex

it may be not unique

it may change radically whenleaving one sample out

Tomaso Poggio The Learning Problem and Regularization

Page 82: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Regularization

Regularization is the classical way to restore well posedness(and ensure generalization). Regularization (originallyintroduced by Tikhonov independently of the learning problem)ensures well-posedness and (because of the above argument)generalization of ERM by constraining the hypothesis space H.The direct way – minimize the empirical error subject to f in aball in an appropriate H – is called Ivanov regularization. Theindirect way is Tikhonov regularization (which is not strictlyERM).

Tomaso Poggio The Learning Problem and Regularization

Page 83: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Ivanov and Tikhonov Regularization

ERM finds the function in (H) which minimizes

1

n

nXi=1

V (f (xi ), yi )

which in general – for arbitrary hypothesis spaceH – is ill-posed.

Ivanov regularizes by finding the function that minimizes

1

n

nXi=1

V (f (xi ), yi )

while satisfyingR(f ) ≤ A.

Tikhonov regularization minimizes over the hypothesis spaceH, for a fixed positive parameter γ, theregularized functional

1

n

nXi=1

V (f (xi ), yi ) + γR(f ). (2)

R(f ) is the regulirizer, a penalization on f . In this course we will mainly discuss the caseR(f ) = ‖f‖2K where ‖f‖2

Kis the norm in the Reproducing Kernel Hilbert Space (RKHS)H, defined by the kernel K .

Tomaso Poggio The Learning Problem and Regularization

Page 84: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Tikhonov Regularization

As we will see in future classes

Tikhonov regularization ensures well-posedness egexistence, uniqueness and especially stability (in a verystrong form) of the solutionTikhonov regularization ensures generalizationTikhonov regularization is closely related to – but differentfrom – Ivanov regularization, eg ERM on a hypothesisspace H which is a ball in a RKHS.

Tomaso Poggio The Learning Problem and Regularization

Page 85: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Remarks on Foundations of Learning Theory

Intelligent behavior (at least learning) consists of optimizingunder constraints. Constraints are key for solvingcomputational problems; constraints are key for prediction.Constraints may correspond to rather general symmetryproperties of the problem (eg time invariance, space invariance,invariance to physical units (pai theorem), universality ofnumbers and metrics implying normalization, etc.)

Key questions at the core of learning theory:generalization and predictivity not explanationprobabilities are unknown, only data are givenwhich constraints are needed to ensure generalization(therefore which hypotheses spaces)?regularization techniques result usually in computationally“nice” and well-posed optimization problems

Tomaso Poggio The Learning Problem and Regularization

Page 86: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Statistical Learning Theory and Bayes

The Bayesian approach tends to ignore

the issue of generalization (following the tradition instatistics of explanatory statistics);that probabilities are not known and that only data areknown: assuming a specific distribution is a very strong –unconstrained by any Bayesian theory – seat-of-the-pantsguess;the question of which priors are needed to ensuregeneralization;that the resulting optimization problems are oftencomputationally intractable and possibly ill-posedoptimization problems (for instance not unique).

The last point may be quite devastating for Bayesonomics: Montecarlo techniques etc. may just hide hopelessexponential computational complexity for the Bayesian approach to real-life problems, like exhastive search didinitially for AI. A possibly interesting conjecture suggested by our stability results and the last point above, is thatill-posed optimization problems or their ill-conditioned approximative solutions may not be predictive!

Tomaso Poggio The Learning Problem and Regularization

Page 87: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Plan

Part I: Basic Concepts and Notation

Part II: Foundational ResultsPart III: Algorithms

Tomaso Poggio The Learning Problem and Regularization

Page 88: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Hypotheses Space

We are going to look at hypotheses spaces which arereproducing kernel Hilbert spaces.

RKHS are Hilbert spaces of point-wise definedfunctions.They can be defined via a reproducing kernel, which is asymmetric positive definite function.

n∑i,j=1

cicjK (ti , tj) ≥ 0

for any n ∈ N and choice of t1, ..., tn ∈ X and c1, ..., cn ∈ R.functions in the space are (the completion of) linearcombinations

f (x) =

p∑i=1

K (x , xi)ci .

the norm in the space is a natural measure of complexity

‖f‖2H =

p∑j,i=1

K (xj , xi)cicj .Tomaso Poggio The Learning Problem and Regularization

Page 89: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Hypotheses Space

We are going to look at hypotheses spaces which arereproducing kernel Hilbert spaces.

RKHS are Hilbert spaces of point-wise definedfunctions.They can be defined via a reproducing kernel, which is asymmetric positive definite function.

n∑i,j=1

cicjK (ti , tj) ≥ 0

for any n ∈ N and choice of t1, ..., tn ∈ X and c1, ..., cn ∈ R.functions in the space are (the completion of) linearcombinations

f (x) =

p∑i=1

K (x , xi)ci .

the norm in the space is a natural measure of complexity

‖f‖2H =

p∑j,i=1

K (xj , xi)cicj .Tomaso Poggio The Learning Problem and Regularization

Page 90: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Hypotheses Space

We are going to look at hypotheses spaces which arereproducing kernel Hilbert spaces.

RKHS are Hilbert spaces of point-wise definedfunctions.They can be defined via a reproducing kernel, which is asymmetric positive definite function.

n∑i,j=1

cicjK (ti , tj) ≥ 0

for any n ∈ N and choice of t1, ..., tn ∈ X and c1, ..., cn ∈ R.functions in the space are (the completion of) linearcombinations

f (x) =

p∑i=1

K (x , xi)ci .

the norm in the space is a natural measure of complexity

‖f‖2H =

p∑j,i=1

K (xj , xi)cicj .Tomaso Poggio The Learning Problem and Regularization

Page 91: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Hypotheses Space

We are going to look at hypotheses spaces which arereproducing kernel Hilbert spaces.

RKHS are Hilbert spaces of point-wise definedfunctions.They can be defined via a reproducing kernel, which is asymmetric positive definite function.

n∑i,j=1

cicjK (ti , tj) ≥ 0

for any n ∈ N and choice of t1, ..., tn ∈ X and c1, ..., cn ∈ R.functions in the space are (the completion of) linearcombinations

f (x) =

p∑i=1

K (x , xi)ci .

the norm in the space is a natural measure of complexity

‖f‖2H =

p∑j,i=1

K (xj , xi)cicj .Tomaso Poggio The Learning Problem and Regularization

Page 92: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Examples of pd kernels

Very common examples of symmetric pd kernels are• Linear kernel

K (x , x ′) = x · x ′

• Gaussian kernel

K (x , x ′) = e−‖x−x′‖2

σ2 , σ > 0

• Polynomial kernel

K (x , x ′) = (x · x ′ + 1)d , d ∈ N

For specific applications, designing an effective kernel is achallenging problem.

Tomaso Poggio The Learning Problem and Regularization

Page 93: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Kernel and Features

Often times kernels, are defined through a dictionary of features

D = {φj , i = 1, . . . ,p | φj : X → R, ∀j}

setting

K (x , x ′) =

p∑i=1

φj(x)φj(x ′).

Tomaso Poggio The Learning Problem and Regularization

Page 94: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Ivanov regularization

We can regularize by explicitly restricting the hypotheses spaceH— for example to a ball of radius R.

Ivanov regularization

minf∈H

1n

n∑i=1

V (f (xi), yi)

subject to‖f‖2H ≤ R.

The above algorithm corresponds to a constrained optimizationproblem.

Tomaso Poggio The Learning Problem and Regularization

Page 95: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Tikhonov regularization

Regularization can also be done implicitly via penalization

Tikhonov regularizarion

arg minf∈H

1n

n∑i=1

V (f (xi), yi) + λ ‖f‖2H .

λ is the regularization parameter trading-off between the twoterms.

The above algorithm can be seen as the Lagrangianformulation of a constrained optimization problem.

Tomaso Poggio The Learning Problem and Regularization

Page 96: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

The Representer Theorem

An important resultThe minimizer over the RKHS H, fS, of the regularizedempirical functional

IS[f ] + λ‖f‖2H,

can be represented by the expression

fn(x) =n∑

i=1

ciK (xi , x),

for some (c1, . . . , cn) ∈ R.

Hence, minimizing over the (possibly infinite dimensional)Hilbert space, boils down to minimizing over Rn.

Tomaso Poggio The Learning Problem and Regularization

Page 97: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

SVM and RLS

The way the coefficients c = (c1, . . . , cn) are computed dependon the loss function choice.

RLS: Let Let y = (y1, . . . , yn) and Ki,j = K (xi , xj) thenc = (K + λnI)−1y.SVM: Let αi = yici and Qi,j = yiK (xi , xj)yj

Tomaso Poggio The Learning Problem and Regularization

Page 98: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

SVM and RLS

The way the coefficients c = (c1, . . . , cn) are computed dependon the loss function choice.

RLS: Let Let y = (y1, . . . , yn) and Ki,j = K (xi , xj) thenc = (K + λnI)−1y.SVM: Let αi = yici and Qi,j = yiK (xi , xj)yj

Tomaso Poggio The Learning Problem and Regularization

Page 99: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Bayes Interpretation

Tomaso Poggio The Learning Problem and Regularization

Page 100: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Regularization approach

More generally we can consider:

In(f ) + λR(f )

where, R(f ) is a regularizing functional.

Sparsity based methodsManifold learningMulticlass...

Tomaso Poggio The Learning Problem and Regularization

Page 101: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Summary

statistical learning as framework to deal with uncertainty indata.non free lunch and key theorem: no prior, no learning.regularization as a fundamental tool for enforcing a priorand ensure stability and generalization

Tomaso Poggio The Learning Problem and Regularization

Page 102: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Kernel and Data Representation

In the above reasoning the kernel and the hypotheses spacedefine a representation/parameterization of the problem andhence play a special role.

Where do they come from?

There are a few off the shelf choices (Gaussian,polynomial etc.)Often they are the product of problem specific engineering.

Are there principles– applicable in a wide range of situations–to design effective data representation?

Tomaso Poggio The Learning Problem and Regularization

Page 103: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Kernel and Data Representation

In the above reasoning the kernel and the hypotheses spacedefine a representation/parameterization of the problem andhence play a special role.

Where do they come from?

There are a few off the shelf choices (Gaussian,polynomial etc.)Often they are the product of problem specific engineering.

Are there principles– applicable in a wide range of situations–to design effective data representation?

Tomaso Poggio The Learning Problem and Regularization

Page 104: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Kernel and Data Representation

In the above reasoning the kernel and the hypotheses spacedefine a representation/parameterization of the problem andhence play a special role.

Where do they come from?

There are a few off the shelf choices (Gaussian,polynomial etc.)Often they are the product of problem specific engineering.

Are there principles– applicable in a wide range of situations–to design effective data representation?

Tomaso Poggio The Learning Problem and Regularization

Page 105: The Learning Problem and Regularizationweb.mit.edu/9.520/www/spring12/slides/class02/class02.pdf · Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference

Kernel and Data Representation

In the above reasoning the kernel and the hypotheses spacedefine a representation/parameterization of the problem andhence play a special role.

Where do they come from?

There are a few off the shelf choices (Gaussian,polynomial etc.)Often they are the product of problem specific engineering.

Are there principles– applicable in a wide range of situations–to design effective data representation?

Tomaso Poggio The Learning Problem and Regularization