lecture 2: learning without over-learning isabelle guyon [email protected]
TRANSCRIPT
Machine Learning
• Learning machines include:– Linear discriminant (including Naïve Bayes)– Kernel methods– Neural networks– Decision trees
• Learning is tuning:– Parameters (weights w or , threshold b)– Hyperparameters (basis functions, kernels, number
of units)
Conventions
X={xij}
n
mxi
y ={yj}
w
What is a Risk Functional?
• A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.
Parameter space (w)
R[f(x,w)]
w*
Examples of risk functionals
• Classification:
– Error rate: (1/m) i=1:m 1(F(xi)yi)
– 1- AUC
• Regression:
– Mean square error: (1/m) i=1:m(f(xi)-yi)2
How to Train?
• Define a risk functional R[f(x,w)]
• Find a method to optimize it, typically “gradient descent”
wj wj - R/wj
or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.)
x1
x2
Fit / Robustness Tradeoff
x1
x2
15
Overfitting
Example: Polynomial regression
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5
Learning machine: y=w0+w1x + w2x2 …+ w10x10
Target: a 10th degree polynomial + noise
Error
x
y
Underfitting
Example: Polynomial regression
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5
Linear model: y=w0+w1x
Target: a 10th degree polynomial + noise
Error
x
y
Variance
10-10 -8 -6 -4 -2 0 2 4 6 8-1.5
-1
-0.5
0
0.5
1
1.5
2
x
y
x
y
-10 -8 -6 -4 -2 0 2 4 6 8 10
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
y
Bias
x
y
Ockham’s Razor
• Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est ponenda sine neccesitate”.
• Of two theories providing similarly good predictions, prefer the simplest one.
• Shave off unnecessary parameters of your models.
The Power of Amnesia
• The human brain is made out of billions of cells or Neurons, which are highly interconnected by synapses.
• Exposure to enriched environments with extra sensory and social stimulation enhances the connectivity of the synapses, but children and adolescents can lose them up to 20 million per day.
Artificial Neurons
x1
x2
xn
1
f(x)
w1
w2
wn
b
f(x) = w x + b
Axon
Synapses
Activation of other neurons Dendrites
Cell potential
Activation function
McCulloch and Pitts, 1943
Hebb’s Rule
wj wj + yi xij
Axon
yxj wj
Synapse
Activation of another neuron
Dendrite
Weight Decay
wj wj + yi xij Hebb’s rule
wj (1-) wj + yi xij Weight decay
[0, 1], decay parameter
Overfitting Avoidance
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r= 0.01
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r= 0.1
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r= 1
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r= 10
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+002
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+003
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+004
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+005
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+006
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+007
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+008
Example: Polynomial regression
Target: a 10th degree polynomial + noise
Learning machine: y=w0+w1x + w2x2 …+ w10x10
Weight Decay for MLP
xj
Replace: wj wj + back_prop(j)by: wj (1-) wj + back_prop(j)
Theoretical Foundations
• Structural Risk Minimization
• Bayesian priors
• Minimum Description Length
• Bayes/variance tradeoff
Risk Minimization
• Examples are given:
(x1, y1), (x2, y2), … (xm, ym)
loss function unknown distribution
• Learning problem: find the best function f(x; w) minimizing a risk functional
R[f] = L(f(x; w), y) dP(x, y)
Approximations of R[f]
• Empirical risk: Rtrain[f] = (1/n) i=1:m L(f(xi; w), yi)
– 0/1 loss 1(F(xi)yi) : Rtrain[f] = error rate
– square loss (f(xi)-yi)2 : Rtrain[f] = mean square error
• Guaranteed risk:
With high probability (1-), R[f] Rgua[f]
Rgua[f] = Rtrain[f] + C
Structural Risk Minimization
Vapnik, 1974
S3
S2
S1
Increasing complexity
Nested subsets of models, increasing complexity/capacity:
S1 S2 … SN
0 10 20 30 40 50 60 70 80 90 100 0
0.2
0.4
0.6
0.8
1
1.2
1.4
Tr, Training error
Ga, Guaranteed riskGa= Tr + C
, Function of Model Complexity C
Complexity/Capacity C
SRM Example (linear model)
• Rank with ||w||2 = i wi2
Sk = { w | ||w||2 < k2 }, 1<2<…<n
• Minimization under constraint:
min Rtrain[f] s.t. ||w||2 < k2
• Lagrangian:
Rreg[f,] = Rtrain[f] + ||w||2
R
capacity
S1 S2 … SN
Gradient Descent
Rreg[f] = Remp[f] + ||w||2 SRM/regularization
wj wj - Rreg/wj
wj wj - Remp/wj - 2 wj
wj (1- )wj - Remp/wj Weight decay
Multiple Structures
• Shrinkage (weight decay, ridge regression, SVM):
Sk = { w | ||w||2< k }, 1<2<…<k
1 > 2 > 3 >… > k ( is the ridge)
• Feature selection:
Sk = { w | ||w||0< k },
1<2<…<k ( is the number of features)
• Data compression:
1<2<…<k ( may be the number of clusters)
Hyper-parameter Selection
• Learning = adjusting: parameters (w vector).
hyper-parameters().
• Cross-validation with K-folds:
For various values of : - Adjust w on a fraction (K-1)/K of
training examples e.g. 9/10th. - Test on 1/K remaining examples
e.g. 1/10th. - Rotate examples and average test
results (CV error). - Select to minimize CV error. - Re-compute w on all training
examples using optimal .
X y
Prospective study / “real”
validation
Tra
inin
g d
ata:
Mak
e K
fol
ds
Tes
t da
ta
Summary
• High complexity models may “overfit”:– Fit perfectly training examples– Generalize poorly to new cases
• SRM solution: organize the models in nested subsets such that in every structure element
complexity < threshold.• Regularization: Formalize learning as a
constrained optimization problem, minimize regularized risk = training error + penalty.
Bayesian MAP SRM
• Maximum A Posteriori (MAP):f = argmax P(f|D) = argmax P(D|f) P(f) = argmin –log P(D|f) –log P(f)
• Structural Risk Minimization (SRM):
f = argmin Remp[f] + [f]
Negative log likelihood = Empirical risk Remp[f]
Negative log prior = Regularizer [f]
Example: Gaussian Prior
• Linear model:
f(x) = w.x
• Gaussian prior:
P(f) = exp -||w||2/2
• Regularizer:
[f] = –log P(f) = ||w||2
w1
w2
Minimum Description Length
• MDL: minimize the length of the “message”.
• Two part code: transmit the model and the residual.
• f = argmin –log2 P(D|f) –log2 P(f)
Length of the shortest code to encode the model (model complexity)
Residual: length of the shortest code to encode the data given the model
Bias-variance tradeoff
• f trained on a training set D of size m (m fixed)
• For the square loss:
ED[f(x)-y]2 = [EDf(x)-y]2 + ED[f(x)-EDf(x)]2
VarianceBias2Expected value of the loss over datasets D of the same size
EDf(x)f(x)
y target
Bias2
Variance
-10 -8 -6 -4 -2 0 2 4 6 8 10
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
y
Bias
x
y
Variance
10-10 -8 -6 -4 -2 0 2 4 6 8-1.5
-1
-0.5
0
0.5
1
1.5
2
x
y
x
y
The Effect of SRM
Reduces the variance…
…at the expense of introducing some bias.
Ensemble Methods
ED[f(x)-y]2 = [EDf(x)-y]2 + ED[f(x)-EDf(x)]2
• Variance can also be reduced with committee machines.
• The committee members “vote” to make the final decision.
• Committee members are built e.g. with data subsamples.
• Each committee member should have a low bias (no use of ridge/weight decay).
Overall summary
• Weight decay is a powerful means of overfitting avoidance (||w||2 regularizer).
• It has several theoretical justifications: SRM, Bayesian prior, MDL.
• It controls variance in the learning machine family, but introduces bias.
• Variance can also be controlled with ensemble methods.
Want to Learn More?
• Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031.
• Structural risk minimization for character recognition, I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S.A. Solla. In J. E. Moody et al., editor, Advances in Neural Information Processing Systems 4 (NIPS 91), pages 471--479, San Mateo CA, Morgan Kaufmann, 1992. http://clopinet.com/isabelle/Papers/srm.ps.Z
• Kernel Ridge Regression Tutorial, I. Guyon. http://clopinet.com/isabelle/Projects/ETH/KernelRidge.pdf
• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book