1 model assessment and selection lecture notes for comp540 chapter7 jian li mar.2007
TRANSCRIPT
![Page 1: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/1.jpg)
1
Model Assessment and Selection
Lecture Notes for Comp540 Chapter7
Jian Li
Mar.2007
![Page 2: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/2.jpg)
2
Goal
• Model Selection
• Model Assessment
![Page 3: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/3.jpg)
3
A Regression Problem
• y = f(x) + noise
• Can we learn f from this data?
• Let’s consider three methods...
![Page 4: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/4.jpg)
4
Linear Regression
![Page 5: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/5.jpg)
5
Quadratic Regression
![Page 6: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/6.jpg)
6
Joining the dots
![Page 7: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/7.jpg)
7
Which is best?
• Why not choose the method with the best fit to the data?
“How well are you going to predict future data drawn
from the same distribution?”
![Page 8: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/8.jpg)
8
Model Selection and Assessment
• Model Selection: Estimating performances of different models to choose the best one (produces the minimum of the test error)
• Model Assessment: Having chosen a model, estimating the prediction error on new data
![Page 9: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/9.jpg)
9
Why Errors
• Why do we want to study errors?
• In a data-rich situation split the data:Train Validation Test
Model Selection Model assessment
• But, that’s not usually the case
![Page 10: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/10.jpg)
10
Overall Motivation
• Errors Measurement of errors (Loss functions) Decomposing Test Error into Bias & Variance
• Estimating the true error Estimating in-sample error (analytically )
AIC, BIC, MDL, SRM with VC Estimating extra-sample error (efficient sample reuse)
Cross Validation & Bootstrapping
![Page 11: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/11.jpg)
11
Measuring Errors: Loss Functions
• Typical regression loss functions Squared error:
Absolute error:
![Page 12: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/12.jpg)
12
Measuring Errors: Loss Functions
• Typical classification loss functions 0-1 Loss:
Log-likelihood (cross-entropy loss / deviance):
![Page 13: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/13.jpg)
13
• We want to minimize generalization error or test error:
The Goal: Low Test Error
Err E[L(Y, ˆ f (X))]
err 1
NL(y i,
ˆ f (x i))i1
N
• But all we really know is training error:
• And this is a bad estimate of test error
?
![Page 14: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/14.jpg)
14
Bias, Variance & Complexity
Training error can always be reduced when increasing model complexity, but risks over-fitting.
Typically
![Page 15: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/15.jpg)
15
Decomposing Test Error
For squared-error loss & additive noise:
Irreducible error of target Y
Deviation of the average estimatefrom the true function’s mean
Expected squared deviation of ourestimate around its mean
2)(and0)(;)( VarEXfYModel:
![Page 16: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/16.jpg)
16
Further Bias Decomposition
• For linear models (eg. Ridge), bias can be further decomposed:
* is the best fitting linear approximationAverage EstimationBias
For standard linear regression, Estimation Bias = 0
2))((minarg XXfE T
Average ModelBias
![Page 17: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/17.jpg)
17
Model Fitting
Graphical representation of bias & variance
Hypothesis SpaceModel Space(basic linear regression)
Regularized Model Space(ridge regression)
Truth
Realization
Model Bias
EstimationBias
EstimationVariance
Shrunken fit
Closest fit(given our observation)
Closest fitIn population(if epsilon=0)
![Page 18: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/18.jpg)
18
Bias & Variance Decomposition Examples
• kNN Regression
• Linear Regression
Linear weights on y:
Averaging over the training set:
1
NErr(x i)
i
2
1
N[ f (x i) Eˆ f (
i
x i)]2
p
N
2
![Page 19: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/19.jpg)
19
Regression with squared error loss
Classification with 0-1 loss
Prediction error
Bias2
Variance
-- + -- = -- -- + -- = --
-- + -- <> -- -- + -- <> -- Bias-Variance different for 0-1 loss than for squared error loss
Estimation errors on the right side of the boundary don’t hurt!
Simulated Example of Bias Variance Decomposition
![Page 20: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/20.jpg)
20
Optimism of The Training Error Rate
• Typically: training error rate < true error
(same data is being used to fit the method and assess its error)
<
overly optimistic
Err E[L(Y, ˆ f (X))]
err 1
NL(y i,
ˆ f (x i))i1
N
![Page 21: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/21.jpg)
21
Estimating Test Error
• Can we estimate the discrepancy between err and Err?
Adjustment for optimism of training error
extra-sample error
Errin --- In-sample error:Expectation over N new
responses at each xi
![Page 22: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/22.jpg)
22
Optimism
Summary: 1
2ˆ ,
N
in y i ii
Err E err Cov y yN
• For linear fit with d indep inputs/basis funcs:
optimism linearly with # d Optimism as training sample size
22in yErr E err d
N
for squared error, 0-1 and other loss functions:
1
2ˆ ,
N
i ii
op Cov y yN
optimism: in yop Err E err
![Page 23: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/23.jpg)
23
Ways to Estimate Prediction Error
• In-sample error estimates: AIC BIC MDL SRM
• Extra-sample error estimates: Cross-Validation
• Leave-one-out
• K-fold
Bootstrap
![Page 24: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/24.jpg)
24
• General form of the in-sample estimate:
• For linear fit : 22
ˆ , so called statisticp p
dC err C
N
poerrin ˆrrE
Estimates of In-Sample Prediction Error
![Page 25: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/25.jpg)
25
AIC & BIC
Similarly: Akaike Information Criterion (AIC)
AIC 2
Nloglik 2
d
N
Bayesian Information Criterion (BIC)
dNCBI )(logloglik2
![Page 26: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/26.jpg)
26
AIC & BIC
AIC LL(Data | MLE params) (# of parameters)
BIC LL(Data | MLE params) logN
2(# of parameters)
![Page 27: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/27.jpg)
27
MDL(Minimum Description Length)
• Regularity ~ Compressibility
• Learning ~ Finding regularities
Learning model
Real model
InputSamples
Rn
PredictionsR1
=?
Real classR1
error
![Page 28: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/28.jpg)
28
MDL(Minimum Description Length)
• Regularity ~ Compressibility
• Learning ~ Finding regularities
length logPr(y |, M, X) logPr( | M)
Length of transmitting the discrepancy
given the model + optimal coding
under the given model
Description of the model
under optimal coding
MDL principle: choose the model with the minimum description length
Equivalent to maximizing the posterior:
Pr(y |, M,X)Pr( | M)
![Page 29: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/29.jpg)
29
SRM with VC (Vapnik-Chernovenkis) Dimension
• Vapnik showed that with probability 1-
Errtrue Errtrain 2 1 1
4 Errtrain
As h increases
A method of selecting a class F from a family of nested classes
h VC dimension (measure of f 's power)
where a1
h log(a2N /h) 1 log( /4)
N
![Page 30: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/30.jpg)
30
Errin Estimation
• A trade-off between the fit to the data and the model complexity
AIC err 2d
Nˆ e
dNCBI )(logloglik2
MDL length logPr(y |,M,X) logPr( | M)
VC : Errtrue Errtrain 2 1 1
4 Errtrain
![Page 31: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/31.jpg)
31
Estimation of Extra-Sample Err
• Cross Validation
• Bootstrap
![Page 32: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/32.jpg)
32
CV () 1
NL y i,
ˆ f (i)(x i,) i1
N
……
test train
K-fold
Cross-Validation
![Page 33: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/33.jpg)
33
How many folds?
Variance decreases bias decreases
Computation increases
k increases
k foldLeave-one-out
![Page 34: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/34.jpg)
34
Cross-Validation: Choosing K
Popular choices for K: 5,10,N
![Page 35: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/35.jpg)
35
Generalized Cross-Validation
• LOOCV can be computational expensive for linear fitting with large N
• GCV provides a computationally cheaper approximation
• Linear fitting
• For linear fitting under squared-error loss:
ˆ y Sy (S is a smoother matrix)
1
Nyi ˆ f i(xi)
i1
N
2
1
N
yi ˆ f (xi)
1 Sii
i1
N
2
Sii i'th diagonal element of S
trace(S)N
GCV 1
N
yi ˆ f (xi)
1
i1
N
2
![Page 36: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/36.jpg)
36
Bootstrap: Main Concept
“The bootstrap is a computer-based method of statistical inference that can answer many real statistical questions without formulas”
(An Introduction to the Bootstrap, Efron and Tibshirani, 1993)
Step 1: Draw samples with replacement
Step 2: Calculate the statistic
![Page 37: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/37.jpg)
37
How is it coming
Sampling distribution of sample mean
x In practice cannot afford
large number of random samples
The theory tells us the sampling distribution
The sample stands for the populationand the distribution of in many resamples stands for the sampling
distribution
x
![Page 38: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/38.jpg)
38
Bootstrap: Error Estimation with Errboot
Vˆ a r[S(Z)] 1
B 1S(Z*b ) S
* b1
B
S*
1
BS(Z*b )
b1
N
Var ˆ F [S(Z)] Depends on the unknown true
distribution F
A straightforward application of bootstrap to error prediction
Eˆ r rboot 1
B
1
NL(y i,
ˆ f *b (x i))i1
N
b1
B
![Page 39: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/39.jpg)
39
Bootstrap: Error Estimation with Err(1)
A CV-inspired improvement on Errboot
Eˆ r r(1) 1
N
1
C iL(y i,
ˆ f *b (x i))b C i
i1
N
![Page 40: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/40.jpg)
40
Bootstrap: Error Estimation with Err(.632)
An improvement on Err(1) in light-fitting cases
Eˆ r r(.632) .368err .632Eˆ r r(1)
N # of datapoints Z (z1,...,zn )
Probability of zi NOT being chosen when 1 point is uniformly sampled from Z : 1 - 1
N
Probability of z i NOT being chosen when Z is sampled N times : 1 - 1
N
N
Probability of zi being chosen AT LEAST once when Z is sampled N times: 1 1 - 1
N
N
1 e 1 Eˆ r r(.632) err .632(Eˆ r r(1) err)
0.632 .368err .632Eˆ r r(1)
?
![Page 41: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/41.jpg)
41
Bootstrap: Error Estimation with Err(.632+)
An improvement on Err(.632) by adaptively accounting for overfitting
• Depending on the amount of overfitting, the best error estimate is as little as Err(.632) , or as much as Err(1), or something in between
• Err(.632+) is like Err(.632) with adaptive weights, with Err(1) weighted at least .632
• Err(.632+) adaptively mixes training error and leave-one-out error using the relative overfitting rate (R)
![Page 42: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/42.jpg)
42
Bootstrap: Error Estimation with Err(.632+)
Eˆ r r(.632) ranges from Eˆ r r(.632) if there is minimal overfitting (R0),
to Eˆ r r(1) if there is maximal overfitting (R1)
![Page 43: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/43.jpg)
43
Cross Validation & Bootstrap
• Why bother with cross-validation and bootstrap when analytical estimates are known?
1) AIC, BIC, MDL, SRM all requires knowledge of d, which is difficult to attain in most situations.
2) Bootstrap and cross validation gives similar results to above but also applicable in more complex situation.
3) Estimating the noise variance requires a roughly working model, cross validation and bootstrap will work well even if the model is far from correct.
![Page 44: 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649cf75503460f949c7299/html5/thumbnails/44.jpg)
44
Conclusion
• Test error plays crucial roles in model selection• AIC, BIC and SRMVC have the advantage that you only need the
training error• If VC-dimension is known, then SRM is a good method for model
selection – requires much less computation than CV and bootstrap, but is wildly conservative
• Methods like CV, Bootstrap give tighter error bounds, but might have more variance
• Asymptotically AIC and Leave-one-out CV should be the same• Asymptotically BIC and a carefully chosen k-fold should be the
same • BIC is what you want if you want the best structure instead of the
best predictor• Bootstrap has much wider applicability than just estimating
prediction error