model assessment and cross-validation - overview

29
Data Mining and statistic al learning 2008 1 Model assessment and cross-validation - overview Bias-variance decomposition and tradeoff Analytical validation methods (AIC, BIC) Resampling methods (cross-validation, bootstrap methods)

Upload: miette

Post on 21-Mar-2016

55 views

Category:

Documents


7 download

DESCRIPTION

Model assessment and cross-validation - overview. Bias-variance decomposition and tradeoff Analytical validation methods (AIC, BIC) Resampling methods (cross-validation, bootstrap methods). Bias, variance and model complexity. Bias, variance decomposition, and tradeoff. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

1

Model assessment and cross-validation- overview

Bias-variance decomposition and tradeoff

Analytical validation methods (AIC, BIC)

Resampling methods (cross-validation, bootstrap methods)

Page 2: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

2

Bias, variance and model complexity

Page 3: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

3

Bias, variance decomposition, and tradeoff

Expected prediction error

INSERT formula 7.8

We would like to minimize Err!

Increasing the complexity of models increases variance and decreases bias

Example: Smoothing based on nearest neighbours.

Page 4: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

4

Loss function, training error, and test error- quantitative response (regression)

• Loss functions (examples)

• Training error:

• Test error:

where the expectation is taken over the joint distribution of (X, Y)

Page 5: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

5

Loss function, training error, and test error- qualitative response (classification)

Model: G=[1, 2, …, K], pk(X)=Pr(G=k| X).

Decision rule:

Examples of loss functions:

• 0-1 Loss

INSERT form. 7.4

• Cross-entropy loss (log-likelihood)

INSERT form. 7.5

XpXG kk ˆmaxargˆ

Page 6: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

6

Training error, and test error- qualitative response (classification)

• Training error

– 0-1-loss: misclassification rate

– Cross-entropy loss:

• Test error

XGGLEErr ˆ,

N

iiG xp

Nerr

i1

)(ˆlog2

Page 7: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

7

Model selection

Assume that the given model fα(x) is dependent on some tuning parameter (model complexity parameter) α

Examples of α:

• No. predictors (multiple regression)

• Degree of a polynomial (polynomial regression)

• Penalty factor (smoothing splines, ridge regression)

• Window width (kernels)

The aim of model selection: To find a model having minimum test error

Page 8: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

8

Model selection and assessment

Model selection

Estimate the performance of different models in order to choose the best one.

Model assessment

Having chosen the final model, estimate the test error

Page 9: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

9

Model selection and assessmentin data-rich applications

• Training set (to produce a fit, appr. 50%)

• Validation set (for model selection, 25 %)

• Test set (for model assessment, 25%)

Example (splines)

1. Fit the training set using models with smoothing factors λ1, …, λn

2. Using fitted splines f1*(x),…fn*(x), estimate the prediction error using the validation set and choose the model #i producing minimal error.

3. Estimate the generalization error using the test set and model #i

Page 10: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

10

Model selection and assessment in applicationswith insufficient data

• Analytical expressions to select and assess models

– Cp (correction for the number of inputs or basis functions)– AIC (Akaike’s information criterion)– BIC (Bayesian information criterion)

• Resampling

– Cross-validation– The bootstrap

Page 11: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

11

Analytical validation methods

Background:

A model typically overfits the data.

The prediction error will on average be higher than the training error

Terminology:

The difference between the average training and prediction error is called optimism

Basic idea:

Find an analytical expression for the optimism.

Page 12: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

12

Optimism

• Training error rate:

• In-sample error:

• Optimism:

N

ii

newiin xfYL

NEErr

1

))(ˆ,(1Y

Page 13: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

13

Analytical expressions for the optimism

1. For squared error, 0-1 loss, and some other loss functions

2. For linear models with d inputs or basis functions

Page 14: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

14

Cp scores

Basic idea:

when d parameters are fitted under squared loss.

Compute

and choose the model with smallest Cp score

Properties:

• Penalization inreases with increasing model complexity (d)

and decreases as the training sample size increases

22)( NderrEErr Yin

Page 15: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

15

Akaike’s information criterion

When the likelihood is maximized it holds asymptotically that

where

Given a tuning parameter , we set

where d() is the effective number of parameters.

For Gaussian models, AIC is equivalent to Cp

Page 16: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

16

Effective number of parameters (d())

For linear smoothers:

Examples:

• Simple linear regression (#exact param), ridge regression

• Smoothing splines

• Kernel smoothers

Define the effective number of parameters as

Syy ˆ

SS traced

Page 17: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

17

Bayesian information criterion

Based on Bayesian theory we set

• For Gaussian models

Properties:

• BIC =AIC if ”2” is substituted for log(N)

• Since log(N)>1 for N>7, BIC penalizes complex models more heavily than AIC

Page 18: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

18

Features of AIC and BIC

For large models (assymptotical property)

• BIC chooses the right model (if it is present among alternatives)

• AIC chooses too complex models

For small models

• BIC chooses too simple models

• AIC is OK

Page 19: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

19

Resampling methods

Cross-validation

K-fold cross-validation (rough scheme, show picture):

1. Divide data-set in K roughly equally-sized subsets

2. Remove subset #i and fit the model using remaining data.

3. Predict the function values for subset #i using the fitted model.

4. Repeat steps 2-3 for different i

5. CV= squared difference between observed values and predicted values (another function is possible)

Page 20: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

20

Resampling methods

Cross-validation

Note: if K=N then method is leave-one-out cross-validation.

K-fold cross-validation:

Page 21: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

21

Model selection using cross-validation

Having a model depending on a tuning (complexity) parameter, choose the one with smallest CV:

Page 22: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

22

Prediction error and cross-validation curveestimated from a single training set

Page 23: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

23

Generalized cross-validation

Basic idea: approximate the leave-one-out CV to make it faster Used for linear smoothers:

Note: In smoothing problems, GCV is similiar to AIC

Syy ˆ

Page 24: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

24

The bootstrap

Why do we need the bootstrap?

• To estimate the uncertainty of parameter estimates

Example:

• Having a sample X1, …, Xn from an unknown distribution

we estimate its mean (expectation) by computing the sample mean

• How to find uncertainty of the sample mean?

Page 25: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

25

Implementation of the crude nonparametric bootstrap

1. Sample with replacement form the observed data to obtain a bootstrap sample

2. Repeat step 1 B times

3. Compute parameter estimates using the bootstrap samples

4. Compute the variance of estimates from step 3

Page 26: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

26

Bootstrap replicates

Page 27: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

27

Benefits and drawbacks of the nonparametric bootstrap

1. The uncertainty of an estimate can be obtained without any information about the underlying distribution

2. It is not applicable to small data sets

3. If the distribution is known (except for the level of some parameters) the bootstarp may be slightly less efficient than conventional parametric methods

Page 28: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

28

The bootstrap and estimation of prediction errors

• Fit the model to bootstrap samples (role=training) and examine how well it predicts the training set (role=prediction)

Page 29: Model assessment and cross-validation - overview

Data Mining and statistical learning 2008

29

An improved bootstrap estimator of the prediction error

The leave-one-out bootstrap is similar to two-fold CV, and produces biased estimates of the expected squared prediction error.

Solution:

.632-estimator(pulls bias down)