introduction training error vs test error objective tuning...

IntroductionTraining Error vs Test ErrorObjective Tuning Methods

Lecture 15: Model Selection and Validation, CrossValidation

Hao Helen Zhang

Fall, 2017

Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation


Outline

Read: ESL 7.4, 7.10

better understand CV



In general,

the linear regression method like OLS has low bias (if the truemodel is linear) but high variance (especially if data dimensionis high) – leading to poor predictions.

Modern methods, such as LASSO and ridge regression, solve

minβ‖y − Xβ‖22 + λJ(β).

They introduce some bias to reduce variance, leading tobetter prediction accuracy.

J is called a penalty function.Regularization can help to achieve sparsity or smoothness inestimation.



Model Performance Measurements

Given the data (xi , yi ), i = 1, · · · , n, where xi is the input and yi isthe response, we assume

yi = f (xi ) + εi , i = 1, . . . , n.

We fit the model using some method, and obtain f . If the model isindexed by a parameter, we denote the fitted model by fα or fλ.

For example,

the OLS estimator: f (x) = xT βols

.

the LASSO estimator fλ(x) = xT βlassoλ .



Bias-Variance Trade-off

As complexity increases, the model tends to

adapt to more complicated underlying structures (decrease inbias)

have increased estimation error (increase in variance)

In between there is an optimal model complexity, giving theminimum test error.

Role of tuning parameter α or λ:

Tuning parameter varies the model complexity

Find the best α to produce the minimum of test error



Model Performance Evaluation

How do we evaluate the performance of f ? Consider

its generalization performance

its prediction capability on an independent set of data, theso-called “test data”

Model assessment is extremely important, as it

measures the quality of the final model.

allows us to compare different methods and make an optimalchoice.

guides the choice of tuning parameters in regularizationmethods.



Model Selection and Assessment

Two Goals:

Model Selection: estimate the performance of differentmodels and choose the best model

Model Assessment: after choosing a final model, estimate itsprediction error (generalization performance) on new data.



Common Techniques

The following are commonly used to estimate the prediction error

Validation set (data-rich situation)

Resampling methods:

cross validationbootstrap

Analytical estimation criteria

Next, we will introduce their basic ideas.

See Efron (2004, JASA) for more details.



Different Types of Errors

In machine learning, we encounter different types of “errors”

training error

test error, prediction error

tuning error

cross validation error

Example: If Y is continuous, two commonly-used loss functions are

squared error L(Y , f (X)) = (Y − f (X))2

absolute error L(Y , f (X)) = |Y − f (X)|



Training error and Test error

Training error: the average loss over the training samples

TrainErr =1

n

n∑i=1

L(yi , f (xi ))

Prediction error: the expected error over all the input

PE = E(X,Y ′,data)[1

n

n∑i=1

L(Y ′i , f (Xi ))],

where Y ′i = f (Xi ) + ε′i , i = 1, · · · , n, are independent ofy1, · · · , yn. Expectation is taken w.r.t. all random quantities.Test error: Given a test data set (xi , y

′i ) for i = 1, . . . , n′,

TestErr =1

n′

n′∑i=1

L(Y ′i , f (Xi )).

TestErr is an estimate of PE. The larger n′, the betterestimate.



Classification Problems: Y is discrete

If Y takes values {1, ...,K}, the common loss functions are

0− 1 loss L(Y , f (X) = I (Y 6= f (X))

log-likelihood L(Y , p(X) = −2∑K

k=1 I (Y = k) log pk(X)= −2 log pY (X)

With the log-likelihood loss,Training error is the sample log-likelihood

TrainErr = −2

n

n∑i=1

log pyi (xi )

Log-likelihood is referred to as cross-entropy loss or deviance.General densities: Poisson, gamma, exponential, log-normal

L(Y , θ(X)) = −2 log Prθ(X)(Y ).

With 0− 1 loss, the test error is expected misclassification rate



Training Error vs Test Error

Training error can be easily calculated by applying thestatistical learning method to the training set.

Test error is the average error of a statistical learning methodwhen being used to predict the response for a newobservation, one that was not used in training the method.

TrainErr is not a good estimate of TestErr

The training error rate often is quite different from the testerror rate.

TrainErr can dramatically underestimate TestErr.



��

��

� ��

��

��

��

��

� ��

��

��



What is Wrong with Training Error

TrainErr is not an objective measure on the generalizationperformance of the method, as we build f based on thetraining data.

TrainErr decreases with model complexity.

Training error drops to zero if the model is complex enoughA model with zero training error overfits the training data;over-fitted models typically generalize poorly

Question: How do we estimate TestErr accurately? We consider

data-rich situation

data-insufficient situation



Data-Rich SituationData-Insufficient SituationK-fold CV and LOOCVGCVModel-based Methods: Information Criteria

Data-Rich Situation

Key Idea: Estimate the test error by holding out a subset of thetraining observations from the fitting process, and then applyingthe statistical learning method to those held out observations.

Validation-set Approach: without tuning

Randomly divide the available set of samples into two parts: atraining set and a validation or hold-out set.

The model is fit on the training set, and the fitted model isused to predict the responses for the observations in thevalidation set.

The resulting validation-set error provides an estimate of thetest error. This is typically assessed using MSE in the case ofa quantitative response and misclassification rate in the caseof a qualitative (discrete) response.




Validation-set Approach: with tuning

Randomly divide the dataset into three parts:

training set: to fit the models

validation (tuning) set: to estimate the prediction error formodel selection

test set: to assess the generalization error of the final chosenmodel

The typical proportions are respectively: 50%, 25%, 25%.




Drawbacks of Validation-set Approach

The validation estimate of the test error can be highlyvariable, depending on precisely which observations areincluded in the training set and which observations areincluded in the validation set.

In the validation approach, only a subset of the observations -those that are included in the training set rather than in thevalidation set - are used to fit the model.

This suggests that the validation set error may tend tooverestimate the test error for the model fit on the entire dataset.




Data-Insufficient Situation

How much training data is enough?

depends on signal-to-noise ratio of the underlying function

depends on complexity of the models

too difficulty to give a general rule




Resampling Methods

These methods refit a model of interest to samples formed fromthe training set, and obtain additional information about the fittedmodel.

Nonparametric methods: efficient sample reuse

Cross-validation, bootstrap techniques

They provide estimates of test-set prediction error, and thestandard deviation and bias of our parameter estimates.




K-fold Cross Validation

The simplest and most popular way to estimate the test error.1 Randomly split the data into K roughly equal parts2 For each k = 1, ...,K

leave the kth portion out, and fit the model using the otherK − 1 parts. Denote the solution by f −k(x)calculate the prediction error of the f −k(x) on the kth(left-out) portion

3 Average the errors

Define the Index function (allocating memberships)

κ : {1, ..., n} → {1, ...,K}.Then the K -fold cross-validation estimate of prediction error is

CV =1

n

n∑i=1

L(yi , f−κ(i)(xi ))

Typical choice: K = 5, 10Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation



Leave-Out-One (LOO) Cross Validation

If K=n, we have leave-out-one cross validation.

κ(i) = i

LOO-CV is approximately unbiased for the true predictionerror

LOO-CV can have high variance because the n “training sets”are so similar to one another

Computation:

Computational burden is generally considerable, requiring nmodel fitting.

For special models like linear smoothers, computation can bedone quickly. For example: Generalized Cross Validation(GCV)




Statistical Properties of Cross Validation Error Estimates

Cross-validation error is very nearly unbiased for test error.

The slight bias is due to that the training set incross-validation is slightly smaller than the actual data set(e.g. for LOOCV the training set size is n 1 when there are nobserved cases).

In nearly all situations, the effect of this bias is conservative,i.e., the estimated fit is slightly biased in the directionsuggesting a poorer fit. In practice, this bias is rarely aconcern.

The variance of CV-error can be large.

In practice, to reduce variability, multiple rounds ofcross-validation are performed using different partitions, andthe validation results are averaged over the rounds.




Choose α

Given tuning parameter α, the model fit f −k(x, α).The CV (α) curve is

CV (α) =1

n

n∑i=1

L(yi , f−κ(i)(xi , α))

CV (α) provides an estimate of the test error curve

We find the best parameter α by minimizing CV (α).

Final model is f (x, α).




Performance of CV

Example: Two-class classification problem

Linear model with best subsets regression of subset size p

In Figure 7.9, (p is tuning parameter)

both curves picks best p = 10. (The true model)

CV curve is flat beyond 10.

Standard error bars are the standard errors of the individualmisclassification error rates




Standard Errors for K-fold CV

Let CVk(fα) denote the validation errors of fα in each fold for α, orsimply CVk(α). Here α ∈ {α1, . . . , αm}, within a set of candidatevalues for α. Then we have

The cross validation error

CV (α) =1

K

K∑k=1

CVk(α).

The standard deviation of CV (α) is calculated as

SE (α) =√

Var(CV (α)).




One-standard Error for K-fold CV

Letα = arg min

α∈{α1,...,αm}CV (α).

First we find α; then we move α in the direction of increasingregularization as much as can as long as

CV (α) ≤ CV (α) + SE (α).

choose the most parsimonious model with error no more thanone standard error above the best error.

Idea: “All equal (up to one standard error), go for thesimplest model”.

In this example, one-SE rule CV would choose p = 9.




�� !��" �#�$�&% �')( �*�+��-,/.��01��2#�3" �-�#��4 5#"��768�9�-�;:�<�<1= >?2#��@&A�7"CB

Subset Size p

Miscla

ssifica

tion Er

ror

5 10 15 20

0.00.1

0.20.3

0.40.5

0.6

••

••

••

••

•• • • • • • • • • • •

••

• •

••

• •

• • • • • • •• • • • •

DFE GIH9J!K LNMPO9Q RTS#U#VXW�Y[Z\W�]X^ U_S_S1]XS `7S#U#V_a bX^9V Z�U!^dce]gfhV Y!S1]Xijilkm bgfnW�V?bgZoW�]X^ YepqS m U `�rXS#UjU!^sa U!ilZ\W�t bgZ�U#VucjS1]Xt b ijWA^qrsfvU ZoS1bXW�^wkW�^qr i/U/ZoxycjS&]zt Z�{IU ieY8U_^9bXS_W|] W�^ Z�{IU }#]sZ|Z�]Xt S_W~r�{�Z��bX^�Uef�]�c� W~rXpqS#U �X�+��




The Wrong and Right Way

Consider a simpler classifier for microarrays:

1 Starting with 5,000 genes, find the top 200 genes having thelargest correlation with the class label

2 Carry about the nearest-centroid classification using top 200genes

How do we select the tuning parameter in the classifier?

Way 1: apply cross-validation to step 2

Way 2: apply cross-validation to steps 1 and 2

Which is right? – Cross validating the whole procedure.




Generalized Cross Validation

If linear fitting methods y = Sy satisfy

yi − f −i (xi ) =yi − f (xi )

1− Sii,

then we do NOT need to train the model n times.Using the estimation Sii = trace(S)/n, we have the approximatescore

GCV =1

n

n∑i=1

[yi − f (xi )

1− trace(S)/n

]2

computational advantage

GCV (α) is always smoother than CV (α)




Application Examples

For regression problems, apply CV to

shrinkage methods: select λ

best subset selection: select the best model

nonparametric methods: select the smoothing parameter

For classification methods, apply CV to

kNN: select k

SVM: select λ




Model-Based Methods: Information Criteria

Covariance penalties

Cp

AIC

BIC

Stein’s UBR

They can be used for both model selection and test errorestimation.




Generalized Information Criteria (GIC)

Information criteria:

GIC(model) = −2 · loglik + α · df,

the degree of freedom (df) of f as the number of effectiveparameters of the model.

Examples: In linear regression settings, we use

AIC = n log(RSS/n) + 2 · df,

BIC = n log(RSS/n) + log(n) · df

where RSS =∑n

i=1[yi − f (xi )]2 (the residual sum of squares)




Optimism of Training Error Rate

training error:

err =1

n

n∑i=1

L(yi , f (xi ))

test error:Err = E(X,Y ,data)[L(Y , f (X))].

in-sample error rate (an estimate of prediction error)

Errin =1

n

n∑i=1

EynewEyL(ynewi , f (xi ))

Optimismop = Errin − Ey err




Optimism

op is typically positive (Why?)

For the squared error, 0-1, and other loss function,

op =2

n

n∑i=1

Cov(yi , yi ).

The amount by which err under-estimates the true errordepends on the strength of correlation yi and its production.


introduction training error vs test error objective tuning...

Documents