modeling social data, lecture 7: model complexity and generalization

11
Model complexity and generalization APAM E4990 Modeling Social Data Jake Hofman Columbia University March 3, 2017 Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 1 / 10

Upload: jakehofman

Post on 22-Mar-2017

97 views

Category:

Education


3 download

TRANSCRIPT

Model complexity and generalizationAPAM E4990

Modeling Social Data

Jake Hofman

Columbia University

March 3, 2017

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 1 / 10

Overfitting (a la xkcd)

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 2 / 10

Overfitting (a la xkcd)

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 3 / 10

ComplexityOur models should be complex enough to explain the past, butsimple enough to generalize to the future

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 4 / 10

Bias-variance tradeoff

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 5 / 10

Bias-variance tradeoff38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Pre

dic

tion

Err

or

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased,the variance tends to increase and the squared bias tends to decreases.The opposite behavior occurs as the model complexity is decreased. Fork-nearest neighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)

2. Unfortunatelytraining error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). In

that case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

Simple models may be “wrong” (high bias), but fits don’t vary alot with different samples of training data (low variance)

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 6 / 10

Bias-variance tradeoff38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Pre

dic

tion

Err

or

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.

More generally, as the model complexity of our procedure is increased,the variance tends to increase and the squared bias tends to decreases.The opposite behavior occurs as the model complexity is decreased. Fork-nearest neighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi − yi)

2. Unfortunatelytraining error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). In

that case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

Flexible models can capture more complex relationships (low bias),but are also sensitive to noise in the training data (high variance)

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 6 / 10

Bigger models 6= Better models

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 7 / 10

Cross-validation

222 7. Model Assessment and Selection

The “−2” in the definition makes the log-likelihood loss for the Gaussiandistribution match squared-error loss.

For ease of exposition, for the remainder of this chapter we will use Y andf(X) to represent all of the above situations, since we focus mainly on thequantitative response (squared-error loss) setting. For the other situations,the appropriate translations are obvious.

In this chapter we describe a number of methods for estimating theexpected test error for a model. Typically our model will have a tuningparameter or parameters α and so we can write our predictions as fα(x).The tuning parameter varies the complexity of our model, and we wish tofind the value of α that minimizes error, that is, produces the minimum ofthe average test error curve in Figure 7.1. Having said this, for brevity wewill often suppress the dependence of f(x) on α.

It is important to note that there are in fact two separate goals that wemight have in mind:

Model selection: estimating the performance of different models in orderto choose the best one.

Model assessment: having chosen a final model, estimating its predic-tion error (generalization error) on new data.

If we are in a data-rich situation, the best approach for both problems isto randomly divide the dataset into three parts: a training set, a validationset, and a test set. The training set is used to fit the models; the validationset is used to estimate prediction error for model selection; the test set isused for assessment of the generalization error of the final chosen model.Ideally, the test set should be kept in a “vault,” and be brought out onlyat the end of the data analysis. Suppose instead that we use the test-setrepeatedly, choosing the model with smallest test-set error. Then the testset error of the final chosen model will underestimate the true test error,sometimes substantially.

It is difficult to give a general rule on how to choose the number ofobservations in each of the three parts, as this depends on the signal-to-noise ratio in the data and the training sample size. A typical split mightbe 50% for training, and 25% each for validation and testing:

TestTrain Validation TestTrain Validation TestValidationTrain Validation TestTrain

The methods in this chapter are designed for situations where there isinsufficient data to split it into three parts. Again it is too difficult to givea general rule on how much training data is enough; among other things,this depends on the signal-to-noise ratio of the underlying function, andthe complexity of the models being fit to the data.

• Randomly split our data into three sets

• Fit models on the training set

• Use the validation set to find the best model

• Quote final performance of this model on the test set

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 8 / 10

K-fold cross-validation

Estimates of generalization error from one train / validation splitcan be noisy, so shuffle data and average over K distinct validationpartitions instead

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 9 / 10

K-fold cross-validation: pseudocode

(randomly) divide the data into K parts

for each model

for each of the K folds

train on everything but one fold

measure the error on the held out fold

store the training and validation error

compute and store the average error across all folds

pick the model with the lowest average validation error

evaluate its performance on a final, held out test set

Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 10 / 10