modeling social data, lecture 7: model complexity and generalization
Post on 22-Mar-2017
97 Views
Preview:
TRANSCRIPT
Model complexity and generalizationAPAM E4990
Modeling Social Data
Jake Hofman
Columbia University
March 3, 2017
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 1 / 10
Overfitting (a la xkcd)
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 2 / 10
Overfitting (a la xkcd)
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 3 / 10
ComplexityOur models should be complex enough to explain the past, butsimple enough to generalize to the future
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 4 / 10
Bias-variance tradeoff
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 5 / 10
Bias-variance tradeoff38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Pre
dic
tion
Err
or
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased,the variance tends to increase and the squared bias tends to decreases.The opposite behavior occurs as the model complexity is decreased. Fork-nearest neighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)
2. Unfortunatelytraining error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). In
that case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
Simple models may be “wrong” (high bias), but fits don’t vary alot with different samples of training data (low variance)
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 6 / 10
Bias-variance tradeoff38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Pre
dic
tion
Err
or
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeoff.
More generally, as the model complexity of our procedure is increased,the variance tends to increase and the squared bias tends to decreases.The opposite behavior occurs as the model complexity is decreased. Fork-nearest neighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi − yi)
2. Unfortunatelytraining error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). In
that case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
Flexible models can capture more complex relationships (low bias),but are also sensitive to noise in the training data (high variance)
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 6 / 10
Bigger models 6= Better models
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 7 / 10
Cross-validation
222 7. Model Assessment and Selection
The “−2” in the definition makes the log-likelihood loss for the Gaussiandistribution match squared-error loss.
For ease of exposition, for the remainder of this chapter we will use Y andf(X) to represent all of the above situations, since we focus mainly on thequantitative response (squared-error loss) setting. For the other situations,the appropriate translations are obvious.
In this chapter we describe a number of methods for estimating theexpected test error for a model. Typically our model will have a tuningparameter or parameters α and so we can write our predictions as fα(x).The tuning parameter varies the complexity of our model, and we wish tofind the value of α that minimizes error, that is, produces the minimum ofthe average test error curve in Figure 7.1. Having said this, for brevity wewill often suppress the dependence of f(x) on α.
It is important to note that there are in fact two separate goals that wemight have in mind:
Model selection: estimating the performance of different models in orderto choose the best one.
Model assessment: having chosen a final model, estimating its predic-tion error (generalization error) on new data.
If we are in a data-rich situation, the best approach for both problems isto randomly divide the dataset into three parts: a training set, a validationset, and a test set. The training set is used to fit the models; the validationset is used to estimate prediction error for model selection; the test set isused for assessment of the generalization error of the final chosen model.Ideally, the test set should be kept in a “vault,” and be brought out onlyat the end of the data analysis. Suppose instead that we use the test-setrepeatedly, choosing the model with smallest test-set error. Then the testset error of the final chosen model will underestimate the true test error,sometimes substantially.
It is difficult to give a general rule on how to choose the number ofobservations in each of the three parts, as this depends on the signal-to-noise ratio in the data and the training sample size. A typical split mightbe 50% for training, and 25% each for validation and testing:
TestTrain Validation TestTrain Validation TestValidationTrain Validation TestTrain
The methods in this chapter are designed for situations where there isinsufficient data to split it into three parts. Again it is too difficult to givea general rule on how much training data is enough; among other things,this depends on the signal-to-noise ratio of the underlying function, andthe complexity of the models being fit to the data.
• Randomly split our data into three sets
• Fit models on the training set
• Use the validation set to find the best model
• Quote final performance of this model on the test set
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 8 / 10
K-fold cross-validation
Estimates of generalization error from one train / validation splitcan be noisy, so shuffle data and average over K distinct validationpartitions instead
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 9 / 10
K-fold cross-validation: pseudocode
(randomly) divide the data into K parts
for each model
for each of the K folds
train on everything but one fold
measure the error on the held out fold
store the training and validation error
compute and store the average error across all folds
pick the model with the lowest average validation error
evaluate its performance on a final, held out test set
Jake Hofman (Columbia University) Model complexity and generalization March 3, 2017 10 / 10
top related