introduction to data science: classifier · classicerievaluation how do we determine how well...

51
Classier evaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the percent of mistakes it makes when predicting class 1 / 50 Classier evaluation ## observed ## predicted No Yes ## No 9625 233 ## Yes 42 100 ## [1] 2.75 ## [1] 3.33 logis_fit <- glm(default ~ balance, data=Def logis_pred_prob <- predict(logis_fit, type=" logis_pred <- ifelse(logis_pred_prob > 0.5, print(table(predicted=logis_pred, observed=D # error rate mean(Default$default != logis_pred) * 100 # dummy error rate mean(Default$default != "No") * 100 2 / 50 Classier evaluation We need a more precise language to describe classification mistakes: True Class + True Class - Total Predicted Class + True Positive (TP) False Positive (FP) P* Predicted Class - False Negative (FN) True Negative (TN) N* Total P N 3 / 50 Classier evaluation Using these we can define statistics that describe classifier performance Name Definition Synonyms False Positive Rate (FPR) FP / N Type-I error, 1-Specificity True Positive Rate (TPR) TP / P 1 - Type-II error, power, sensitivity, recall Positive Predictive Value (PPV) TP / P* precision, 1-false discovery proportion Negative Predicitve Value (NPV) FN / N* 4 / 50 Classier evaluation In the credit default case we may want to increase TPR (recall, make sure we catch all defaults) at the expense of FPR (1-Specificity, clients we lose because we think they will default) 5 / 50 Classier evaluation This leads to a natural question: Can we adjust our classifiers TPR and FPR? Remember we are classifying Yes if What would happen if we use ? 6 / 50 Classier evaluation 7 / 50 Classier evaluation A way of describing the TPR and FPR tradeoff is by using the ROC curve (Receiver Operating Characteristic) and the AUROC (area under the ROC) 8 / 50 Classier evaluation 9 / 50 default ~ balance*student + income Classier evaluation Consider comparing a logistic regression model using all predictors in the dataset, including an interaction term between balance and student. 10 / 50 Classier evaluation Another metric that is frequently used to understand classification errors and tradeoffs is the precision-recall curve: 11 / 50 Classier evaluation The bigger model shows a slightly higher precision at the same recall values and slightly higher area under the precision-recall curve. This is commonly found in datasets where there is a skewed distribution of classes (e.g., there are many more "No" than "Yes" in this dataset). The area under the PR curve tends to distinguish classifier performance than area under the ROC curve in these cases. 12 / 50 Model Selection Our goal when we use a learning model like linear or logistic regression, decision trees, etc., is to learn a model that can predict outcomes for new unseen data. 13 / 50 Model Selection We should therefore think of model evaluation based on expected predicted error: what will the prediction error be for data outside the training data. 14 / 50 Model Selection We should therefore think of model evaluation based on expected predicted error: what will the prediction error be for data outside the training data. How then, do we measure our models' ability to predict unseen data, when we only have access to training data? 15 / 50 Cross-validation The most common method to evaluate model generalization performance is cross-validation. It is used in two essential data analysis phases: Model Selection and Model Assessment. 16 / 50 Cross-validation Model Selection Decide what kind, and how complex of a model we should fit. 17 / 50 Cross-validation Model Selection Decide what kind, and how complex of a model we should fit. Consider a regression example: I will fit a linear regression model, what predictors should be included?, interactions?, data transformations? 18 / 50 Cross-validation Model Selection Decide what kind, and how complex of a model we should fit. Consider a regression example: I will fit a linear regression model, what predictors should be included?, interactions?, data transformations? Another example is what classification tree depth to use. 19 / 50 Cross-validation Model Selection Decide what kind, and how complex of a model we should fit. Consider a regression example: I will fit a linear regression model, what predictors should be included?, interactions?, data transformations? Another example is what classification tree depth to use. Which kind of algorithm to use, linear regression vs. decision tree vs. random forest 20 / 50 Cross-validation Model Assessment Determine how well does our selected model performs as a general model. 21 / 50 Cross-validation Model Assessment Determine how well does our selected model performs as a general model. Ex. I've built a linear regression model with a specific set predictors. How well will it perform on unseen data? 22 / 50 Cross-validation Model Assessment Determine how well does our selected model performs as a general model. Ex. I've built a linear regression model with a specific set predictors. How well will it perform on unseen data? The same question can be asked of a classification tree of specific depth. 23 / 50 Cross-validation Cross-validation is a resampling method to obtain estimates of expected prediction error rate (or any other performance measure on unseen data). In some instances, you will have a large predefined test dataset that you should never use when training. In the absence of access to this kind of dataset, cross validation can be used. 24 / 50 Validation Set The simplest option to use cross-validation is to create a validation set, where our dataset is randomly divided into training and validation sets. Then the validation is set aside, and not used at until until we are ready to compute test error rate (once, don't go back and check if you can improve it). 25 / 50 A linear regression model was not appropriate for this dataset. Use polynomial regression as an illustrative example. Validation Set Let's look at our running example using automobile data, where we want to build a regression model to predict miles per gallon given other auto attributes. 26 / 50 Validation Set For polynomial regression, our regression model (for a single predictor ) is given as a degree polynomial. For model selection, we want to decide what degree we should use to model this data. 27 / 50 Using the validation set method, split our data into a training set, fit the regression model with different polynomial degrees on the training set, measure test error on the validation set. Validation Set 28 / 50 Resampled validation set The validation set approach can be prone to sampling issues. It can be highly variable as error rate is a random quantity and depends on observations in training and validation sets. 29 / 50 Resampled validation set The validation set approach can be prone to sampling issues. It can be highly variable as error rate is a random quantity and depends on observations in training and validation sets. We can improve our estimate of test error by averaging multiple measurements of it (remember the law of large numbers). 30 / 50 Resample validation set 10 times (yielding different validation and training sets) and averaging the resulting test errors. Resampled validation set 31 / 50 Leave-one-out Cross-Validation This approach still has some issues. Each of the training sets in our validation approach only uses 50% of data to train, which leads to models that may not perform as well as models trained with the full dataset and thus we can overestimate error. 32 / 50 Leave-one-out Cross-Validation This approach still has some issues. Each of the training sets in our validation approach only uses 50% of data to train, which leads to models that may not perform as well as models trained with the full dataset and thus we can overestimate error. To alleviate this situation, we can extend our approach to the extreme: Make each single training point it's own validation set. 33 / 50 Procedure: For each observation in data set: a. Train model on all but -th observation b. Predict response for -th observation c. Calculate prediction error Leave-one-out Cross-Validation 34 / 50 This gives us the following cross- validation estimate of error. Leave-one-out Cross-Validation 35 / 50 Leave-one-out Cross-Validation use observations to train each model no sampling effects introduced since error is estimated on each sample 36 / 50 Leave-one-out Cross-Validation use observations to train each model no sampling effects introduced since error is estimated on each sample Disadvantages: Depending on the models we are trying to fit, it can be very costly to train models. Error estimate for each model is highly variable (since it comes from a single datapoint). 37 / 50 Leave-one-out Cross-Validation On our running example 38 / 50 k-fold Cross-Validation This discussion leads us to the most commonly used cross-validation approach k-fold Cross-Validation. 39 / 50 Procedure: Partition observations randomly into groups (folds). For each of the groups of observations: Train model on observations in the other folds Estimate test-set error (e.g., Mean Squared Error) on this fold k-fold Cross-Validation 40 / 50 Procedure: Compute average error across folds where is mean squared error estimated on the -th fold k-fold Cross-Validation 41 / 50 k-fold Cross-Validation Fewer models to fit (only of them) Less variance in each of the computed test error estimates in each fold. 42 / 50 k-fold Cross-Validation Fewer models to fit (only of them) Less variance in each of the computed test error estimates in each fold. It can be shown that there is a slight bias (over estimating usually) in error estimate obtained from this procedure. 43 / 50 k-fold Cross-Validation Running Example 44 / 50 Cross-Validation in Classication Each of these procedures can be used for classification as well. In this case we would substitute MSE with performance metric of choice. E.g., error rate, accuracy, TPR, FPR, AUROC. 45 / 50 Cross-Validation in Classication Each of these procedures can be used for classification as well. In this case we would substitute MSE with performance metric of choice. E.g., error rate, accuracy, TPR, FPR, AUROC. Note however that not all of these work with LOOCV (e.g. AUROC since it can't be defined over single data points). 46 / 50 Suppose you want to compare two classification models (logistic regression vs. a decision tree) on the Default dataset. We can use Cross-Validation to determine if one model is better than the other, using a -test for example. Comparing models using cross-validation 47 / 50 Comparing models using cross-validation Using hypothesis testing: term estimate std.error statistic p.value (Intercept) 0.0267 0.0020306 13.148828 0.0000000 methodtree 0.0030 0.0028717 1.044677 0.3099998 In this case, we do not observe any significant difference between these two classification methods. 48 / 50 Summary Model selection and assessment are critical steps of data analysis. Error and accuracy statistics are not enough to understand classifier performance. Classifications can be done using probability cutoffs to trade, e.g., TPR- FPR (ROC curve), or precision-recall (PR curve). Area under ROC or PR curve summarize classifier performance across different cutoffs. 49 / 50 Summary Resampling methods are general tools used for this purpose. k-fold cross-validation can be used to provide larger training sets to algorithms while stabilizing empirical estimates of expected prediction error 50 / 50 Introduction to Data Science: Classifier Evaluation and Model Selection Héctor Corrada Bravo University of Maryland, College Park, USA CMSC320: 2020-04-26

Upload: others

Post on 30-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluationHow do we determine how well classifiers are performing?

One way is to compute the error rate of the classifier, the percent ofmistakes it makes when predicting class

1 / 50

Classi�er evaluation## observed

## predicted No Yes

## No 9625 233

## Yes 42 100

## [1] 2.75

## [1] 3.33

logis_fit <- glm(default ~ balance, data=Def

logis_pred_prob <- predict(logis_fit, type="

logis_pred <- ifelse(logis_pred_prob > 0.5,

print(table(predicted=logis_pred, observed=D

# error rate

mean(Default$default != logis_pred) * 100

# dummy error rate

mean(Default$default != "No") * 100

2 / 50

Classi�er evaluationWe need a more precise language to describe classification mistakes:

True Class + True Class - Total

Predicted Class + True Positive (TP) False Positive (FP) P*

Predicted Class - False Negative (FN) True Negative (TN) N*

Total P N

3 / 50

Classi�er evaluationUsing these we can define statistics that describe classifier performance

Name Definition Synonyms

False Positive Rate (FPR) FP / N Type-I error, 1-Specificity

True Positive Rate (TPR) TP / P1 - Type-II error, power,sensitivity, recall

Positive Predictive Value(PPV)

TP / P*precision, 1-false discoveryproportion

Negative Predicitve Value(NPV)

FN / N*

4 / 50

Classi�er evaluationIn the credit default case we may want to increase TPR (recall, makesure we catch all defaults) at the expense of FPR (1-Specificity, clientswe lose because we think they will default)

5 / 50

Classi�er evaluationThis leads to a natural question: Can we adjust our classifiers TPR andFPR?

Remember we are classifying Yes if

What would happen if we use ?

log > 0 ⇒  P(Y = Yes|X) > 0.5P(Y = Yes|X)

P(Y = No|X)

P(Y = Yes|X) > 0.2

6 / 50

Classi�er evaluation

7 / 50

Classi�er evaluationA way of describing the TPR and FPR tradeoff is by using the ROCcurve (Receiver Operating Characteristic) and the AUROC (area underthe ROC)

8 / 50

Classi�er evaluation

9 / 50

default ~balance*student+ income

Classi�er evaluationConsider comparing a logistic regression model using all predictors inthe dataset, including an interaction term between balance and student.

10 / 50

Classi�er evaluationAnother metric that is frequently used to understand classification errorsand tradeoffs is the precision-recall curve:

11 / 50

Classi�er evaluationThe bigger model shows a slightly higher precision at the same recallvalues and slightly higher area under the precision-recall curve.

This is commonly found in datasets where there is a skewed distributionof classes (e.g., there are many more "No" than "Yes" in this dataset).

The area under the PR curve tends to distinguish classifier performancethan area under the ROC curve in these cases.

12 / 50

Model SelectionOur goal when we use a learning model like linear or logistic regression,decision trees, etc., is to learn a model that can predict outcomes fornew unseen data.

13 / 50

Model SelectionWe should therefore think of model evaluation based on expectedpredicted error: what will the prediction error be for data outside thetraining data.

14 / 50

Model SelectionWe should therefore think of model evaluation based on expectedpredicted error: what will the prediction error be for data outside thetraining data.

How then, do we measure our models' ability to predict unseen data,when we only have access to training data?

15 / 50

Cross-validationThe most common method to evaluate model generalizationperformance is cross-validation.

It is used in two essential data analysis phases: Model Selection andModel Assessment.

16 / 50

Cross-validation

Model Selection

Decide what kind, and how complex of a model we should fit.

17 / 50

Cross-validation

Model Selection

Decide what kind, and how complex of a model we should fit.

Consider a regression example: I will fit a linear regression model, whatpredictors should be included?, interactions?, data transformations?

18 / 50

Cross-validation

Model Selection

Decide what kind, and how complex of a model we should fit.

Consider a regression example: I will fit a linear regression model, whatpredictors should be included?, interactions?, data transformations?

Another example is what classification tree depth to use.

19 / 50

Cross-validation

Model Selection

Decide what kind, and how complex of a model we should fit.

Consider a regression example: I will fit a linear regression model, whatpredictors should be included?, interactions?, data transformations?

Another example is what classification tree depth to use.

Which kind of algorithm to use, linear regression vs. decision tree vs.random forest

20 / 50

Cross-validation

Model Assessment

Determine how well does our selected model performs as a generalmodel.

21 / 50

Cross-validation

Model Assessment

Determine how well does our selected model performs as a generalmodel.

Ex. I've built a linear regression model with a specific set predictors. Howwell will it perform on unseen data?

22 / 50

Cross-validation

Model Assessment

Determine how well does our selected model performs as a generalmodel.

Ex. I've built a linear regression model with a specific set predictors. Howwell will it perform on unseen data?

The same question can be asked of a classification tree of specificdepth.

23 / 50

Cross-validationCross-validation is a resampling method to obtain estimates of expectedprediction error rate (or any other performance measure on unseendata).

In some instances, you will have a large predefined test dataset that youshould never use when training.

In the absence of access to this kind of dataset, cross validation can beused.

24 / 50

Validation SetThe simplest option to use cross-validation is to create a validation set,where our dataset is randomly divided into training and validation sets.

Then the validation is set aside, and not used at until until we are readyto compute test error rate (once, don't go back and check if you canimprove it).

25 / 50

A linear regression model was notappropriate for this dataset. Usepolynomial regression as anillustrative example.

Validation SetLet's look at our running example using automobile data, where we wantto build a regression model to predict miles per gallon given other autoattributes.

26 / 50

Validation SetFor polynomial regression, our regression model (for a single predictor

) is given as a degree polynomial.

For model selection, we want to decide what degree we should use tomodel this data.

X d

E[Y |X = x] = β0 + β1x + β2x2 + ⋯ + βdx

d

d

27 / 50

Using the validation set method,split our data into a training set,

fit the regression model withdifferent polynomial degrees onthe training set,

measure test error on the validationset.

Validation Set

d

28 / 50

Resampled validation setThe validation set approach can be prone to sampling issues.

It can be highly variable as error rate is a random quantity and dependson observations in training and validation sets.

29 / 50

Resampled validation setThe validation set approach can be prone to sampling issues.

It can be highly variable as error rate is a random quantity and dependson observations in training and validation sets.

We can improve our estimate of test error by averaging multiplemeasurements of it (remember the law of large numbers).

30 / 50

Resample validation set 10 times(yielding different validation andtraining sets) and averaging theresulting test errors.

Resampled validation set

31 / 50

Leave-one-out Cross-ValidationThis approach still has some issues.

Each of the training sets in our validation approach only uses 50% ofdata to train, which leads to models that may not perform as well asmodels trained with the full dataset and thus we can overestimate error.

32 / 50

Leave-one-out Cross-ValidationThis approach still has some issues.

Each of the training sets in our validation approach only uses 50% ofdata to train, which leads to models that may not perform as well asmodels trained with the full dataset and thus we can overestimate error.

To alleviate this situation, we can extend our approach to the extreme:Make each single training point it's own validation set.

33 / 50

Procedure:For each observation in data set:a. Train model on all but -thobservationb. Predict response for -thobservationc. Calculate prediction error

Leave-one-out Cross-Validation

i

i

i

34 / 50

This gives us the following cross-validation estimate of error.

Leave-one-out Cross-Validation

CV(n) = ∑i

(yi − y i)21

n

35 / 50

Leave-one-out Cross-Validationuse observations to train each modelno sampling effects introduced since error is estimated on eachsample

n − 1

36 / 50

Leave-one-out Cross-Validationuse observations to train each modelno sampling effects introduced since error is estimated on eachsample

Disadvantages:

Depending on the models we are trying to fit, it can be very costly totrain models.Error estimate for each model is highly variable (since it comes from asingle datapoint).

n − 1

n − 1

37 / 50

Leave-one-out Cross-ValidationOn our running example

38 / 50

k-fold Cross-ValidationThis discussion leads us to the most commonly used cross-validationapproach k-fold Cross-Validation.

39 / 50

Procedure:Partition observations randomly into

groups (folds).

For each of the groups ofobservations:

Train model on observations inthe other foldsEstimate test-set error (e.g.,Mean Squared Error) on this fold

k-fold Cross-Validation

k

k

k − 1

40 / 50

Procedure:

Compute average error across folds

where is mean squarederror estimated on the -th fold

k-fold Cross-Validation

k

CV(k) = ∑i

MSEi

1

k

MSEi

i

41 / 50

k-fold Cross-ValidationFewer models to fit (only of them)Less variance in each of the computed test error estimates in eachfold.

k

42 / 50

k-fold Cross-ValidationFewer models to fit (only of them)Less variance in each of the computed test error estimates in eachfold.

It can be shown that there is a slight bias (over estimating usually) inerror estimate obtained from this procedure.

k

43 / 50

k-fold Cross-ValidationRunning Example

44 / 50

Cross-Validation in Classi�cationEach of these procedures can be used for classification as well.

In this case we would substitute MSE with performance metric of choice.E.g., error rate, accuracy, TPR, FPR, AUROC.

45 / 50

Cross-Validation in Classi�cationEach of these procedures can be used for classification as well.

In this case we would substitute MSE with performance metric of choice.E.g., error rate, accuracy, TPR, FPR, AUROC.

Note however that not all of these work with LOOCV (e.g. AUROC sinceit can't be defined over single data points).

46 / 50

Suppose you want to compare twoclassification models (logisticregression vs. a decision tree) onthe Default dataset. We can useCross-Validation to determine if onemodel is better than the other, usinga -test for example.

Comparing models using cross-validation

t

47 / 50

Comparing models using cross-validationUsing hypothesis testing:

term estimate std.error statistic p.value

(Intercept) 0.0267 0.0020306 13.148828 0.0000000

methodtree 0.0030 0.0028717 1.044677 0.3099998

In this case, we do not observe any significant difference between thesetwo classification methods.

48 / 50

SummaryModel selection and assessment are critical steps of data analysis.

Error and accuracy statistics are not enough to understand classifierperformance.

Classifications can be done using probability cutoffs to trade, e.g., TPR-FPR (ROC curve), or precision-recall (PR curve).

Area under ROC or PR curve summarize classifier performance acrossdifferent cutoffs.

49 / 50

SummaryResampling methods are general tools used for this purpose.

k-fold cross-validation can be used to provide larger training sets toalgorithms while stabilizing empirical estimates of expected predictionerror

50 / 50

Introduction to Data Science: ClassifierEvaluation and Model Selection

Héctor Corrada Bravo

University of Maryland, College Park, USACMSC320: 2020-04-26

Page 2: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluationHow do we determine how well classifiers are performing?

One way is to compute the error rate of the classifier, the percent ofmistakes it makes when predicting class

1 / 50

Page 3: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluation## observed

## predicted No Yes

## No 9625 233

## Yes 42 100

## [1] 2.75

## [1] 3.33

logis_fit <- glm(default ~ balance, data=Def

logis_pred_prob <- predict(logis_fit, type="

logis_pred <- ifelse(logis_pred_prob > 0.5,

print(table(predicted=logis_pred, observed=D

# error rate

mean(Default$default != logis_pred) * 100

# dummy error rate

mean(Default$default != "No") * 100

2 / 50

Page 4: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluationWe need a more precise language to describe classification mistakes:

True Class + True Class - Total

Predicted Class + True Positive (TP) False Positive (FP) P*

Predicted Class - False Negative (FN) True Negative (TN) N*

Total P N

3 / 50

Page 5: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluationUsing these we can define statistics that describe classifier performance

Name Definition Synonyms

False Positive Rate (FPR) FP / N Type-I error, 1-Specificity

True Positive Rate (TPR) TP / P1 - Type-II error, power,sensitivity, recall

Positive Predictive Value(PPV)

TP / P*precision, 1-false discoveryproportion

Negative Predicitve Value(NPV)

FN / N*

4 / 50

Page 6: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluationIn the credit default case we may want to increase TPR (recall, makesure we catch all defaults) at the expense of FPR (1-Specificity, clientswe lose because we think they will default)

5 / 50

Page 7: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluationThis leads to a natural question: Can we adjust our classifiers TPR andFPR?

Remember we are classifying Yes if

What would happen if we use ?

log > 0 ⇒  P(Y = Yes|X) > 0.5P(Y = Yes|X)

P(Y = No|X)

P(Y = Yes|X) > 0.2

6 / 50

Page 8: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluation

7 / 50

Page 9: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluationA way of describing the TPR and FPR tradeoff is by using the ROCcurve (Receiver Operating Characteristic) and the AUROC (area underthe ROC)

8 / 50

Page 10: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluation

9 / 50

Page 11: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

default ~balance*student+ income

Classi�er evaluationConsider comparing a logistic regression model using all predictors inthe dataset, including an interaction term between balance and student.

10 / 50

Page 12: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluationAnother metric that is frequently used to understand classification errorsand tradeoffs is the precision-recall curve:

11 / 50

Page 13: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Classi�er evaluationThe bigger model shows a slightly higher precision at the same recallvalues and slightly higher area under the precision-recall curve.

This is commonly found in datasets where there is a skewed distributionof classes (e.g., there are many more "No" than "Yes" in this dataset).

The area under the PR curve tends to distinguish classifier performancethan area under the ROC curve in these cases.

12 / 50

Page 14: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Model SelectionOur goal when we use a learning model like linear or logistic regression,decision trees, etc., is to learn a model that can predict outcomes fornew unseen data.

13 / 50

Page 15: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Model SelectionWe should therefore think of model evaluation based on expectedpredicted error: what will the prediction error be for data outside thetraining data.

14 / 50

Page 16: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Model SelectionWe should therefore think of model evaluation based on expectedpredicted error: what will the prediction error be for data outside thetraining data.

How then, do we measure our models' ability to predict unseen data,when we only have access to training data?

15 / 50

Page 17: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-validationThe most common method to evaluate model generalizationperformance is cross-validation.

It is used in two essential data analysis phases: Model Selection andModel Assessment.

16 / 50

Page 18: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-validation

Model Selection

Decide what kind, and how complex of a model we should fit.

17 / 50

Page 19: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-validation

Model Selection

Decide what kind, and how complex of a model we should fit.

Consider a regression example: I will fit a linear regression model, whatpredictors should be included?, interactions?, data transformations?

18 / 50

Page 20: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-validation

Model Selection

Decide what kind, and how complex of a model we should fit.

Consider a regression example: I will fit a linear regression model, whatpredictors should be included?, interactions?, data transformations?

Another example is what classification tree depth to use.

19 / 50

Page 21: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-validation

Model Selection

Decide what kind, and how complex of a model we should fit.

Consider a regression example: I will fit a linear regression model, whatpredictors should be included?, interactions?, data transformations?

Another example is what classification tree depth to use.

Which kind of algorithm to use, linear regression vs. decision tree vs.random forest

20 / 50

Page 22: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-validation

Model Assessment

Determine how well does our selected model performs as a generalmodel.

21 / 50

Page 23: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-validation

Model Assessment

Determine how well does our selected model performs as a generalmodel.

Ex. I've built a linear regression model with a specific set predictors. Howwell will it perform on unseen data?

22 / 50

Page 24: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-validation

Model Assessment

Determine how well does our selected model performs as a generalmodel.

Ex. I've built a linear regression model with a specific set predictors. Howwell will it perform on unseen data?

The same question can be asked of a classification tree of specificdepth.

23 / 50

Page 25: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-validationCross-validation is a resampling method to obtain estimates of expectedprediction error rate (or any other performance measure on unseendata).

In some instances, you will have a large predefined test dataset that youshould never use when training.

In the absence of access to this kind of dataset, cross validation can beused.

24 / 50

Page 26: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Validation SetThe simplest option to use cross-validation is to create a validation set,where our dataset is randomly divided into training and validation sets.

Then the validation is set aside, and not used at until until we are readyto compute test error rate (once, don't go back and check if you canimprove it).

25 / 50

Page 27: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

A linear regression model was notappropriate for this dataset. Usepolynomial regression as anillustrative example.

Validation SetLet's look at our running example using automobile data, where we wantto build a regression model to predict miles per gallon given other autoattributes.

26 / 50

Page 28: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Validation SetFor polynomial regression, our regression model (for a single predictor

) is given as a degree polynomial.

For model selection, we want to decide what degree we should use tomodel this data.

X d

E[Y |X = x] = β0 + β1x + β2x2 + ⋯ + βdx

d

d

27 / 50

Page 29: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Using the validation set method,split our data into a training set,

fit the regression model withdifferent polynomial degrees onthe training set,

measure test error on the validationset.

Validation Set

d

28 / 50

Page 30: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Resampled validation setThe validation set approach can be prone to sampling issues.

It can be highly variable as error rate is a random quantity and dependson observations in training and validation sets.

29 / 50

Page 31: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Resampled validation setThe validation set approach can be prone to sampling issues.

It can be highly variable as error rate is a random quantity and dependson observations in training and validation sets.

We can improve our estimate of test error by averaging multiplemeasurements of it (remember the law of large numbers).

30 / 50

Page 32: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Resample validation set 10 times(yielding different validation andtraining sets) and averaging theresulting test errors.

Resampled validation set

31 / 50

Page 33: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Leave-one-out Cross-ValidationThis approach still has some issues.

Each of the training sets in our validation approach only uses 50% ofdata to train, which leads to models that may not perform as well asmodels trained with the full dataset and thus we can overestimate error.

32 / 50

Page 34: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Leave-one-out Cross-ValidationThis approach still has some issues.

Each of the training sets in our validation approach only uses 50% ofdata to train, which leads to models that may not perform as well asmodels trained with the full dataset and thus we can overestimate error.

To alleviate this situation, we can extend our approach to the extreme:Make each single training point it's own validation set.

33 / 50

Page 35: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Procedure:For each observation in data set:a. Train model on all but -thobservationb. Predict response for -thobservationc. Calculate prediction error

Leave-one-out Cross-Validation

i

i

i

34 / 50

Page 36: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

This gives us the following cross-validation estimate of error.

Leave-one-out Cross-Validation

CV(n) = ∑i

(yi − y i)21

n

35 / 50

Page 37: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Leave-one-out Cross-Validationuse observations to train each modelno sampling effects introduced since error is estimated on eachsample

n − 1

36 / 50

Page 38: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Leave-one-out Cross-Validationuse observations to train each modelno sampling effects introduced since error is estimated on eachsample

Disadvantages:

Depending on the models we are trying to fit, it can be very costly totrain models.Error estimate for each model is highly variable (since it comes from asingle datapoint).

n − 1

n − 1

37 / 50

Page 39: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Leave-one-out Cross-ValidationOn our running example

38 / 50

Page 40: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

k-fold Cross-ValidationThis discussion leads us to the most commonly used cross-validationapproach k-fold Cross-Validation.

39 / 50

Page 41: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Procedure:Partition observations randomly into

groups (folds).

For each of the groups ofobservations:

Train model on observations inthe other foldsEstimate test-set error (e.g.,Mean Squared Error) on this fold

k-fold Cross-Validation

k

k

k − 1

40 / 50

Page 42: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Procedure:

Compute average error across folds

where is mean squarederror estimated on the -th fold

k-fold Cross-Validation

k

CV(k) = ∑i

MSEi

1

k

MSEi

i

41 / 50

Page 43: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

k-fold Cross-ValidationFewer models to fit (only of them)Less variance in each of the computed test error estimates in eachfold.

k

42 / 50

Page 44: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

k-fold Cross-ValidationFewer models to fit (only of them)Less variance in each of the computed test error estimates in eachfold.

It can be shown that there is a slight bias (over estimating usually) inerror estimate obtained from this procedure.

k

43 / 50

Page 45: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

k-fold Cross-ValidationRunning Example

44 / 50

Page 46: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-Validation in Classi�cationEach of these procedures can be used for classification as well.

In this case we would substitute MSE with performance metric of choice.E.g., error rate, accuracy, TPR, FPR, AUROC.

45 / 50

Page 47: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Cross-Validation in Classi�cationEach of these procedures can be used for classification as well.

In this case we would substitute MSE with performance metric of choice.E.g., error rate, accuracy, TPR, FPR, AUROC.

Note however that not all of these work with LOOCV (e.g. AUROC sinceit can't be defined over single data points).

46 / 50

Page 48: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Suppose you want to compare twoclassification models (logisticregression vs. a decision tree) onthe Default dataset. We can useCross-Validation to determine if onemodel is better than the other, usinga -test for example.

Comparing models using cross-validation

t

47 / 50

Page 49: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

Comparing models using cross-validationUsing hypothesis testing:

term estimate std.error statistic p.value

(Intercept) 0.0267 0.0020306 13.148828 0.0000000

methodtree 0.0030 0.0028717 1.044677 0.3099998

In this case, we do not observe any significant difference between thesetwo classification methods.

48 / 50

Page 50: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

SummaryModel selection and assessment are critical steps of data analysis.

Error and accuracy statistics are not enough to understand classifierperformance.

Classifications can be done using probability cutoffs to trade, e.g., TPR-FPR (ROC curve), or precision-recall (PR curve).

Area under ROC or PR curve summarize classifier performance acrossdifferent cutoffs.

49 / 50

Page 51: Introduction to Data Science: Classifier · ClassiCerievaluation How do we determine how well classifiers are performing? One way is to compute the error rate of the classifier, the

SummaryResampling methods are general tools used for this purpose.

k-fold cross-validation can be used to provide larger training sets toalgorithms while stabilizing empirical estimates of expected predictionerror

50 / 50