evaluation of learned models

Evaluation of learned models

Kurt Driessens

again with slides stolen from Evgueni Smirnov and Hendrik Blockeel

Overview

• Motivation• Metrics for Classifier Evaluation• Methods for Classifier Evaluation &

Comparison• Costs in Data Mining

– Cost-Sensitive Classification and Learning – Lift Charts– ROC Curves

Motivation

• It is important to evaluate classifier’s generalization performance in order to:– Determine whether to employ the classifier;

(For example: when learning the effectiveness of medical treatments from a limited-size data, it is important to estimate the accuracy of the classifiers.)

– Optimize the classifier.(For example: when post-pruning decision trees we must evaluate the accuracy of the decision trees on each pruning step.)

data

Targetdata Processed

data

Transformeddata Patterns

Knowledge

SelectionPreprocessing& cleaning

Transformation& featureselection

Data Mining

InterpretationEvaluation

Model’s Evaluation in the KDD Process

How to evaluate the Classifier’s Generalization Performance?

Predicted class

Actualclass

Pos Neg

+ TP FN

- FP TN

Assume that we test a classifier on some test set and we derive at the end the following confusion matrix:

P

N

Metrics for Classifier’s Evaluation

Predicted class

Actualclass

Pos Neg

+ TP FN

- FP TN

Accuracy = (TP+TN)/(P+N)Error = (FP+FN)/(P+N)Precision = TP/(TP+FP)Recall/TP rate = TP/PFP Rate = FP/N

P

N

How to Estimate the Metrics?

• We can use:– Training data;– Independent test data;– Hold-out method;– k-fold cross-validation method;– Leave-one-out method;– Bootstrap method;– And many more…

Estimation with Training Data

The accuracy/error estimates on the training data are not good indicators of performance on future data.

– Q: Why? – A: Because new data will probably not be exactly the same

as the training data!• The accuracy/error estimates on the training data

measure the degree of classifier’s overfitting.

Training set

Classifier

Training set

Estimation with Independent Test Data

Estimation with independent test data is used when we have plenty of data and there is a natural way to forming training and test data.

• For example: Quinlan in 1987 reported experiments in a medical domain for which the classifiers were trained on data from 1985 and tested on data from 1986.

Training set

Classifier

Test set

Hold-out MethodThe hold-out method splits the data into training data and test

data (usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data.

• used with thousands of instances, • including plenty from each class.

Training set

Classifier

Test set

Data

Classification: Train, Validation, Test Split

Data

Predictions

Y N

Results Known

Training set

Validation set

++--+

Classifier BuilderEvaluate

+-+-

ClassifierFinal Test Set

+-+-

Final Evaluation

ModelBuilder

The test data can’t be used for parameter tuning!

Making the Most of the Data

• Once evaluation is complete, all the data can be used to build the final classifier.

• Generally, the larger the training data the better the classifier (but returns diminish).

• The larger the test data the more accurate the error estimate.

Stratification

• The holdout method reserves a certain amount for testing and uses the remainder for training.–Usually: one third for testing, the rest for training.

• For “unbalanced” datasets, samples might not be representative.–Few or none instances of some classes.

• Stratified sampling: advanced version of balancing the data.–Make sure that each class is represented with

approximately equal proportions in both subsets.

Repeated Holdout Method

In general, estimates can be made more reliable by repeated sampling– Each iteration, a certain proportion is randomly

selected for training (possibly with stratification).– The error rates on the different iterations are

averaged to yield an overall error rate.

This is called the repeated holdout method.

Repeated Holdout Method, 2

Random sampling ≠ optimal – the different test sets overlap– we would like all our instances from the data to

be tested at least once

Can we prevent overlapping?

k-Fold Cross-Validation

• k-fold cross-validation avoids overlapping test sets:– First step: data is split into k subsets of equal size;– Second step: each subset in turn is used for testing and the

remainder for training.• The subsets are stratified

before the cross-validation.• The estimates are averaged to

yield an overall estimate.

Classifier

Data

train train test

train test train

test train train

More on Cross-Validation

• Standard method for evaluation: stratified 10-fold cross-validation.

• Why 10? Extensive experiments have shown that this is the best choice to get an accurate estimate.

• Stratification reduces the estimate’s variance.

Even better: repeated stratified cross-validation:– E.g. ten-fold cross-validation is repeated ten times and

results are averaged (reduces the variance).

Leave-One-Out Cross-Validation

• Leave-One-Out is a particular form of cross-validation:– Set number of folds to number of training

instances;– I.e., for n training instances, build classifier n

times.• Makes best use of the data.• Involves no random sub-sampling.• Very computationally expensive.

Leave-One-Out Cross-Validation and Stratification

• A disadvantage of Leave-One-Out-CV is that stratification is not possible:– It guarantees a non-stratified sample because

there is only one instance in the test set!• Extreme example - random dataset split

equally into two classes:– Best inducer predicts majority class;– 50% accuracy on fresh data;– Leave-One-Out-CV estimate is 100% error!

Bootstrap Method

• Cross validation uses sampling without replacement:– The same instance, once selected, can not be selected

again for a particular training/test set• The bootstrap uses sampling with replacement to

form the training set:– Sample a dataset of n instances n times with replacement

to form a new dataset of n instances;– Use this data as the training set;– Use the instances from the original dataset that don’t

occur in the new training set for testing.

Bootstrap Method

The bootstrap method is also called the 0.632 bootstrap:– A particular instance has a probability of 1–1/n of not

being picked;– Thus its probability of ending up in the test data is:

– This means the training data will contain approximately 63.2% of the instances and the test data will contain approximately 36.8% of the instances.

368.011 1

e

n

n

Estimating Error with the Bootstrap Method

The error estimate on the test data will be very pessimistic because the classifier is trained on approx. 63% of the instances.– Therefore, combine it with the training error:

– The training error gets less weight than the error on the test data.

– Repeat process several times with different replacement samples; average the results.

instances traininginstancestest 368.0632.0 eeerr

Confidence Intervals for Performance

Assume that the error errorS(h) of the classifier h estimated by the 10-fold cross validation is 25%.

• How close is the estimated error errorS(h) to the true error errorD(h) ?

Confidence intervals (2)

If test data contain n examples, drawn independently of each other, n 30

Then with approximately N% probability, errorD(h) lies

in the interval

where

N%: 50% 68% 80% 90% 95% 98% 99%zN: 0.67 1.00 1.28 1.64 1.96 2.33 2.58

nherrorherror

NSSSzherror ))(1)(()(

Comparison of hypotheses

Given two hypotheses, which one has lower true error?

• Statistical hypothesis test: – claim that both are equally good– if claim rejected, accept that 1 is better

• 2 cases:– compare 2 hypotheses on possibly different test

sets– compare 2 hypotheses on same test set

Different Test Sets

2

22

1

1121

)'1(')'1('''n

ppn

ppzpp

Same Test Set

When comparing hypotheses on the same data set, more powerful procedure possible– uses more information from test– possible influence of easy/difficult examples

removedMore informative method:

– for each single example, compare h1 and h2

– how often was h1 correct and h2 wrong on the same example, vs. the other way around?

– McNemar’s test

McNemar's test• Consider table:

• If h1 is equally good as h2:– for each instance where h1 and h2 differ, probability

0.5 that either is correct– hence we expect B C (B+C)/2– B and C follow binomial (+/- normal) distribution

• Reject equality if B deviates too much from (B+C)/2

h1 correct h1 wrongh2 correct A Bh2 wrong C D

h1 correct h1 wrongh2 correct 45 10h2 wrong 0 45

- h2 clearly better than h1- might not be discovered using "conservative" comparison

Example comparison

Consider table below• Method with independent test sets:

– 55-45 in favour of h2 (out of 100)– not very convincing

• Method with same test set:– much more convincing: 10-0 in favour of h2

Metric Evaluation Summary:1. Use test sets and the hold-out method for “large”

data;2. Use the cross-validation method for “middle-sized”

data;3. Use the leave-one-out and bootstrap methods for

small data;

Don’t use test data for parameter tuning - use separate validation data.

Comparing two classifiers to each other can use more advanced statistics: t-test, McNemar, …

Drawbacks of Accuracy

• Evaluation based on accuracy is not always appropriate

• Shortcomings:– can sometimes be misleading– unstable when class distribution may change– assumes symmetric misclassification costs

1: Accuracy can be misleading

+ +

IF false THEN pos96% correct

IF green area THEN pos:92% correct

Assume all examples -, except blue region (+)• Which of these classifiers is best?

Classifier 1 Classifier 2

• An alternative measure: correlation– e.g., correlation = (ad-bc) / TposTminT+T-

• close to 1: high correlation predictions - classes• close to 0: no correlation• (close to -1: predicting the opposite)

– Avoids the unintuitive results just mentioned

pred

ictio

n

actual value

note: +/- are actual valuespos/neg are predictions

+ - Sum

Pos a b Tpos

Neg c d Tneg

Sum T+ T- T

2: Accuracy is sensitive to class distributions

If class distribution in test set differs from that in training set, accuracy will also differ

E.g.:– Suppose a classifier has TP = 0.8, TN = 0.6– Tested on test set with T+/T = 0.5, T-/T = 0.5:

• Acc = 0.7– Employed in environment with T+/T = 0.3, T-/T = 0.7:

• Acc = 0.66

3: Accuracy ignores misclassification costs

Accuracy ignores possibility of different misclassification costs– sometimes, incorrectly predicting "pos" costs

more/less than incorrectly predicting "neg”E.g.:

• not treating an ill patient vs. treating a healthy patient• refusing credit to client who would have paid back vs.

assigning credit to client who won't pay back

Need to distinguish probability of making different types of errors

Misclassification CostsSolution: distinguish “predictive accuracy” for

different classes– Acc: probability that some instance is classified

correctly– Decomposed into

• TP: “true positive” rate, (estimated) probability that a positive instance is classified correctly

• TN: “true negative” rate, (estimated) probability that a negative instance is classified correctly

– We also define • FP = 1-TN: “false positive rate”: estimated probability that a

negative is classified as positive • analogously FN = 1-TP

Misclassification Costs (2)

Consider costs CFP and CFN

= cost of false positive resp. false negativeExpected cost of a single prediction:

C = CFP P(pos|-) P(-) + CFN P(neg|+) P(+)– estimated by C = CFP FP T-/T + CFN FN T+ /T

Note : – Acc is weighted average of TP and TN

Acc = TP T+/T + TN T-/T– C is not computable from Acc alone

Cost Sensitive Learning

Simple methods for cost sensitive learning:• Resampling of instances according to costs• Weighting of instances according to costs

In Weka Cost Sensitive Classification and Learning can be applied for any classifier using the meta scheme: CostSensitiveClassifier.

Lift ChartsIn practice, decisions are usually made by comparing possible scenarios taking into account different costs.

E.g. - Promotional mailout to 1,000,000 households. If we mail to

all households, we get 0.1% respond (1000).- Data mining tool identifies (a) subset of 100,000 households

with 0.4% respond (400); or (b) subset of 400,000 households with 0.2% respond (800);

- Depending on the costs we can make final decision using lift charts!

- A lift chart allows a visual comparison.

Generating a Lift ChartInstances are sorted according to their predicted probability of

being a true positive:

Rank Predicted probability Actual class1 0.95 Pos2 0.93 Pos3 0.93 Neg4 0.88 Pos… … …

In the lift chart, x axis is sample size and y axis is number of true positives.

Hypothetical Lift Chart

ROC diagrams

• ROC = "Receiver operating characteristic"• Allows to see

– how well a classifier will perform given certain misclassification costs and class distribution

– in which environments one classifier is better than another

• Explicitly aims at solving problems 2 and 3 mentioned before

ROC diagram (2)• ROC diagram plots TP-rate versus FP-rate• From confusion matrix:

– TP = a/(a+c) = a/T+

– FP = b/(b+d) = b/T-

pred

ictio

n

actual value

+ - Sum

Pos a b Tpos

Neg c d Tneg

Sum T+ T- T

Classifier in ROC diagram

1 classifier = 1 point on ROC diagram

0 1

1

random prediction

FP

TPperfect prediction

no negatives returned

no positives forgotten

True

Predicted

pos neg

+ 60 40

- 20 80

True

Predicted

pos neg

+ 80 20

- 50 50

True

Predicted

pos neg

+ 40 60

- 30 70

if true then pos

if false then pos

Dominance in the ROC Space

Classifier A dominates classifier B if and only if TPrA > TPrB and FPrA < FPrB.

ROC Convex Hull (ROCCH)

Determined by the dominant classifiers Classifiers below ROCCH are always sub-optimal.

Any point of the line segment connecting two classifiers can be achieved by randomly choosing between them; The classifiers on ROCCH can be combined to form a hybrid.

Rank classifiers

Rank classifiers yield a ROC curveeach specific threshold = 1 point on that curve

0 1

1

FP

TP

Ranker

Ranker with high threshold: worse than Red

Ranker with low threshold: better than Blue

ROC for one Classifier

Good separation between the classes, convex curve.

Reasonable separation between the classes, mostly convex.

Fairly poor separation between the classes, mostly convex.

Poor separation between the classes, large and small concavities. Random performance.

The AUC Metric

The area under ROC curve (AUC) assesses the ranking in terms of separation of the classes.

AUC estimates that randomly chosen positive instance will be ranked before randomly chosen negative instances.

Note

• To generate ROC curves or Lift charts we need to use some evaluation methods considered in this lecture.

• ROC curves and Lift charts can be used for internal optimization of classifiers.

Costs in ROC diagram

Given misclassification costs:– CFP: cost of a false positive– CFN: cost of a false negative (undetected "+")

Average cost is– C = CFP * FP * T-/T + CFN * (1-TP) * T+/T– Lines of equal cost can be drawn in ROC diagram

(straight lines)• Slope of such a line : (CFP * T-/T) / (CFN * T+/T)

0 1

1

FP

TP

increasing cost

high cost of false positive: Red is better

low cost of false positive: Ranker with low threshold is better

Blue and Green are never better than the Ranker or Red

0 1

1

FP

TP

Iso-Accuracy Lines

Remember: Accuracy is weighted average of TP and TN

Acc = TP T+/T + TN T-/T = TP T+/T + (1-FP) T-/T

TP = N/P FP + Cte

Higher iso-accuracy lines are better.

ExampleFor uniform class distribution, C4.5 is optimal and achieves about 82% accuracy. With 4 times as many positives as

negatives SVM is optimal and achieves about 84% accuracy.

With 4 times as many negatives as positives CN2 is optimal and achieves about 86% accuracy.

Summary

• Metrics for Classifier’s Evaluation• Methods for Classifier’s Evaluation &

Comparison• Costs in Data Mining

– Cost-Sensitive Classification and Learning – Lift Charts– ROC Curves

Evaluation of regression models

• Predicting numbers: no "right or wrong" approach• Possible measures:

– Sum of squared errors SSE• Is an absolute measure

– Relative error RE: measures improvement over trivial model• RE = SSE(hypothesis) / SSE(trivial hypothesis)• Trivial hypothesis: e.g. always predict mean• RE normally between 0 and 1

– Spearman correlation r• measures how well predictions and actual values correlate• less sensitive to actual errors

evaluation of learned models

Documents

train data

future data

new data

limitedsize data

test splitmaking

classifier evaluationmethods

final classifier

classifiers evaluationaccuracy