thoughts on model assessment
DESCRIPTION
Thoughts on model assessment. Porco, DAIDD , December 2012. Assessment. Some models are so wrong as to be useless. Not really possible to argue that a model is “true” Treacherous concept of “validation”: has a model that has been “validated” been shown to be “valid”? Philosophy of science. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/1.jpg)
THOUGHTS ON MODEL ASSESSMENTPorco, DAIDD, December 2012
![Page 2: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/2.jpg)
Assessment Some models are so wrong as to be
useless. Not really possible to argue that a model
is “true” Treacherous concept of “validation”: has
a model that has been “validated” been shown to be “valid”?
Philosophy of science
![Page 3: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/3.jpg)
Face Validity Knowledge representation Analogy Does the model represent in some
approximate sense facts about the system?
![Page 4: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/4.jpg)
Verification Does your simulation actually simulate
what you think it simulates? How do you know your program is not
generating nonsense? Unit testing Boundary cases “Test harnesses” Formal software testing techniques
![Page 5: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/5.jpg)
Thought “…the gold standard of model
performance is predictive power on data that wasn’t used to fit your model…”
--Conway and White, Machine Learning for Hackers
![Page 6: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/6.jpg)
Prediction Not everything is predictable Predict individuals? Communities?
Counterfactuals? Does the model somehow correspond to
reality?
![Page 7: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/7.jpg)
California TB
![Page 8: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/8.jpg)
Probabilistic forecasting Weather Forecast expressed as a probability Correctly quantify uncertainty Uncertainty is needed to make the best
use of a forecast If you don’t know, don’t say you do Not bet-hedging, but honesty
![Page 9: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/9.jpg)
Two approaches Assessment of probabilistic forecasts
Brier Score Information Score
![Page 10: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/10.jpg)
Assessing simple forecasts Binary event: 0 for no, 1 for yes Forecast: p is the probability it occurred If p=1, you forecast it with certainty If p=0, you forecast that it was certain
not to have occurred If 0<p<1, you were somewhere in the
middle
![Page 11: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/11.jpg)
Brier score
![Page 12: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/12.jpg)
Brier score Squared error of a probabilistic forecast BS = (p-1)2 + ((1-p)-0)2
BS = 2(1-p)2
Brier score has a negative orientation (like in golf, smaller is better)
Some authors do not use the factor of 2
![Page 13: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/13.jpg)
Brier score Example. Suppose I compute a forecast
that trachoma will be eliminated in village A after two years as measured by pooled PCR for DNA in 50 randomly selected children. Suppose I say the elimination probability is 0.8.
If trachoma is in fact eliminated, the Brier score for this is (0.8-1)^2 + (0.2-0)^2 = [you work it out]
![Page 14: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/14.jpg)
Brier score What is the smallest possible Brier
score? What is the largest possible Brier score?
![Page 15: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/15.jpg)
Brier score Suppose we now forecast elimination in
five villages. We can compute an overall Brier score by just adding up separate scores for each village.
Elimination Prob.
Elimination Score
0.8 1 0.080.1 0 0.020.5 1 0.50.4 1 0.720.8 0 1.28
![Page 16: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/16.jpg)
Murphy decomposition Classical form applies to binary
predictions Predictions are probabilities Finite set of possible probabilities Example: chance of a meteorological
event happening might be predicted as 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%.
![Page 17: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/17.jpg)
Reliability (sensu Murphy) Similar to Brier score – less is better Looking at all identical forecasts, how
similar is what really happened to what was forecast?
![Page 18: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/18.jpg)
Examples
Prediction Result Prediction Result0.2 0 0.8 10.2 1 0.8 00.2 0 0.8 00.2 0 0.8 00.2 0 0.8 10.2 1 0.8 1
![Page 19: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/19.jpg)
Example Here, we get a total Brier score of 6.96.
We have to add up (0.2-0)^2 + (0.2-1)^2+… (and not forget the factor of two).
![Page 20: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/20.jpg)
Some terminology Forecast levels: the different values
possible for the forecast. Notation: fk for the levels of k
Observed means at the forecast levels: averaging the number of times the event happened stratifying on forecast level. Notation:
We have two forecast levels in the example, 0.2 and 0.8. For forecast level 0.2, we have an observed mean of 2/6. For the forecast level 0.8, we have an observed mean of 3/6.
![Page 21: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/21.jpg)
Examples We had 2/6 that were YES when we gave
a 20% chance. Let’s compare the prediction to the successes just for the 20% predictions. For each one we have the same thing: (0.2 – 2/6)2+(0.8-4/6)2 which is about 0.0356. We had 3/6 that were YES when we gave
an 80% chance. We do a similar computation and we get (0.8-3/6)2 + (0.2-3/6)2=0.222 or so.
![Page 22: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/22.jpg)
Reliability So the total reliability score is going to
be computed:
This yields a total score of approximately 1.2933 for the reliability component REL.
Prediction Reliability Repeats Total0.2 0.0356 6 0.21330.8 0.2222 6 1.08
![Page 23: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/23.jpg)
Resolution When we said different things would
happen, how different really were they? We’re going to contrast the observed means at different forecast levels.
Specifically, we want the variance of the distribution of the observed means at different forecast levels.
![Page 24: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/24.jpg)
Example Using the simple example: the overall
mean is 5/12. Then compute 2*(2/6 – 5/12)2 for every observation in the first group, and 2*(3/6-5/12)2 for every observation in the second group.
The total resolution component RES is 6*2*(2/6 – 5/12)2 +2*(3/6-5/12)2 = 1/6.
![Page 25: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/25.jpg)
Uncertainty In the classical Murphy formula, this is
computed by calculating the mean observation times one minus the mean—and adding this up for each observation.
For our example, the uncertainty is 5.8333 or so.
![Page 26: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/26.jpg)
Murphy decomposition Murphy (1973) showed that the Brier score can
be decomposed as follows: BS=REL-RES+UNC N.B. the negative sign in front of resolution High uncertainty contributes to high Brier score
(all else being equal) High discrepancy of the observed means at each
level from the forecasts raises the Brier score But having those observed means at each level
separate from each other lowers the Brier score
![Page 27: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/27.jpg)
Scriptbrier.component <- function(forecast,observation) { 2*(forecast-observation)^2}reliability <- function(fk,ok) { 2*(fk-ok)^2}resolution <- function(obar,ok) { 2*(ok-obar)^2}
gen.obark <- function(predictions,outcome,key) { tmp <- data.frame(key=key,outcome=outcome,pred=predictions) f1 <- merge(tmp, ddply( tmp, .(pred), function(x){mean(x$outcome)} ), by="pred") f1 <- f1[order(f1$key),] f1$V1}
![Page 28: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/28.jpg)
Exampleds <- c(0,1,0,0,0,1,1,0,0,0,1,1)fs <- c(rep(0.2,6),rep(0.8,6))obark <- gen.obark(fs,ds,1:12)rel <- sum(reliability(fs,obark))res <- sum(resolution(mean(ds),obark))unc <- 12*2*(mean(ds)*(1-mean(ds)))brs <- sum(brier.component(fs,ds))(rel-res+unc)-brs[1] -1.776357e-15
![Page 29: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/29.jpg)
General form
B: Brier score; K, number of forecast levels; nk the number of forecasts at level kOther quantities as defined earlier
![Page 30: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/30.jpg)
More general forms More general decompositions
(Stephenson, five terms) Continuous forms
![Page 31: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/31.jpg)
Alternatives to Brier score Decomposition into Reliability,
Resolution, Uncertainty works for information measures (Weijs et al)
![Page 32: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/32.jpg)
Information measures Surprise Expected surprise Surprising a model
![Page 33: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/33.jpg)
Surprisal How much do you learn if two equally
likely alternatives are disclosed to you? Heads vs tails; 1 vs 2
![Page 34: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/34.jpg)
Surprisal How much do you learn if 3 are
disclosed? A, B, or C How much if I disclose one of 26 equally
likely outcomes? You should have learned more
![Page 35: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/35.jpg)
Surprisal Standard example: Now combine two independent things. If
we had 1 vs 2, and A, B, or C: A/1 C/2 B/1 B/2 …
![Page 36: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/36.jpg)
Surprisal Six equally likely things s(2 x 3) = s(2) + s(3) Uniquely useful way to do this: define
surprisal as log(1/p), log of the reciprocal probability
Stevens information tutorial online
![Page 37: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/37.jpg)
Surprisal Different bases can be used for the logarithm Log base 2 is traditional How much information is transmitted if you learn
which of two equally likely outcomes happened? Log2(1/(1/2)) = 1 If we use base two, then the value is 1, referred
to as 1 bit. Disclosure of one of two equally likely outcomes
reveals one bit of information. Notation: log base 2 often written lg
![Page 38: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/38.jpg)
Example Sequence of values of flips of a bent coin
with P(heads)=1/3, P(tails)=2/3Observed Probabilit
y1/Probability
Surprisal
0 2/3 1.5 0.5850 2/3 1.5 0.5851 1/3 3 1.5850 2/3 1.5 0.5851 1/3 3 1.5851 1/3 3 1.585
The average surprisal was 1.085 or so.
![Page 39: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/39.jpg)
Expected surprisal Every time an event with probability pi
happens, the surprisal is lg(1/pi). What is the expected surprisal?
![Page 40: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/40.jpg)
Expected value
![Page 41: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/41.jpg)
Expected square
![Page 42: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/42.jpg)
Expected surprisal
Shannon entropy
![Page 43: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/43.jpg)
Shannon entropy Shannon entropy can be used to quantify
the amount of information provided by a model for a categorical outcome, in much the same way that the squared multiple correlation coefficient can quantify the amount of variance explained by a continuous model
Use of entropy and related measures to assess probabilistic predictions is sometimes recommended (Weijs)
![Page 44: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/44.jpg)
Estimated entropy If you have a Bernoulli trial, what is the
entropy? The success probability is p. H=-(1-p)lg(1-p) – p lg(p) If we estimate p from data and plug this
in, we get an estimator of the entropy. Example: if we observe 8 successes in
20 trials, the observed frequency is 0.4, and the estimated entropy would be about 0.971.
![Page 45: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/45.jpg)
What does a model do? If we now have a statistical model that
provides a better prediction, we can compute the conditional entropy.
Without the model, our best estimate of the chance of something is just the observed relative frequency.
But with the model, maybe we have a better estimate.
![Page 46: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/46.jpg)
Example Logistic regression: use data from the
Mycotic Ulcer Treatment Trial (Prajna et al 2012, Lietman group).
Using just the chance of having a transplant or perforation, the estimated entropy is 0.637. About 16% of patients had such an event.
![Page 47: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/47.jpg)
Example Now, let’s use a covariate, such as what
treatment a person got. Now, we have a better (we hope)
estimate of the chance of a bad outcome for each person. These are 11% for treatment 1 and 21% for treatment 2.
![Page 48: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/48.jpg)
Example For group 1, the chance of a bad
outcome was 11%. The entropy given membership in this group is about 0.503.
For group 2, the chance of a bad outcome was 21% or so. The entropy given membership in this group is about 0.744.
The weighted average of these is the conditional entropy given the model. This gives us just 0.623.
![Page 49: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/49.jpg)
Example So the entropy was 0.636, and now it’s
0.623. The treatment model does not predict very much.
If we look at the proportional reduction, we get 2.2%; this model explained 2.2% of the uncertainty.
This is exactly the McFadden R2 you get in logistic regression output.
![Page 50: Thoughts on model assessment](https://reader035.vdocuments.us/reader035/viewer/2022062410/568164cb550346895dd6ea85/html5/thumbnails/50.jpg)
Summary Probabilistic forecasts can be assessed
using the Brier score. The Brier score can be decomposed into
reliability, resolution, and uncertainty components.
Information theoretic measures can also be used in assessment.
Information theoretic measures are useful in other biostatistical applications as well.