common verification methods for ensemble forecasts, and how to apply them properly

1

Common verification methods for ensemble forecasts, andhow to apply them properly

Tom HamillNOAA Earth System Research Lab,

Physical Sciences Division, Boulder, [email protected]

NOAA Earth SystemResearch Laboratory

2

Part 1: two desirable properties of ensembles, and the challenges of

evaluating these properties

• Property 1: Reliability, no matter how you slice and dice your ensemble data.

• Property 2: Specificity, i.e., sharpness.

3

Unreliable ensemble?

Here, the observed is outside of the range of the ensemble,which was sampled from the pdf shown. Is this a sign ofa poor ensemble forecast?

4

Unreliable ensemble?

Here, the observed is outside of the range of the ensemble,which was sampled from the pdf shown. Is this a sign ofa poor ensemble forecast?

You just don’t know…it’s only one sampleYou just don’t know; it’s only one sample.

5

Rank 1 of 21 Rank 14 of 21

Rank 5 of 21 Rank 3 of 21

6

“Rank histograms,” aka “Talagrand diagrams”

Happens whenobserved is indistinguishablefrom any other member of theensemble. Ensemblehopefully is reliable.

Happens when observed too commonly islower than the ensemble members.

Happens whenthere are eithersome low and somehigh biases, or whenthe ensemble doesn’tspread out enough.

With lots of samples from many situations, can evaluate the characteristics of the ensemble.

ref: Hamill, MWR, March 2001

7

X =(x1, K , xn)

E P V < xi( )⎡⎣ ⎤⎦=i

n +1

E P xi−1 ≤V < xi( )⎡⎣ ⎤⎦=1

n +1R = r1, K , rn+1( )

rj=P xj−1 ≤V < xj( )

n-member sorted ensemble atsome point

probability the truth V is less than the ith sorted member if V and X’s members sample the same distribution

… equivalently

our rank histogram vector

overbar denotes the sample average over many (hopefully independent)samples

Underlying mathematics

ref: ibid.

8

Rank histograms of Z500, T850, T2m

(from 1998 reforecast version of NCEP GFS)

Solid lines indicate ranks after bias

correction.

Rank histograms are particularly

U-shaped for T2M.

Ref: Hamill and Whitaker, MWR, Sep 2007.

9

Rank histograms…pretty simple, right?

Let’s consider some of the issues involved with this one seemingly simple verification metric.

10

Issues in using rank histograms(1) Flat rank histograms can be created from

combinations of uncalibrated ensembles.

Lesson: you only be confident you have reliability if you see flatness of rank histogram having sliced the data many different ways.

Ref: Hamill, MWR, Mar 2001

11

Issues in using rank histograms(2) Same rank histogram shape may result from

ensembles with different deficiencies

Ref: ibid

Here, half of ensemble members from low-biaseddistribution,half fromhigh-biaseddistribution

Here, all of ensemble members from under-spreaddistribution.

12

Issues in using rank histograms(3) Evaluating ensemble relative to observations with errors

distorts shape of rank histograms.

A solution of sorts is to dress every ensemble member witha random sample of noise, consistent with the observation errors.

Ref: ibid.

small obs error medium obs error

large obs error

13

Rank histogram shapes under “location” and “scale” errors

truth from N(0,1)

forecast fromN(μ,σ); no implicitcorrelations betweenmembers or betweenmember and truth

Ref: ibid.

14

Caren Marzban’s question:what if we did have correlations between members, and w. truth?

Truth y; ensemble x1,K , xn( )

y, x1,K , xn( ) : MVN 0,μx,K , μx( ),Σ( )

Σ =

1 Rσ x Rσ x ... Rσ x

σ x2 rσ x

2 ... rσ x2

σ x2 ... rσ x

2

O Mσ x

2

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥

r is correlation between ensemble members

R is correlation between ensemble member and the observation

Ref: Marzban et al., MWR 2010, in press.

Rank histograms, r = R = 0.9

15

Marzban’s rankhistograms includebox & whiskers toquantify samplingvariability (nice!).

μ=0.0 μ=0.2 μ=0.4 μ=1.6μ=0.8

σ =0.25

σ =0.5

σ =1.0

σ =2.0

σ =4.0

Rank histograms, r = R = 0.9

16

Marzban’s rankhistograms includebox & whiskers toquantify samplingvariability (nice!).

What’s going on here? Why do we now appear to havediagnosed under-spreadwith an ensemblewith apparentexcess spread?

μ=0.0 μ=0.2 μ=0.4 μ=1.6μ=0.8

σ =0.25

σ =0.5

σ =1.0

σ =2.0

σ =4.0

17

Dan Wilks’ insight into this curious phenomenon…

Property of multivariate Gaussian distribution:

(= 0, since in this case assumed y ~ N(0,1))

When μx= 0 (unbiased forecast), and given σy =1, expectedvalue for ensemble member is Rσxy, which will be more extreme (further from origin) than y for relatively large σx.

Ref: Wilks, 2010 MWR, submitted.

μx |y = μ x + Rσ x

σ y

y − μ y( ) = μ x + Rσ x

σ y

y

Illustration, μx=0

18

slopeRσx

Here, for a 1-memberensemble, that forecast member x1tends to be further from the origin than the verification y. If we generated another ensemble member, it would tend to be further from the origin in the same manner. Hence, you have an ensemble clustered together, away from the origin, and the verification near the origin, i.e., at extreme ranks relative to the sorted ensemble.

Many realizations of truthand synthetic 1-memberensemble were created,μx = 0,σx = 2, r = R = 0.9

19

Rank histograms in higher dimensions: the “minimum spanning tree” histogram

• Solid lines: minimum spanning tree (MST) between 10-member forecasts• Dashed line: MST when observed O is substituted for member D• Calculate MST’s sum of line segments for all forecasts, and observed replacing each forecast

member. Tally rank of pure forecast sum relative to sum where observed replaced a member.• Repeat for independent samples, build up a histogram

Ref: Wilks, MWR, June 2004. See also Smith and Hansen, MWR, June 2004

20

Rank histograms in higher dimensions: the “minimum spanning tree” histogram

• Graphical interpretation of MST is different and perhaps less intuitive than it is for uni-dimensional rank histogram.

Ref: Wilks, MWR, June 2004. See also Smith and Hansen, MWR, June 2004

21

Multi-variate rank histogram

• Standardize and rotate using Mahalanobis transformation (see Wilks 2006 text). • For each of n members of forecast and observed, define “pre-rank” as the number of

vectors to its lower left (a number between 1 and n+1)• The multi-variate rank is the rank of the observation pre-rank, with ties resolved at

random• Composite multi-variate ranks over many independent samples and plot rank histogram.• Same interpretation as scalar rank histogram (e.g., U-shape = under-dispersive).

based on Tilmann Gneiting’s presentation at Probability and Statistics, 2008 AMS Annual Conf., New Orleans .and http://www.springerlink.com/content/q58j4167355611g1/

zi = Σ[ ]−1/2 xi −x( )

“Mahalanobis”transform(S is forecasts’samplecovariance)

22

Multi-variate rank histogram calculation

based on Tilmann Gneiting’s presentation at Probability and Statistics, 2008 AMS Annual Conf., New Orleans

F1, F2, F3, F4, F5, O pre-ranks: [1, 5, 3, 1, 4, 1] ; sorted: obs = either rank 1, 2, or 3 with p=1/3.

23

Reliability diagrams

24


Curve tells you whatthe observed frequencywas each time youforecast a given probability.This curve ought to liealong y = x line. Here thisshows the ensemble-forecastsystem over-forecasts theprobability of light rain.

Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

25


Inset histogram tellsyou how frequentlyeach probability wasissued.

Perfectly sharp: frequency of usagepopulates only0% and 100%.


26


BSS = Brier Skill Score

BSS =BΣ(Cliμo)−BΣ(Forecaσt)BΣ(Cliμo)−BΣ(Perfect)

BS(•) measures theBrier Score, which youcan think of as the squared error of a probabilistic forecast.

Perfect: BSS = 1.0Climatology: BSS = 0.0


27

Brier score• Define an event, e.g., precip > 2.5 mm.• Let be the forecast probability for the ith

forecast case.• Let be the observed probability (1 or 0).

Then

yi

oi

BS =1n

yi −oi( )i=1

n∑2

(So the Brier score is the averaged squared error ofthe probabilistic forecast)

Ref: Wilks 2006 text

28

Brier score decomposition

BS =1n

yi −oi( )i=1

n∑2

Σuppoσe I allowable forecaσt valueσ, e.g., I=10 for (5%, 15%, ... , 95%). Frequency of uσage Ni.Σuppoσe overall cliμ atological probability o, andconditional average obσerved probability oi for Ith value.

BΣ=1n

Nii=1

I

∑ yi −oi( )2 −

1n

Nii=1

I

∑ oi −o( )2 + o 1−o( )

("reliability") ("reσolution") ("uncertainty")

Ref: 2006 Wilks text

29

Brier score decomposition

BS =1n

yi −oi( )i=1

n∑2

Σuppoσe I allowable forecaσt valueσ, e.g., I=10 for (5%, 15%, ... , 95%). Frequency of uσage Ni.Σuppoσe overall cliμ atological probability o, andconditional average obσerved probability oi for Ith value.

BΣ=1n

Nii=1

I

∑ yi −oi( )2 −

1n

Nii=1

I

∑ oi −o( )2 + o 1−o( )


Decomposition only makes sense when every sample is drawn from a distribution with an overall climatologyof . For example, don’t use if your forecast samplemixes together data from both desert and rainforest locations. For more, see Hamill and Juras, Oct 2006 QJ

o

Degenerate case:

Skill mightappropriatelybe 0.0 if allsamples with0.0 probabilityare drawn fromclimatology with0.0 probability,and all sampleswith 1.0 are drawn from climatology with1.0 probability.

31

“Attributes diagram”

www.bom.gov.au/bmrc/wefor/staff/eee/verif/ReliabilityDiagram.gif,from Beth Ebert’s verification web page,http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.htmlbased on Hsu and Murphy, 1986, Int’l Journal of Forecasting

BSS = “Resolution” - “Reliability”

“Uncertainty”

Uncertainty term always positive, so probability forecasts will exhibit positive skill if resolution term is larger in absolute value than reliability term. Geometrically, this corresponds to points on the attributes diagram being closer to 1:1 perfect reliability line than horizontal no-resolution line (from Wilks text, 2006, chapter 7)

Again, this geometric interpretation of the attributes diagram makes sense only if all samples used to populate the diagram are drawn from the same climatological distribution.If you are mixing samples from locations with different climatologies, this interpretation is no longer correct! (Hamill and Juras, Oct 2006 QJRMS)

http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/ReliabilityDiagram.gif

http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.html

32

Proposed modifications to standard reliability diagrams

(1) Block-bootstrap techniques (perhaps each forecast day is a block) to provide confidence intervals. See Hamill, WAF, April 1999, and Bröcker and Smith, WAF, June 2007.

(2) BSS calculated in a way that does not attribute false skill to varying climatology (talk later this morning)

(3) Distribution of climatological forecasts plotted as horizontal bars on the inset histogram. Helps explain why there is small skill for a forecast that appears so reliable (figure from Hamill et al., MWR, 2008).

33

When does reliability ≠ reliability?

Probabilities directly estimated from a “reliable” ensemble system with only a few members may not produce diagonal reliability curves due to sampling variability.

10 members 50 members

ref: Richardson 2001 QJRMS

34

Sharpness also important

“Sharpness”measures thespecificity ofthe probabilisticforecast. Given two reliable forecastsystems, the one producing the sharper forecastsis preferable.

But: don’t wantsharp if not reliable.Implies unrealistic confidence.

35

Sharpness ≠ resolution• Sharpness is a property of the forecasts

alone; a measure of sharpness in Brier score decomposition would be how populated the extreme Ni’s are is.

BS =1n

Nii=1

I

∑ yi −oi( )2 −

1n

Nii=1

I

∑ oi −o( )2 + o 1−o( )


“Spread-error” relationships are important, too.

ensemble-meanerror from a sampleof this pdf on avg.should be low.

ensemble-meanerror should bemoderate on avg.

ensemble-meanerror should belarge on avg.

Small-spread ensemble forecasts should have less ensemble-mean error than large-spread forecasts, in some sense a conditional reliability dependent upon amount of sharpness.

36

37

Why would you expectspread-error relationship?

• Ensemble-mean error ought to be the same expected value as ensemble spread.

• If V, xi are sampled from the same distribution, then

• Sometimes quantified with a correlation between spread and error.

E xi −x( )2=E V −x( )2

(spread)2 (ens. mean error)2

Spread-error correlations with 1990’s NCEP GFS

At a given grid point, spread S is assumed to be a randomvariable with a lognormal distribution

where Sm is the mean spread and β is its standard deviation.

As β increases, there is a wider range of spreads in the sample. One would expect then the possibility fora larger spread-skill correlation.

ln S ~ N ln Sm ,b( )

38

corr

β

Lesson: spread-error correlations inherently limited by amount of variation in spread

Ref: Whitaker and Loughe, Dec. 1998 MWR

39

Spread-error relationships and

precipitation forecastsTrue spread-skill relationships harder to diagnose if forecastPDF is non-normally distributed, as they are typically for precipitation forecasts.

Commonly, spread is no longerindependent of the mean value;it’s larger when the amount islarger.

Hence, you may get an apparentspread-error relationship, butthis may reflect variations in themean forecast rather thanreal spread-skill.

Ref & possible solution: Hamill and Colucci, MWR, Mar. 1998

40

Is the spread-error correlation as high as it could be?

40

Hur

rican

e tra

ck e

rror

and

spr

ead

Ref: Hamill et al., MWR 2010 conditionally accepted.

…use one member as synthetic verification to gauge potential spread-error correlation

41

42

Spread vs. error in more than one

dimension: verificationof ellipse

eccentricity

Hurricane Bill, initialized 00 UTC 19 August 2009.

Fitted bi-variate normalto each ensemble’s forecast track positions.Ellipse encloses 90% ofthe fitted probability.

Ref: Hamill et al., MWR 2010 conditionally accepted.

43

Ellipse eccentricity analysis

xl' = xl(1) −xl , K , xl(nt) −xl( ) nt−1( )

1/2

xj' = xj(1) −xj, K , xj(nt) −xj( ) nt−1( )

1/2

l =longitude, j =latitude, nt=#tracked

X =xl'

xj'

⎡

⎣⎢⎢

⎤

⎦⎥⎥

F=X X T =ΣΛΣ−1 =ΣΛΣT = ΣΛ1/2( ) ΣΛ1/2( )T

E gΣ1 σhould be conσiσtentwith X iggΣ1

E gΣ2 σhould be conσiσtentwith X iggΣ2

g =average over caσeσ; g =average over caσeσ, μ eμ berσ

Question: are errors projections larger along the direction where the ellipse is stretched out?

Ellipse eccentricity analysis(n

on-h

omog

eneo

us)

Ellipse eccentricity analysis(n

on-h

omog

eneo

us)

45

Notes for GFS/EnKF:

(1) Along major axis of ellipse, consistent average projection error of errors and projection of members; spread well estimated.

(2) Along minor axis of ellipse, slightly larger projection of errors than projection of members. Too little spread.

(3) Together, imply more isotropy needed.(4) Still (dashed lines) some separation of

projection of error onto ellipses indicates there is some skill in forecasting ellipticity.

Another way of conveying spread-error relationships.

46

Averaging the individualsamples into bins, thenplotting a curve.

Ref: Wang and Bishop,JAS, May 2003.

47

Part 2: other common (and uncommon) ensemble-related evaluation tools

• What do they tell us about the ensemble?• How are they computed?• Is there some relation to reliability, sharpness,

or other previously discussed attributes?• What are some issues in their usage?

48

Isla Gilmour’s non-linearity index

Ref: Gilmour et al., JAS, Nov. 2001

how long does a linear regime persist?

49

Power spectra and saturation

Ref: Tribbia and Baumhefner, MWR, March 2004

The 0–curve represents the power difference betweenanalyses as a function ofglobal wavenumber.

Other numbered curves indicate the power difference at different forecast leads, in days.

Diagnostics like this can beuseful in evaluating upscalegrowth characteristics ofensemble uncertainty.

50

Satterfield and Szunyogh’s “Explained Variance”

EV =

d x P( )

x=

d x P( )

d x P( )+d x ⊥( )

How much of the error ξ (difference of truth from ens. mean)lies in the space spanned by the ensemble (||) vs. orthogonal to it ( )? (calculation is done in a limited-area region).

Would expect that as EV decreases, more U-shaped multi-dimensional rank histograms would be encountered.

Ref: Satterfield and Szunyogh, MWR, March 2010

⊥

51

Example with

forecasts from “local

ETKF”

top: 5x5 degreeboxes

bottom: 10x10boxes.

the larger the box, the smallerthe EV, but patterns similar.

52

Cumulative distribution function (CDF); used in CRPS

• Ff(x) = Pr {X ≤ x}where X is the random variable, x is some

specified threshold.

Continuous Ranked Probability Score• Let be the forecast probability CDF for the ith forecast case.• Let be the observed probability CDF (Heaviside function).

Fif (x)

Fio(x)

CRPS =1n

Fif(x)−Fi

o(x)( )x=−∞

x=−∞

∫2

dxi=1

n

∑

53

b

54

Continuous Ranked Probability Score• Let be the forecast probability CDF for the ith forecast case.• Let be the observed probability CDF (Heaviside function)*.

Fif (x)

Fio(x)

CRPS =1n

Fif(x)−Fi

o(x)( )x=−∞

x=−∞

∫2

dxi=1

n

∑

* or incorporate obs error; see Candille and Talagrand, QJRMS,2007

(difference squared)

b

55

Continuous Ranked Probability Skill Score (CRPSS)

CRPSS =CRPΣ(forecaσt)−CRPΣ(cliμo)CRPΣ(perfect)−CRPΣ(cliμo)

Like the Brier score, it’s common to convert this toa skill score by normalizing by the skill of a reference forecast, perhaps climatology.

Danger: can over-estimate forecast skill. Seelater presentation on this.

56

Decomposition of CRPS

• Like Brier score, there is a decomposition of CRPS into reliability, resolution, uncertainty.

• Like Brier score, interpretation of this decomposition only makes sense if all samples are draws from a distribution with the same climatology.

Ref: Hersbach, WAF, Oct 2000, + Hamill and Juras, QJ, Oct 2006

57

Relative Operating Characteristic (ROC)

Hit Rate = H / (H+M) FAR = F / (F+C)

58

Relative Operating Characteristic (ROC)

ROCSS =AUCf −AUCcliμ

AUCperf −AUCcliμ

=AUCf −0.51.0 −0.5

=2AUCf −1

59

Method of calculation of ROC:parts 1 and 2

Y N

Y 0 0

N 0 1

55 56 57 58 59 60 61 62 63 64 65 66

T

F FFFFF

Obs ≥ T?

Fcst

≥ T

?

(1) Build contingency tables for each sorted ensemble member

Obs

Y N

Y 0 0

N 0 1

Obs ≥ T?

Fcst

≥ T

?

Y N

Y 0 0

N 0 1

Obs ≥ T?

Fcst

≥ T

?

Y N

Y 0 1

N 0 0

Obs ≥ T?Fc

st ≥

T?

Y N

Y 0 1

N 0 0

Obs ≥ T?

Fcst

≥ T

?

Y N

Y 0 1

N 0 0

Obs ≥ T?

Fcst

≥ T

?

(2) Repeat the process for other locations, dates, buildingup contingency tables for sorted members.

60

Method of calculation of ROC:part 3

Y N

Y 4020 561

N 2707 72712

Obs ≥ T?

Fcst

≥ T

?

(3) Get hit rate and false alarm rate for each from contingency table for each sorted ensemble member.

Obs ≥ T?

Fcst

≥ T

?

Obs ≥ T?

Fcst

≥ T

?

Obs ≥ T?

Fcst

≥ T

?

Obs ≥ T?

Fcst

≥ T

?

Obs ≥ T?

Fcst

≥ T

?

Y N

Y H F

N M C

Obs ≥ T?

Fcst

≥ T

? HR = H / (H+M) FAR = F / (F+C)

Sorted Member 1

Sorted Member 2

Sorted Member 3

Sorted Member 4

Sorted Member 5

Sorted Member 6

Y N

Y 1106 3

N 5651 73270

Y N

Y 4692 1270

N 2035 72003

Y N

Y 3097 176

N 3630 73097

Y N

Y 5297 2655

N 1430 70618

Y N

Y 6603 44895

N 124 28378

HR = 0.163FAR = 0.000

HR = 0.504FAR = 0.002

HR = 0.597FAR = 0.007

HR = 0.697FAR = 0.017

HR = 0.787FAR = 0.036

HR = 0.981FAR = 0.612

61

Method of calculation of ROC:part 3

HR = 0.163FAR = 0.000

HR = 0.504FAR = 0.002

HR = 0.597FAR = 0.007

HR = 0.697FAR = 0.017

HR = 0.787FAR = 0.036

HR = 0.981FAR = 0.612

HR = [0.000, 0.163, 0.504, 0.597, 0.697, 0.787, 0.981, 1.000]

FAR = [0.000, 0.000, 0.002, 0.007, 0.017, 0.036, 0.612, 1.000]

(4) Plot hit ratevs. false alarmrate

Again, can overestimateforecast skill.

62

ROC connection with hypothesis testing

Point on ROC curve provides evaluation of Type I (inappropriate rejection of null hypothesis, which here is Obs < T) vs. 1. - Type II statistical errors (inappropriate acceptance of alternate hypothesis, Obs ≥ T). ROC curve provides tradeoff as different ensemble data is used as test statistic.

YES NO

YES H (HIT) (F) FALSE ALARM

NO (M) MISS (C)CORRECTNO

Observed ≥ T?

Fcst

≥ T

? (o

ur te

st s

tatis

tic)

FAR = F / F+C = P(Type I error)HR = H / H+M = 1 - P(Type II error)

63

Economic value diagrams

These diagramstell you the potential economicvalue of yourensemble forecastsystem applied toa particular forecastaspect. Perfectforecast has valueof 1.0, climatologyhas value of 1.0.Value differs withuser’s cost/lossratio.

Motivated by search for a metric that relates ensemble forecastperformance to things that customers will actually care about.

64

Economic value: calculation method

Assumes decision makeralters actions based on weather forecast info.

C = Cost of protectionL = Lp+Lu = total cost of a loss, where …Lp = Loss that can be protected againstLu = Loss that can’t be protected against.N = No cost

h + μ

=o

f+ c

=1−o

65

Economic value, continuedSuppose we have the contingencytable of forecast outcomes, [h, m, f, c].

Then we can calculate the expectedvalue of the expenses from a forecast,from climatology, from a perfect forecast.

E forecast =fC + h C + Λu( )+ μ Λp + Λu( )

Ecliμate =Min o Λp + Λu( ), C + oΛu⎡⎣ ⎤⎦=oΛu + Min oΛp, C⎡⎣ ⎤⎦Eperfect =o C + Λu( )

V =Ecliμate −Eforecaσt

Ecliμate −Eperfect

=Min oΛp, C⎡⎣ ⎤⎦− h+ f( )C−μΛp

Min oΛp, C⎡⎣ ⎤⎦−oC

Note thatvalue will varywith C, Lp, Lu;

Different userswith different protection costsmay experiencea different valuefrom the forecastsystem.

h + μ

=o

f+ c

=1−o

Ref: Zhu et al. 2002 BAMS

66

From ROC to economic value

HR =ho

FAR =f

1−oμ =o −HRo

V =Min o,C Λp⎡⎣ ⎤⎦− h+ f( )C Λp −μ

Min o,C Λp⎡⎣ ⎤⎦−or

=Min o,C Λp⎡⎣ ⎤⎦− C Λp( )FAR 1−o( )+ HRo 1−C Λp( )−o

Min o,C Λp⎡⎣ ⎤⎦−or

Value is now seen to be related to FAR and HR, thecomponents of the ROC curve. A (HR, FAR)point on the ROC curve will thus map to a value curve (as a function of C/L)

Ref: ibid.

67

The red curve isfrom the ROCdata for the member defining the 90th percentile of the ensemble distribution.Green curve is forthe 10th percentile.Overall economicvalue is the maximum(use whatever memberfor decision threshold that provides thebest economic value).

68

Conclusions• Many ensemble verification techniques

out there.• A good principle is to thought-test every

verification technique you seek to use; is there some way it could be misleading you about the characteristics of the forecast (commonly, the answer will be yes).

69

Useful references• Good overall references for forecast verification:

– (1): Wilks, D.S., 2006: Statistical Methods in the Atmospheric Sciences (2nd Ed). Academic Press, 627 pp.

– (2) Beth Ebert’s forecast verification web page, http://tinyurl.com/y97c74• Rank histograms: Hamill, T. M., 2001: Interpretation of rank histograms for verifying

ensemble forecasts. Mon. Wea. Rev., 129, 550-560.• Spread-skill relationships: Whitaker, J.S., and A. F. Loughe, 1998: The relationship

between ensemble spread and ensemble mean skill. Mon. Wea. Rev., 126, 3292-3302.• Brier score, continuous ranked probability score, reliability diagrams: Wilks text

again.• Relative operating characteristic: Harvey, L. O., Jr, and others, 1992: The application

of signal detection theory to weather forecasting behavior. Mon. Wea. Rev., 120, 863-883.

• Economic value diagrams: – (1)Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble

prediction system. Quart. J. Royal Meteor. Soc., 126, 649-667.– (2) Zhu, Y, and others, 2002: The economic value of ensemble-based weather forecasts. Bull.

Amer. Meteor. Soc., 83, 73-83.• Overforecasting skill: Hamill, T. M., and J. Juras, 2006: Measuring forecast skill: is it

real skill or is it the varying climatology? Quart. J. Royal Meteor. Soc., Jan 2007 issue. http://tinyurl.com/kxtct

http://tinyurl.com/y97c74

http://tinyurl.com/kxtct

common verification methods for ensemble forecasts, and how to apply them properly

Documents