common verification methods for ensemble forecasts, and how to apply them properly

69
Common verification methods for ensemble forecasts, and how to apply them properly Tom Hamill NOAA Earth System Research Lab, Physical Sciences Division, Boulder, CO [email protected] NOAA Earth System Research Laboratory 1

Upload: hector

Post on 23-Feb-2016

34 views

Category:

Documents


0 download

DESCRIPTION

NOAA Earth System Research Laboratory. Common verification methods for ensemble forecasts, and how to apply them properly. Tom Hamill NOAA Earth System Research Lab, Physical Sciences Division, Boulder, CO [email protected]. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Common verification methods for ensemble forecasts, and how to apply them properly

1

Common verification methods for ensemble forecasts, andhow to apply them properly

Tom HamillNOAA Earth System Research Lab,

Physical Sciences Division, Boulder, [email protected]

NOAA Earth SystemResearch Laboratory

Page 2: Common verification methods for ensemble forecasts, and how to apply them properly

2

Part 1: two desirable properties of ensembles, and the challenges of

evaluating these properties

• Property 1: Reliability, no matter how you slice and dice your ensemble data.

• Property 2: Specificity, i.e., sharpness.

Page 3: Common verification methods for ensemble forecasts, and how to apply them properly

3

Unreliable ensemble?

Here, the observed is outside of the range of the ensemble,which was sampled from the pdf shown. Is this a sign ofa poor ensemble forecast?

Page 4: Common verification methods for ensemble forecasts, and how to apply them properly

4

Unreliable ensemble?

Here, the observed is outside of the range of the ensemble,which was sampled from the pdf shown. Is this a sign ofa poor ensemble forecast?

You just don’t know…it’s only one sampleYou just don’t know; it’s only one sample.

Page 5: Common verification methods for ensemble forecasts, and how to apply them properly

5

Rank 1 of 21 Rank 14 of 21

Rank 5 of 21 Rank 3 of 21

Page 6: Common verification methods for ensemble forecasts, and how to apply them properly

6

“Rank histograms,” aka “Talagrand diagrams”

Happens whenobserved is indistinguishablefrom any other member of theensemble. Ensemblehopefully is reliable.

Happens when observed too commonly islower than the ensemble members.

Happens whenthere are eithersome low and somehigh biases, or whenthe ensemble doesn’tspread out enough.

With lots of samples from many situations, can evaluate the characteristics of the ensemble.

ref: Hamill, MWR, March 2001

Page 7: Common verification methods for ensemble forecasts, and how to apply them properly

7

X =(x1, K , xn)

E P V < xi( )⎡⎣ ⎤⎦=i

n +1

E P xi−1 ≤V < xi( )⎡⎣ ⎤⎦=1

n +1R = r1, K , rn+1( )

rj=P xj−1 ≤V < xj( )

n-member sorted ensemble atsome point

probability the truth V is less than the ith sorted member if V and X’s members sample the same distribution

… equivalently

our rank histogram vector

overbar denotes the sample average over many (hopefully independent)samples

Underlying mathematics

ref: ibid.

Page 8: Common verification methods for ensemble forecasts, and how to apply them properly

8

Rank histograms of Z500, T850, T2m

(from 1998 reforecast version of NCEP GFS)

Solid lines indicate ranks after bias

correction.

Rank histograms are particularly

U-shaped for T2M.

Ref: Hamill and Whitaker, MWR, Sep 2007.

Page 9: Common verification methods for ensemble forecasts, and how to apply them properly

9

Rank histograms…pretty simple, right?

Let’s consider some of the issues involved with this one seemingly simple verification metric.

Page 10: Common verification methods for ensemble forecasts, and how to apply them properly

10

Issues in using rank histograms(1) Flat rank histograms can be created from

combinations of uncalibrated ensembles.

Lesson: you only be confident you have reliability if you see flatness of rank histogram having sliced the data many different ways.

Ref: Hamill, MWR, Mar 2001

Page 11: Common verification methods for ensemble forecasts, and how to apply them properly

11

Issues in using rank histograms(2) Same rank histogram shape may result from

ensembles with different deficiencies

Ref: ibid

Here, half of ensemble members from low-biaseddistribution,half fromhigh-biaseddistribution

Here, all of ensemble members from under-spreaddistribution.

Page 12: Common verification methods for ensemble forecasts, and how to apply them properly

12

Issues in using rank histograms(3) Evaluating ensemble relative to observations with errors

distorts shape of rank histograms.

A solution of sorts is to dress every ensemble member witha random sample of noise, consistent with the observation errors.

Ref: ibid.

small obs error medium obs error

large obs error

Page 13: Common verification methods for ensemble forecasts, and how to apply them properly

13

Rank histogram shapes under “location” and “scale” errors

truth from N(0,1)

forecast fromN(μ,σ); no implicitcorrelations betweenmembers or betweenmember and truth

Ref: ibid.

Page 14: Common verification methods for ensemble forecasts, and how to apply them properly

14

Caren Marzban’s question:what if we did have correlations between members, and w. truth?

Truth y; ensemble x1,K , xn( )

y, x1,K , xn( ) : MVN 0,μx,K , μx( ),Σ( )

Σ =

1 Rσ x Rσ x ... Rσ x

σ x2 rσ x

2 ... rσ x2

σ x2 ... rσ x

2

O Mσ x

2

⎢⎢⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥⎥⎥

r is correlation between ensemble members

R is correlation between ensemble member and the observation

Ref: Marzban et al., MWR 2010, in press.

Page 15: Common verification methods for ensemble forecasts, and how to apply them properly

Rank histograms, r = R = 0.9

15

Marzban’s rankhistograms includebox & whiskers toquantify samplingvariability (nice!).

μ=0.0 μ=0.2 μ=0.4 μ=1.6μ=0.8

σ =0.25

σ =0.5

σ =1.0

σ =2.0

σ =4.0

Page 16: Common verification methods for ensemble forecasts, and how to apply them properly

Rank histograms, r = R = 0.9

16

Marzban’s rankhistograms includebox & whiskers toquantify samplingvariability (nice!).

What’s going on here? Why do we now appear to havediagnosed under-spreadwith an ensemblewith apparentexcess spread?

μ=0.0 μ=0.2 μ=0.4 μ=1.6μ=0.8

σ =0.25

σ =0.5

σ =1.0

σ =2.0

σ =4.0

Page 17: Common verification methods for ensemble forecasts, and how to apply them properly

17

Dan Wilks’ insight into this curious phenomenon…

Property of multivariate Gaussian distribution:

(= 0, since in this case assumed y ~ N(0,1))

When μx= 0 (unbiased forecast), and given σy =1, expectedvalue for ensemble member is Rσxy, which will be more extreme (further from origin) than y for relatively large σx.

Ref: Wilks, 2010 MWR, submitted.

μx |y = μ x + Rσ x

σ y

y − μ y( ) = μ x + Rσ x

σ y

y

Page 18: Common verification methods for ensemble forecasts, and how to apply them properly

Illustration, μx=0

18

slopeRσx

Here, for a 1-memberensemble, that forecast member x1tends to be further from the origin than the verification y. If we generated another ensemble member, it would tend to be further from the origin in the same manner. Hence, you have an ensemble clustered together, away from the origin, and the verification near the origin, i.e., at extreme ranks relative to the sorted ensemble.

Many realizations of truthand synthetic 1-memberensemble were created,μx = 0,σx = 2, r = R = 0.9

Page 19: Common verification methods for ensemble forecasts, and how to apply them properly

19

Rank histograms in higher dimensions: the “minimum spanning tree” histogram

• Solid lines: minimum spanning tree (MST) between 10-member forecasts• Dashed line: MST when observed O is substituted for member D• Calculate MST’s sum of line segments for all forecasts, and observed replacing each forecast

member. Tally rank of pure forecast sum relative to sum where observed replaced a member.• Repeat for independent samples, build up a histogram

Ref: Wilks, MWR, June 2004. See also Smith and Hansen, MWR, June 2004

Page 20: Common verification methods for ensemble forecasts, and how to apply them properly

20

Rank histograms in higher dimensions: the “minimum spanning tree” histogram

• Graphical interpretation of MST is different and perhaps less intuitive than it is for uni-dimensional rank histogram.

Ref: Wilks, MWR, June 2004. See also Smith and Hansen, MWR, June 2004

Page 21: Common verification methods for ensemble forecasts, and how to apply them properly

21

Multi-variate rank histogram

• Standardize and rotate using Mahalanobis transformation (see Wilks 2006 text). • For each of n members of forecast and observed, define “pre-rank” as the number of

vectors to its lower left (a number between 1 and n+1)• The multi-variate rank is the rank of the observation pre-rank, with ties resolved at

random• Composite multi-variate ranks over many independent samples and plot rank histogram.• Same interpretation as scalar rank histogram (e.g., U-shape = under-dispersive).

based on Tilmann Gneiting’s presentation at Probability and Statistics, 2008 AMS Annual Conf., New Orleans .and http://www.springerlink.com/content/q58j4167355611g1/

zi = Σ[ ]−1/2 xi −x( )

“Mahalanobis”transform(S is forecasts’samplecovariance)

Page 22: Common verification methods for ensemble forecasts, and how to apply them properly

22

Multi-variate rank histogram calculation

based on Tilmann Gneiting’s presentation at Probability and Statistics, 2008 AMS Annual Conf., New Orleans

F1, F2, F3, F4, F5, O pre-ranks: [1, 5, 3, 1, 4, 1] ; sorted: obs = either rank 1, 2, or 3 with p=1/3.

Page 23: Common verification methods for ensemble forecasts, and how to apply them properly

23

Reliability diagrams

Page 24: Common verification methods for ensemble forecasts, and how to apply them properly

24

Reliability diagrams

Curve tells you whatthe observed frequencywas each time youforecast a given probability.This curve ought to liealong y = x line. Here thisshows the ensemble-forecastsystem over-forecasts theprobability of light rain.

Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

Page 25: Common verification methods for ensemble forecasts, and how to apply them properly

25

Reliability diagrams

Inset histogram tellsyou how frequentlyeach probability wasissued.

Perfectly sharp: frequency of usagepopulates only0% and 100%.

Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

Page 26: Common verification methods for ensemble forecasts, and how to apply them properly

26

Reliability diagrams

BSS = Brier Skill Score

BSS =BΣ(Cliμo)−BΣ(Forecaσt)BΣ(Cliμo)−BΣ(Perfect)

BS(•) measures theBrier Score, which youcan think of as the squared error of a probabilistic forecast.

Perfect: BSS = 1.0Climatology: BSS = 0.0

Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

Page 27: Common verification methods for ensemble forecasts, and how to apply them properly

27

Brier score• Define an event, e.g., precip > 2.5 mm.• Let be the forecast probability for the ith

forecast case.• Let be the observed probability (1 or 0).

Then

yi

oi

BS =1n

yi −oi( )i=1

n∑2

(So the Brier score is the averaged squared error ofthe probabilistic forecast)

Ref: Wilks 2006 text

Page 28: Common verification methods for ensemble forecasts, and how to apply them properly

28

Brier score decomposition

BS =1n

yi −oi( )i=1

n∑2

Σuppoσe I allowable forecaσt valueσ, e.g., I=10 for (5%, 15%, ... , 95%). Frequency of uσage Ni.Σuppoσe overall cliμ atological probability o, andconditional average obσerved probability oi for Ith value.

BΣ=1n

Nii=1

I

∑ yi −oi( )2 −

1n

Nii=1

I

∑ oi −o( )2 + o 1−o( )

("reliability") ("reσolution") ("uncertainty")

Ref: 2006 Wilks text

Page 29: Common verification methods for ensemble forecasts, and how to apply them properly

29

Brier score decomposition

BS =1n

yi −oi( )i=1

n∑2

Σuppoσe I allowable forecaσt valueσ, e.g., I=10 for (5%, 15%, ... , 95%). Frequency of uσage Ni.Σuppoσe overall cliμ atological probability o, andconditional average obσerved probability oi for Ith value.

BΣ=1n

Nii=1

I

∑ yi −oi( )2 −

1n

Nii=1

I

∑ oi −o( )2 + o 1−o( )

("reliability") ("reσolution") ("uncertainty")

Decomposition only makes sense when every sample is drawn from a distribution with an overall climatologyof . For example, don’t use if your forecast samplemixes together data from both desert and rainforest locations. For more, see Hamill and Juras, Oct 2006 QJ

o

Page 30: Common verification methods for ensemble forecasts, and how to apply them properly

Degenerate case:

Skill mightappropriatelybe 0.0 if allsamples with0.0 probabilityare drawn fromclimatology with0.0 probability,and all sampleswith 1.0 are drawn from climatology with1.0 probability.

Page 31: Common verification methods for ensemble forecasts, and how to apply them properly

31

“Attributes diagram”

www.bom.gov.au/bmrc/wefor/staff/eee/verif/ReliabilityDiagram.gif,from Beth Ebert’s verification web page,http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.htmlbased on Hsu and Murphy, 1986, Int’l Journal of Forecasting

BSS = “Resolution” - “Reliability”

“Uncertainty”

Uncertainty term always positive, so probability forecasts will exhibit positive skill if resolution term is larger in absolute value than reliability term. Geometrically, this corresponds to points on the attributes diagram being closer to 1:1 perfect reliability line than horizontal no-resolution line (from Wilks text, 2006, chapter 7)

Again, this geometric interpretation of the attributes diagram makes sense only if all samples used to populate the diagram are drawn from the same climatological distribution.If you are mixing samples from locations with different climatologies, this interpretation is no longer correct! (Hamill and Juras, Oct 2006 QJRMS)

Page 32: Common verification methods for ensemble forecasts, and how to apply them properly

32

Proposed modifications to standard reliability diagrams

(1) Block-bootstrap techniques (perhaps each forecast day is a block) to provide confidence intervals. See Hamill, WAF, April 1999, and Bröcker and Smith, WAF, June 2007.

(2) BSS calculated in a way that does not attribute false skill to varying climatology (talk later this morning)

(3) Distribution of climatological forecasts plotted as horizontal bars on the inset histogram. Helps explain why there is small skill for a forecast that appears so reliable (figure from Hamill et al., MWR, 2008).

Page 33: Common verification methods for ensemble forecasts, and how to apply them properly

33

When does reliability ≠ reliability?

Probabilities directly estimated from a “reliable” ensemble system with only a few members may not produce diagonal reliability curves due to sampling variability.

10 members 50 members

ref: Richardson 2001 QJRMS

Page 34: Common verification methods for ensemble forecasts, and how to apply them properly

34

Sharpness also important

“Sharpness”measures thespecificity ofthe probabilisticforecast. Given two reliable forecastsystems, the one producing the sharper forecastsis preferable.

But: don’t wantsharp if not reliable.Implies unrealistic confidence.

Page 35: Common verification methods for ensemble forecasts, and how to apply them properly

35

Sharpness ≠ resolution• Sharpness is a property of the forecasts

alone; a measure of sharpness in Brier score decomposition would be how populated the extreme Ni’s are is.

BS =1n

Nii=1

I

∑ yi −oi( )2 −

1n

Nii=1

I

∑ oi −o( )2 + o 1−o( )

("reliability") ("reσolution") ("uncertainty")

Page 36: Common verification methods for ensemble forecasts, and how to apply them properly

“Spread-error” relationships are important, too.

ensemble-meanerror from a sampleof this pdf on avg.should be low.

ensemble-meanerror should bemoderate on avg.

ensemble-meanerror should belarge on avg.

Small-spread ensemble forecasts should have less ensemble-mean error than large-spread forecasts, in some sense a conditional reliability dependent upon amount of sharpness.

36

Page 37: Common verification methods for ensemble forecasts, and how to apply them properly

37

Why would you expectspread-error relationship?

• Ensemble-mean error ought to be the same expected value as ensemble spread.

• If V, xi are sampled from the same distribution, then

• Sometimes quantified with a correlation between spread and error.

E xi −x( )2=E V −x( )2

(spread)2 (ens. mean error)2

Page 38: Common verification methods for ensemble forecasts, and how to apply them properly

Spread-error correlations with 1990’s NCEP GFS

At a given grid point, spread S is assumed to be a randomvariable with a lognormal distribution

where Sm is the mean spread and β is its standard deviation.

As β increases, there is a wider range of spreads in the sample. One would expect then the possibility fora larger spread-skill correlation.

ln S ~ N ln Sm ,b( )

38

corr

β

Lesson: spread-error correlations inherently limited by amount of variation in spread

Ref: Whitaker and Loughe, Dec. 1998 MWR

Page 39: Common verification methods for ensemble forecasts, and how to apply them properly

39

Spread-error relationships and

precipitation forecastsTrue spread-skill relationships harder to diagnose if forecastPDF is non-normally distributed, as they are typically for precipitation forecasts.

Commonly, spread is no longerindependent of the mean value;it’s larger when the amount islarger.

Hence, you may get an apparentspread-error relationship, butthis may reflect variations in themean forecast rather thanreal spread-skill.

Ref & possible solution: Hamill and Colucci, MWR, Mar. 1998

Page 40: Common verification methods for ensemble forecasts, and how to apply them properly

40

Is the spread-error correlation as high as it could be?

40

Hur

rican

e tra

ck e

rror

and

spr

ead

Ref: Hamill et al., MWR 2010 conditionally accepted.

Page 41: Common verification methods for ensemble forecasts, and how to apply them properly

…use one member as synthetic verification to gauge potential spread-error correlation

41

Page 42: Common verification methods for ensemble forecasts, and how to apply them properly

42

Spread vs. error in more than one

dimension: verificationof ellipse

eccentricity

Hurricane Bill, initialized 00 UTC 19 August 2009.

Fitted bi-variate normalto each ensemble’s forecast track positions.Ellipse encloses 90% ofthe fitted probability.

Ref: Hamill et al., MWR 2010 conditionally accepted.

Page 43: Common verification methods for ensemble forecasts, and how to apply them properly

43

Ellipse eccentricity analysis

xl' = xl(1) −xl , K , xl(nt) −xl( ) nt−1( )

1/2

xj' = xj(1) −xj, K , xj(nt) −xj( ) nt−1( )

1/2

l =longitude, j =latitude, nt=#tracked

X =xl'

xj'

⎣⎢⎢

⎦⎥⎥

F=X X T =ΣΛΣ−1 =ΣΛΣT = ΣΛ1/2( ) ΣΛ1/2( )T

E gΣ1 σhould be conσiσtentwith X iggΣ1

E gΣ2 σhould be conσiσtentwith X iggΣ2

g =average over caσeσ; g =average over caσeσ, μ eμ berσ

Question: are errors projections larger along the direction where the ellipse is stretched out?

Page 44: Common verification methods for ensemble forecasts, and how to apply them properly

Ellipse eccentricity analysis(n

on-h

omog

eneo

us)

Page 45: Common verification methods for ensemble forecasts, and how to apply them properly

Ellipse eccentricity analysis(n

on-h

omog

eneo

us)

45

Notes for GFS/EnKF:

(1) Along major axis of ellipse, consistent average projection error of errors and projection of members; spread well estimated.

(2) Along minor axis of ellipse, slightly larger projection of errors than projection of members. Too little spread.

(3) Together, imply more isotropy needed.(4) Still (dashed lines) some separation of

projection of error onto ellipses indicates there is some skill in forecasting ellipticity.

Page 46: Common verification methods for ensemble forecasts, and how to apply them properly

Another way of conveying spread-error relationships.

46

Averaging the individualsamples into bins, thenplotting a curve.

Ref: Wang and Bishop,JAS, May 2003.

Page 47: Common verification methods for ensemble forecasts, and how to apply them properly

47

Part 2: other common (and uncommon) ensemble-related evaluation tools

• What do they tell us about the ensemble?• How are they computed?• Is there some relation to reliability, sharpness,

or other previously discussed attributes?• What are some issues in their usage?

Page 48: Common verification methods for ensemble forecasts, and how to apply them properly

48

Isla Gilmour’s non-linearity index

Ref: Gilmour et al., JAS, Nov. 2001

how long does a linear regime persist?

Page 49: Common verification methods for ensemble forecasts, and how to apply them properly

49

Power spectra and saturation

Ref: Tribbia and Baumhefner, MWR, March 2004

The 0–curve represents the power difference betweenanalyses as a function ofglobal wavenumber.

Other numbered curves indicate the power difference at different forecast leads, in days.

Diagnostics like this can beuseful in evaluating upscalegrowth characteristics ofensemble uncertainty.

Page 50: Common verification methods for ensemble forecasts, and how to apply them properly

50

Satterfield and Szunyogh’s “Explained Variance”

EV =

d x P( )

x=

d x P( )

d x P( )+d x ⊥( )

How much of the error ξ (difference of truth from ens. mean)lies in the space spanned by the ensemble (||) vs. orthogonal to it ( )? (calculation is done in a limited-area region).

Would expect that as EV decreases, more U-shaped multi-dimensional rank histograms would be encountered.

Ref: Satterfield and Szunyogh, MWR, March 2010

Page 51: Common verification methods for ensemble forecasts, and how to apply them properly

51

Example with

forecasts from “local

ETKF”

top: 5x5 degreeboxes

bottom: 10x10boxes.

the larger the box, the smallerthe EV, but patterns similar.

Page 52: Common verification methods for ensemble forecasts, and how to apply them properly

52

Cumulative distribution function (CDF); used in CRPS

• Ff(x) = Pr {X ≤ x}where X is the random variable, x is some

specified threshold.

Page 53: Common verification methods for ensemble forecasts, and how to apply them properly

Continuous Ranked Probability Score• Let be the forecast probability CDF for the ith forecast case.• Let be the observed probability CDF (Heaviside function).

Fif (x)

Fio(x)

CRPS =1n

Fif(x)−Fi

o(x)( )x=−∞

x=−∞

∫2

dxi=1

n

53

b

Page 54: Common verification methods for ensemble forecasts, and how to apply them properly

54

Continuous Ranked Probability Score• Let be the forecast probability CDF for the ith forecast case.• Let be the observed probability CDF (Heaviside function)*.

Fif (x)

Fio(x)

CRPS =1n

Fif(x)−Fi

o(x)( )x=−∞

x=−∞

∫2

dxi=1

n

* or incorporate obs error; see Candille and Talagrand, QJRMS,2007

(difference squared)

b

Page 55: Common verification methods for ensemble forecasts, and how to apply them properly

55

Continuous Ranked Probability Skill Score (CRPSS)

CRPSS =CRPΣ(forecaσt)−CRPΣ(cliμo)CRPΣ(perfect)−CRPΣ(cliμo)

Like the Brier score, it’s common to convert this toa skill score by normalizing by the skill of a reference forecast, perhaps climatology.

Danger: can over-estimate forecast skill. Seelater presentation on this.

Page 56: Common verification methods for ensemble forecasts, and how to apply them properly

56

Decomposition of CRPS

• Like Brier score, there is a decomposition of CRPS into reliability, resolution, uncertainty.

• Like Brier score, interpretation of this decomposition only makes sense if all samples are draws from a distribution with the same climatology.

Ref: Hersbach, WAF, Oct 2000, + Hamill and Juras, QJ, Oct 2006

Page 57: Common verification methods for ensemble forecasts, and how to apply them properly

57

Relative Operating Characteristic (ROC)

Hit Rate = H / (H+M) FAR = F / (F+C)

Page 58: Common verification methods for ensemble forecasts, and how to apply them properly

58

Relative Operating Characteristic (ROC)

ROCSS =AUCf −AUCcliμ

AUCperf −AUCcliμ

=AUCf −0.51.0 −0.5

=2AUCf −1

Page 59: Common verification methods for ensemble forecasts, and how to apply them properly

59

Method of calculation of ROC:parts 1 and 2

Y N

Y 0 0

N 0 1

55 56 57 58 59 60 61 62 63 64 65 66

T

F FFFFF

Obs ≥ T?

Fcst

≥ T

?

(1) Build contingency tables for each sorted ensemble member

Obs

Y N

Y 0 0

N 0 1

Obs ≥ T?

Fcst

≥ T

?

Y N

Y 0 0

N 0 1

Obs ≥ T?

Fcst

≥ T

?

Y N

Y 0 1

N 0 0

Obs ≥ T?Fc

st ≥

T?

Y N

Y 0 1

N 0 0

Obs ≥ T?

Fcst

≥ T

?

Y N

Y 0 1

N 0 0

Obs ≥ T?

Fcst

≥ T

?

(2) Repeat the process for other locations, dates, buildingup contingency tables for sorted members.

Page 60: Common verification methods for ensemble forecasts, and how to apply them properly

60

Method of calculation of ROC:part 3

Y N

Y 4020 561

N 2707 72712

Obs ≥ T?

Fcst

≥ T

?

(3) Get hit rate and false alarm rate for each from contingency table for each sorted ensemble member.

Obs ≥ T?

Fcst

≥ T

?

Obs ≥ T?

Fcst

≥ T

?

Obs ≥ T?

Fcst

≥ T

?

Obs ≥ T?

Fcst

≥ T

?

Obs ≥ T?

Fcst

≥ T

?

Y N

Y H F

N M C

Obs ≥ T?

Fcst

≥ T

? HR = H / (H+M) FAR = F / (F+C)

Sorted Member 1

Sorted Member 2

Sorted Member 3

Sorted Member 4

Sorted Member 5

Sorted Member 6

Y N

Y 1106 3

N 5651 73270

Y N

Y 4692 1270

N 2035 72003

Y N

Y 3097 176

N 3630 73097

Y N

Y 5297 2655

N 1430 70618

Y N

Y 6603 44895

N 124 28378

HR = 0.163FAR = 0.000

HR = 0.504FAR = 0.002

HR = 0.597FAR = 0.007

HR = 0.697FAR = 0.017

HR = 0.787FAR = 0.036

HR = 0.981FAR = 0.612

Page 61: Common verification methods for ensemble forecasts, and how to apply them properly

61

Method of calculation of ROC:part 3

HR = 0.163FAR = 0.000

HR = 0.504FAR = 0.002

HR = 0.597FAR = 0.007

HR = 0.697FAR = 0.017

HR = 0.787FAR = 0.036

HR = 0.981FAR = 0.612

HR = [0.000, 0.163, 0.504, 0.597, 0.697, 0.787, 0.981, 1.000]

FAR = [0.000, 0.000, 0.002, 0.007, 0.017, 0.036, 0.612, 1.000]

(4) Plot hit ratevs. false alarmrate

Again, can overestimateforecast skill.

Page 62: Common verification methods for ensemble forecasts, and how to apply them properly

62

ROC connection with hypothesis testing

Point on ROC curve provides evaluation of Type I (inappropriate rejection of null hypothesis, which here is Obs < T) vs. 1. - Type II statistical errors (inappropriate acceptance of alternate hypothesis, Obs ≥ T). ROC curve provides tradeoff as different ensemble data is used as test statistic.

YES NO

YES H (HIT) (F) FALSE ALARM

NO (M) MISS (C)CORRECTNO

Observed ≥ T?

Fcst

≥ T

? (o

ur te

st s

tatis

tic)

FAR = F / F+C = P(Type I error)HR = H / H+M = 1 - P(Type II error)

Page 63: Common verification methods for ensemble forecasts, and how to apply them properly

63

Economic value diagrams

These diagramstell you the potential economicvalue of yourensemble forecastsystem applied toa particular forecastaspect. Perfectforecast has valueof 1.0, climatologyhas value of 1.0.Value differs withuser’s cost/lossratio.

Motivated by search for a metric that relates ensemble forecastperformance to things that customers will actually care about.

Page 64: Common verification methods for ensemble forecasts, and how to apply them properly

64

Economic value: calculation method

Assumes decision makeralters actions based on weather forecast info.

C = Cost of protectionL = Lp+Lu = total cost of a loss, where …Lp = Loss that can be protected againstLu = Loss that can’t be protected against.N = No cost

h + μ

=o

f+ c

=1−o

Page 65: Common verification methods for ensemble forecasts, and how to apply them properly

65

Economic value, continuedSuppose we have the contingencytable of forecast outcomes, [h, m, f, c].

Then we can calculate the expectedvalue of the expenses from a forecast,from climatology, from a perfect forecast.

E forecast =fC + h C + Λu( )+ μ Λp + Λu( )

Ecliμate =Min o Λp + Λu( ), C + oΛu⎡⎣ ⎤⎦=oΛu + Min oΛp, C⎡⎣ ⎤⎦Eperfect =o C + Λu( )

V =Ecliμate −Eforecaσt

Ecliμate −Eperfect

=Min oΛp, C⎡⎣ ⎤⎦− h+ f( )C−μΛp

Min oΛp, C⎡⎣ ⎤⎦−oC

Note thatvalue will varywith C, Lp, Lu;

Different userswith different protection costsmay experiencea different valuefrom the forecastsystem.

h + μ

=o

f+ c

=1−o

Ref: Zhu et al. 2002 BAMS

Page 66: Common verification methods for ensemble forecasts, and how to apply them properly

66

From ROC to economic value

HR =ho

FAR =f

1−oμ =o −HRo

V =Min o,C Λp⎡⎣ ⎤⎦− h+ f( )C Λp −μ

Min o,C Λp⎡⎣ ⎤⎦−or

=Min o,C Λp⎡⎣ ⎤⎦− C Λp( )FAR 1−o( )+ HRo 1−C Λp( )−o

Min o,C Λp⎡⎣ ⎤⎦−or

Value is now seen to be related to FAR and HR, thecomponents of the ROC curve. A (HR, FAR)point on the ROC curve will thus map to a value curve (as a function of C/L)

Ref: ibid.

Page 67: Common verification methods for ensemble forecasts, and how to apply them properly

67

The red curve isfrom the ROCdata for the member defining the 90th percentile of the ensemble distribution.Green curve is forthe 10th percentile.Overall economicvalue is the maximum(use whatever memberfor decision threshold that provides thebest economic value).

Page 68: Common verification methods for ensemble forecasts, and how to apply them properly

68

Conclusions• Many ensemble verification techniques

out there.• A good principle is to thought-test every

verification technique you seek to use; is there some way it could be misleading you about the characteristics of the forecast (commonly, the answer will be yes).

Page 69: Common verification methods for ensemble forecasts, and how to apply them properly

69

Useful references• Good overall references for forecast verification:

– (1): Wilks, D.S., 2006: Statistical Methods in the Atmospheric Sciences (2nd Ed). Academic Press, 627 pp.

– (2) Beth Ebert’s forecast verification web page, http://tinyurl.com/y97c74• Rank histograms: Hamill, T. M., 2001: Interpretation of rank histograms for verifying

ensemble forecasts. Mon. Wea. Rev., 129, 550-560.• Spread-skill relationships: Whitaker, J.S., and A. F. Loughe, 1998: The relationship

between ensemble spread and ensemble mean skill. Mon. Wea. Rev., 126, 3292-3302.• Brier score, continuous ranked probability score, reliability diagrams: Wilks text

again.• Relative operating characteristic: Harvey, L. O., Jr, and others, 1992: The application

of signal detection theory to weather forecasting behavior. Mon. Wea. Rev., 120, 863-883.

• Economic value diagrams: – (1)Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble

prediction system. Quart. J. Royal Meteor. Soc., 126, 649-667.– (2) Zhu, Y, and others, 2002: The economic value of ensemble-based weather forecasts. Bull.

Amer. Meteor. Soc., 83, 73-83.• Overforecasting skill: Hamill, T. M., and J. Juras, 2006: Measuring forecast skill: is it

real skill or is it the varying climatology? Quart. J. Royal Meteor. Soc., Jan 2007 issue. http://tinyurl.com/kxtct