common verification methods for ensemble forecasts

31
Common verification methods for ensemble forecasts Tom Hamill NOAA Earth System Research Lab, Physical Sciences Division, Boulder, CO [email protected] NOAA Earth System Research Laboratory

Upload: joel-wood

Post on 03-Jan-2016

37 views

Category:

Documents


1 download

DESCRIPTION

NOAA Earth System Research Laboratory. Common verification methods for ensemble forecasts. Tom Hamill NOAA Earth System Research Lab, Physical Sciences Division, Boulder, CO [email protected]. What constitutes a “good” ensemble forecast?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Common verification methods for  ensemble forecasts

Common verification methods for ensemble forecasts

Tom HamillNOAA Earth System Research Lab,

Physical Sciences Division, Boulder, CO

[email protected]

NOAA Earth SystemResearch Laboratory

Page 2: Common verification methods for  ensemble forecasts

What constitutes a “good” ensemble forecast?

Here, the observed is outside of the range of the ensemble,which was sampled from the pdf shown. Is this a sign ofa poor ensemble forecast?

Page 3: Common verification methods for  ensemble forecasts

Rank 1 of 21 Rank 14 of 21

Rank 5 of 21 Rank 3 of 21

Page 4: Common verification methods for  ensemble forecasts

One way of evaluating ensembles: “rank histograms” or “Talagrand diagrams”

Happens whenobserved is indistinguishablefrom any other member of theensemble. Ensemble

is “reliable”

Happens when observed too commonly islower than the ensemble members.

Happens whenthere are eithersome low and somehigh biases, or whenthe ensemble doesn’tspread out enough.

We need lots of samples from many situations to evaluate the characteristics of the ensemble.

ref: Hamill, MWR, March 2001

Page 5: Common verification methods for  ensemble forecasts

Rank histograms of Z500, T850, T2m

(from 1998 reforecast version of NCEP GFS)

Solid lines indicate ranks after bias correction. Rank histograms are particularly U-shaped for T2M, which is probably the most relevant of the three plotted here.

Page 6: Common verification methods for  ensemble forecasts

Rank histograms tell us about reliability - but what else is important?

“Sharpness”measures thespecificity ofthe probabilisticforecast. Given two reliable forecastsystems, the one producing the sharper forecastsis preferable.

But: don’t wantsharp if not reliable.Implies unrealistic confidence.

Page 7: Common verification methods for  ensemble forecasts

“Spread-skill” relationships are important, too.

ensemble-meanerror from a sampleof this pdf on avg.should be low.

ensemble-meanerror should bemoderate on avg.

ensemble-meanerror should belarge on avg.

Small-spread ensemble forecasts should have less ensemble-mean error than large-spread forecasts.

Page 8: Common verification methods for  ensemble forecasts

Spread-skill and precipitation forecasts.

True spread-skill relationships harder to diagnose if forecastPDF is non-normally distributed, as they are typically for precipitation forecasts.

Commonly, spread is no longerindependent of the mean value;it’s larger when the amount islarger.

Hence, you get an apparentspread-skill relationship, butthis may reflect variations in themean forecast rather thanreal spread-skill.

Page 9: Common verification methods for  ensemble forecasts

Reliability Diagrams

Page 10: Common verification methods for  ensemble forecasts

Reliability Diagrams

Curve tells you whatthe observed frequencywas each time youforecast a given probability.This curve ought to liealong y = x line. Here thisshows the ensemble-forecastsystem over-forecasts theprobability of light rain.

Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

Page 11: Common verification methods for  ensemble forecasts

Reliability Diagrams

Inset histogram tellsyou how frequentlyeach probability wasissued.

Perfectly sharp: frequency of usagepopulates only0% and 100%.

Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

Page 12: Common verification methods for  ensemble forecasts

Reliability Diagrams

BSS = Brier Skill Score

BSS =BS(CLimo)−BS(Forecast)BS(CLimo)−BS(Perfect)

BS(•) measures theBrier Score, which youcan think of as the squared error of a probabilistic forecast.

Perfect: BSS = 1.0Climatology: BSS = 0.0

Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

Page 13: Common verification methods for  ensemble forecasts

Brier Score• Define an event, e.g., precip > 2.5 mm.• Let be the forecast probability for the ith

forecast case.• Let be the observed probability (1 or 0).

Then

Pif

Oi

BS(forecast)=1

ncasesPi

f −Oi( )i=1

ncases∑2

(So the Brier score is the averaged squared error ofthe probabilistic forecast)

Page 14: Common verification methods for  ensemble forecasts

Reliability after post-processing

Statistical correctionof forecasts using a long, stable set ofprior forecasts fromthe same model (like in MOS).

Ref: Hamill et al., MWR, Nov 2006

Page 15: Common verification methods for  ensemble forecasts

Cumulative Distribution Function (CDF)

• Ff(x) = Pr {X ≤ x}where X is the random variable, x is some

specified threshold.

Page 16: Common verification methods for  ensemble forecasts

Continuous Ranked Probability Score

• Let be the forecast probability CDF for the ith forecast case.

• Let be the observed probability CDF (Heaviside function).

Fif (x)

Fio(x)

CRPS forecast( ) =1

ncasesFi

f (x)−Fio(x)( )

x=−∞

x=−∞

∫2

dxi=1

ncases

Page 17: Common verification methods for  ensemble forecasts

Continuous Ranked Probability Score

• Let be the forecast probability CDF for the ith forecast case.

• Let be the observed probability CDF (Heaviside function).

Fif (x)

Fio(x)

CRPS forecast( ) =1

ncasesFi

f (x)−Fio(x)( )

x=−∞

x=−∞

∫2

dxi=1

ncases

∑(squared)

Page 18: Common verification methods for  ensemble forecasts

Continuous Ranked Probability Skill Score (CRPSS)

CRPSS =CRPS( forecast)−CRPS(climo)CRPS(perfect)−CRPS(climo)

Like the Brier score, it’s common to convert this toa skill score by normalizing by the skill of climatology

Page 19: Common verification methods for  ensemble forecasts

Relative Operating Characteristic (ROC)

Page 20: Common verification methods for  ensemble forecasts

Relative Operating Characteristic (ROC)

ROCSS =AUC f −AUCclim

AUCperf −AUCclim

=AUC f −0.51.0 −0.5

=2AUC f −1

Page 21: Common verification methods for  ensemble forecasts

Method of Calculation of ROC:Parts 1 and 2

Y N

Y 0 0

N 0 1

55 56 57 58 59 60 61 62 63 64 65 66

T

F FFFFF

Obs ≥ T?

Fcs

t ≥ T

?

(1) Build contingency tables for each sorted ensemble member

Obs

Y N

Y 0 0

N 0 1

Obs ≥ T?

Fcs

t ≥ T

?

Y N

Y 0 0

N 0 1

Obs ≥ T?

Fcs

t ≥ T

?

Y N

Y 0 1

N 0 0

Obs ≥ T?F

cst ≥

T?

Y N

Y 0 1

N 0 0

Obs ≥ T?

Fcs

t ≥ T

?

Y N

Y 0 1

N 0 0

Obs ≥ T?

Fcs

t ≥ T

?

(2) Repeat the process for other locations, dates, buildingup contingency tables for sorted members.

Page 22: Common verification methods for  ensemble forecasts

Method of Calculation of ROC:Part 3

Y N

Y 4020 561

N 2707 72712

Obs ≥ T?

Fcs

t ≥ T

?

(3) Get hit rate and false alarm rate for each from contingency table for each sorted ensemble member.

Obs ≥ T?

Fcs

t ≥ T

?

Obs ≥ T?

Fcs

t ≥ T

?

Obs ≥ T?

Fcs

t ≥ T

?

Obs ≥ T?

Fcs

t ≥ T

?

Obs ≥ T?

Fcs

t ≥ T

?

Y N

Y H F

N M C

Obs ≥ T?

Fcs

t ≥ T

? HR = H / (H+M) FAR = F / (F+C)

Sorted Member 1

Sorted Member 2

Sorted Member 3

Sorted Member 4

Sorted Member 5

Sorted Member 6

Y N

Y 1106 3

N 5651 73270

Y N

Y 4692 1270

N 2035 72003

Y N

Y 3097 176

N 3630 73097

Y N

Y 5297 2655

N 1430 70618

Y N

Y 6603 44895

N 124 28378

HR = 0.163FAR = 0.000

HR = 0.504FAR = 0.002

HR = 0.597FAR = 0.007

HR = 0.697FAR = 0.017

HR = 0.787FAR = 0.036

HR = 0.981FAR = 0.612

Page 23: Common verification methods for  ensemble forecasts

Method of Calculation of ROC:Part 3

HR = 0.163FAR = 0.000

HR = 0.504FAR = 0.002

HR = 0.597FAR = 0.007

HR = 0.697FAR = 0.017

HR = 0.787FAR = 0.036

HR = 0.981FAR = 0.612

HR = [0.000, 0.163, 0.504, 0.597, 0.697, 0.787, 0.981, 1.000]

FAR = [0.000, 0.000, 0.002, 0.007, 0.017, 0.036, 0.612, 1.000]

(4) Plot hit ratevs. false alarmrate

Page 24: Common verification methods for  ensemble forecasts

Economic Value Diagrams

These diagramstell you the potential economicvalue of yourensemble forecastsystem applied toa particular forecastaspect. Perfectforecast has valueof 1.0, climatologyhas value of 1.0.Value differs withuser’s cost/lossratio.

Motivated by search for a metric that relates ensemble forecastperformance to things that customers will actually care about.

Page 25: Common verification methods for  ensemble forecasts

Economic Value: calculation method

Assumes decision makeralters actions based on weather forecast info.

C = Cost of protectionL = Lp+Lu = total cost of a loss, where …Lp = Loss that can be protected againstLu = Loss that can’t be protected against.N = No cost

h +m

=o

f + c

=1−o

Page 26: Common verification methods for  ensemble forecasts

Economic value, continuedSuppose we have the contingencytable of forecast outcomes, [h, m, f, c].

Then we can calculate the expectedvalue of the expenses from a forecast,from climatology, from a perfect forecast.

E forecast = fC +h C + Lu( ) +m Lp + Lu( )

Eclimate =Min o Lp + Lu( ), C + oLu⎡⎣ ⎤⎦=oLu + Min oLp, C⎡⎣ ⎤⎦

Eperfect =o C + Lu( )

V =Eclimate −E forecast

Eclimate −Eperfect

=Min oLp, C⎡⎣ ⎤⎦− h+ f( )C −mLp

Min oLp, C⎡⎣ ⎤⎦−oC

Note thatvalue will varywith C, Lp, Lu;

Different userswith different protection costsmay experiencea different valuefrom the forecastsystem.

h +m

=o

f + c

=1−o

Page 27: Common verification methods for  ensemble forecasts

From ROC to Economic Value

HR =ho

FAR=f

1−om=o−HRo

V =Min o,C Lp⎡⎣ ⎤⎦− h+ f( )C Lp −m

Min o,C Lp⎡⎣ ⎤⎦−or

=Min o,C Lp⎡⎣ ⎤⎦− C Lp( )FAR 1−o( ) + HRo 1−C Lp( )−o

Min o,C Lp⎡⎣ ⎤⎦−or

Value is now seen to be related to FAR and HR, thecomponents of the ROC curve. A (HR, FAR)point on the ROC curve will thus map to a value curve (as a function of C/L)

Page 28: Common verification methods for  ensemble forecasts

The red curve isfrom the ROCdata for the member defining the 90th percentile of the ensemble distribution.Green curve is forthe 10th percentile.Overall economicvalue is the maximum(use whatever memberfor decision threshold that provides thebest economic value).

Page 29: Common verification methods for  ensemble forecasts

Forecast skill often overestimated!- Suppose you have a sample of forecasts from two islands, and each island has different climatology.

- Weather forecasts impossible on both islands.

- Simulate “forecast” with an ensemble of draws from climatology

- Island 1: F ~ N(,1). Island 2: F ~ N(-,1)

- Calculate ROCSS, BSS, ETS in normal way. Expect no skill.

As climatology of the two islands begins to differ, then “skill” increases though samples drawn from climatology.

These scores falsely attribute differences in samples’ climatologies to skill of the forecast.

Samples must have the same climatological event frequency to avoid this.

Page 30: Common verification methods for  ensemble forecasts

Useful References• Good overall references for forecast verification:

– (1): Wilks, D.S., 2006: Statistical Methods in the Atmospheric Sciences (2nd Ed). Academic Press, 627 pp.

– (2) Beth Ebert’s forecast verification web page, http://tinyurl.com/y97c74

• Rank histograms: Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550-560.

• Spread-skill relationships: Whitaker, J.S., and A. F. Loughe, 1998: The relationship between ensemble spread and ensemble mean skill. Mon. Wea. Rev., 126, 3292-3302.

• Brier score, continuous ranked probability score, reliability diagrams: Wilks text again.• Relative operating characteristic: Harvey, L. O., Jr, and others, 1992: The application of

signal detection theory to weather forecasting behavior. Mon. Wea. Rev., 120, 863-883.• Economic value diagrams:

– (1)Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Royal Meteor. Soc., 126, 649-667.

– (2) Zhu, Y, and others, 2002: The economic value of ensemble-based weather forecasts. Bull. Amer. Meteor. Soc., 83, 73-83.

• Overforecasting skill: Hamill, T. M., and J. Juras, 2006: Measuring forecast skill: is it real skill or is it the varying climatology? Quart. J. Royal Meteor. Soc., Jan 2007 issue. http://tinyurl.com/kxtct

Page 31: Common verification methods for  ensemble forecasts

Spread-skill for 1990’s NCEP GFS

At a given grid point,spread S is assumedto be a randomvariable with a lognormal distribution

where Sm is the meanspread and is its standard deviation.

As increases, thereis a wider range ofspreads in the sample.One would expect then the possibility fora larger spread-skillcorrelation.

lnS ~ N lnSm ,( )