evaluation and calibration of operational hydrological ensemble forecasts in sweden

11
Evaluation and calibration of operational hydrological ensemble forecasts in Sweden Jonas Olsson * , Go ¨ran Lindstro ¨m Hydrological Research Unit, Swedish Meteorological and Hydrological Institute, 601 76 Norrko ¨ping, Sweden Received 4 July 2007; received in revised form 2 November 2007; accepted 9 November 2007 KEYWORDS Ensemble; Probability; Forecasts; Flood warning; HBV model; Sweden Summary Daily operational hydrological 9-day ensemble forecasts during 18 months in 45 catchments were evaluated in probabilistic terms. The forecasts were generated by using ECMWF meteorological ensemble forecasts as input to the HBV model, set up and calibrated for each catchment. Two kinds of reference discharges were used in the evaluation, ‘‘perfect forecasts’’ and actual discharge observations. A percentile-based evaluation indicated that the ensemble spread is underestimated, with a degree that decreases with increasing lead time. The share of this error related to hydrological model uncertainty was found to be similar in magnitude to the share related to underdispersivity in the ECMWF meteorological forecasts. A threshold-based evaluation indicated that the probability of exceeding a high discharge threshold is generally overestimated in the ensemble forecasts, with a degree that increases with probability level. In this case the contribution to the error from the meteorological forecasts is larger than the contribution from the hydrological model. A simple calibration method to adjust the ensemble spread by bias correction of ensemble percentiles was formulated and tested in five catchments. The method substantially improved the ensemble spread in all tested catchments, and the adjustment parameters were found to be reasonably well estimated as simple functions of the mean catchment discharge. ª 2007 Elsevier B.V. All rights reserved. Introduction Within meteorology, medium-range (3–10 days) ensemble forecasting is an established way to acknowledge the uncer- tainty in the initial atmospheric conditions and to generate probabilistic forecasts. Operational ensemble forecasting is performed at several meteorological institutes and services, e.g. the European Centre for Medium-range Weather Fore- casts, ECMWF (e.g. Molteni et al., 1996; Buizza et al., 2005). The forecasts are obtained by perturbing the initial state to produce a number of possible realisations, from which an atmospheric model is run to generate an ensemble of members (e.g. Persson, 2001). The main gain from 0022-1694/$ - see front matter ª 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.jhydrol.2007.11.010 * Corresponding author. Tel.: +46 (0) 11 4958322; fax: +46 (0) 11 4958250. E-mail addresses: [email protected] (J. Olsson), Goran. [email protected] (G. Lindstro ¨m). Journal of Hydrology (2008) 350, 1424 available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/jhydrol

Upload: jonas-olsson

Post on 28-Nov-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Hydrology (2008) 350, 14–24

ava i lab le at www.sc iencedi rec t . com

journal homepage: www.elsevier .com/ locate / jhydro l

Evaluation and calibration of operational hydrologicalensemble forecasts in Sweden

Jonas Olsson *, Goran Lindstrom

Hydrological Research Unit, Swedish Meteorological and Hydrological Institute, 601 76 Norrkoping, Sweden

Received 4 July 2007; received in revised form 2 November 2007; accepted 9 November 2007

00do

49

Li

KEYWORDSEnsemble;Probability;Forecasts;Flood warning;HBV model;Sweden

22-1694/$ - see front mattei:10.1016/j.jhydrol.2007.11

* Corresponding author. Tel.58250.E-mail addresses: Jonas.

[email protected] (G. Lindst

r ª 200.010

: +46 (0

Olsson@srom).

Summary Daily operational hydrological 9-day ensemble forecasts during 18 months in45 catchments were evaluated in probabilistic terms. The forecasts were generated byusing ECMWF meteorological ensemble forecasts as input to the HBV model, set up andcalibrated for each catchment. Two kinds of reference discharges were used in theevaluation, ‘‘perfect forecasts’’ and actual discharge observations. A percentile-basedevaluation indicated that the ensemble spread is underestimated, with a degree thatdecreases with increasing lead time. The share of this error related to hydrological modeluncertainty was found to be similar in magnitude to the share related to underdispersivityin the ECMWF meteorological forecasts. A threshold-based evaluation indicated that theprobability of exceeding a high discharge threshold is generally overestimated inthe ensemble forecasts, with a degree that increases with probability level. In this casethe contribution to the error from the meteorological forecasts is larger than thecontribution from the hydrological model. A simple calibration method to adjustthe ensemble spread by bias correction of ensemble percentiles was formulated andtested in five catchments. The method substantially improved the ensemble spread inall tested catchments, and the adjustment parameters were found to be reasonably wellestimated as simple functions of the mean catchment discharge.ª 2007 Elsevier B.V. All rights reserved.

Introduction

Within meteorology, medium-range (3–10 days) ensembleforecasting is an established way to acknowledge the uncer-

7 Elsevier B.V. All rights reserved

) 11 4958322; fax: +46 (0) 11

mhi.se (J. Olsson), Goran.

tainty in the initial atmospheric conditions and to generateprobabilistic forecasts. Operational ensemble forecasting isperformed at several meteorological institutes and services,e.g. the European Centre for Medium-range Weather Fore-casts, ECMWF (e.g. Molteni et al., 1996; Buizza et al.,2005). The forecasts are obtained by perturbing the initialstate to produce a number of possible realisations, fromwhich an atmospheric model is run to generate an ensembleof members (e.g. Persson, 2001). The main gain from

.

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden 15

ensemble forecasts is a measure of the forecast uncer-tainty. This measure may be of a qualitative nature, essen-tially with a small spread among the ensemble membersindicating a high certainty, and vice versa. This is knownas spread-skill relationships, which have been widely evalu-ated for meteorological ensemble forecasts (e.g. Scherreret al., 2004). Further, the ensemble may be used to producequantitative probabilistic forecasts. A precise estimation ofthe probabilities however requires that the forecasts arewell calibrated, i.e. that they accurately describe thevariability of the predictand. This is not always the case.For instance, the ECMWF forecasts tend to be somewhatunderdispersive, i.e. the variability is underestimated(e.g. Buizza, 1997; Persson, 2001; Buizza et al., 2005).

An attractive prospect is to use the meteorologicalensemble forecasts as input to a hydrological model, there-by producing medium-range hydrological (discharge) ensem-ble forecasts. This was attempted shortly after themeteorological ensemble forecasts became available, andto date a fair amount of scientific evaluation of hydrologicalensemble forecasts has been performed. De Roo et al.(2003) applied ensemble forecasts as a part of their Euro-pean flood forecasting system (EFFS), which includes theLISFLOOD rainfall–runoff model, and showed some casestudies. Roulin and Vannitsem (2005) evaluated hydrologicalensemble forecasts in two Belgian catchments and found aconsiderably higher accuracy than with forecasts based onclimatological precipitation. The EFFS system was evalu-ated for two cases of flooding (rivers Meuse and Odra) byGouweleeuw et al. (2005), who concluded that proper eval-uation requires data from a long time period where differ-ent discharge levels are represented. Werner et al. (2005)evaluated the EFFS system for a case of flooding in riverRhine 1995 and proposed a new type of presentationthat combines today’s ensemble forecast with previousforecasts. Roulin (2006) investigated how economicalconsiderations can be combined with ensemble forecaststo improve decision-making in flooding situations.

Feeding a hydrological model with meteorologicalensemble forecasts is a way to handle the input uncertainty.There is however also another major source of uncertaintyinvolved, that of the hydrological model, and an importantissue concerns how to accommodate also this component.Pappenberger et al. (2005) tackled this issue by feedingthe meteorological ensemble through an ensemble of LIS-FLOOD models with different but equally plausible parame-ter sets, obtained using the GLUE concept (Beven andBinley, 1992). The methodology was demonstrated for theMeuse case mentioned above. A method for statisticalpost-processing of hydrological ensemble forecasts that ac-counts for the model uncertainty was developed and testedby Seo et al. (2006). In this method, each original ensemblemember is perturbed stochastically with an amount that re-flects the model uncertainty. Verification of the method infive subcatchments of the river Juanita showed that thepost-processed ensemble was accurate and unbiased bothin the mean and in the probabilistic sense.

Virtually all published evaluations of hydrological ensem-ble forecasts have been limited to flooding case studiesand/or single catchments. This is partly due to the fact thatdespite the recent scientific advancements, there are stillonly few institutes where hydrological ensemble forecasts

are issued on an operational basis. At the Swedish Meteoro-logical and Hydrological Institute (SMHI), medium-rangehydrological ensemble forecasting has been operational inaround 50 catchments since July 2004. The forecasts arestored for subsequent analysis and a deterministic evalua-tion was recently performed by Johnell et al. (2007). Themain aim of this study is to perform a probabilistic evalua-tion of the first 18 months of daily forecasts from this sys-tem. This is to our knowledge the largest data set forwhich medium-range hydrological ensemble forecasts havebeen evaluated to date. A further objective is to developand test a methodology for adjusting the derived excee-dance probabilities.

Operational system and analysed data

The hydrological forecasting system at the SMHI is based onthe HBV-96 model (e.g. Bergstrom, 1976; Lindstrom et al.,1997), which is a widely used catchment-scale conceptualhydrological model. Water volumes in different compart-ments are determined according to the general water bal-ance equation

P � E � Q ¼ d

dt½SSP þ SSM þ SGW þ SSW� ð1Þ

where P denotes precipitation, E evapotranspiration, Q dis-charge, and S storage in various compartments: snow pack(SP), soil moisture (SM), groundwater (GW) and surfacewater (SW). The main input data are observations ofprecipitation and temperature (T), which are normallyinterpolated to catchment values. The model containssubroutines for estimation of snow melt and accumulation,evapotranspiration, soil moisture and generated runoff.A simple routing scheme connects runoff from differentsub-catchments. The model is semi-distributed and acatchment may be divided into altitude and vegetationzones. The model has a number of free parameters thatneeds to be calibrated, which may be done by an automaticprocedure (Lindstrom, 1997).

Daily operational forecasting is performed for a number(currently around 50) of so-called indicator catchments thatare rather evenly distributed throughout Sweden. Thecatchments have been selected to represent different geo-graphical regions as well as catchment sizes (from 8 to6110 km2; mean size 647 km2), although the main focus ison small and medium-sized catchments which respond rap-idly to changes in the meteorological forcing. Forecasts areupdated autoregressively, i.e. the model error for a certainday in the forecast is estimated as a function of the modelerror at the start of the forecast (e.g. Lundberg, 1982).

In 2004, the hydrological forecasting system was comple-mented with a routine for ensemble forecasting. In this rou-tine, meteorological ensemble forecasts from the EuropeanCentre for Medium-range Weather Forecasts (ECMWF) (Pers-son, 2001) are used to provide the P and T input for the indi-cator catchments. In this case no interpolation is made, butthe P and T values from the nearest cell in the ECMWF T255grid are used directly. This grid has a spatial resolution of0.7� which corresponds to �60 km, thus an indicatorcatchment is generally covered by only one grid cell. ECMWFprovides 10-day forecasts, but owing to time differences inthe forecasting systems at ECMWF and SMHI, respectively,

16 J. Olsson, G. Lindstrom

the first 12 h in the ECMWF forecasts can not be used in thehydrological forecasts, and therefore the hydrological fore-casts are 9 days long.

The ECMWF forecasts comprise 50 ensemble membersand one control forecast, which are used to generate 519-day hydrological forecasts in each indicator catchment.In the ensemble routine, these forecasts are processed togenerate five statistical percentiles for each forecast day:minimum (2% probability of non-exceedance), lower quar-tile (25%), median (50%), upper quartile (75%) and maximum(98%). These percentiles were used in the probabilistic eval-uation below.

As the hydrological ensemble forecasting system wasmade operational in July 2004 and the present evaluationbegan in January 2006, 18 months of data were available.Out of all indicator catchments, a few were omitted inthe present evaluation. In some of these the number ofobservations was insufficient for a reliable evaluation. Inothers the HBV model error was unusually large during theevaluation period, possibly because of insufficient modelcalibration (or errors in the observations), which may havea substantial impact on the total result. The final numberof catchments evaluated is 45 and for these on average 9%of the data are missing.

Evaluation of the hydrological ensemble forecasts re-quires a reference discharge with which to compare. Oneobvious candidate is the observed discharge (OBS). The dif-ference between HBV forecast and observation, however,contains error components from both the meteorologicalforecast and the hydrological model. As this evaluation fo-cuses on the accuracy of the spread in the hydrological fore-casts, generated only by the spread in the meteorologicalensemble forecasts, the model error component needs tobe eliminated. This was achieved by using as reference dis-charge ‘‘perfect HBV forecasts’’ (HBVpf). These are recon-structed forecasts (hindcasts) generated from the actuallyobserved meteorological inputs, which are assumed to bewithout errors. This way the forecasts and the referencewill have the same model error and the comparison thus re-flects only the properties of the meteorological forecast.Thus HBVpf is the main reference discharge and for this

0

5

10

15

20

25

30

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Obs

erve

d fre

quen

cy (%

)

Theoretically correct spread

a

Figure 1 Theoretical frequencies of reference discharge fallindiagram (b).

purpose HBVpf forecasts were reconstructed for all catch-ments during the entire period. However, also referencedischarge OBS was used to evaluate the total, operationalaccuracy in the ensemble forecasts. This further made itpossible to estimate the relative contributions to the totalerror from the meteorological forecast and the hydrologicalmodel, respectively.

Probabilistic evaluation methods

Two types of probabilistic evaluation methods have beenemployed, one percentile-based and one threshold-based.In the percentile-based evaluation, the frequency of refer-ence discharge falling below the ensemble percentiles iscalculated. If the ensemble spread is accurate, these fre-quencies should agree with the percentiles’ probabilitiesof non-exceedance. Thus, during 2% of the evaluation peri-od (which here corresponds to 11 days) the reference dis-charge should lie above (below) the ensemble maximum(minimum). During 23% of the period it should fall betweenthe upper (lower) quartile and the maximum (minimum),and during 25% of the period between the median and eachof the quartiles. This theoretical frequency distribution isshown graphically in Fig. 1a, which corresponds to the Tala-grand diagram that is widely used in the verification ofmeteorological ensemble forecasts (e.g. Persson, 2001).

The percentile-based evaluation thus focuses only on thedischarge levels given by the ensemble percentiles, and howthey relate to the actually occurring discharge. An alterna-tive approach is to focus on a certain event, such as theexceedance of some critical discharge threshold. In thethreshold-based evaluation, the ensemble percentiles aswell as the actual observation are compared with a dis-charge threshold. Similarly to the percentile-based evalua-tion, the estimated risk of threshold exceedance iscompared with the corresponding frequency of referencedischarge exceeding the threshold. For example, if in a cer-tain forecast the threshold is located between the ensemblemaximum (2% probability of exceedance) and the upperquartile (25%), the probability of exceedance may beapproximated as (2 + 25)/2 = 13.5%. Thus, out of all such

0

20

40

60

80

100

0 20 40 60 80 100Forecasted probability (%)

Obs

erve

d fre

quen

cy (%

)

Correct forecast

b

g between different ensemble percentiles (a) and reliability

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden 17

forecasts, in 13.5% of the cases the reference dischargeshould exceed the threshold if the ensemble probabilitiesare accurate. The result of this type of analysis may be plot-ted in a reliability diagram (e.g. Wilks, 1995). In this dia-gram, forecasted probability is plotted on the x-axis andobserved frequency on the y-axis, with the line y = x indicat-ing perfect forecasts (Fig. 1b).

The threshold-based evaluation has been performed withoperational flood warning in mind. Thus, in the evaluationonly cases when the discharge was below the threshold atthe time of the forecast have been considered, since contin-ued exceedance during the forecast is of limited interest. AtSMHI, a flood warning is issued when e.g. the 2-year flow orthe 10-year flow is expected to be exceeded. However, asthis evaluation comprises only 18 months of data thesethresholds are not suitable. Instead two lower threshold lev-els were selected, termed Q70 and Q90, the former beingthe 70th percentile (i.e. exceeded during 30% of the evalu-ation period) and the latter the 90th percentile (10%).

Results and discussion

Percentile-based evaluation

Fig. 2 shows two examples of results from the percentile-based evaluation using reference discharge HBVpf. Fig. 2ashows the result for catchment Hammarby, forecast day5, which represents catchments with a comparatively accu-rate ensemble spread. In this case the observed frequenciesagree relatively well with the theoretical distribution. Onenotable deviation is that the frequencies of HBVpf forecastsfalling between the median and one of the quartiles (25–50% and 50–75% in Fig. 2) are 5–10% too low. This impliesthat the interquartile range (IQR) is generally too smalland the reference discharge falls outside it too often. An-other clear difference is that the frequency of HBVpf fore-casts falling below the ensemble minimum is �10% toohigh, i.e. the ensemble minimum is systematically overesti-mated. A closer inspection of the data showed that thisoverestimation (1) is very small, i.e. the HBVpf forecast isvery close to but slightly below the ensemble minimum,

0

5

10

15

20

25

30

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Fre

quen

cy (

%)

Hammarby Theoretical

Figure 2 Frequencies of reference discharge (HBVpf) between enselead time 5 days.

and (2) mainly occurs during periods of recession or lowflow, when it is likely that no precipitation occurred in real-ity. Because of the low resolution of the ECMWF model grid,however, ensemble forecast precipitation has an overesti-mated frequency of days with non-zero precipitation, ascompared with observed catchment average precipitation(Johnell et al., 2007). Thus it may be suspected that smallamounts of erroneously forecasted precipitation duringthese periods is one source of this deviation.

Fig. 2b shows the result for catchment Ersbo, forecastday 5, which represents catchments with a comparativelyinaccurate ensemble spread. In this case the HBVpf forecastfar too often falls outside the entire ensemble spread, dur-ing nearly 50% of the time compared with the theoretical4%, and the spread is thus far too narrow. Compared withHammarby (Fig. 2a) the HBVpf forecast falls below theensemble minimum nearly twice as often, and further liesabove the ensemble maximum more than 20% of the time.

The variation with forecast lead time is exemplified inFig. 3a, showing the results for Hammarby on days 1, 5and 9 in the forecast. It is obvious that the agreement ofthe ensemble spread with the theoretical spread increaseswith increasing lead time. The HBVpf forecast falls outsidethe ensemble spread in �40% of the forecasts day 1 whilethis only happens in �10% of the forecasts day 9. The HBVpfforecast falls within the IQR in �20% of the forecasts day 1and in �40% day 9. This indicates that the spread in singleECMWF precipitation forecasts (for day 1) is not sufficientfor generating a proper spread in the discharge forecasts.However, as spread from a number of consecutive precipita-tion forecasts gradually accumulate in the discharge ensem-ble forecasts, the spread of the latter approaches thecorrect level. Fig. 3b shows the result averaged over allforecast lead times (day 1–9). The picture is very similarto Fig. 2a, i.e. the middle day in the forecast (day 5) repre-sents the average behaviour well.

The variation with lead time averaged over all catch-ments is illustrated in Fig. 4. The picture is similar to thatof Hammarby showed in Fig. 3a, i.e. the ensemble spreadis greatly underestimated on the first day of the forecastand reasonably accurate on the last day. On average, the

0

5

10

15

20

25

30

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Fre

quen

cy (

%)

Ersbo Theoretical

mble percentiles in catchments Hammarby (a) and Ersbo (b) for

0

5

10

15

20

25

30

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Fre

quen

cy (

%)

Day 1 Day 5 Day 9 Theoretical

0

5

10

15

20

25

30

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Fre

quen

cy (

%)

Average day 1-9 Theoretical

Figure 3 Frequencies of reference discharge (HBVpf) between ensemble percentiles in catchments Hammarby for lead times 1, 5and 9 days (a) and as average over all lead times (b).

Day 1

0

10

20

30

40

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Day 5

0

10

20

30

40

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Day 9

0

10

20

30

40

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Figure 4 Frequencies of reference discharge (HBVpf) between ensemble percentiles averaged over all catchments, for lead times1, 5 and 9 days.

18 J. Olsson, G. Lindstrom

frequency of HBVpf forecasts falling within the ensemblespread increases from �33% of the time day 1 to �75%day 9. The overestimated frequency of HBVpf forecasts be-low the ensemble minimum is clear for all lead times.

When using the perfect forecast HBVpf as reference dis-charge, the result is not affected by model uncertainty. Ifassuming the meteorological observations used to generateHBVpf to be accurate, the only source of inaccuracies in thespread of the discharge ensemble forecast is thus inaccura-cies in the input precipitation and temperature ensembleforecasts. Even if temperature has a strong impact duringthe snowmelt period, in general the precipitation is themain forcing with respect to runoff generation. Thus itmay be assumed that underestimated spread in the ECMWFprecipitation forecasts is the main source of the underesti-mated discharge spread. One reason for this is clearly themismatch in spatial scale between the ECMWF grid(�3600 km2) and the catchments (�650 km2). Another rea-son is likely an underestimated spread in the ECMWF ensem-ble forecasts, which is typically found in verifications (e.g.Buizza et al., 2005). As no downscaling to catchment scaleis performed in this study, it may be expected that thespread in the ECMWF precipitation better captures the ob-served variability in large catchments than in small ones.This would be reflected also in the discharge forecasts,i.e. the reference discharge would be more often located

within the ensemble spread in large catchments than insmall ones. Evaluation for the data used here, however,failed to reveal any clear trend with catchment size, prob-ably because the catchments are too small (all but one arearound or below half the size of the ECMWF grid).

As mentioned in Section ‘‘Operational system and ana-lysed data’’, evaluation was also performed using the actualdischarge observations (OBS) as reference discharge. Fig. 5shows the result of the percentile-based evaluation for bothreference discharges, averaged over all catchments andlead times. The frequency of HBVpf forecasts falling withinthe ensemble spread is 63% whereas for OBS the frequencyis 38%. The frequency of reference discharge within theIQR is 26% for HBVpf and 14% for OBS. The accuracy of theforecasted ensemble spread is thus approximately halvedfor reference discharge OBS, which indicates a similar con-tribution to the total error from errors in the meteorologicalforecast and the hydrological model, respectively.

Threshold-based evaluation

In the threshold-based evaluation, for each catchment theresults from all lead times (days 1–9) have been aggre-gated. This was done because of the small number of occa-sions when the ensemble forecasts indicate exceedance of adischarge threshold level, from a starting level below the

0

10

20

30

40

50

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Fre

quen

cy (

%)

HBVpf OBS Theoretical

Figure 5 Frequencies of reference discharge (HBVpf and OBS)between ensemble percentiles averaged over all catchmentsand lead times.

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden 19

threshold at the time of the forecast. Usually either the en-tire ensemble spread is below the threshold or the dischargeis above the threshold at the time of the forecast. Thereforethere is only a small number of applicable forecasts for eachlead time. This is most pronounced for short lead times,when the ensemble spread is very narrow. For example,on average over all catchments there are in the evaluationperiod only two occasions when the threshold level Q70 isforecasted to be exceeded with a probability between 75%and 98% for day 1 in the forecast. For longer lead times boththe ensemble spread and the number of applicable forecastsincrease, but the latter seldom greatly exceeds 10. The lim-ited number of cases makes the estimated frequencies ofexceedance very uncertain for single lead times, so the re-sults from all lead times were averaged to get reasonablystable frequency estimates. The average number of applica-ble cases for each catchment varies between �100 for thehighest probabilities to �500–1500 for the lowest.

0

20

40

60

80

100

0 20 40 60 80 100

Forecasted probability (%)

Obs

erve

d fre

quen

cy (%

)

0

500

1000

1500

Figure 6 Reliability diagrams (Q70, HBVpf), averaged over all leadInserted diagram shows the number of cases for each forecasted p

Fig. 6 shows two examples of results from the threshold-based evaluation in terms of reliability diagrams for twocatchments, threshold level Q70. The forecasted probabili-ties on the x-axis are the mean values of each consideredinterval. For example, the case when the ensemble maxi-mum is below the threshold represents a probability be-tween 0% and 2%, on average 1%, which is the smallestforecasted probability. The other forecasted probabilitiesare 13.5% (mean of 2 and 25), 37.5% (25 and 50), 62.5%(50 and 75), 86.5% (75 and 98) and 99% (98 and 100). The in-serted diagrams show the number of forecasts applicablefor each forecasted probability, as previously mentionedgenerally between 100 and 1500.

The result for catchment Krokfors kvarn, representingcatchments with a good agreement between observed fre-quencies and forecasted probabilities, is shown in Fig. 6a.For all probability levels, the observed frequencies are veryclose to the theoretical y = x line. For example, out of 52cases with a forecasted probability of 99% (i.e. all membersexceed the threshold), in 48 exceedance actually occurredgiving an observed frequency of 92.3% (top right corner inFig. 6a). Fig. 6b shows the result for catchment Akestakvarn, representing catchments with less accurately fore-casted probabilities. In Akesta kvarn the probabilities ofexceedance are substantially overestimated, particularlythe high probability levels. In this case exceedance with99% probability was forecasted 105 times, but only in 72(68.6%) was the threshold actually exceeded. Exceedanceforecasted with 86.5% probability occurred in only 40% ofthe cases. In this catchment there is thus a pronounced riskof false alarms.

The reliability diagram averaged over all catchments isshown in Fig. 7a. Overall the curve reflects a behaviour inbetween the results shown in Fig. 6, characterised by anoverestimation of the exceedance probabilities that in-crease with the probability level. For the highest probability(99%) the average overestimation is �20%. Exceedance willthus occur in only �80% of the cases when it is forecasted toalmost certainly happen, i.e. in this sense every fifth case

0

20

40

60

80

100

0 20 40 60 80 100

Forecasted probability (%)

Obs

erve

d fre

quen

cy (%

)

0500

100015002000

times, for catchments Krokfors kvarn (a) and Akesta kvarn (b).robability between 1% (far left) and 99% (far right).

0

20

40

60

80

100

0 20 40 60 80 100

Forecasted probability (%)

Ob

serv

ed fr

eque

ncy

(%)

Q70 Q90

0

20

40

60

80

100

0 20 40 60 80 100

Forecasted probability (%)

Ob

serv

ed fr

eque

ncy

(%)

HBVpf OBS

Figure 7 Reliability diagrams, averaged over all catchments and lead times, for thresholds Q70 and Q90 (HBVpf) (a) and forreference discharge HBVpf and OBS (Q70) (b).

20 J. Olsson, G. Lindstrom

will be a false alarm. The curves for thresholds Q70 andQ90, respectively, are virtually identical, which is also thecase for individual catchments. The only real difference isa smaller number of forecasted cases for threshold Q90,and thus a higher uncertainty in the estimated frequencies.

The situation in Fig. 7a, where the curve is approxi-mately a straight line through the origin but less steep thany = x, can be adjusted by a simple calibration procedure. Ifall forecasted probabilities are multiplied by �0.7 the ad-justed values will agree very well with the y = x line. A draw-back is however that 70% will be the highest probabilitypossible to forecast.

Fig. 7b shows the average reliability diagrams for refer-ence discharge HBVpf and OBS, respectively, thresholdQ70. The overestimation of exceedance probabilities ismarkedly larger for OBS, for which exceedance forecastedwith 99% probability occurred in only �40% of the cases.In total, the deviation from the y = x line is �40% largerfor OBS than for HBVpf, indicating that in this threshold-based perspective the impact of the hydrological model er-ror is smaller than that of the meteorological forecast error.

Calibration of ensemble spread

As shown in Section ‘‘Percentile-based evaluation’’, thespread in the hydrological ensemble forecasts is systemati-cally underestimated. Consequently, the direct use of theensemble spread does not give correct exceedance proba-bilities. For example, on average over all catchments, forforecast day 1 �30% of the HBVpf forecasts exceed theensemble maximum (Fig. 4). Thus the ensemble maximumreflects a non-exceedance probability of �70% rather thanthe theoretical 98%. For reference discharge OBS, which ismore relevant in operational forecasting, the discrepancyis substantially larger (Fig. 5) which complicates the useof the ensemble forecasts for decision-making purposes.Therefore it was decided to perform a follow-up study tothe evaluation, with the objective of developing a methodfor adjusting the derived probabilities.

An adjustment of the probabilities with respect toreflecting the variability of the discharge observation inprinciple has two components. One component is the re-quired adjustment related to the meteorological forecasts,i.e. to adjust for the inaccuracies found in the evaluationusing reference discharge HBVpf. As shown in Fig. 4, the typeof adjustment required is an increase of the spread which isvery large day 1 and gradually decreases for longer leadtimes. The second component is the adjustment related tothe evolution of the hydrological model error. Owing tothe autoregressive updating used, the model error is zeroat the time of the forecast and gradually increases for long-er lead times. This indicates an opposite type of adjust-ment, where the increase of the spread increases withlead time.

As the two adjustment components have opposite trendswith lead time, and further are of a similar order of magni-tude (Section ‘‘Percentile-based evaluation’’), a simpleconceivable approach is a constant adjustment for all daysin the forecast. One way to achieve this adjustment is bya simple translation of the original ensemble percentiles,upwards in the case of the maximum and the upper quartileand downwards in the case of the lower quartile and theminimum. This amounts to a bias correction of the percen-tiles which is independent of lead time. Fig. 8 illustratessuch an adjustment of the minimum (2%) and maximum(98%) percentiles for an arbitrary forecast. The parameterCmin (Cmax) denotes the constant amount with which theensemble minimum (maximum) is shifted downwards (up-wards). The quartile adjustment parameters will be de-noted C25 and C75, and only C when referring to all fourparameters.

The approach was evaluated for five of the catchmentsby, for each lead time, identifying which values of C thatwould provide observed frequencies of non-exceedance inline with the ensemble percentiles. For example, out ofthe 550 days in the evaluation period, the ensemble maxi-mum should be exceeded on 11. The value of Cmax producingthis number of exceedances was estimated in an iterative

0

100

200

300

400

04-aug 05-aug 06-aug 07-aug 08-aug 09-aug 10-aug 11-aug 12-aug 13-aug 14-aug

Q (m

3 /s)

EPS-median Original EPS-min/max Adjusted EPS-min/max

98%50%2%

C max

C min

Figure 8 Schematic of the proposed calibration method.

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden 21

trial-and-error procedure, and similarly for the other Cparameters. The results for Cmin and Cmax in two of thecatchments are shown in Fig. 9. In catchment Pepparforsen(Fig. 9a) Cmin is constantly close to 1 m3/s, weakly increas-ing with lead time. The parameter Cmax varies more, be-tween 4 m3/s and more than 5 m3/s, but is for all leadtimes relatively close to the mean value Cmax = 4.3 m3/s.In Sundstorp (Fig. 9b) Cmin is relatively stable around themean value Cmin = 3.4 m3/s, whereas Cmax varies between 0and 5 m3/s (Cmax = 2.0 m3/s) and clearly decreases withincreasing lead time. The results for C25 and C75 as well asfor the other three evaluated catchments are overall similarto the picture shown in Fig. 9, i.e. that the C parameters arerelatively independent of lead time and are generally wellapproximated by their mean value.

To verify that the mean value approximation of the Cparameters is sufficiently accurate, the ensemble forecastswere re-evaluated after adjustment by the suggested proce-dure. For each forecast the ensemble percentiles were ad-justed prior to the evaluation, using the mean C values. Itshould be noted that the entire data sample was thus usedin both calibration and verification of the method, as the

0

1

2

3

4

5

6

1Forecast day

C (

m3 /

s)

Cmin Cmax Mean value

98765432

a b

Figure 9 Parameters Cmin and Cmax as a function of forecas

period available was considered too short for reliablesplit-sample evaluation, especially of the minimum andmaximum percentiles.

The re-evaluation was carried out in all five catchments,and Fig. 10 shows the result in catchment Sundstorp. Thefigure shows the frequency of observations falling betweenthe ensemble percentiles, both the original (top) and theadjusted (bottom) ones, for forecast days 1, 5 and 9. It isapparent that the adjustment improves the agreement be-tween the observed frequencies and the theoretical proba-bilities for all three lead times shown. In a few cases, theagreement with the theoretical probability deterioratesafter the adjustment (e.g. for interval ‘‘2–25%’’ on day9), but in total the improvement is remarkable. It may benoted that the frequency of observations above the ensem-ble maximum (‘‘>98%’’ in Fig. 10) is somewhat overesti-mated on day 1 and somewhat underestimated day 9,which is in line with the variation of Cmax with lead time(Fig. 9b). As Cmax systematically decreases with increasinglead time, the mean value approximation will underesti-mate the adjustment on day 1 and overestimate it on day9. The effect is, however, rather limited, indicating that

0

1

2

3

4

5

Forecast day

C (

m3 /

s)

Cmin Cmax Mean value

1 98765432

t day in catchments Pepparforsen (a) and Sundstorp (b).

Day 1 - original

0

10

20

30

40

50

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Day 5 - original

0

10

20

30

40

50

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Day 9 - original

0

10

20

30

40

50

<2 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Day 1 - adjusted

0

10

20

30

40

50

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Day 5 - adjusted

0

10

20

30

40

50

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Day 9 - adjusted

0

10

20

30

40

50

<2% 2%-25%

25%-50%

50%-75%

75%-98%

>98%

Figure 10 Frequencies of reference discharge (OBS) between ensemble percentiles in catchments Sundstorp before (top) andafter (bottom) adjustment, for lead times 1, 5 and 9 days.

22 J. Olsson, G. Lindstrom

the mean value approximation is satisfactory even when Cexhibits a pronounced and systematic variation with leadtime.

The level of improvement by the adjustment was similarin the other four evaluated catchments. As expected, thevalues of the C parameters vary between catchments, inparticular reflecting their difference in size (50–700 km2)and mean discharge (0.8–6.3 m3/s). For the purpose of gen-eralisation, the possibility to relate the C parameters to thecatchment mean discharge Q was investigated. It was found

0

1

2

3

4

5

6

0 3

Generalised (m3/s)

Indi

vidu

al (

m3 /

s)

Cmin

C25

C75

Cmax

21 654

Figure 11 Relationship between adjustment parametersobtained by calibration for the individual catchments (Individ-ual) and by the generalised equation (2) (Generalised).

that all C parameters may be reasonably well estimated assimple functions of Q according to

C25 ¼ C75 ¼ Cq ¼ 0:06Q

Cmin ¼ 6Cq

Cmax ¼ 12Cq

ð2Þ

where Cq denotes a common adjustment of both quartiles.Fig. 11 shows the relationship between generalised Cparameters estimated according to (2) and individually cal-ibrated parameters in all five catchments. The amount ofexplained variance achieved by (2) is 70%.

The adjustment procedure was developed and optimisedto improve the forecasts in the percentile-based sense with-out regard to the performance in a threshold-based sense.An evaluation of the adjusted forecasts, however, revealedsome improvement also of the reliability diagrams, althoughnot as pronounced. This was at least partly because alreadythe un-adjusted reliability diagrams were relatively accu-rate in the evaluated catchments.

Conclusions

The main findings in this study can be summarized asfollows:

• The spread in the raw (i.e. unadjusted) hydrologicalensemble forecasts is underestimated, most pronouncedon forecast day 1 and to a gradually lower degree withincreasing lead time.

• The exceedance probability of high discharge thresholdsis overestimated in the forecasts and this overestimationincreases with increasing probability level.

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden 23

• The contributions to these inaccuracies coming fromerrors in the meteorological forecast and the hydrologi-cal model, respectively, are similar in the case of thespread underestimation whereas the meteorologicalforecast error is slightly larger in the case of the overes-timated exceedance probability.

• A simple calibration method can substantially improvethe ensemble spread.

In essence, this investigation has shown that the outputfrom a conceptual hydrological model, set up and calibratedfor a catchment of size �650 km2, fed with raw ECMWFmeteorological ensemble forecasts, is characterised in par-ticular by substantial deficiencies in the ensemble spreadand consequently the derived exceedance probabilities.This finding is neither new nor unexpected, but the signifi-cant contributions are (1) characterisation and quantifica-tion of the deficiencies involved and (2) assessment of thecontributions from uncertainties in the meteorological fore-cast and the hydrological model, respectively. Concerningthe latter, the relative proportions of the two contributionsthus turned out to differ somewhat between the two evalu-ation methods used. The similar contribution found in thepercentile-based evaluation indicates that in total, i.e.when all phases of the runoff process are considered (baseflow, rising limb, recession), the two uncertainty compo-nents are of approximately equal importance. The thresh-old-based evaluation (as performed here), however,focuses on the rising limb. Thus, the dominance of themeteorological forecast error in this case indicates thatthe uncertainty in forecasts preceding flow peaks contrib-utes more than the HBV model’s inability of exactly describ-ing the corresponding flow rise.

From a scientific point of view, a separate treatment ofthe two uncertainty components is clearly desirable. Suchan approach is further appropriate for producing an ensem-ble of adjusted members, which may be required in somehydrological applications, and not only adjusted percen-tiles. Concerning the meteorological forecast component,this would include some form of downscaling of the ECMWFforecasts from the ECMWF model grid to the catchmentscale, effectively increasing the ensemble spread, prior tothe hydrological modelling. Further, the underestimatedspread is likely related also to underdispersivity in theECMWF ensemble prediction system, which is likely to be re-duced in future versions of the system (e.g. Buizza et al.,2005). Concerning the hydrological model component, thismay be treated e.g. by combining the meteorological fore-cast ensemble with an ensemble of hydrological modelswith different parameter sets (e.g. Pappenberger et al.,2005). Alternatively an error term may be added to thehydrological ensemble members from a single model,reflecting the probability distribution of the expected mod-el error (e.g. Seo et al., 2006).

From an operational point of view, however, a totaladjustment of only the final discharge output has apparentadvantages. Using separate adjustment methods first of allinvolves substantially more work, both in design and man-agement. Both the meteorological and hydrological modelsinvolved are constantly developing, which puts high de-mands on the continuous update also of the adjustmentmethods. Also a total adjustment method has to be kept

updated, but this is far less laborious as it may involve onlyrather straight-forward analyses of relatively limitedamounts of directly available data. Further, with separateadjustments it has to be verified that these when integratedare indeed able of generating accurate exceedance proba-bilities also for the final output.

It may be remarked that adjustment methods conceptu-ally similar to the one proposed here, i.e. essentially biascorrection of the ensemble percentiles, have beenadvocated also for meteorological forecasts (e.g. Hamilland Colucci, 1997). We consider the version of the methodproposed here to be a first, rather crude approach, with arange of possible improvements. For example, it may beexpected that the amount of adjustment required is charac-terised by dependencies with several factors such as dis-charge level at the time of the forecast as well asproperties of the meteorological forecast. The prospect ofdevelopment in this direction, which essentially may be away to indirectly take the two separate uncertainty compo-nents into account, will be assessed in connection with fu-ture evaluations of the ensemble forecasting system. Alsothe more advanced methods of post-processing ensembleforecasts recently developed within meteorology (e.g. Raf-tery et al., 2005) may be considered for the adjustment ofhydrological ensemble forecasts.

Acknowledgements

The financial support from Raddningsverket (Swedish Res-cue Services Agency), Elforsk (Swedish Electrical UtilitiesJoint Company for Research and Development) and SMHI(Swedish Meteorological and Hydrological Institute) is grate-fully acknowledged. Many thanks to Anna Johnell for provid-ing the data base, to Anders Persson for fruitful discussionsand to two reviewers for constructive criticism of the origi-nal manuscript. The work has benefited from participationin COST Action 731 (Propagation of uncertainty in advancedmeteo-hydrological forecast systems).

References

Bergstrom, S., 1976. Development and application of a conceptualrunoff model for Scandinavian catchments. SMHI Norrkoping,Report RHO 7.

Beven, K.J., Binley, A., 1992. The future of distributed models:model calibration and uncertainty prediction. HydrologicalProcesses 6, 279–298.

Buizza, R., 1997. Potential forecast skill of ensemble prediction andspread and skill distributions of the ECMWF ensemble predictionsystem. Monthly Weather Review 125, 99–119.

Buizza, R., Houtekamer, P.L., Toth, Z., Pellerin, G., Wei, M., Zhu,Y., 2005. A comparison of the ECMWF, MSC, and NCEP globalensemble prediction systems. Monthly Weather Review 133,1076–1097.

De Roo, A., Gouweleeuw, B., Thielen, J., Bartholmes, J., Bon-gioannini-Cerlini, P., Todini, E., Bates, P., Horritt, M., Hunter,N., Beven, K., Pappenberger, F., Heise, E., Rivin, G., Hils, M.,Hollingsworth, A., Holst, B., Kwadijk, J., Reggiani, P., van Dijk,M., Sattler, K., Sprokkereef, E., 2003. Development of aEuropean flood forecasting system. International Journal ofRiver Basin Management 1, 49–59.

24 J. Olsson, G. Lindstrom

Gouweleeuw, B.T., Thielen, J., Franchello, G., De Roo, A.P.J.,Buizza, R., 2005. Flood forecasting using medium-range proba-bilistic weather prediction. Hydrology and Earth System Sciences9, 365–380.

Hamill, T.M., Colucci, S.J., 1997. Verification of Eta-RSM short-rangeensemble forecasts. Monthly Weather Review 125, 1312–1327.

Johnell, A., Lindstrom, G., Olsson, J., 2007. Deterministic evalu-ation of ensemble streamflow predictions in Sweden. NordicHydrology 38, 441–450.

Lindstrom, G., 1997. A simple automatic calibration routine for theHBV model. Nordic Hydrology 28, 153–168.

Lindstrom, G., Johansson, B., Persson, M., Gardelin, M., Bergstrom,S., 1997. Development and test of the distributed HBV-96 model.Journal of Hydrology 201, 272–288.

Lundberg, A., 1982. Combination of a conceptual model and anautoregressive error model for improving short time forecasting.Nordic Hydrology 13, 233–246.

Molteni, F., Buizza, R., Palmer, T.N., Petroliagis, T., 1996. TheECMWF ensemble prediction system: methodology and valida-tion. Quarterly Journal of the Royal Meteorological Society 122,73–119.

Pappenberger, F., Beven, K.J., Hunter, N.M., Bates, P.D., Gou-weleeuw, B.T., Thielen, J., De Roo, A.P.J., 2005. Cascadingmodel uncertainty from medium range weather forecasts (10days) through a rainfall-runoff model to flood inundation

predictions with the European Flood Forecasting System (EFFS).Hydrology and Earth System Sciences 9, 381–393.

Persson, A., 2001. User guide to ECMWF forecast products.Meteorological Bulletin M3.2, ECMWF.

Raftery, A.E., Gneiting, T., Balabdaoui, F., Polakowski, M., 2005.Using Bayesian averaging to calibrate forecast ensembles.Monthly Weather Review 133, 1155–1174.

Roulin, E., 2006. Skill and relative economic value of medium-rangehydrological ensemble predictions. Hydrology and Earth SystemSciences Discussions 3, 1369–1406.

Roulin, E., Vannitsem, S., 2005. Skill of medium-range hydrologicalensemble predictions. Journal of Hydrometeorology 6, 729–744.

Scherrer, S.C., Appenzeller, C., Eckert, P., Cattani, D., 2004.Analysis of the spread-skill relations using the ECMWF ensembleprediction system over Europe. Weather and Forecasting 19,552–565.

Seo, D.-J., Herr, H.D., Schaake, J.C., 2006. A statistical post-processor for accounting of hydrologic uncertainty in short-range ensemble streamflow prediction. Hydrology and EarthSystem Sciences Discussions 3, 1987–2035.

Werner, M., Reggiani, P., De Roo, A., Bates, P., Sprokkereef, E.,2005. Flood forecasting and warning at the river basin and at theEuropean scale. Natural Hazards 36, 25–42.

Wilks, D.S., 1995. Statistical models in the atmospheric sciences.Academic, San Diego.