robin hogan ewan oconnor, anthony illingworth university of reading, uk chris ferro, ian jolliffe,...
TRANSCRIPT
Robin HoganRobin HoganEwan O’Connor, Anthony IllingworthEwan O’Connor, Anthony Illingworth
University of Reading, UKUniversity of Reading, UK
Chris Ferro, Ian Jolliffe, David StephensonChris Ferro, Ian Jolliffe, David Stephenson
University of Exeter, UKUniversity of Exeter, UK
Verifying cloud forecasts:Verifying cloud forecasts:What is the “half-life” of a cloud forecast?What is the “half-life” of a cloud forecast?
Is the Equitable Threat Score really equitable?Is the Equitable Threat Score really equitable?
How skillful is a forecast?
• Most model evaluations of clouds test the cloud climatology– What about individual forecasts?
• Standard measure shows ECMWF forecast “half-life” of ~6 days in 1980 and ~9 days in 2000 – But virtually insensitive to clouds!
ECMWF 500-hPa geopotential anomaly correlation
Overview• The “Cloudnet” processing of ground-based radar and lidar
observations– Continuous evaluation of the climatology of clouds in models– Evaluation of the diurnal cycle of boundary-layer clouds
• Desirable properties of verification measures (skill scores)– Usefulness for rare events: the Symmetric Extreme Dependency
Score– Equitability: is the “Equitable Threat Score” equitable?
• Testing the skill of cloud forecasts from seven models– Skill versus cloud fraction, height, scale, forecast lead time, season...– Estimating the forecast “half life”
• Testing the skill of cloud forecasts from space– Evaluation of ECMWF model with ICESat/GLAS lidar
• Most results taken from these papers:– Hogan, O’Connor & Illingworth (QJ 2009)– Hogan, Ferro, Jolliffe & Stephenson (WAF, in press)
Project
• Aim: to retrieve and evaluate the crucial cloud variables in forecast and climate models– 8+ models: global, mesoscale and high-resolution forecast models– Variables: cloud fraction, LWC, IWC, plus a number of others– Sites: 4 across Europe plus worldwide ARM sites– Period: several years to avoid unrepresentative case studies
• Current status– Funded by US Department of Energy Climate Change Prediction
Program to apply to ARM data worldwide
Level 1b
• Minimum instrument requirements at each site– Cloud radar, lidar, microwave radiometer, rain gauge, model or sondes
Radar
Lidar
Level 1c
Ice
LiquidRain
Aerosol
• Instrument Synergy product– Example of target classification and data quality fields:
Level 2a/2b
• Cloud products on (L2a) observational and (L2b) model grid– Water content and cloud fraction
L2a IWC on radar/lidar grid
L2b Cloud fraction on model grid
ChilboltonObservations
Met OfficeMesoscale
Model
ECMWFGlobal Model
Meteo-FranceARPEGE Model
KNMIRACMO Model
Swedish RCA model
Cloud fraction
Cloud fraction in 7 models• Mean & PDF for 2004 for Chilbolton, Paris and Cabauw
Illingworth et al. (BAMS 2007)
0-7 km
– All models except DWD underestimate mid-level cloud– Some have separate “radiatively inactive” snow (ECMWF, DWD); Met
Office has combined ice and snow but still underestimates cloud fraction
– Wide range of low cloud amounts in models– Not enough overcast boxes, particularly in Met Office model
Diurnal cycle composite of clouds
Barrett, Hogan & O’Connor (GRL 2009)
Meteo-France:Local mixing scheme: too little entrainment
SMHI:Prognostic TKE scheme: no diurnal evolution
All other models have a non-local mixing scheme in unstable conditions and an explicit formulation for entrainment at cloud top: better performance over the diurnal cycle
Radar and lidar provide cloud boundaries and cloud properties above site
Joint PDFs of cloud fraction
• Raw (1 hr) resolution– 1 year from Murgtal– DWD COSMO model
• 6-hr averaging
ab
cd
…or use a simple contingency table
a = 7194 b = 4098
c = 4502 d = 41062
DWD model, Murgtal
Model cloud
Model clear-sky
a: Cloud hit b: False alarm
c: Miss d: Clear-sky hit
Contingency tables
For given set of observed events, only 2 degrees of freedom in all possible forecasts (e.g. a & b), because 2 quantities fixed: - Number of events that occurred n =a +b +c +d - Base rate (observed frequency of occurrence) p =(a +c)/n
Observed cloud Observed clear-sky
Skill-Bias diagrams
Positiveskill
Randomforecast
Negativeskill
Best possible forecast
Worst possible forecast
Under-prediction No bias Over-prediction
Random unbiased forecast
Constant forecast of non-occurrence
Constant forecast of occurrence
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Reality (n=16, p=1/4)
Forecast
-
5 desirable properties of verification measures
1. “Equitable”: all random forecasts receive expected score zero– Constant forecasts of occurrence or non-occurrence also score
zero– Note that forecasting the right cloud climatology versus height
but with no other skill should also score zero
2. Difficult to “hedge”– Some measures reward under- or over-prediction
3. Useful for rare events– Almost all measures are “degenerate” in that they asymptote to 0
or 1 for vanishingly rare events
4. Dependence on full joint PDF, not just 2x2 contingency table– Difference between cloud fraction of 0.9 and 1 is as important for
radiation as a difference between 0 and 0.1– Difficult to achieve with other desirable properties: won’t be
studied much today...
5. “Linear”: so that can fit an inverse exponential for half-life– Some measures (e.g. Odds Ratio Skill Score) are very non-linear
Hedging“Issuing a forecast
that differs from your true belief in order to improve your score”
(e.g. Jolliffe 2008)
• Hit rate H=a/(a+c)– Fraction of events
correctly forecast– Easily hedged by
randomly changing some forecasts of non-occurrence to occurrence
H=0.5
H=0.75
H=1
EquitabilityDefined by Gandin and Murphy (1992):• Requirement 1: An equitable verification measure awards all
random forecasting systems, including those that always forecast the same value, the same expected score– Inequitable measures rank some random forecasts above skillful
ones
• Requirement 2: An equitable verification measure S must be expressible as the linear weighted sum of the elements of the contingency table, i.e. S = (Saa +Sbb +Scc +Sdd) / n– This can safely be discarded: it is incompatible with other
desirable properties, e.g. usefulness for rare events
• Gandin and Murphy reported that only the Peirce Skill Score and linear transforms of it is equitable by their requirements – PSS = Hit Rate minus False Alarm Rate = a/(a+c) – b/(b+d)– What about all the other measures reported to be equitable?
Some reportedly equitable measures
HSS = [x-E(x)] / [n-E(x)]; x = a+d ETS = [a-E(a)] / [a+b+c-E(a)]
LOR = ln[ad/bc] ORSS = [ad/bc – 1] / [ad/bc + 1]
E(a) = (a+b)(a+c)/n is the expected value of a for an unbiased random forecasting system
Random and constant forecasts all score zero, so these measures are all equitable, right?
Simple attempts to hedge will fail for all these measures
Skill versus cloud-fraction threshold
• Consider 7 models evaluated over 3 European sites in 2003-2004
LOR implies skill increases for larger
cloud-fraction thresholdHSS implies skill decreases
significantly for larger cloud-fraction threshold
LORHSS
Extreme dependency score• Stephenson et al. (2008) explained this behavior:
– Almost all scores have a meaningless limit as “base rate” p 0– HSS tends to zero and LOR tends to infinity
• They proposed the Extreme Dependency Score:
– where n = a + b + c + d
• It can be shown that this score tends to a meaningful limit:– Rewrite in terms of hit rate H =a/(a +c) and base rate p =(a +c)/n :
– Then assume a power-law dependence of H on p as p 0:– In the limit p 0 we find
– This is useful because random forecasts have Hit rate converging to zero at the same rate as base rate: =1 so EDS=0
– Perfect forecasts have constant Hit rate with base rate: =0 so EDS=1
Symmetric extreme dependency score
• EDS problems:– Easy to hedge (unless
calibrated)– Not equitable
• Solved by defining a symmetric version:– All the benefits of EDS,
none of the drawbacks!
Hogan, O’Connor and Illingworth (2009 QJRMS)
Skill versus cloud-fraction threshold
SEDS has much flatter behaviour for all models (except for Met Office which underestimates high cloud occurrence significantly)
LORHSS SEDS
Skill versus height– Most scores not reliable
near the tropopause because cloud fraction tends to zero
LORHSS
LBSS
SEDS
• New score reveals:– Skill tends to slowly
decrease at tropopause
– Mid-level clouds (4-5 km) most skilfully predicted, particularly by Met Office
– Boundary-layer clouds least skilfully predicted
EDS
A surprise?• Is mid-level cloud well forecast???
– Frequency of occurrence of these clouds is commonly too low (e.g. from Cloudnet: Illingworth et al. 2007)
– Specification of cloud phase cited as a problem– Higher skill could be because large-scale ascent has largest
amplitude here, so cloud response to large-scale dynamics most clear at mid levels
– Higher skill for Met Office models (global and mesoscale) because they have the arguably most sophisticated microphysics, with separate liquid and ice water content (Wilson and Ballard 1999)?
• Low skill for boundary-layer cloud is not a surprise!– Well known problem for forecasting (Martin et al. 2000) – Occurrence and height a subtle function of subsidence rate,
stability, free-troposphere humidity, surface fluxes, entrainment rate...
Key properties for estimating ½ life
• We wish to model the score S versus forecast lead time t as:
– where 1/2 is forecast “half-life”
• We need linearity– Some measures “saturate” at high skill
end (e.g. Yule’s Q / ORSS)– Leads to misleadingly long half-life
• ...and equitability– The formula above assumes that score tends to zero for very long
forecasts, which will only occur if the measure is equitable
2/1/0
/0 2)( tt SeStS
• Expected values of a–d for a random forecasting system may score zero:– S[E(a), E(b), E(c), E(d)] = 0
• But expected score may not be zero!
– E[S(a,b,c,d)] = P(a,b,c,d)S(a,b,c,d)
• Width of random probability distribution decreases for larger sample size n– A measure is only equitable if positive
and negative scores cancel
Which measures are equitable?
ETS & ORSS are asymmetric
n = 16 n = 80
Asyptotic equitability • Consider first unbiased forecasts of events that occur with
probability p = ½
– Expected value of “Equitable Threat Score” by a random forecasting system decreases below 0.01 only when n > 30
– This behaviour we term asymptotic equitability
– Other measures are never equitable, e.g. Critical Success Index CSI = a/(a+b+c), also known as Threat Score
What about rarer events?• “Equitable Threat Score” still virtually equitable for n > 30
• ORSS, EDS and SEDS approach zero much more slowly with n – For events that occur 2% of the time (e.g. Finley’s tornado
forecasts), need n > 25,000 before magnitude of expected score is less than 0.01
– But these measures are supposed to be useful for rare events!
Possible solutions1. Ensure n is large enough that E(a) > 102. Inequitable scores can be scaled to make them equitable:
– This opens the way to a new class of non-linear equitable measures
),|E()max(
),|E(
s
sequit qpSS
qpSSS
3. Report confidence intervals and “p-values” (the probability of a score being achieved by chance)
What is the origin of the term “ETS”?
• First use of “Equitable Threat Score”: Mesinger & Black (1992)– A modification of the “Threat Score” a/(a+b+c)– They cited Gandin and Murphy’s equitability requirement that
constant forecasts score zero (which ETS does) although it doesn’t satisfy requirement that non-constant random forecasts have expected score 0
– ETS now one of most widely used verification measures in meteorology
• An example of rediscovery– Gilbert (1884) discussed a/(a+b+c) as a possible verification
measure in the context of Finley’s (1884) tornado forecasts– Gilbert noted deficiencies of this and also proposed exactly the
same formula as ETS, 108 years before!
• Suggest that ETS is referred to as the Gilbert Skill Score (GSS)– Or use the Heidke Skill Score, which is unconditionally equitable
and is uniquely related to ETS = HSS / (2 – HSS)
Hogan, Ferro, Jolliffe and Stephenson (WAF, in press)
• Truly equitable
• Asymptotically equitable
• Not equitable
Properties of various measures
Skill versus lead time
• Only possible for UK Met Office 12-km model and German DWD 7-km model– Steady decrease of skill with lead time– Both models appear to improve between 2004 and 2007
• Generally, UK model best over UK, German best over Germany– An exception is Murgtal in 2007 (Met Office model wins)
2004 2007
Forecast “half life”
• Fit an inverse-exponential:– S0 is the initial score and 1/2 is the half-life
• Noticeably longer half-life fitted after 36 hours– Same thing found for Met Office rainfall forecast (Roberts 2008)– First timescale due to data assimilation and convective events– Second due to more predictable large-scale weather systems
2004 20072.6 days
2.9 days2.9 days2.7 days2.9 days
2.7 days
2.7 days3.1 days
2.4 days
4.0 days4.3 days4.3 days
3.0 d
3.2 d
3.1 d
Met Office DWD
2/1/0 2)( tStS
• Different spatial scales? Convection?– Average temporally before calculating skill scores:
– Absolute score and half-life increase with number of hours averaged
Why is half-life less for clouds than pressure?
• Cloud is noisier than geopotential height Z because it is separated by around two orders of differentiation:
– Cloud ~ vertical wind ~ relative vorticity ~ 2streamfunction ~ 2pressure– Suggests cloud observations should be used routinely to evaluate models
Geopotential height anomaly Vertical velocity
Satellite observations: IceSAT• Cloud observations from IceSAT 0.5-micron
lidar (first data Feb 2004)• Global coverage but lidar attenuated by thick
clouds: direct model comparison difficult
Optically thick liquid cloud obscures view of any clouds beneath
Solution: forward-model the measurements (including attenuation) using the ECMWF variables
Lidar apparent backscatter coefficient (m-1 sr-1)
Latitude
Global cloud fraction comparison
ECMWF raw cloud fraction ECMWF processed cloud fraction
IceSAT cloud fraction
Wilkinson, Hogan, Illingworth and Benedetti (MWR 2008)
• Results for October 2003– Tropical convection peaks too
high– Too much polar cloud– Elsewhere agreement is good
• Results can be ambiguous– An apparent low cloud
underestimate could be a real error, or could be due to high cloud above being too thick
Testing the model skill from space
Clearly need to apply SEDS to cloud estimated from lidar & radar!
Unreliable region
Lowest skill: tropical boundary-layer clouds
Tropical skill appears to peak at mid-levels but cloud very infrequent
here
Highest skill in north mid-latitude and polar upper
troposphere
Is some of reduction of skill at low levels because of lidar
attenuation?
Wilkinson, Hogan, Illingworth and Benedetti (MWR 2008)
CCPP project• US Dept of Energy Climate Change Prediction Program recently
funded 5-year consortium project centred at Brookhaven, NY– Implement updated Cloudnet processing system at Atmospheric
Radiation Measurement (ARM) radar-lidar sites worldwide– Ingests ARM’s cloud boundary diagnosis, but uses Cloudnet for
stats– New diagnostics being tested
• Testing of NWP models– NCEP, ECMWF, Met Office, Meteo-France... – Over a decade of data at several sites: have cloud forecasts
improved over this time?
• Single-column model testbed– SCM versions of many GCMs will be run over ARM sites by Roel
Neggers – Different parameterization schemes tested– Verification measures can be used to judge improvements
US Southern Great Plains 2004
Winter2004
Summer2004
Summary and outlook• Model comparisons reveal:
– Half-life of a cloud forecast is between 2.5 and 4 days, much less than ~9 days for ECMWF 500-hPa geopotential height forecast
– In Europe, higher skill for mid-level cloud and lower for boundary-layer cloud, but larger seasonal contrast in Southern US
• Findings applicable to other verification problems:– “Symmetric Extreme Dependency Score” is a reliable measure of
skill for both common and rare events (given we have large enough sample)
– Many measures regarded as equitable are only so for very large samples, including the “Equitable Threat Score”, but they can be rescaled
• Future work (in addition to CCPP):– CloudSat & Calipso: what is the skill of cloud forecasts globally?– What is half-life of ECMWF cloud forecasts? (Need more data!) – Near-real-time evaluation for rapid feedback to NWP centres?– Dept of Meteorology Lunchtime Seminar, 1pm Tuesday 3rd Nov:
“Faster and more accurate representation of clouds and gases in GCM radiation schemes”
Monthly skill versus time• Measure of the skill of forecasting cloud fraction>0.05
– Comparing models using similar forecast lead time– Compared with the persistence forecast (yesterday’s
measurements)
• Lower skill in summer convective events
Statistics from AMF
• Murgtal, Germany, 2007– 140-day comparison
with Met Office 12-km model
• Dataset released to the COPS community– Includes German
DWD model at multiple resolutions and forecast lead times
Possible skill scoresContingency
tableObserved
cloud Observed clear
sky
Modeled cloud
ahit
b false alarm
Modeled clear sky
cmiss
d correct negative
DWD model
a = 7194 b = 4098
c = 4502 d = 41062
Perfect forecast
ap = 11696 bp = 0
cp = 0 dp = 45160
Random forecast
ar = 2581 br = 8711
cr = 9115 dr = 36449
• To ensure equitability and linearity, we can use the concept of the “generalized skill score” = (x-xrandom)/(xperfect-xrandom)– Where “x ” is any number derived from the joint PDF– Resulting scores vary linearly from random=0 to perfect=1
• Simplest example: Heidke skill score (HSS) uses x=a+d– We will use this as a reference to test other scores
• Brier skill score uses x=mean squared cloud-fraction difference, Linear Brier skill score (LBSS) uses x=mean absolute difference– Sensitive to errors in model for all values of cloud fraction
“Cloud” deemed to occur when cloud fraction f is larger than
some threshold fthresh
Alternative approach• How valid is it to estimate 3D cloud fraction from 2D slice?
– Henderson and Pincus (2009) imply that it is reasonable, although presumably not in convective conditions
• Alternative: treat cloud fraction as a probability forecast– Each time the model forecasts a particular cloud fraction, calculate
the fraction of time that cloud was observed instantaneously over the site
– Leads to a Reliability Diagram:
Jakob et al. (2004)
Perfect
No skillNo resolution
Simulate lidar backscatter:– Create subcolumns with max-rand
overlap– Forward-model lidar backscatter from
ECMWF water content & particle size– Remove signals below lidar sensitivity
ECMWF raw cloud fraction
ECMWF cloud fraction after processing
IceSAT cloud fraction
Testing the model climatology
Reduction in model due to lidar attenuation
Error due to uncertain extinction-to-backscatter ratio