benchmark database inhomogeneous data, surrogate data and synthetic data victor venema

Benchmark database

inhomogeneous data, surrogate data and

synthetic data

Victor Venema

M e te o ro lo g ic a l

I n stitu te

B o n n

Victor Venema, [email protected], COST HOME, Mai 2008, Budapest, Hungary

Goals of COST-HOME working group 1

Literature survey

Benchmark dataset– Known inhomogeneities– Test the homogenisation algorithms (HA)


Benchmark dataset1) Real (inhomogeneous) climate records

Most realistic case Investigate if various HA find the same breaks Good meta-data

2) Synthetic data For example, Gaussian white noise Insert know inhomogeneities Test performance

3) Surrogate data Empirical distribution and correlations Insert know inhomogeneities Compare to synthetic data: test of assumptions


Creation benchmark – Outline talk

1) Start with (in)homogeneous data

2) Multiple surrogate and synthetic realisations

3) Mask surrogate records

4) Add global trend

5) Insert inhomogeneities in station time series

6) Published on the web

7) Homogenize by COST participants and third parties

8) Analyse the results and publish


1) Start with homogeneous data

Monthly mean temperature and precipitation Later also daily data (WG4), maybe other

variables

Homogeneous No missing data Longer surrogates are based on multiple copies Generated networks are 100 a


1) Start with inhomogeneous data

Distribution– Years with breaks are removed– Mean of section between breaks is adjusted to global

mean

Spectrum– Longest period without any breaks in the stations– Surrogate is divided in overlapping sections– Fourier coefficients and phases are adjusted for every

small section– No adjustments on large scales!


Surrogates from inhomogeneous data

1900 1920 1940 1960 1980 2000 20200

50

100

150

200

250

300Measurement stations rrm_thenetherlands rrm

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 20000

50

100

150

200

250Surrogate stations


2) Multiple surrogate realisations

Multiple surrogate realisations– Temporal correlations– Station cross-correlations– Empirical distribution function

Annual cycle removed before, added at the end Number of stations between 5 and 20 Cross correlation varies as much as possible


5) Insert inhomogeneities in stations

Independent breaks Determined at random for every station and time 5 breaks per 100 a Monthly slightly different perturbations Temperature

– Additive– Size: Gaussian distribution, σ=0.8°C

Rain– Multiplicative– Size: Gaussian distribution, <x>=1, σ=10%


Example break perturbations station

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Example break perturbations network

Year

Temperature perturbations

1920 1940 1960 1980 2000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

-1

-0.5

0

0.5

1

1.5



Correlated break in network One break in 50 % of networks In 30 % of the station simultaneously Position random

– At least 10 % of data points on either side


Example correlated break

Year

Correlated break

1920 1940 1960 1980 2000

2

4

6

8

10

12

14

16

18

20-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0



Outliers Size

– Temperature: < 1 or > 99 percentile– Rain: < 0.1 or > 99.9 percentile

Frequency– 50 % of networks: 1 %– 50 % of networks: 3 %


Example outlier perturbations station

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000-10

-5

0

5Outliers


Example outliers network

Year

Outliers

1900 1920 1940 1960 1980 2000

2

4

6

8

10

12

14

16

18

20

-10

-5

0

5

10



Local trends (only temperature) Linear increase or decrease in one station Duration: 30, 60a Maximum size: 0.2 to 1.5 °C Frequency: once in 10 % of the stations Also for rain?


Example local trends

Year

Local trends

1900 1920 1940 1960 1980

2

4

6

8

10

12

14

16

18

20-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1


6) Published on the web

Inhomogeneous data will be published on the COST-HOME homepage

Everyone is welcome to download and homogenize the data

http://www.meteo.uni-bonn.de/ mitarbeiter/venema/themes/homogenisation


7) Homogenize by participants

Return homogenised data Should be in COST-HOME file format (next slide)

Return break detections– BREAK– OUTLI– BEGTR– ENDTR

Multiple breaks at one data possible


7) Homogenize by participants

COST-HOME file format: http://www.meteo.uni-bonn.de/

venema/themes/homogenisation/costhome_fileformat.pdf For benchmark & COST homogenisation software

New since Vienna:– Stations files include height– Many clarifications


Work in progress

Preliminary benchmark: http://www.meteo.uni-bonn.de/ venema/themes/homogenisation/

Write report on the benchmark dataset More input data

Set deadline for the availability benchmark Deadline for the return of the homogeneous data Agree on the details of the benchmark Daily data: other, realistic, fair inhomogeneities

benchmark database inhomogeneous data, surrogate data and synthetic data victor venema

Documents

cost home

inhomogeneous data slide

hungary goals of cost

hungary example

cost participants

hungary surrogates

possible slide

hungary benchmark dataset