„extreme-value analysis: focusing on the fit and the conditions, with hydrological applications”...

„EXTREME-VALUE ANALYSIS: FOCUSING ON THE FIT AND

THE CONDITIONS, WITH HYDROLOGICAL

APPLICATIONS”

Dávid Bozsó, Pál Rakonczai, András Zempléni

Eötvös Loránd University, Budapest

4th Conference on Extreme Value AnalysisProbabilistic and Statistical Models and their Applications

Table of contents

Goodness of fit procedures Checking the conditions D, D(un) and D’(un). Multivariate problems:

• Copulas • Simulations• Goodness of fit tests for copulas• Time dependence

Hydrological applications

Generalized Pareto distribution

Peaks over a sufficiently high threshold u can be modeled by the generalized Pareto distribution (under mild conditions):

Appropriate threshold selection is very

important

/1)~1(1)|()( y

uXyuXPyFGPD

Goodness of fit in univariate threshold models Usual goodness-of-fit tests (Chi-squared,

Kolmogorov-Smirnov) are not sensitive for the tails

A better alternative is the Anderson-Darling test

,

where the discrepancies near the tails get larger weights. Its computation:

)())(1)((

))()(( 22 xdF

xFxF

xFxFA n

n

iini nzzinA

11

2 /))1log()(log12(

Goodness of fit - continued

Modification: often the focus is on one tail only For maximum:

(Zempléni, 2004) Computation:

Critical values can be simulated (like in Choulakian and Stevens, 2001)

)()(1

))()(( 22 xdF

xF

xFxFB n

n

ii

n

iin znzinB

111

2 2/)1log()12(2/

Finding thresholds

Theoritical results related to GPD are doubly asymptotic, since not only the sample size but the threshold has to converge to infinity as well

How can we find suitable thresholds? Suggestion:

• Increase the threshold level step by step • Fit the GPD (by ML method for example) and

perform AD-type tests in all of the cases • Select some levels, for which the fit is

acceptable For more details, see Bozsó et al, 2005

Hydrological applications

Daily water level data from several stations along the river Tisza were given (time span: more than 100 years)

As an illustration we have chosen Szeged station, but in fact we have repeated the suggested procedures (almost) automatically for all the stations

In later parts of the talk we shall also use data from Csenger (river Szamos)

Finding thresholdsThreshold Shape-parameter AD-statistics

. . .

.. .. ..

330 -0.5717 1.1599

340 -0.5601 0.9015

350 -0.5473 0.6296

360 -0.5344 0.4048

370 -0.5312 0.4198

380 -0.5339 0.5456

390 -0.5191 0.3566

400 -0.5033 0.2188410 -0.499 0.2414

420 -0.5152 0.437

430 -0.491 0.2163440 -0.4866 0.239

450 -0.4896 0.2599

460 -0.4775 0.33

.. .. ..

. . .250 300 350 400 450 500 550

12

34

5

Example: Szeged water level

treshold levels (cm)

A-D

sta

tistic

s

AD-statistics95% Critical value

Focusing on the conditions

So far:• Threshold selection• Fit a GPD model for data over the

selected thresholdfor iid data

Dependence is present • Possible long range dependence?• Are the return levels affected by it?

Condition D and D(un)

. sequence somefor as 0 where

,

have we, for which

1

integersany for if hold tosaid is u DCondition

. as 0 where

,

, realany and , for which and

integersany for if hold tosaid is DCondition

,

,,

1

11

n

,

11

1

111

111

noln

uFuFuF

lij

njjii

llg

lguFuFuF

ulijjj

ii

nln

lnnjjniinjjii

p

rp

jjiijjii

pr

p

n

rprqp

rprqp

How to check condition D ?

Set p=1 and r=1 in the definition of condition D and choose threshold u as the level of interest, e.g. 400 or 430 cm in our example

Calculate

for each lag l=1,…,1000

2

11,max

11

n

iuX

ln

iuXX ilii nln

ld

Applications: daily water level data

400 cm – 80% quantile

430 cm – 83% quantile

Compare with |d(l)| for well-known sequences

• iid, normally distributed sequence

• AR(1) series

0 200 400 600 800 1000

0.00

0.05

0.10

0.15

|d(l)| statistics for selected levels

Lag

|d(l)

|

400 cm - 80% quantile430 cm - 83% quantile


Hydrological data (level: 430 cm)

Normal iid sequences Sample mean 95% quantile

AR(1) sequences Sample mean 95% quantile

Simulation study confirms our hypothesis, empirical data is in the 95% confidence interval0 200 400 600 800 1000

0.00

0.05

0.10

0.15

|d(l)| statistics from 10000 simulation

Lag

|d(l)

|

Condition D’(un)

Practical procedure: select a sequence (un), calculate

and plot it as a function of k

part. integer the denotes ] [ where

,0,suplimlim

if constants of )(u sequence and )(X sequence

stationary the for hold to said is )(uD' Condition

]/[

21

nj

n

kn

inin

nkuXuXPn

]/[

2

1

1),min(

10001 11

1max)('

kn

i

iN

juXX

n nijjiNnkd


Hydrological data Normal iid

sequence Sample mean

Yn=max(Xn,Xn+1), where X2 has a standard normal distribution Sample mean

0 200 400 600 800 1000

020

0060

0010

000

d'(k) statistics for simulation

k

d'(k

)

Multivariate models

Copulas are very useful tools for investigating dependence among the coordinates of multivariate observations

The marginal distributions and the dependence structure can be modeled separately!

Which parametric models to use for the hydrological applications? (in two dimensions)

Hydrological applications Water level peaks

measured in two different stations are shown (peaks were coupled to each other if occured nearer than one month)

With the help of the earlier algorithm we can choose threshold levels (blue lines) and fit GPD to the marginals

Only those peaks are used, which are extremal in both coordinates!

300 400 500 600 700 800 900

200

300

400

500

600

700

800

900

Omitting flood peaks

Szeged (cm)

Cse

nger

(cm

)

QQ-Plot for marginals

500 600 700 800 900

500

600

700

800

900

Quantile Plot of Szeged station

GPD Model: shape parameter = -0.56

Em

piric

al

300 400 500 600 700 800

300

400

500

600

700

800

900

Quantile Plot of Csenger station

GPD Model: shape parameter = -0.37

Em

piric

al

Empirical copula

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Kendal's tau = 0.44

Szeged

Cse

nger

After transforming the data into uniform marginals the empirical copula is obtained

Which parametric copula is the most adequate for the given application?

Conceivable copulas in 2D

Elliptical copulas:

Gauss:

Student-t: Archimedian copulas:

Gumbel:

Clayton: Other copulas:

Frechet: …

))(),(()( 21

11

, uuuC dRR

))(),(()( 21

11

,,, ututtuC dRR

)])log()log[(exp(),( /1/1 vuvuC /1)1(),( vuvuC

vuuvvuC ,min1,

Simulation - Gauss

500 700 900

300

500

700

900

Original flood peaks

500 700 900

300

500

700

Simulated flood peaks

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Empirical copula

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Gauss copula

Simulation – Student-t

500 700 900

300

500

700

900


500 700 900

300

600

900


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Empirical copula

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Student-t copula

Simulation – Clayton I.

500 700 900

300

500

700

900


500 700 900

300

500

700


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Empirical copula

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Clayton copula

Simulation – Clayton II.

500 700 900

300

500

700

900


500 700 900

300

500

700


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Empirical copula

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Clayton copula

Simulation – Gumbel

500 700 900

300

500

700

900


500 700 900

300

600

900


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Empirical copula

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Gumbel copula

Goodness of fit for copulas

Cramér-von Mises and Kolmogorov-Smirnov functionals of might be used to test the null hypotesis

A simple approach, which is based on the multivariate probability integral transformation of F, is defined by

where (U1,...,Ud ) is a vector of uniform variables having C as their joint distribution

n

CCn n

CCH :0

}),...,,({}),...,,({)( 2121 tUUUCPtxxxFPtK dd

Visual comparison

Genest et al (2003) proposed a graphical procedure for model selection through the visual comparison of the non-parametric estimate Kn(.) of K to the parametric estimate K(θn,.)

,where

The better the fit is, the closer the graphs of these functions are

Question: how to define the distance between the graphs?

n

jjnn te

ntK

1

1

n

kijik

d

ijn XX

ne

11

1

0.0 0.2 0.4 0.6 0.8 1.0

0.00

000.

0005

0.00

100.

0015

0.00

20

Gumbel VS. Empirical copula

t

squa

red

devi

atio

ns

0.0 0.2 0.4 0.6 0.8 1.0

0.00

00.

005

0.01

00.

015

0.02

0

Gumbel VS. Empirical copula

t

wei

ghte

d sq

uare

d de

viat

ions

weightedmodified weighted

1,0

2,it

iinin wtKtK

1iw 1,,1 inini tKtKw 1,1 in

Mi tKw

Weighted quadratic differences:

Which weights to use?

In order to compare which test statistics performs better at detecting discrepancies in the upper tail we applied the following algorithm:1. Simulate a sample from a parametric copula2. Randomly choose two not concordant points (x,y)

near the right tail and permute their coordinates so that the new points x*,y* are concordant (the marginals do not change) – but the copula changes

3. Perform the three versions of the test for the modified data set

4. Repeat steps 2 and 3, and investigate which statistics is faster in detecting the changes

The data and its permutations

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Permutation = 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Permutation = 5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Permutation = 10

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Permutation = 15

The number inthe title givesthe number ofchanged pairs

Detecting changes

5 10 15 20 25

0.00

0.20

Comparing methods

sq.d

ev

5 10 15 20 25

0.0

1.5

3.0

w. s

q.de

v

5 10 15 20 25

0.0

1.5

3.0

number of permutations

mod

ified

In general the tests based on weigthed squared deviation perform better than the original one..

Among the two weighted tests, the modified version is more sensible!

Simulation results

We recorded how many steps the different tests needed to detect the changes during the replications

As expected, the modified weights were the best!

mean st.dev

Sum of squares (SS)

15.78 8.57

weighted SS 11.58 7.7

Modified weighted SS

10.15 7.13

Time dependence

Has the dependence structure of the observations changed in the last century?

Windows of 80 years with a step size of 5 years were used to detect possible changes

Firstly we have to decide which copula to use

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Frechet theta = 0.51

1900-19801905-19851910-19901915-19951920-20001900-2000 (parametric)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Gumbel theta = 1.74

1900-19801905-19851910-19901915-19951920-20001900-2000 (parametric)

Time dependence

0.0 0.2 0.4 0.6 0.8 1.0

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

0.0

10

Frechet vs. Gumbel

Square

d d

evia

tions

Deviations from Frechet copulaDeviations from Gumbel copula

0.0 0.2 0.4 0.6 0.8 1.0

0.0

00.0

20.0

40.0

60.0

8

Frechet vs. Gumbel

Weig

hte

d s

quare

d d

evia

tions


0.0 0.2 0.4 0.6 0.8 1.0

0.0

00.0

20.0

40.0

60.0

8

Frechet vs. GumbelM

odifie

d w

eig

hte

d s

quare

d d

evia

tions


In all of the three cases the Gumbel copula seems to be better than Frechet!

Simulated critical values

n \ tau 0.3 0.4 0.5 0.6

50 0,4612 0,4625 0,4104 0,3685

100 0,2105 0,2036 0,1872 0,1723

150 0,1345 0,1295 0,1312 0,117

n \ tau 0.3 0.4 0.5 0.6

50 4,0558 3,3538 2,9626 2,4796

100 1,6783 1,507 1,2564 1,1039

150 1,1004 0,9405 0,8423 0,733

n \ tau 0.3 0.4 0.5 0.6

50 2,8795 2,3359 1,9354 1,5223

100 1,2066 1,0019 0,8153 0,6618

150 0,7914 0,6278 0,5349 0,4314

Applications for the hydrological data set: time dependence

N (sample size)

Kendall-tau theta

Sum of squares (SS) weighted SS

Modified weighted SS

1 88 0.4306 1.7563 0.1208 0.8148 0.5313

2 90 0.4559 1.8379 0.16 1.1805 0.7844

3 87 0.4503 1.819 0.1376 1.1665 0.8706

4 88 0.3872 1.6319 0.1419 1.5023 1.1526*

5 94 0.3856 1.6277 0.088 0.906 0.6534

All obs. 119 0.425 1.7391 0.075 0.5987 0.3716

•The only (marginally) significant value is marked with *•A simulation study may be used for detecting changes in the dependence structure

References

Bozsó, D., Rakonczai, P. and Zempléni, A. (2005). Floods on river Tisza and some of its affluents. Extreme-value modelling in practice. Statisztikai Szemle, accepted for publication. (In Hungarian.)

Choulakian, V. and Stephens, M.A. (2001). Goodness-of-fit tests for the genaralized Pareto distribution. Technometrics 43, 478-484.

D’Agostino, R.B. and Stephens, M.A. (1986). Goodnes-of-fit Techniques. Marcell Dekker.

Genest, C. Quessy, J.-F. and Rémillard, B. (2003). Goodnes-of-fit Procedures for Copula Models Based on the Integral Probability Transformation. GERAD.

Leadbetter, M. R. - Lindgren, G. and Rootzen, H. (1983). Extremes and Related Properties of Random Sequences and Processes, Springer.

Zempléni, A. (2004). Goodness-of-fit test in extreme value applications. Discussion paper No. 383, SFB 386, Statistische Analyse Diskreter Strukturen, TU München.

„extreme-value analysis: focusing on the fit and the conditions, with hydrological applications”...

Documents