„extreme-value analysis: focusing on the fit and the conditions, with hydrological applications”...
Post on 21-Dec-2015
217 views
TRANSCRIPT
„EXTREME-VALUE ANALYSIS: FOCUSING ON THE FIT AND
THE CONDITIONS, WITH HYDROLOGICAL
APPLICATIONS”
Dávid Bozsó, Pál Rakonczai, András Zempléni
Eötvös Loránd University, Budapest
4th Conference on Extreme Value AnalysisProbabilistic and Statistical Models and their Applications
Table of contents
Goodness of fit procedures Checking the conditions D, D(un) and D’(un). Multivariate problems:
• Copulas • Simulations• Goodness of fit tests for copulas• Time dependence
Hydrological applications
Generalized Pareto distribution
Peaks over a sufficiently high threshold u can be modeled by the generalized Pareto distribution (under mild conditions):
Appropriate threshold selection is very
important
/1)~1(1)|()( y
uXyuXPyFGPD
Goodness of fit in univariate threshold models Usual goodness-of-fit tests (Chi-squared,
Kolmogorov-Smirnov) are not sensitive for the tails
A better alternative is the Anderson-Darling test
,
where the discrepancies near the tails get larger weights. Its computation:
)())(1)((
))()(( 22 xdF
xFxF
xFxFA n
n
iini nzzinA
11
2 /))1log()(log12(
Goodness of fit - continued
Modification: often the focus is on one tail only For maximum:
(Zempléni, 2004) Computation:
Critical values can be simulated (like in Choulakian and Stevens, 2001)
)()(1
))()(( 22 xdF
xF
xFxFB n
n
ii
n
iin znzinB
111
2 2/)1log()12(2/
Finding thresholds
Theoritical results related to GPD are doubly asymptotic, since not only the sample size but the threshold has to converge to infinity as well
How can we find suitable thresholds? Suggestion:
• Increase the threshold level step by step • Fit the GPD (by ML method for example) and
perform AD-type tests in all of the cases • Select some levels, for which the fit is
acceptable For more details, see Bozsó et al, 2005
Hydrological applications
Daily water level data from several stations along the river Tisza were given (time span: more than 100 years)
As an illustration we have chosen Szeged station, but in fact we have repeated the suggested procedures (almost) automatically for all the stations
In later parts of the talk we shall also use data from Csenger (river Szamos)
Finding thresholdsThreshold Shape-parameter AD-statistics
. . .
.. .. ..
330 -0.5717 1.1599
340 -0.5601 0.9015
350 -0.5473 0.6296
360 -0.5344 0.4048
370 -0.5312 0.4198
380 -0.5339 0.5456
390 -0.5191 0.3566
400 -0.5033 0.2188410 -0.499 0.2414
420 -0.5152 0.437
430 -0.491 0.2163440 -0.4866 0.239
450 -0.4896 0.2599
460 -0.4775 0.33
.. .. ..
. . .250 300 350 400 450 500 550
12
34
5
Example: Szeged water level
treshold levels (cm)
A-D
sta
tistic
s
AD-statistics95% Critical value
Focusing on the conditions
So far:• Threshold selection• Fit a GPD model for data over the
selected thresholdfor iid data
Dependence is present • Possible long range dependence?• Are the return levels affected by it?
Condition D and D(un)
. sequence somefor as 0 where
,
have we, for which
1
integersany for if hold tosaid is u DCondition
. as 0 where
,
, realany and , for which and
integersany for if hold tosaid is DCondition
,
,,
1
11
n
,
11
1
111
111
noln
uFuFuF
lij
njjii
llg
lguFuFuF
ulijjj
ii
nln
lnnjjniinjjii
p
rp
jjiijjii
pr
p
n
rprqp
rprqp
How to check condition D ?
Set p=1 and r=1 in the definition of condition D and choose threshold u as the level of interest, e.g. 400 or 430 cm in our example
Calculate
for each lag l=1,…,1000
2
11,max
11
n
iuX
ln
iuXX ilii nln
ld
Applications: daily water level data
400 cm – 80% quantile
430 cm – 83% quantile
Compare with |d(l)| for well-known sequences
• iid, normally distributed sequence
• AR(1) series
0 200 400 600 800 1000
0.00
0.05
0.10
0.15
|d(l)| statistics for selected levels
Lag
|d(l)
|
400 cm - 80% quantile430 cm - 83% quantile
Applications: daily water level data
Hydrological data (level: 430 cm)
Normal iid sequences Sample mean 95% quantile
AR(1) sequences Sample mean 95% quantile
Simulation study confirms our hypothesis, empirical data is in the 95% confidence interval0 200 400 600 800 1000
0.00
0.05
0.10
0.15
|d(l)| statistics from 10000 simulation
Lag
|d(l)
|
Condition D’(un)
Practical procedure: select a sequence (un), calculate
and plot it as a function of k
part. integer the denotes ] [ where
,0,suplimlim
if constants of )(u sequence and )(X sequence
stationary the for hold to said is )(uD' Condition
]/[
21
nj
n
kn
inin
nkuXuXPn
]/[
2
1
1),min(
10001 11
1max)('
kn
i
iN
juXX
n nijjiNnkd
Applications: daily water level data
Hydrological data Normal iid
sequence Sample mean
Yn=max(Xn,Xn+1), where X2 has a standard normal distribution Sample mean
0 200 400 600 800 1000
020
0060
0010
000
d'(k) statistics for simulation
k
d'(k
)
Multivariate models
Copulas are very useful tools for investigating dependence among the coordinates of multivariate observations
The marginal distributions and the dependence structure can be modeled separately!
Which parametric models to use for the hydrological applications? (in two dimensions)
Hydrological applications Water level peaks
measured in two different stations are shown (peaks were coupled to each other if occured nearer than one month)
With the help of the earlier algorithm we can choose threshold levels (blue lines) and fit GPD to the marginals
Only those peaks are used, which are extremal in both coordinates!
300 400 500 600 700 800 900
200
300
400
500
600
700
800
900
Omitting flood peaks
Szeged (cm)
Cse
nger
(cm
)
QQ-Plot for marginals
500 600 700 800 900
500
600
700
800
900
Quantile Plot of Szeged station
GPD Model: shape parameter = -0.56
Em
piric
al
300 400 500 600 700 800
300
400
500
600
700
800
900
Quantile Plot of Csenger station
GPD Model: shape parameter = -0.37
Em
piric
al
Empirical copula
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Kendal's tau = 0.44
Szeged
Cse
nger
After transforming the data into uniform marginals the empirical copula is obtained
Which parametric copula is the most adequate for the given application?
Conceivable copulas in 2D
Elliptical copulas:
Gauss:
Student-t: Archimedian copulas:
Gumbel:
Clayton: Other copulas:
Frechet: …
))(),(()( 21
11
, uuuC dRR
))(),(()( 21
11
,,, ututtuC dRR
)])log()log[(exp(),( /1/1 vuvuC /1)1(),( vuvuC
vuuvvuC ,min1,
Simulation - Gauss
500 700 900
300
500
700
900
Original flood peaks
500 700 900
300
500
700
Simulated flood peaks
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Empirical copula
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Gauss copula
Simulation – Student-t
500 700 900
300
500
700
900
Original flood peaks
500 700 900
300
600
900
Simulated flood peaks
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Empirical copula
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Student-t copula
Simulation – Clayton I.
500 700 900
300
500
700
900
Original flood peaks
500 700 900
300
500
700
Simulated flood peaks
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Empirical copula
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Clayton copula
Simulation – Clayton II.
500 700 900
300
500
700
900
Original flood peaks
500 700 900
300
500
700
Simulated flood peaks
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Empirical copula
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Clayton copula
Simulation – Gumbel
500 700 900
300
500
700
900
Original flood peaks
500 700 900
300
600
900
Simulated flood peaks
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Empirical copula
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
Gumbel copula
Goodness of fit for copulas
Cramér-von Mises and Kolmogorov-Smirnov functionals of might be used to test the null hypotesis
A simple approach, which is based on the multivariate probability integral transformation of F, is defined by
where (U1,...,Ud ) is a vector of uniform variables having C as their joint distribution
n
CCn n
CCH :0
}),...,,({}),...,,({)( 2121 tUUUCPtxxxFPtK dd
Visual comparison
Genest et al (2003) proposed a graphical procedure for model selection through the visual comparison of the non-parametric estimate Kn(.) of K to the parametric estimate K(θn,.)
,where
The better the fit is, the closer the graphs of these functions are
Question: how to define the distance between the graphs?
n
jjnn te
ntK
1
1
n
kijik
d
ijn XX
ne
11
1
0.0 0.2 0.4 0.6 0.8 1.0
0.00
000.
0005
0.00
100.
0015
0.00
20
Gumbel VS. Empirical copula
t
squa
red
devi
atio
ns
0.0 0.2 0.4 0.6 0.8 1.0
0.00
00.
005
0.01
00.
015
0.02
0
Gumbel VS. Empirical copula
t
wei
ghte
d sq
uare
d de
viat
ions
weightedmodified weighted
1,0
2,it
iinin wtKtK
1iw 1,,1 inini tKtKw 1,1 in
Mi tKw
Weighted quadratic differences:
Which weights to use?
In order to compare which test statistics performs better at detecting discrepancies in the upper tail we applied the following algorithm:1. Simulate a sample from a parametric copula2. Randomly choose two not concordant points (x,y)
near the right tail and permute their coordinates so that the new points x*,y* are concordant (the marginals do not change) – but the copula changes
3. Perform the three versions of the test for the modified data set
4. Repeat steps 2 and 3, and investigate which statistics is faster in detecting the changes
The data and its permutations
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Permutation = 0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Permutation = 5
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Permutation = 10
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Permutation = 15
The number inthe title givesthe number ofchanged pairs
Detecting changes
5 10 15 20 25
0.00
0.20
Comparing methods
sq.d
ev
5 10 15 20 25
0.0
1.5
3.0
w. s
q.de
v
5 10 15 20 25
0.0
1.5
3.0
number of permutations
mod
ified
In general the tests based on weigthed squared deviation perform better than the original one..
Among the two weighted tests, the modified version is more sensible!
Simulation results
We recorded how many steps the different tests needed to detect the changes during the replications
As expected, the modified weights were the best!
mean st.dev
Sum of squares (SS)
15.78 8.57
weighted SS 11.58 7.7
Modified weighted SS
10.15 7.13
Time dependence
Has the dependence structure of the observations changed in the last century?
Windows of 80 years with a step size of 5 years were used to detect possible changes
Firstly we have to decide which copula to use
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Frechet theta = 0.51
1900-19801905-19851910-19901915-19951920-20001900-2000 (parametric)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Gumbel theta = 1.74
1900-19801905-19851910-19901915-19951920-20001900-2000 (parametric)
Time dependence
0.0 0.2 0.4 0.6 0.8 1.0
0.0
00
0.0
02
0.0
04
0.0
06
0.0
08
0.0
10
Frechet vs. Gumbel
Square
d d
evia
tions
Deviations from Frechet copulaDeviations from Gumbel copula
0.0 0.2 0.4 0.6 0.8 1.0
0.0
00.0
20.0
40.0
60.0
8
Frechet vs. Gumbel
Weig
hte
d s
quare
d d
evia
tions
Deviations from Frechet copulaDeviations from Gumbel copula
0.0 0.2 0.4 0.6 0.8 1.0
0.0
00.0
20.0
40.0
60.0
8
Frechet vs. GumbelM
odifie
d w
eig
hte
d s
quare
d d
evia
tions
Deviations from Frechet copulaDeviations from Gumbel copula
In all of the three cases the Gumbel copula seems to be better than Frechet!
Simulated critical values
n \ tau 0.3 0.4 0.5 0.6
50 0,4612 0,4625 0,4104 0,3685
100 0,2105 0,2036 0,1872 0,1723
150 0,1345 0,1295 0,1312 0,117
n \ tau 0.3 0.4 0.5 0.6
50 4,0558 3,3538 2,9626 2,4796
100 1,6783 1,507 1,2564 1,1039
150 1,1004 0,9405 0,8423 0,733
n \ tau 0.3 0.4 0.5 0.6
50 2,8795 2,3359 1,9354 1,5223
100 1,2066 1,0019 0,8153 0,6618
150 0,7914 0,6278 0,5349 0,4314
Applications for the hydrological data set: time dependence
N (sample size)
Kendall-tau theta
Sum of squares (SS) weighted SS
Modified weighted SS
1 88 0.4306 1.7563 0.1208 0.8148 0.5313
2 90 0.4559 1.8379 0.16 1.1805 0.7844
3 87 0.4503 1.819 0.1376 1.1665 0.8706
4 88 0.3872 1.6319 0.1419 1.5023 1.1526*
5 94 0.3856 1.6277 0.088 0.906 0.6534
All obs. 119 0.425 1.7391 0.075 0.5987 0.3716
•The only (marginally) significant value is marked with *•A simulation study may be used for detecting changes in the dependence structure
References
Bozsó, D., Rakonczai, P. and Zempléni, A. (2005). Floods on river Tisza and some of its affluents. Extreme-value modelling in practice. Statisztikai Szemle, accepted for publication. (In Hungarian.)
Choulakian, V. and Stephens, M.A. (2001). Goodness-of-fit tests for the genaralized Pareto distribution. Technometrics 43, 478-484.
D’Agostino, R.B. and Stephens, M.A. (1986). Goodnes-of-fit Techniques. Marcell Dekker.
Genest, C. Quessy, J.-F. and Rémillard, B. (2003). Goodnes-of-fit Procedures for Copula Models Based on the Integral Probability Transformation. GERAD.
Leadbetter, M. R. - Lindgren, G. and Rootzen, H. (1983). Extremes and Related Properties of Random Sequences and Processes, Springer.
Zempléni, A. (2004). Goodness-of-fit test in extreme value applications. Discussion paper No. 383, SFB 386, Statistische Analyse Diskreter Strukturen, TU München.