bayesian adaptive bandwidth kernel density estimation of...
TRANSCRIPT
Bayesian Adaptive Bandwidth Kernel Density
Estimation of Irregular Multivariate Distributions
Shuowen Hu, D. S. Poskitt, Xibin Zhang∗
Department of Econometrics and Business Statistics, Monash University, Australia
Abstract
In this paper, we propose a new methodology for multivariate kernel density
estimation in which data are categorized into low- and high-density regions
as an underlying mechanism for assigning adaptive bandwidths. We derive
the posterior density of the bandwidth parameters via the Kullback-Leibler
divergence criterion and use a Markov chain Monte Carlo (MCMC) sampling
algorithm to estimate the adaptive bandwidths. The resulting estimator is
referred to as the tail-adaptive density estimator. Monte Carlo simulation
results show that the tail-adaptive density estimator outperforms the global-
bandwidth density estimators implemented using different global bandwidth
selection rules. The inferential potential of the tail-adaptive density estimator
is demonstrated by employing the estimator to estimate the bivariate density
of daily index returns observed from the USA and Australian stock markets.
Keywords: marginal likelihood, Markov chain Monte Carlo, S&P500 index,
value-at-risk
∗Corresponding author. Email: [email protected]. Address: 900 DandenongRoad, Caulfield East, Victoria 3145, Australia. Telephone: +61 3 99032130.Fax: +61 3 99032007. URL: http://users.monash.edu.au/~xzhang/
Preprint submitted to Computational Statistics and Data Analysis July 25, 2011
1. Introduction
Kernel density estimation is one of the most important techniques for
understanding the distributional properties of data. It is understood that the
effectiveness of such approach depends on the choice of a kernel function and
the choice of a smoothing parameter or bandwidth (see for example, Izenman,
1991, for a discussion). Although the two issues cannot be treated separately,
it is widely accepted that the performance of a kernel density estimator is
mainly determined by the bandwidth (see for example, Scott, 1992; Wand
and Jones, 1995), while the impact of kernel choices on the performance of the
resulting density estimator was examined by Marron and Nolan (1988),Vieu
(1999) and Horova et al. (2002). Most investigations have aimed at choosing
a fixed or global bandwidth for a full sample of data (see Jones et al., 1996,
for a survey). Terrell and Scott (1992) and Sain and Scott (1996) proposed
the idea of data-driven adaptive bandwidth density estimation, which allows
the bandwidth to vary at different data points. In this situation, bandwidth
selection remains as an important issue and has been extensively investigated
for univariate data. However, less attention has been paid to investigations
of data-driven methods for estimating adaptive bandwidths in multivariate
data. This motivates the investigation of this paper.
Our investigation is also empirically motivated. Most financial analysts
believe that during the course of the global financial crisis caused by the fall-
out of the USA sub-prime mortgage crisis, the USA stock market has had a
leading effect on most other stock markets world wide. Using a kernel den-
sity estimator of bivariate stock-index returns we can derive the conditional
distribution of the stock-index return in one market for a given value of the
2
stock-index return in the USA market, and therefore we can better under-
stand how the former market was associated with the USA market. However,
the marginal density of stock-index returns often exhibits leptokurtosis. Con-
sequently, the kernel estimation of the bivariate density of stock-index returns
may require different bandwidths to different groups of observed returns.
1.1. Some Background
Let X = (X1, X2, · · · , Xd)′ denote a d-dimensional random vector with
its density function f(x) defined on Rd (see Jacome et al., 2008, for an
example of censored data). Let {x1,x2, · · · ,xn} be a random sample drawn
from f(x). The kernel density estimator of f(x) is (Wand and Jones, 1995)
fH(x) =1
n
n∑i=1
KH(x− xi) =1
n|H|1/2n∑i=1
K(H−1/2(x− xi)), (1)
where K(·) is a multivariate kernel, and H is a symmetric and positive defi-
nite d× d matrix known as the bandwidth matrix. To choose an optimal H,
several methods have been discussed in literature (see for example, Devroye
and Gyorfi, 1985; Marron, 1987). Marron (1992) proposed a bootstrapping
approach to the approximation of the mean integrated squared error (MISE),
and the normal reference rule (NRR) or equivalently the rule-of-thumb that
minimizes the asymptotic MISE was discussed by Scott (1992). The least
squares cross-validation was discussed by Sain et al. (1994) and Duong and
Hazelton (2005), and a plug-in method was suggested by Wand and Jones
(1994) and improved by Duong and Hazelton (2003). Moreover, Zhang et al.
(2006) proposed a Bayesian approach to the estimation of the bandwidth
matrix based on Kullback-Leibler information criterion.
3
However, a problem in using a global bandwidth is that the kernel meth-
ods often produce unsatisfactory results for complex or irregular densities.
Sain and Scott (1996) presented a classic example, in which a bimodal mix-
ture with two modes of equal height but different levels of variation, where
an optimal global bandwidth tends to under-smooth the mode with a large
variation and over-smooth the mode with a small variation. Therefore, it is
necessary to let the bandwidth be adaptive to local observations; a relatively
small bandwidth is needed where observations are densely distributed, and a
large bandwidth is required where observations are sparsely distributed (see
Jones, 1990; Wand and Jones, 1995; Sain, 2002, for similar arguments).
Several versions of the adaptive bandwidth kernel density estimator have
been studied in literature. Mielniczuk et al. (1989) proposed to use a weight-
ing function on data point x in a global bandwidth estimator. A popular
method is to make the bandwidth as a function of data points, and Nolan and
Marron (1989) discussed this issue from a general Delta-sequence estimator
perspective. Loftsgaarden and Quesenberry (1965) suggested to replace H in
(1) is by H(x), which is the bandwidth at data point x. This estimator was
studied by Cao (2001) and Sain and Scott (2002) and is also called the bal-
loon estimator. However, the balloon estimator does not integrate to one and
therefore is not a good choice for density estimation (Terrell and Scott, 1992;
Izenman, 1991). The other estimator proposed by Breiman et al. (1977) is
called the sample-point estimator, which employs H(xi) as the bandwidth
associated with sample point xi. Abramson (1982a,b) suggested to choose
bandwidth as the inverse square root of f(xi). As the sample-point estimator
always integrate to one, we consider this estimator throughout this paper.
4
The difficulty of using sample-point estimator is that the number of band-
widths exceeds the sample size in multivariate data, and this makes diffi-
culties in estimating bandwidths. Sain and Scott (1996) and Sain (2002)
suggested grouping data into different bins and using a constant bandwidth
for each bin. Although this method reduces the number of bandwidths, the
number of bandwidths still grows exponentially with the dimension.
1.2. The Tail-Adaptive Density Estimator
A simple method to control the number of bandwidths to be estimated
for the sample-point kernel density estimator is to put the data into a small
number of groups. In this paper, we propose dividing the observations into
two regions, namely the low-density region (LDR) and high-density region
(HDR), and assigning two different bandwidth matrices to these two regions.
When the true density is unimodal, the low-density region is the tail area
and should be assigned larger bandwidths than the high-density region. We
call this type of kernel density estimator the tail-adaptive density estimator.
It is not new to group the observations into the low- and high-density
regions. Hartigan (1975, 1987) proposed clustering data into different regions
with different density values, and Hyndman (1996) presented an algorithm for
computing and graphing data in different density regions. A comprehensive
review of applications related to the issue of low-and high-density regions
was given in Mason and Polonik (2009). In terms of bandwidth selection,
Samworth and Wand (2010) considered a bandwidth selection method for
univariate high-density region estimation. To our knowledge, there is yet any
other investigation on bandwidth selection for a general multivariate kernel
density estimation, where two different bandwidth matrices are assigned to
5
observations in the low- and high-density regions.
In this paper, we treat the elements of the two bandwidth matrices
as parameters, whose posterior can be approximately derived through the
Kullback-Leibler information. Therefore, bandwidths can be estimated through
a posterior simulator. During the past decade, there have been several inves-
tigations on Bayesian approaches to bandwidth estimation for kernel density
estimation (see for example, Brewer, 2000; Gangopadhyay and Cheung, 2002;
Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular,
Zhang et al. (2006) derived the posterior of bandwidths for multivariate ker-
nel density estimation with a global bandwidth matrix. Their Monte Carlo
simulation results reveal the advantage of this Bayesian approach over its
competitors including the plug-in method of Duong and Hazelton (2003)
and the NRR. However, Hall (1987) showed that the use of a global band-
width for a long-tailed distribution can mislead Kullback-Leibler informa-
tion. Therefore, in this paper, we extent the sampling algorithm proposed
by Zhang et al. (2006) by incorporating tail-adaptive bandwidth matrices
into the multivariate kernel density estimation.
The rest of this paper is organized as follows. In Section 2, we derive the
posterior of the elements of the tail-adaptive bandwidth matrices describe an
MCMC sampling algorithm. Sections 3 and 4 present the results of Monte
Carlo simulation studies designed to examine the performance of the tail-
adaptive density estimator. In Section 5, we apply the tail-adaptive density
estimator to the estimation of the bivariate density of two asset returns.
Section 6 concludes the paper.
6
2. Bayesian estimation of bandwidths
2.1. Likelihood cross-validation
The Kullback-Leibler information, which measures the discrepancy be-
tween a density estimator and its true density, is defined as
dKL
(f(x), fH(x)
)=
∫Rd
log{f(x)}f(x)dx−∫Rd
log{fH(x)}f(x)dx. (2)
An optimal bandwidth could be derived by minimizing (2), and this is equiv-
alent to the maximization of∫Rd log{fH(x)}f(x)dx with respect to H. As
this integral is the expectation of log{fH(x)} under f(x), it is approximated
by n−1∑n
i=1 log fH,i(xi), where
fH,i(xi) =1
n− 1
n∑j=1j 6=i
|H|−1/2K(H−1/2(xi − xj)
), (3)
known as the leave-one-out estimator of f(xi) (Hardle, 1991). The likelihood
cross-validation method is to maximize∑n
i=1 log fH,i(xi), which is regarded
as the likelihood of {x1,x2, · · · ,xn} for given H, with respect to H.
The bandwidth matrix can be either a full or diagonal matrix. A full
bandwidth matrix can reveal useful features of the resulting density estima-
tor, while the implementation of some computing algorithms for bandwidth
estimation is often very difficult especially when adaptive bandwidth ma-
trices are used. The numerical result obtained by Sain (2002) showed that
the density estimator with a full bandwidth matrix is not smooth in low-
density regions. In this paper, we use a diagonal bandwidth matrix and let
h = (h1, h2, . . . , hd)′ denote the vector of the square roots of the diagonal
elements of the bandwidth matrix. h is also known as the bandwidth vector.
7
2.2. Tail-adaptive kernel density estimator
The concept of grouping observations into low- and high-density regions
has been discussed in literature (see for example, Hartigan, 1975, p205). We
propose to group some observations into the low-density region, inside which
each observation has a density no more than the density of each observation
outside the region. Actually, our definition of the low-density region is in the
same sense of Hyndman (1996) whose defined the highest density region.
Let α be a threshold value that determines the proportion of the low-
density region relative to the sample space. Let L(fα) denote a subset of the
sample space, so that the (100× α)% low-density region is shown as
L(fα) = {x : f(x) ≤ fα},
where fα is the largest constant such that Pr{x ∈ L(fα)} ≤ α.
Let
Ij =
1 if xj ∈ L(fα)
0 otherwise,
for j = 1, 2, · · · , n. Let h(1) denote the bandwidth vector assigned to the
observations in L(fα) and h(0) the bandwidth vector assigned to those ob-
servations outside L(fα). The kernel density estimator is
fh(1),h(0)(x) =1
n
n∑j=1
{IjK
((x− xj)./h
(1))./h(1)
+ (1− Ij)K((x− xj)./h
(0))./h(0)
}, (4)
where ./ denotes division by elements. The leave-one-out estimator is denoted
as fh(1),h(0),i(xi) for i = 1, 2, · · · , n. As the low-density region becomes the tail
area when the underlying density is unimodal, we call (4) the tail-adaptive
8
estimator for simplicity. The value of α can be chosen as either 5% or 10%.
As f(x) is unknown, the initial value of fα can be approximated through the
kernel density estimator of f(x) using a global bandwidth.
2.3. Posterior of bandwidth parameters
Given h(1) and h(0), the approximate likelihood is
`(x1,x2, . . . ,xn|h(1),h(0)) =n∏i=1
fh(1),h(0),i(xi). (5)
As suggested by Zhang et al. (2006), we assume that the prior of each band-
width to be the truncated Cauchy density
p(h(l)k ) =
2
1 + h(l)k × h
(l)k
, for h(l)k > 0,
for k = 1, 2, · · · , d, and l = 0 and 1. The posterior of h(1) and h(0) for given
{x1,x2, . . . ,xn} is
π(h(1),h(0)|x1,x2, · · · ,xn) ∝
{n∏i=1
fh(1),h(0),i(xi)
}×
{d∏
k=1
p(h(1)k )× p(h(0)
k )
}.
(6)
The posterior given by (6) is of non-standard form, and we cannot derive
an analytical expression as the estimate of {h(1),h(0)}. However, we can use
the random-walk Metropolis algorithm to sample {h(1),h(0)} from (6). The
sampling procedure is as follows.
1) Obtain initial low-density regions for a given α based on kernel density
estimator with a global bandwidth vector chosen via NRR.
2) Assign initial values to h(1) and h(0), which are respectively, the band-
width vectors given to observations within the low- and high-density
regions specified in Step 1).
9
3) Let h denote the vector of all elements of h(1) and h(0). Use the random-
walk Metropolis algorithm to update h with the acceptance probability
computed through the posterior given by (6).
4) Derive the low-density region according the density estimator with the
bandwidth vectors updated in Step 3).
5) Repeat Steps 3) and 4) until the simulated chain of h achieves reason-
able mixing performance.
During the above iteration process, we discard the draws during the burn-
in period, and record the draws of h thereafter. Let {h(1), h(2), · · · , h(M)}
denote the recorded draws. The posterior mean (or ergodic average) denoted
as∑M
i=1 h(i)/M , is an estimate of h.
3. A Monte Carlo simulation study
To investigate the performance of the proposed tail-adaptive kernel den-
sity estimator, we approximate Kullback-Leibler information via Monte Carlo
simulation. A large number of random vectors are drawn from f(x) so as to
approximate (2) numerically. A bandwidth estimation method is better than
its competitor if the former produces a smaller Kullback-Leibler information
than the former.
3.1. True densities
We simulate samples from six target bivariate densities labeled A, B, C,
D, E and F1. These densities are of irregular shapes. Density A is a mixture of
1Refer to Hu et al. (2010) for univariate target densities, their contour plots and the
simulation results.
10
two equally weighted normal densities with bimodality: fA (x|µ1,Σ1, µ2,Σ2) =
12φ (x|µ1,Σ1) + 1
2φ (x|µ2,Σ2) , where φ(x|µ,Σ) is a multivariate normal den-
sity with mean µ and variance-covariance matrix
Σ =
1 ρ
ρ 1
. (7)
We used µ1 = (−1.5,−1.5)′, µ2 = (2, 2)′, ρ = 0.3 for Σ1 and ρ = −0.9 for Σ2.
Density B is a mixture of two normal densities with different weights
but an equal height at the modes: fB (x|µ1,Σ1, µ2,Σ2) = 34φ (x|µ1,Σ1) +
14φ (x|µ2,Σ2) , where µ1 = (−1.5,−1.5)′ and µ2 = (1.5, 1.5)′. Σ is defined in
(7) with ρ = 0.5 in Σ1, and Σ2 = 13Σ1.
Density C is a mixture of two skew-normal densities proposed by Azzalini
and Valle (1996): fC (x|µ1, γ1, µ2, γ2,Σ) = 12× 2φ(x|µ1,Σ)Φ (γ′1(x− µ1)) +
12× 2φ(x|µ2,Σ)Φ (γ′2(x− µ2)), where Φ(·) is the cumulative density function
(CDF) of the standard normal distribution. We chose µ1 = (−0.5,−0.5)′,
γ1 = (−9,−9)′, µ2 = (0, 0)′, γ2 = (9, 9) and Σ is defined in (7) with ρ = 0.3.
Note that γ1, γ2 ∈ R2 are the shape parameters determining the skewness.
Density D is a Student t density denoted as td (x|µ,Σ, ν),
Density E is a mixture of two Student t densities with ν = 5:
fE (x|µ1, µ2,Σ, ν) = 0.5 td (x|µ1,Σ1, ν) + 0.5 td (x|µ2,Σ2, ν) ,
where µ1 = (−2, 0)′, µ2 = (2, 0)′, and Σ1 and Σ2 are defined in (7) with
ρ = −0.5 and 0.5, respectively.
Density F is a skew-t density proposed by Azzalini and Capitanio (2003):
fF (x|µ,Σ, α, ν) = 2 td(x|µ,Σ, ν)Td(x|ν + d),
11
where x = γ′ω−1(x − µ)(
ν+d(x−µ)′Σ−1(x−µ)+ν
)1/2
, µ = (0, 0)′, γ = (−2, 0)′, Σ
is an identity matrix. ω is a diagonal matrix with diagonal elements the
same as those of Σ, and Td(·|ν + d) is the CDF of the Student t distribution
with ν + d degrees of freedom. The contour plot of each of the six bivariate
densities can be found in Figure 2 of Hu et al. (2010).
3.2. Accuracy of our Bayesian bandwidth estimation
We generated samples of n = 500, 1000, 2000 from each of the six bivariate
densities. The kernel function is the product of univariate Gaussian kernels.
We estimated the bandwidth vectors for the proposed tail-adaptive kernel
density estimator with α = 0.05 and 0.1. A global bandwidth vector is also
estimated via the Bayesian method of Zhang et al. (2006) and NRR.
We used the random-walk Metropolis algorithm to sample all bandwidths
from their posterior. The burn-in period contains 3,000 iterations, and the
following 10,000 iterations were recorded. We computed the batch-mean
standard deviation discussed by Roberts (1996) and the simulation ineffi-
cient factor (SIF) discussed by Kim et al. (1998) to monitor the mixing per-
formance. Both indicators show that the sampler has achieved a reasonable
mixing performance. Refer to Hu et al. (2010) for graphs on visual inspection
of the mixing performance.
Consider a sample generated from fF (x) with α = 0.05 and sample size
1000. Table 1 presents the MCMC results, where the SIFs are very small and
indicate a reasonable mixing performance of the proposed sampler. The tail-
adaptive density estimator clearly captures the fat-tailed feature of fF . For
example, the estimates of both components of h(1) for observations inside the
12
low-density region are respectively, much larger than the estimates of both
components of h(0) for observations outside this region.
We generatedN=100,000 random vectors from the true density and calcu-
lated the estimated Kullback-Leibler information defined by (2). As shown in
Table 2, among all six densities considered, the tail-adaptive density estima-
tor obviously performs better than the global-bandwidth density estimators
with the bandwidth vector estimated through Bayesian sampling and NRR.
The results also indicate that the performance of the tail-adaptive density
estimator is not very sensitive to different values of α.
The mean integrated squared error (MISE) was also used to examine the
performance of tail-adaptive density estimator. We numerically approximate
the MISE through 200 data sets for each the bivariate densities with sample
size 500, 1000 and 2000. Table 3 shows that the lower-adaptive estimator
always outperforms the global-bandwidth estimator.
4. Tail-adaptive density estimation for high dimensions
Our proposed Bayesian sampling algorithm for estimating bandwidths in
tail-adaptive kernel density estimation is applicable to data of any dimen-
sion. In this section, we aim to examine the performance of the tail-adaptive
estimator in comparison with the global-bandwidth estimator.
4.1. True densities
We consider four target densities labeled G, H, I and J. Density G is a
mixture of two multivariate normal densities:
fG (x|µ1, µ2,Σ1,Σ2) =1
2φ (x|µ1,Σ1) +
1
2φ (x|µ2,Σ2) ,
13
where µ1 = (−1.5,−1.5,−1.5,−1.5,−1.5)′, µ2 = (2, 2, 2, 2, 2)′, and both
variance-covariance matrices have the form of
Σ =1
1− ρ2
1 ρ ρ2 ρ3 ρ4
ρ 1 ρ ρ2 ρ3
. . .
ρ4 ρ3 ρ2 ρ 1
, (8)
with ρ = 0.3 for Σ1 and ρ = −0.9 for Σ2.
Density H is a multivariate skew-normal densities:
fH (x|µ,Σ, α) = 2φ (x|µ,Σ) Φ (γ′(x− µ)) ,
where Σ is given by (8) with ρ = 0.9, µ = (−0.5,−0.5,−0.5,−0.5,−0.5)′,
and γ = (−9,−9,−9,−9,−9)′.
Density I is a mixture of two multivariate Student t densities:
fI (x|µ1, µ2,Σ1,Σ2, ν) = 0.5 td (x|µ1,Σ1, ν) + 0.5 td (x|µ2,Σ2, ν) ,
where µ1 = (−2, 0,−2, 0,−2)′, µ2 = (2, 0, 2, 0, 2)′, ν = 5, and both Σ1 and
Σ1 are defined by (8) with ρ = −0.5 and ρ = 0.5, respectively.
Density J is a multivariate skew-t densities:
fJ (x|µ,Σ, α, ν) = 2td (x|µ,Σ, ν)Td (x|ν + d) ,
where µ = 0, ν = 5, Σ is an identity matrix, and x is defined in a similar
way as that in the bivariate fF (x) with γ = (2, 0, 2, 0, 2)′.
4.2. Accuracy of our Bayesian bandwidth estimation
We generated samples of sizes n = 500, 1000, 2000 from each of the five-
dimensional densities. Table 4 presents the estimated Kullback-Leibler in-
formation between the true density and its estimator resulted from each of
14
the three bandwidth estimation methods. The kernel density estimator with
tail-adaptive bandwidth vectors obviously outperforms its counterpart with a
global bandwidth vector estimated via either the NRR or Bayesian sampling.
This finding is consistent with what we found in the bivariate situation. For
all sample sizes of each density, we found that the tail-adaptive kernel den-
sity estimator with α = 0.1 slightly outperforms the same estimator with
α = 0.05. However, we would be reluctant to make a decision as to whether
the former performs better than the latter because such a is marginal. Due
to the heavy burden required by the numerical computation of MISE for
five-dimensional densities, we have not computed MISE in this section.
5. An application of the tail-adaptive density estimator
In this section, we apply the proposed tail-adaptive kernel density es-
timator to the estimation of bivariate density of stock-index returns. We
downloaded the daily closing index values of the S&P 500 index in the USA
stock market and the All Ordinaries (AORD) in the Australian stock mar-
ket, from the 2nd January 2006 to the 16th September 2010 excluding non-
trading days, with sample size 1155. The AORD return was matched to the
overnight S&P 500 return. Let Pt denote the closing index at date t. The
daily continuously compounded returns in percentage form was computed as
(lnPt− lnPt−1)×100. The sample period covers the period of current global
financial crisis, and there were some extreme observations. The Pearson
correlation coefficient between the two return series is 0.6171.
We used the random-walk Metropolis algorithm to estimate bandwidths
for the tail-adaptive kernel density estimator of the bivariate returns, with
15
α = 0.05. We also estimated the global bandwidth via the NRR and Bayesian
sampling. The results are given in Table 5, where the SIFs indicate reasonable
mixing performance achieved by both sampling algorithms. Moreover, the
use of tail-adaptive bandwidths leads to a much larger marginal likelihood
than the use of a global bandwidth in this bivariate kernel density estimator,
where the marginal likelihoods were calculated through Newton and Raftery
(1994). Thus, the use of tail-adaptive bandwidths is strongly favored against
the use of a global bandwidth. The density surface and contour plot for each
of the three density estimators are presented in Hu et al. (2010), where the
tail-adaptive estimator is shown to be able to capture richer dynamics than
the other tow.
Let xt and yt denote the S&P 500 and AORD daily returns, respectively.
We used the estimated tail-adaptive bandwidths to compute the conditional
density of yt at a given value of xt through f(y|xt = x) = f(y, x)/fx(x),
where f(y, x) is the joint density of (yt, xt), and fx(x) is the marginal density
of xt. Bandwidths estimated for a joint density can be used for the purpose
to compute conditional density (Holmes et al., 2010; Polak et al., 2010). We
computed the conditional density of xt given that yt is at the 10th, 5th and
1st percentiles, which are corresponding to percentage return values of -0.75,
-1.13, and -2.24, respectively.
We are also able to compute the conditional probability Pr{yt ≤ y|xt ≤
x} = Pr{yt ≤ y, xt ≤ x}/Pr{xt ≤ x}. For example, Pr{yt ≤ 0|xt ≤ 0} =
0.67. It means that when the USA stock market finished daily trading with
a negative return, there was a 67% chance that the Australian market would
also drop. Such a percentage suggests that the Australian market followed
16
the USA market during this period. However, this result should not be
interpreted as blaming the problems in Australian stock market to the USA
stock market, while this is simply an evidence of market dependence.
In addition, we can estimate the conditional CDF of yt for given xt = x:
F (y|xt = x) = Pr{yt ≤ y|xt = x} =
∫ y
−∞
f(z, x)
fx(x)dz. (9)
Given that the USA market finished daily trading with the S&P 500 return
being at x%, the probability that the Australian market dropped beyond
the same level is indicated by letting y = x in (9). At the above-mentioned
percentiles of the S&P 500 return, the probability that the AORD return
were below the same levels are 0.27, 0.24 and 0.12, respectively. It indicates
that the probability that the Australian market had a larger drop than the
USA market was no more than 27%. See Hu et al. (2010) for the graphs of
the conditional density and its CDF.
Moreover, we evaluated the performance of our VaR in comparison to the
VaR obtained through IGARCH model (or the Riskmetrics). We followed
the steps described in Bao et al. (2006) and calculated the check function of
Koenker and Bassett Jr (1978). The existing sample was used as the learning
set, and we forecasted daily 5% VaR of AORD from 17th September 2010 to
5th May 2011. The check function computed by IGARCH and our proposed
method are 0.0356 and 0.0344, respectively. This finding indicates that our
approach to VaR estimation is more effective than the Riskmetrics.
6. Conclusion
This paper proposes a kernel density estimator with tail-adaptive band-
widths, which are assigned to the low- and high-density regions, respectively.
17
We have derived the posterior of bandwidth parameters based on Kullback-
Leibler information and presented an MCMC sampling algorithm to estimate
bandwidths. The Monte Carlo simulation study shows that the tail-adaptive
kernel density estimator outperforms its competitor, the global-bandwidth
kernel density estimator. The simulation result also shows that the improve-
ment made by the tail-adaptive kernel density estimator is especially obvious
when the underlying density is fat-tailed. Although the probability of the
low-density region has to be chosen before sampling is carried out, we have
found that performance the tail-adaptive kernel estimator is not sensitive to
different values of this probability. Future search could include this proba-
bility as a parameter to be estimated via the same sampling procedure.
We applied the tail-adaptive kernel density estimator to the estimation
of bivariate density of the paired daily returns of the Australian Ordinary
and S&P 500 indices during a period that covers the global financial crisis.
The tail-adaptive density estimator is strongly favored against the global-
bandwidth density estimator. We derived the estimated conditional density
and distribution of the Australian index return given that the USA market
finished daily trading with different return values. Although the Australian
stock market followed the USA market during the crisis, there was no more
than 27% chance that the former market had a larger drop than the latter.
Our approach can be viewed as mode estimation in clustering analysis.
Even though our algorithm can be implemented to data of any dimension, the
curse of dimensionality is a major concern for kernel density estimation. Fer-
raty and Vieu (2006) suggested attacking the dimensionality problem from
a functional setting even for data of infinite dimensions. It might be possi-
18
ble to explore whether the proposed Bayesian approach is applicable in this
situation. We leave it for future research.
Acknowledgements
We extend our sincere thanks to the Co-Editor Stanley Azen, an associate
editor and two reviewers for their very insightful comments that have led to
a substantially improved paper. Thanks also go to the Victorian Partnership
for Advanced Computing (VPAC) for its quality computing facility.
References
Abramson, I., 1982a. Arbitrariness of the pilot estimator in adaptive kernel
methods. Journal of Multivariate Analysis 12, 562–567.
Abramson, I., 1982b. On bandwidth variation in kernel estimates — a square
root law. The Annals of Statistics 10, 1217–1223.
Azzalini, A., Capitanio, A., 2003. Distributions generated by perturbation
of symmetry with emphasis on a multivariate skew t-distribution. Journal
of the Royal Statistical Society, Series B 65, 367–389.
Azzalini, A., Valle, A., 1996. The multivariate skew-normal distribution.
Biometrika 83, 715–726.
Bao, Y., Lee, T., Saltoglu, B., 2006. Evaluating predictive performance
of value-at-risk models in emerging markets: a reality check. Journal of
Forecasting 25, 101–128.
19
Breiman, L., Meisel, W., Purcell, E., 1977. Variable kernel estimates of
multivariate densities. Technometrics 19, 135–144.
Brewer, M., 2000. A Bayesian model for local smoothing in kernel density
estimation. Statistics and Computing 10, 299–309.
Cao, R., 2001. Relative efficiency of local bandwidths in kernel density esti-
mation. Statistics 35, 113–137.
Devroye, L., Gyorfi, L., 1985. Nonparametric Density Estimation: The L1
View. John Wiley & Sons, New York.
Duong, T., Hazelton, M., 2003. Plug-in bandwidth matrices for bivariate
kernel density estimation. Journal of Nonparametric Statistics 15, 17–30.
Duong, T., Hazelton, M., 2005. Cross-validation bandwidth matrices for
multivariate kernel density estimation. Scandinavian Journal of Statistics
32, 485–506.
Ferraty, F., Vieu, P., 2006. Nonparametric Functional Data Analysis: Theory
and Practice. Springer, New York.
Gangopadhyay, A., Cheung, K., 2002. Bayesian approach to the choice of
smoothing parameter in kernel density estimation. Journal of Nonpara-
metric Statistics 14, 655–664.
Hall, P., 1987. On Kullback-Leibler loss and density estimation. The Annals
of Statistics 15, 1491–1519.
Hardle, W., 1991. Smoothing Techniques: with Implementation in S.
Springer, New York.
20
Hartigan, J.A., 1975. Clustering Algorithms. John Wiley & Sons, New York.
Hartigan, J.A., 1987. Estimation of a convex density contour in two dimen-
sions. Journal of the American Statistical Association 82, 267–270.
Holmes, M.P., Gray, A.G., Isbell Jr, C.L., 2010. Fast kernel conditional
density estimation: A dual-tree Monte Carlo approach. Computational
Statistics & Data Analysis 54, 1707–1718.
Horova, I., Vieu, P., Zelinka, H., 2002. Optimal choice of nonparametric
estimaties of a density and of its derivatives. Statistics & Decisions 20,
355–378.
Hu, S., Poskitt, D., Zhang, X., 2010. Bayesian Adaptive Bandwidth Ker-
nel Density Estimation of Irregular Multivariate Distributions. Monash
Econometrics and Business Statistics Working Papers.
Hyndman, R.J., 1996. Computing and graphing highest density regions. The
American Statistician 50, 120–126.
Izenman, A.J., 1991. Recent developments in nonparametric density estima-
tion. Journal of the American Statistical Association 86, 205–224.
Jacome, M., Gijbels, I., Cao, R., 2008. Comparison of presmoothing methods
in kernel density estimation under censoring. Computational Statistics 23,
381–406.
Jones, M.C., 1990. Variable kernel density estimates and variable kernel
density estimates. Australian & New Zealand Journal of Statistics 32,
361–371.
21
Jones, M.C., Marron, J.S., Sheather, S.J., 1996. A brief survey of band-
width selection for density estimation. Journal of the American Statistical
Association 91, 401–407.
Kim, S., Shepherd, N., Chib, S., 1998. Stochastic volatility: Likelihood
inference and comparison with ARCH models. Review of Economic Studies
65, 361–393.
Koenker, R., Bassett Jr, G., 1978. Regression quantiles. Econometrica 46,
33–50.
Kulasekera, K.B., Padgett, W.J., 2006. Bayes bandwidth selection in kernel
density estimation with censored data. Journal of Nonparametric Statistics
18, 129–143.
de Lima, M., Atuncar, G., 2010. A bayesian method to estimate the optimal
bandwidth for multivariate kernel estimator. Journal of Nonparametric
Statistics 23, 137–148.
Loftsgaarden, D.O., Quesenberry, C.P., 1965. A nonparametric estimate of a
multivariate density function. The Annals of Mathematical Statistics 36,
1049–1051.
Marron, J., 1992. Bootstrap bandwidth selection, in: LePage, R., Billard, L.
(Eds.), Exploring the Limits of Boostrap. John Wiley & Sons, New York,
pp. 249–262.
Marron, J.S., 1987. A comparison of cross-validation techniques in density
estimation. The Annals of Statistics 15, pp. 152–162.
22
Marron, J.S., Nolan, D., 1988. Canonical kernels for density estimation.
Statistics & Probability Letters 7, 195 – 199.
Mason, D.M., Polonik, W., 2009. Asymptotic normality of plug-in level set
estimates. The Annals of Applied Probability 19, 1108–1142.
Mielniczuk, J., Sarda, P., Vieu, P., 1989. Local data-driven bandwidth choice
for density estimation. Journal of Statistical Planning and Inference 23,
53–69.
Newton, M.A., Raftery, A.E., 1994. Approximate Bayesian inference with
the weighted likelihood bootstrap. Journal of the Royal Statistical Society,
Series B 56, 3–48.
Nolan, D., Marron, J., 1989. Uniform consistency of automatic and location-
adaptive delta-sequence estimators. Probability Theory and Related Fields
80, 619–632.
Polak, J., Zhang, X., King, M.L., 2010. Bandwidth selection for kernel condi-
tional density estimation using the MCMC method. Manuscript presented
at Australian Statistical Conference, 6-10 December, Western Australia.
Roberts, G.O., 1996. Markov chain concepts related to sampling algorithms,
in: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (Eds.), Markov Chain
Monte Carlo in Practice. Chapman & Hall, London, pp. 45–57.
Sain, S.R., 2002. Multivariate locally adaptive density estimation. Compu-
tational Statistics & Data Analysis 39, 165–186.
23
Sain, S.R., Baggerly, K.A., Scott, D.W., 1994. Cross-validation of multivari-
ate densities. Journal of the American Statistical Association 89, 807–817.
Sain, S.R., Scott, D.W., 1996. On locally adaptive density estimation. Jour-
nal of the American Statistical Association 91, 1525–1534.
Sain, S.R., Scott, D.W., 2002. Zero-bias locally adaptive density estimators.
Scandinavian Journal of Statistics 29, 441–460.
Samworth, R.J., Wand, M.P., 2010. Asymptotics and optimal bandwidth
selection for highest density region estimation. The Annals of Statistics
38, 1767–1792.
Scott, D.W., 1992. Multivariate Density Estimation: Theory, Practice, and
Visualization. John Wiley & Sons, New York.
Terrell, G.R., Scott, D.W., 1992. Variable kernel density estimation. The
Annals of Statistics 20, 1236–1265.
Vieu, P., 1999. Multiple kernel procedure: An asymptotic support. Scandi-
navian Journal of Statistics 26, 61–72.
Wand, M.P., Jones, M.C., 1994. Multivariate plug-in bandwidth selection.
Computational Statistics 9, 97–116.
Wand, M.P., Jones, M.C., 1995. Kernel Smoothing. Chapman & Hall, New
York.
Zhang, X., King, M.L., Hyndman, R.J., 2006. A Bayesian approach to band-
width selection for multivariate kernel density estimation. Computational
Statistics & Data Analysis 50, 3009–3031.
24
Table 1: MCMC results obtained based on a sample generated from density F
Bandwidths Mean Standard Batch-mean SIF Acceptance
deviation standard deviation rate
h(1)1 1.1121 0.3184 0.0157 24.32 0.28
h(1)2 1.6432 0.3816 0.0164 18.57
h(0)1 0.2505 0.0469 0.0019 17.13
h(0)2 0.4196 0.0675 0.0018 7.35
Table 2: Estimated Kullback-Leibler information for bivariate densities
Global-bandwidth Tail-adaptive bandwidth
Density n NRR Bayesian α = 0.05 α = 0.10
fA 500 0.2878 0.0858 0.0772 0.0748
1000 0.2382 0.0617 0.0498 0.0467
2000 0.1981 0.0402 0.0339 0.0338
fB 500 0.1201 0.0499 0.0444 0.0442
1000 0.0826 0.0349 0.0332 0.0337
2000 0.0653 0.0256 0.0219 0.0217
fC 500 0.1126 0.0930 0.0783 0.0768
1000 0.0924 0.0689 0.0559 0.0558
2000 0.0900 0.0648 0.0497 0.0498
fD 500 0.1171 0.0946 0.0464 0.0449
1000 0.0809 0.0769 0.0286 0.0312
2000 0.0590 0.0565 0.0242 0.0270
fE 500 0.1436 0.1072 0.0623 0.0530
1000 0.1038 0.1088 0.0328 0.0397
2000 0.0782 0.0666 0.0262 0.0282
fF 500 0.1169 0.1641 0.0520 0.0545
1000 0.0781 0.0657 0.0261 0.0306
2000 0.0708 0.0637 0.0237 0.0242
25
Table 3: Estimated MISE (×100) for bivariate densities
Global-bandwidth Tail-adaptive bandwidth
Density n NRR Bayesian α = 0.05 α = 0.10
fA 500 1.8482 0.4945 0.4415 0.4401
1000 1.5546 0.3354 0.2920 0.2901
2000 1.2875 0.2230 0.1875 0.1854
fB 500 0.6526 0.2718 0.2612 0.2606
1000 0.4927 0.1828 0.1738 0.1727
2000 0.3595 0.1164 0.1101 0.1089
fC 500 0.8812 0.6022 0.5422 0.5422
1000 0.7059 0.4208 0.3694 0.3804
2000 0.5589 0.2868 0.2600 0.2692
fD 500 0.3250 0.6573 0.2782 0.2638
1000 0.2152 0.5121 0.1796 0.1737
2000 0.1489 0.3915 0.1219 0.1245
fE 500 0.4022 0.3722 0.1919 0.1822
1000 0.3039 0.2980 0.1279 0.1242
2000 0.2207 0.2283 0.0840 0.0844
fF 500 0.3331 0.7560 0.3092 0.2966
1000 0.2229 0.6150 0.2054 0.2022
2000 0.1496 0.4856 0.1386 0.1437
26
Table 4: Estimated Kullback-Leibler information for 5-dimensional densities
Global-bandwidth Tail-adaptive bandwidth
Density n NRR Bayesian α = 0.05 α = 0.10
fG 500 0.8923 0.4280 0.4026 0.4004
1000 0.7705 0.3093 0.2848 0.2825
2000 0.6933 0.2489 0.2343 0.2300
fH 500 0.4559 0.3438 0.3212 0.3179
1000 0.4041 0.2892 0.2613 0.2582
2000 0.3355 0.2226 0.2033 0.1987
fI 500 0.5943 0.5674 0.3446 0.3187
1000 0.4994 0.4814 0.2891 0.2666
2000 0.4395 0.4255 0.2274 0.2072
fJ 500 0.6107 0.5755 0.3226 0.3033
1000 0.5969 0.4415 0.2538 0.2284
2000 0.5050 0.3937 0.1971 0.1773
Table 5: A summary of MCMC results obtained through our proposed Bayesian sampling
algorithm to the tail-adaptive kernel density estimator of the S&P500 and AORD returns
Bandwidths Mean Standard SIF Acceptance log marginal
deviation rate likelihood
NRR h1 0.2171
h2 0.1783
Bayesian global h1 0.1795 0.0113 5.63 0.21 -1719.64
bandwidth h2 0.2485 0.0121 5.89
Tail-adaptive h(1)1 0.5533 0.2217 39.30 0.27 -1657.14
bandwidth h(1)2 0.1221 0.0161 15.39
with α = 0.05 h(0)1 0.5552 0.1140 19.97
h(0)2 0.1547 0.0174 13.55
27