models and methods for spatial data: applications in...
TRANSCRIPT
-
MODELS AND METHODS FOR SPATIAL DATA:
APPLICATIONS IN EPIDEMIOLOGICAL,
ENVIRONMENTAL AND ECOLOGICAL STUDIES
by
Cindy Xin Feng
M.Sc. (Statistics), Simon Fraser University, 2006
B.Sc. (Applied Mathematics), Beijing University of Technology, 2003
a Thesis submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
in the Department of
Statistics and Actuarial Science
c⃝ Cindy Xin Feng 2011
SIMON FRASER UNIVERSITY
Summer 2011
All rights reserved. However, in accordance with the Copyright Act of
Canada, this work may be reproduced without authorization under the
conditions for Fair Dealing. Therefore, limited reproduction of this
work for the purposes of private study, research, criticism, review and
news reporting is likely to be in accordance with the law, particularly
if cited appropriately.
-
APPROVAL
Name: Cindy Xin Feng
Degree: Doctor of Philosophy
Title of Thesis: Models and Methods for Spatial Data: Applications in
Epidemiological, Environmental and Ecological Stud-
ies
Examining Committee: Dr. Rick Routledge
Chair
Dr. Charmaine Dean, Senior Supervisor
Dr. Jiguo Cao, Supervisor
Dr. Yi Lu, Supervisor
Dr. Paramjit Gill, Internal External Examiner
Dr. Patrick Brown, External Examiner,
University of Toronto
Date Approved:
ii
lib m-scan5Typewritten TextAugust 24. 2011
-
Partial Copyright Licence
-
Abstract
This thesis develops new methodologies for applied problems using smoothing tech-
niques for spatial or spatial temporal data. We investigate Bayesian ranking methods
for identifying high risk areas in disease mapping, assessing these particularly with
regard their performance in isolating emerging unusual and extreme risks in small
areas. We build on information obtained through mapping multivariate outcomes by
developing models which investigate if the multivariate spatial outcomes share the
same underlying spatial structure. We develop a general framework for joint model-
ing of multivariate spatial outcomes for count and zero-inflated count data using a
common spatial factor model.
We also study spatial exposure measures, motivated by an analysis of Comandra
blister rust infection on lodgepole pine trees from British Columbia. We contrast
nearest distance with other, more general, exposure measures and consider the impact
of mis-specification of exposure measures in a semiparametric generalized additive
modeling framework including a spatial residual term modeled as thin plate regression
spline. An appealing feature of the new spatial exposure measures considered is that
they can be easily adapted to other problems, such as investigation of the association
of asthma incidence to traffic exposures. A common theme in the thesis is the use of
functional data analysis, and we specifically adapt such methods for assessing spatial
and temporal variation of Cadmium concentration in Pacific oysters from British
iii
-
Columbia.
The methodologies developed in these projects widen the toolbox for spatial anal-
ysis in applications in epidemiology, and in environmental and ecological studies.
iv
-
Acknowledgments
I am deeply indebted to my senior supervisor Dr. Charmaine Dean for her guidance
and support in countless ways. Without her enlightening instruction, great kindness
and patience, I could not have completed my thesis. Her support and encouragement
were very helpful to me through some very difficult times in my life. I also want
to extend my gratitude to my examining committee members, Dr. Rick Routledge,
Dr. Jiguo Cao, Dr. Yi Lu, Dr. Paramjit Gill and Dr. Patrick Brown for all their
careful reviewing and insightful comments. Their detailed reviews and constructive
comments greatly improved the thesis.
Many thanks to the faculty and staff of the Department of Statistics and Actuarial
Science of Simon Fraser University for providing me a wonderful environment for
graduate studies. In particular, I would also like to thank Dr. Derek Bingham, Dr.
Boxin Tang, Dr. Richard Lockhart, Dr. Tim Swartz, Dr. Leilei Zeng, Dr. Joan Hu
and Mr. Ian Bercovitz for their support and Sadika, Kelly and Charlene for your
help always. Thank you also to the fellow graduate students for being company and
growing together with me during my graduate studies.
Finally, and most importantly, I would like to thank my family, I would not be
able to go this far without their care and encouragement.
v
-
Contents
Approval ii
Abstract iii
Acknowledgments v
Contents vi
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Disease Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Conditional Autoregressive Priors . . . . . . . . . . . . . . . . 3
1.3 Thin-Plate Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.5 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Bayesian Ranking Methods for the Detection of Isolated Hotspots
vi
-
in Disease Mapping 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Bayesian Disease-Mapping Model . . . . . . . . . . . . . . . . . . . . 13
2.3 Ranking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Squared error loss function for the isolation measures . . . . . 16
2.3.2 Squared error loss function for the ranks of the isolation measures 16
2.3.3 Weighted rank squared error loss function . . . . . . . . . . . 17
2.3.4 Misclassification rates of regions in the top 100% group . . . 18
2.4 Comparison of Rank Estimators of Isolation . . . . . . . . . . . . . . 19
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Joint Analysis of Multivariate Spatial Count and Zero-Heavy Count
Outcomes 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Models for Joint Count Outcomes . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Common Spatial Factor Model for Counts . . . . . . . . . . . 42
3.2.2 Common Spatial Factor Model for Zero Heavy Counts . . . . 43
3.2.3 Model Assessment and Comparison . . . . . . . . . . . . . . . 46
3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Ontario Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Comandra Blister Rust Tree Infection . . . . . . . . . . . . . . 53
3.4 Power of the Test for Common Spatial Structure . . . . . . . . . . . . 59
3.5 Precision Gains Through Joint Outcome Modeling . . . . . . . . . . . 64
3.6 Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . . 66
4 Impact of Misspecifying Spatial Exposures 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
vii
-
4.2 Comandra Blister Rust Study . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Flexible Smooth Models . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Comparison of Exposure Measures for CBR Infection . . . . . . . . . 83
4.5 Assessing the Effect of Misspecification of Spatial Exposure Measures 87
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5 Exploring Spatial and Temporal Variations of Cadmium Concentra-
tions in Pacific Oysters from British Columbia 97
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1.1 The Motivating Datasets . . . . . . . . . . . . . . . . . . . . . 98
5.1.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Spline Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.2 Monotone Spline Smoothing . . . . . . . . . . . . . . . . . . . 104
5.2.3 Functional Principal Component Analysis . . . . . . . . . . . 105
5.2.4 Semi-Parametric Additive Model . . . . . . . . . . . . . . . . 107
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.2 Spatial Variability . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.3 The Semi-Parametric Additive Model . . . . . . . . . . . . . . 116
5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 Future Work 122
6.1 Spatial-temporal Modeling for Multivariate Spatial Outcomes . . . . 122
6.2 Spatial Modeling for Infectious Disease . . . . . . . . . . . . . . . . . 124
6.3 Curve Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Bibliography 128
viii
-
A Appendix for Chapter 3 138
B Appendix for Chapter 4 140
C Appendix for Chapter 5 147
ix
-
Chapter 1
Introduction
1.1 Overview
In recent years, there has been considerable interest in the development and applica-
tion of spatial models and methods for the analysis of spatially correlated data, which
are often geographically referenced, temporally correlated or highly multivariate. For
example, a motivating dataset considers the analysis of lung cancer for males and
females by local health unit in Ontario. The key idea throughout the approaches
considered is to take advantage of the correlation structure among observations to
perform estimation, prediction, hypothesis testing and other statistical procedures.
We begin with a review of some important concepts which form the building blocks
of the methods and models developed in later chapters. This is followed by an outline
of the material presented in each of the chapters of the thesis.
1
-
CHAPTER 1. INTRODUCTION 2
1.2 Disease Mapping
Mapping of disease incidence mortality rates is of primary importance in many epi-
demiological studies. The use of crude rates to estimate rare disease risks in small
areas such as health units, census areas or administrative zones, is problematic since it
does not account for the high variability of population sizes over the different regions,
nor the spatial patterns of the regions under study. Because of this, interpretation of
the spatial distribution of disease based on crude estimates is often misleading. Al-
ternatively, Bayesian inference is widely used to produce stabilized risk maps through
borrowing information from neighborhoods across the map. Early developments of
disease mapping methodology included the use of empirical Bayes (EB) techniques
(Manton et al., 1989; Marshall, 1991; Dean and MacNab, 2001; Breslow and Clay-
ton, 1993) to estimate parameters, and a plug-in approximation of these for posterior
inference, which yielded unbiased estimates of the relative risks. However, the vari-
ance of these estimates were underestimated, since the EB approach does not account
for the uncertainty arising from estimating hyperparameters. In recent years, (fully)
Bayesian (FB) approaches have gained prominence. Inference is based on Markov
chain Monte Carlo (MCMC) algorithms (Besag et al., 1991; Bernardinelli and Mon-
tomoli, 1991; MacNab et al., 2004; Congdon, 2006). Interval estimation of relative
risks based on posterior distributions account for the uncertainty associated with the
estimates through the hyperprior specifications. Bayesian methods for disease map-
ping is often termed hierarchical spatial modeling. The first level of the hierarchy
depicts the distribution of the data; the second level introduces the spatial depen-
dence through random effects which account for heterogeneity in the risks; at the
lowest level is specified the distribution of the hyperparameters.
-
CHAPTER 1. INTRODUCTION 3
1.2.1 Conditional Autoregressive Priors
One of the most popular choices for the distribution of the random effects in hi-
erarchical spatial modeling is the intrinsic conditional autoregressive (CAR) model
(Besag et al., 1991). Let W = (wij) denote the so-called spatial proximity matrix,
i = 1, ⋅ ⋅ ⋅ , n and j = 1, ⋅ ⋅ ⋅ , n for n regions, where wii = 0 and wij = 1 if the ith
and the jth areas are neighbours (denoted j ∼ i), and 0 otherwise. The conditional
expectation and variance are
E(bi∣bj ∕=i) =1
wi+
∑j∼i
bj, Var(bi∣bj ∕=i) =�2bwi+
, (1.1)
where b−i represents and b = (b1, ⋅ ⋅ ⋅ , bn) has joint distribution
b ∼MVN(0,Σ), Σ = �2b (D −W )−1 , (1.2)
where D = diag(w1+, ⋅ ⋅ ⋅ , wn+), wi+ =∑
j wij. The forms (1.1) and (1.2) define the
intrinsic CAR (Besag et al., 1991) uniquely. With this model, local smoothing can be
achieved, as E(bi∣bj ∕=i) is the local risk average over the neighborhood of region i and
Var(bi∣bj ∕=i) is scaled by the inverse of the number of neighbors, so that the greater
the number of neighbors the smaller the variance. However, the intrinsic CAR prior
is improper, since the matrix (D−W ) is singular. This impropriety can be remedied
by enforcing constraints such as∑n
i=1 bi = 0, which can be implemented numerically
at each iteration of an MCMC algorithm used for model fitting. Alternatively, the
so-called proper CAR model may be used; this model incorporates an additional
parameter �, so that the full conditionals are
E(bi∣bj ∕=i) =�
wi+
∑j∼i
bj, Var(bi∣bj ∕=i) =�2bwi+
, � ∈ (0, 1) (1.3)
leading to the unique joint distribution
b ∼MVN(0,Σ), Σ = �2b (D − �W )−1 , (1.4)
-
CHAPTER 1. INTRODUCTION 4
so that the covariance matrix (D − �W ) is non-singular.
Alternatively, Leroux et al. (1999) proposed a CAR model defining the full con-
ditionals as
E(bi∣bj ∕=i) =�
1− �+ �wi+
∑j∼i
bj, Var(bi∣bj ∕=i) =�2b
1− �+ �wi+, � ∈ (0, 1) (1.5)
leading to the unique joint distribution
b ∼MVN(0,Σ), Σ = �2b {�(D −W ) + (1− �)I}−1 , (1.6)
where � is a weighting parameter which weights the contributions from the spatially
correlated effect, modeled as intrinsic CAR, and the independent random noise term,
an independent normal distribution.
For point referenced data, geostatistical models (Cressie, 1993) are often used,
which directly specify the covariance matrix based on the distance between the spa-
tial sites. For example, the correlation between two spatial sites may decay expo-
nentially with distance; whereas, CAR models are specified based on the adjacency
structure among the spatial units, and can be used for either point-referenced data
or lattice data. In addition, inference for geostatistical models usually requires in-
version of covariance matrixes at each MCMC iteration; CAR models are therefore
computationally more efficient than geostatistical models.
1.3 Thin-Plate Splines
Thin-plate splines (Duchon, 1977) offer a very elegant approach for estimating a
smooth function of multiple predictor variables. The following provides a concise
introduction to thin plate splines. For a more detailed description, see (Duchon,
1977; Meinguet, 1979; Green and Silverman, 1994; Wood, 2004, 2006).
-
CHAPTER 1. INTRODUCTION 5
Suppose the response yi, i = 1, ⋅ ⋅ ⋅ , n, is modeled as a smooth function of covari-
ates xi such that
yi = f(xi) + �i, i = 1, ⋅ ⋅ ⋅ , n, (1.7)
where f is an unknown function on a fixed domain D ⊂ Rd, �i is a random error term,
and xi ⊂ D are fixed values for covariates.
Thin-plate spline smoothing estimates f by finding the function f̂ which minimizes
the penalized sum of squares
1
n
n∑i=1
wi {yi − f(xi)}2 + �Jm(f) , (1.8)
where wi, i = 1, 2, ⋅ ⋅ ⋅ , n, are some fixed constants; Jm(f) is penalty function measur-
ing the non-smoothness or so-called ‘wiggliness’ of f , and � is the smoothing param-
eter, which controls the tradeoff between f fitting the data precisely and smoothness
of f . The penalty term is defined as
Jm(f) =
∫⋅ ⋅ ⋅∫Rd
∑�1+⋅⋅⋅+�d=m
m!
�1! ⋅ ⋅ ⋅ �d!
( ∂mf∂x�11 ⋅ ⋅ ⋅ ∂x
�dd
)2dx1 ⋅ ⋅ ⋅ dxd . (1.9)
The sum in the integral is taken over all the integers � = (�1, ⋅ ⋅ ⋅ , �d)T such that
�1 + ⋅ ⋅ ⋅ �d = m, where d denotes the number of covariates, so d = 2 for spatial
longitude and latitude coordinate data, and the order m of differentiation in the
penalty can be any integer satisfying 2m > d. Matheron (1973) and Duchon (1977)
showed that the function minimizing (1.8) has the form
f(x) =k∑j=1
�j�j(x) +n∑i=1
i i(x) , (1.10)
where (�1, ⋅ ⋅ ⋅ , �k) are linearly independent polynomials spanning the space of all
d-dimensioned polynomials of degree less than m, and �j, j = 1, ⋅ ⋅ ⋅ , k and i, i =
1, ⋅ ⋅ ⋅ , n are coefficients to be estimated. For example, when d = 2, m = 2, k = 3
-
CHAPTER 1. INTRODUCTION 6
and x = (x1, x2), we have �1(x) = 1, �2(x) = x1 and �3(x) = x2. For d = 2, m = 3,
k = 6, we have �1(x) = 1, �2(x) = x1, �3(x) = x2, �4(x) = x1x2, �5(x) = x21,
�6(x) = x22. The functions ( 1, ⋅ ⋅ ⋅ , n) are a set of n radial basis functions, defined
as
i(r) =
⎧⎨⎩ amd∥r∥2m−dlog∥r∥, d evenbmd∥r∥2m−d, d oddwhere amd and bmd are constants.
For modeling spatial effects, thin-plate regression splines can be viewed as a Gaus-
sian process with generalized covariance (Cressie, 1993), characterized in terms of
distance �. The form of the covariance in two dimensions is C(�) ∝ �2m−2log(�),
where m is the order of the spline (commonly two). Paciorek (2007) provided a nice
comparison of a variety of approaches for modeling spatial surface. Wood (2000, 2003,
2004) proposed the use of iterative weighted fitting of reduced rank thin-plate splines
for computational efficiency.
1.4 Outline of Thesis
This thesis develops models and methods for the analysis of spatial or spatial-temporal
data arising from epidemiology, environmental and ecological studies. Specific prob-
lems will be considered including identification of high risk isolated areas in Chapter
2; misspecification of spatial exposure measures in Chapter 4; joint modeling of multi-
variate spatially correlated outcomes using common spatial factor models in Chapter
3; and investigation of functional data analysis approaches for modeling spatially and
temporally correlated data in Chapter 5. Each of Chapters 2, 3, 4 and 5 constitute
papers submitted. As a result, some introductory material may be repeated through
these chapters as well as the descriptions of motivating data sets.
-
CHAPTER 1. INTRODUCTION 7
1.4.1 Chapter 2
In disease mapping studies, often there is interest in identifying high risk areas in
order to investigate causes of mortality for surveillance purposes, or perhaps for effi-
cient allocation of health funding. Here, we focus on identification of locally isolated
high risk regions termed ‘local hotspots’ or ‘emerging hotspots’, defined as regions
with elevated risks, with respect to their neighbors. Identification of ‘local hotspots’ or
‘emerging hotspots’ before they become extreme is crucial for disease surveillance. We
develop methods of ranking the difference between area risks or ranks and correspond-
ing values for neighbours, based on (1) the standardized mortality ratio (SMR), (2)
minimizing mean squared errors of estimation for relative risks (3) minimizing mean
squared errors of estimation for ranks of risks, (4) minimizing a weighted squared
error loss function for ranks and (5) maximizing the sensitivity in the upper and
lower 100% relative risks at prespecified . We evaluate our methods through sim-
ulation investigation in a scenario which reflects the Scottish lip cancer data used in
several mapping studies. Our simulation results show that ranking the difference be-
tween posterior ranks of emerging hotspots and corresponding values for neighbours,
based on minimizing mean squared errors of estimation for ranks, is superior to other
methods for identifying emerging hotspots.
1.4.2 Chapter 3
This chapter discusses joint outcome modeling of multivariate spatial data, where
outcomes include count as well as zero-inflated count data. The framework utilized for
the joint spatial count outcome analysis reflects that which is now commonly employed
for the joint analysis of longitudinal and survival data, termed shared frailty models,
in which the outcomes are linked through a shared latent spatial random risk term.
We discuss these types of joint mapping models and consider the benefits achieved
-
CHAPTER 1. INTRODUCTION 8
through such joint modeling in the disease mapping context. We also consider the
power of tests for common spatial structure and develop recommendations on the
sort of power achievable in some contexts, as well as overall recommendations on the
utility of joint mapping. We illustrate the approaches in an analysis of lung cancer
mortality as well as an ecological study of Comandra blister rust infection of lodgepole
pine trees.
1.4.3 Chapter 4
In environmental and epidemiological studies, the nearest distance between the sus-
ceptible subject and the exposure source is a commonly used exposure measure, prin-
cipally because this measure is easy to collect. However, the density of the exposure in
the neighborhood of the subject may play an important role in the response to expo-
sure. Misspecification of exposure measures may result in inaccurate determinations
of the link between exposure and the response of interest. Such considerations are
motivated by the study of the disease dynamics of Comandra blister rust (Cronartium
comandrae) on lodgepole. This disease spreads to pine trees through alternate host
plants near the trees. We aim at understanding the relationship between the alternate
host plant presence and the disease, as well as effects relating to genetic variation in
the trees. We contrast the use of nearest distance to the alternate host plant, with
host plant densities at different orders of neighborhood, as exposure measures, in the
framework of a flexible semiparametric generalized additive model, while adjusting for
a spatially smooth surface. We demonstrate that if exposure is inaccurately modeled,
bias in estimating genetic effects may manifest themselves. Our study also provides
information on the added benefit of collecting more detailed information on exposure
beyond the simple nearest distance measure.
-
CHAPTER 1. INTRODUCTION 9
1.4.4 Chapter 5
Oysters from the Pacific Northwest coast of British Columbia, Canada, contain high
levels of cadmium, in some cases exceeding some international food safety guidelines.
A primary goal of this chapter is the investigation of the spatial and temporal variation
in cadmium concentrations for oysters sampled from coastal British Columbia. Such
information is important so that recommendations can be made as to where and when
oysters can be cultured such that accumulation of cadmium within these oysters
is minimized. Some modern statistical methods are applied to achieve this goal,
including monotone spline smoothing, functional principal component analysis and
semi-parametric additive modelling. Oyster growth rates are estimated as the first
derivatives of the monotone smoothing growth curves. Some important patterns in
cadmium accumulation by oysters are observed. For example, most inland regions
tend to have a higher level of cadmium concentration than most coastal regions, so
more caution needs to be taken for shellfish aquaculture practices occurring in the
inland regions. The semi-parametric additive modelling shows that oyster cadmium
concentration decreases with oyster length, and oysters sampled at 7m have higher
average cadmium concentration than those sampled at 1m.
1.4.5 Chapter 6
The thesis closes with a discussion of future research topics.
-
Chapter 2
Bayesian Ranking Methods for the
Detection of Isolated Hotspots in
Disease Mapping
2.1 Introduction
In disease mapping, early capture of emerging hotspots, that is, regions with ele-
vated risks which are surrounded by areas with much lower risks, before they become
extreme, is crucial in decision-making related to health surveillance. Such decision-
making processes may refer to optimal allocation of resources for health prevention, or
to decisions reflecting mobility of a society or other environmental controls. A typical
approach for detection of disease hotspots through a hypothesis testing framework
utilizes the scan statistic (see Kulldorff and Nagarwalla, 1995; Kulldorff et al., 1998),
which aims at detecting the location and size of hotspots without any preconceived
assumptions about these values. Our focus here is quite different as we seek to es-
timate and rank various local elevations in risk across a map. Model based spatial
10
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 11
methods are used here to estimate such ranks. For rare diseases, the observed dis-
ease count may exhibit extra Poisson variation. Hence, the standardized mortality
ratios (SMRs), a basic investigative tool for epidemiologists, may be highly variable.
Subsequently, in maps of SMRs, the most variable values, arising typically from low
population areas, tend to be highlighted, masking the true underlying pattern of dis-
ease risk. To address the issue of such overdispersion, the field of disease mapping
has flourished in the last decade with a variety of estimation methods and spatial
models for latent levels of the model hierarchy. In particular, there have been many
developments related to Bayesian hierarchical models, which allow the risk in an area
to borrow strength from neighboring areas where the disease risks are similar. These
models have indeed become standard tools for mapping rates (see Besag et al., 1991;
Clayton and Bernardinelli, 1992; Clayton et al., 1993; Lawson et al., 2000; MacNab
et al., 2004; Best et al., 2005, for example) in order to identify global hotspots and
trends in the risk surface across the map.
Identification of local or emerging hotspots have received less attention. It is
unclear whether and what sorts of smoothing techniques offer advantages for iden-
tifying isolated hotspots, over basic estimates such as raw rates. Here, we maintain
the focus on Bayesian hierarchical conditional autoregressive (CAR) models, devel-
oped by Besag et al. (1991); Clayton and Bernardinelli (1992); Clayton et al. (1993).
This model and its extensions have become commonplace in epidemiological studies
and have been shown to be flexible and robust (Lawson et al., 2000). Best et al.
(2005) demonstrates the merits of the CAR model when compared to other contem-
porary models including a multivariate normal geostatistical model with exponential
covariance, a spatial mixture model, a partition model and a gamma moving average
model. While CAR models were not designed to detect isolated hotspots or clusters
of isolated hotspots, they have nevertheless been used broadly for identifying extreme
risks.
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 12
The most natural measure of isolation is the difference between the risk or rank of a
potential hotspot and the corresponding quantity for its neighbors. Ranking methods
play a valuable role in drawing attention to elevated regions. This chapter considers
methods for ranking isolation measures with the goal of using these to identify local
or emerging hostpots. We note that Laird and Louis (1989) showed that ranking of
empirical Bayes estimators can be more accurate than that of conventional maximum
likelihood estimators. Shen and Louis (1998) investigated ranking procedures using
squared error loss functions operating on the difference between the estimated and
true ranks. We note also that in many applications, interest focuses principally on
identifying the locations with relatively high (e.g. in the upper 10 %) or low risks.
With such an emphasis, Lin et al. (2006) discussed various loss functions for Bayesian
optimal ranking, as well as decision rules for identifying the regions with the top
100% risk values. Wright et al. (2003) developed a weighted rank squared error loss
function targeted at the most likely high-risk locations. We contrast these methods
for identifying the highest and lowest isolation measures across a map and develop
recommendations based on adaptations of these procedures. Though we focus on
disease mapping, we note that methods for ranking isolation measures may be broadly
useful in many other contexts, particularly sociological, for ranking political or racial
isolation, or ecological, for diversity studies.
In Section 2.2, we review the Bayesian hierarchical models commonly used for
analyzing disease incidence and mortality data. Section 2.3 discusses the ranking
methods considered, focusing on identifying regions associated with high risks which
are isolated and building upon Bayesian hierarchical models. Section 2.4 evaluates the
methods using the spatial distribution of lip cancer from Scotland where local hotspots
are artificially generated. Section 2.5 closes with a discussion and recommendations.
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 13
2.2 Bayesian Disease-Mapping Model
It is well known that Bayesian hierarchical models for disease mapping provide a trade-
off between bias and variance reduction of estimates, and is particularly helpful in
cases where the disease is rare. The variance reduction is achieved through borrowing
information from the neighboring region to produce a more stable estimate of the risk
surface with estimated risks shrunk toward the overall mean risk, or some function of
this mean. Marshall (1991) reviews empirical Bayes and some early Bayesian methods
for disease mapping; Lawson et al. (2000) compares disease mapping models using
various goodness of fit criteria; Best et al. (2005) provides a comprehensive review of
the recent development in Bayesian disease mapping and compares models through
simulation studies; Richardson et al. (2004) conducts a comprehensive evaluation
designed to highlight the amount of smoothing of risk which occurs and the effects on
identifying global hotspots in a variety of settings. Our aim here is to evaluate various
ranking methods for risk estimators obtained from fitting Bayesian disease mapping
models. We focus on the basic spatial model described by Besag et al. (1991).
Let the area under study be divided into n contiguous regions labeled i = 1, ⋅ ⋅ ⋅ , n,
and let y = (y1, ⋅ ⋅ ⋅ , yn)T be the observed, and E = (E1, ⋅ ⋅ ⋅ , En)T be the expected,
disease counts. Denote by � = (�1, ⋅ ⋅ ⋅ , �n)T , i = 1, ⋅ ⋅ ⋅ , n the underlying random
region-specific disease risks. The response variables, conditional on �i, i = 1, ⋅ ⋅ ⋅ , n,
are assumed independent and Poisson distributed: yi∣�i∼Poisson(�i), �i = �iEi. The
conditional log linear model (Besag et al., 1991) specifies
log(�i) = � + log(Ei) + �i, �i = � + bi + ℎi ,
where � denotes the overall mean risk, while �i is decomposed into a spatially cor-
related random error term bi, and a uncorrelated error ℎi. The spatially correlated
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 14
random effects, b = (b1, ⋅ ⋅ ⋅ , bn)T are conveniently interpreted conditionally, as
bi∣bj ∕=i ∼ N
(∑j∼iwijbj∑j∼iwij
,�2b∑j∼iwij
),
where j ∼ i indicates that region j belongs to the neighbourhood of region i,
i = 1, ⋅ ⋅ ⋅ , n. Neighborhoods define the scope of the conditional influence and may
be constructed in different ways depending on the context of the analysis. In our
application, we define regions which are contiguous in space with the ith region,
sharing a common boundary, as its neighborhood. The weights, wij ≥ 0, wii = 0,
i, j = 1, ⋅ ⋅ ⋅ , n may be based on adjacency indicators for a lattice, or on a distance
measure between region i and j. Where the weights are based on adjacency indica-
tors, the joint distribution of random effects, b, is described as the intrinsic condi-
tional autoregressive model (Besag, 1974; Sun et al., 1999): b ∼ MVN(0, �2bQ−1),
where Q has ith diagonal element equal to the number of neighbors of the ith region
while for i ∕= j, Qij = −1 if i and j are neighbors, and 0 otherwise. The vector of
random risks, �, accommodates extra variation by a white noise error vector, and
h = (ℎ1, ⋅ ⋅ ⋅ , ℎn)T ∼ MVN(0, �2ℎI), where I is an identity matrix of dimension
n. By combining the independent and spatially correlated sources of random errors,
we obtain the convolution conditional autoregressive model for defining the distribu-
tion of the risks �i, as defined by Besag et al. (1991): h+ b ∼ MVN(0,Σ), where
Σ = �2ℎI + �2bQ−1. The values of �2ℎ and �
2b give a sense of the contributions of
spatial and non-spatial components in explaining the variability in the map of risks.
Bayesian analysis requires the specification of prior distribution for the parameters.
We put diffuse prior on the intercept �. For the variance parameters (�2b , �2ℎ) of
the random effects (b,h), we let the square root be a noninformative uniform prior
density between 0 and 100 (Gelman, 2006).
In the Bayesian approach to disease mapping, inference on the relative risks is
based on the posterior distribution of the risks given the data. The use of Markov
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 15
chain Monte Carlo (MCMC) methods based on Gibbs sampling (Geman and Ge-
man, 1984; Gelfand and Smith, 1990) yields easy implementation in the WinBUGS
software package (Spiegelhalter et al., 2003), allowing for estimation of the posterior
distribution of the relative risks. The R project R2WinBUGS (Sturtz et al., 2005)
may be used to export results for additional analyses using R.
2.3 Ranking Methods
To estimate isolation, we propose to rank the difference between the rank or risk es-
timates of the region under consideration and the corresponding mean value from its
neighbours. We expect this to provide a useful mechanism for identifying areas with
emerging or unusual elevated risk, and hence for prioritizing public health investiga-
tions. Our discussion of ranking approaches are from both (i) traditional perspectives
which use estimates based upon the SMR and (ii) those based on smoothing methods
with a focus of obtaining a general impression of trends over space as well as utilizing
these to provide more precise identification of isolated high risk areas.
Let d = (d1, ⋅ ⋅ ⋅ , dn)T be a vector representing the isolation measure defined as
the true difference in relative risks between the region and the mean value of the risk
for its neighborhood
di = �i −1
Ni
∑j∼i
�j , (2.1)
where Ni denotes the number of neighbours for region i, i = 1, ⋅ ⋅ ⋅ , n. Define the
corresponding rank of di as
rank(di) = Ri =n∑j=1
I {di ≤ dj} , (2.2)
where I {A} is the indicator function for event A. The smallest difference has rank
n and the largest has rank 1. The ranking methods considered are obtained by
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 16
minimizing the following loss functions.
2.3.1 Squared error loss function for the isolation measures
It is well known that the posterior mean minimizes the Bayesian risk with respect
to the squared-error loss (SEL) function (Berger, 1985). For example, the posterior
mean, E(�i∣y), is the optimal Bayes estimate obtained by minimizing the posterior
expectation of the sum of squared error loss function L(�, �̂) =∑n
i=1(�̂i − �i)2/n
(Carlin and Louis, 1996). In our case, we rank the posterior mean of the isolation
value, E(di∣y), which minimizes the posterior expectation L(d, d̂) =∑n
i=1(d̂i−di)2/n.
The corresponding estimated ranks are denoted as PM.
2.3.2 Squared error loss function for the ranks of the isola-
tion measures
Laird and Louis (1989), Shen and Louis (1998) and Louis and Shen (1999) showed
that if ranks of parameters are of interest, using a rank estimator directly is more ap-
propriate than using the parameter estimator to obtain ranks. The posterior expected
rank is obtained by minimizing the sum of squared error loss function of the ranks
L(R, R̂) =∑n
i=1(R̂i−Ri)2/n. The estimated ranks, which are non-integer quantities,
are
R̄i = E(Ri∣y) =n∑j=1
P (di ≤ dj∣y) , (2.3)
and tend to be shrunk towards the mid-rank (n+ 1)/2. Hence, we rank the posterior
means of (2.3) as described below, and denote the corresponding estimated ranks,
R̂i = rank(R̄i), as PRANK. Lin et al. (2006) shows that the estimator is also optimal
under weighted squared error loss of ranks, 1/n∑n
i=1wi(R̂i − Ri)2 for any values of
wi, i = 1, ⋅ ⋅ ⋅ , n. Calculation of PRANK can be easily implemented in the Bayesian
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 17
context. Let �(r) = (�(r)1 , ⋅ ⋅ ⋅ , �
(r)n )T be a random draw of � from p(�∣y); rank the
isolation measures d(r)i = �
(r)i − 1/Ni
∑j∼i �
(r)j , i = 1, ⋅ ⋅ ⋅ , n, j = 1, ⋅ ⋅ ⋅ ,Ni, and
subsequently rank the average rank of d(r)i over the MCMC iterations, r = 1, ⋅ ⋅ ⋅ , R,
to obtain the optimal rank based on (2.3).
Ranking methods described in Subsections 2.3.1 and 2.3.2 may be reasonable
choices when accurate ranking of all regions is of interest. In contrast, the methods
described in Subsection 2.3.3 and 2.3.4 focus on high risk areas.
2.3.3 Weighted rank squared error loss function
The posterior means are less variable than a typical draw from the posterior distribu-
tion (Louis, 1984). Therefore, high risks tend to be underestimated, while low risks
tend to be overestimated. Wright et al. (2003) introduces weighted rank squared error
loss functions in a hierarchical setting for estimating extrema (hotspot) of parame-
ters. In an exploratory approach, we adapt this method to be aligned with a focus
on identifying local isolated hotspots.
Let (d(1), ⋅ ⋅ ⋅ , d(n)) be the ordered vector of d, d(1) < ⋅ ⋅ ⋅ < d(n), assuming no ties.
To identify the most isolated hotspot, we consider the following loss function:
J(d, d̂, c
)=
n∑k=1
n∑j=1
cjI{dk = d(j)
}(dk − d̂k
)2=
n∑k=1
cr(k)
(dk − d̂k
)2, (2.4)
where r(k) ≡{j : dk = d(j)
}, cr(k) =
∑nj=1 cjI
{dk = d(j)
}and c = (c1, ⋅ ⋅ ⋅ , cn)T is the
vector of weights for d. The optimal Bayes estimator of dk is obtained by minimizing
the conditional expectation of the kth element in (2.4),
E{Jk(d, d̂k, c∣y
)}=
∫ n∑j=1
cjI{dk = d(j)
}(dk − d̂k
)2p(d∣y)dd , (2.5)
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 18
which yields
d̂k =
∑nj=1 cj
∫I{dk = d(j)
}dkp(d∣y)dd∑n
j=1 cj∫I{dk = d(j)
}p(d∣y)dd
=
∑nj=1E
(dk∣dk = d(j),y
)cjp(dk = d(j)∣y
)∑n
j=1 cjp(dk = d(j)∣y
) .(2.6)
The estimate d̂k is a weighted average of conditional posterior means of dk, with the
weight being cj multiplied by the posterior probability that dk has rank j. The corre-
sponding estimated ranks are denoted as WRSEL. For identifying extreme risks, we
use the suggestion in Wright et al. (2003) to consider a sharply increasing weighting
vector, with ci = exp [{(n+ 1)− i} /s] as the weight for rank i, i = 1, ⋅ ⋅ ⋅ , n. We
let WRSEL(a) denote the estimated ranks when s = 2, so that the weighting func-
tion puts large weight on highly isolated risks and almost 0 weight otherwise, and
WRSEL(b) denote the estimated ranks when s = 10, so that the weight function de-
clines less steeply as risks become less isolated. Figure 2.1 displays the weight vectors
c for WRSEL(a) and WRSEL(b) when n = 56.
2.3.4 Misclassification rates of regions in the top 100% group
Lin et al. (2006) considered specific loss functions tailored for estimating extreme
ranks. They recommended ranking the posterior probability that a region’s rank is
in the top 100% of ranks based on the rank-based misclassification loss function:
L0∣1(,R, R̂) =1
n
n∑i=1
{FP(,Ri, R̂i) + FN(,Ri, R̂i)
}, (2.7)
where
FP(,Ri, R̂i) = I{Ri > (n+ 1), R̂i ≤ (n+ 1)
};
FN(,Ri, R̂i) = I{Ri ≤ (n+ 1), R̂i > (n+ 1)
}, (2.8)
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 19
0 10 20 30 40 50
0.00.2
0.40.6
0.81.0
rank
c
WRSEL(a)WRSEL(b)
Figure 2.1: Plot of the weight function c for WRSEL(a) and WRSEL(b). In thisplot, the weight functions are scaled to have maximum value of 1.
where FP (false positive) and FN (false negative) indicate the two possible misclassi-
fication rates.
Lin et al. (2006) shows the loss function (2.7) is minimized by ranking the following
posterior probabilities:
P (R̂i ≤ (n+ 1)∣y) , (2.9)
as in Lin et al. (2006), based on the posterior distribution of Ri, and minimizes
errors in classifying regions above or below a percentile threshold. The corresponding
estimated ranks are denoted as PPR.
2.4 Comparison of Rank Estimators of Isolation
In an effort to understand how these ranking methods perform and how well they cap-
ture isolated hotspots when these are only modestly elevated, we consider hotspots
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 20
from regions with low expected counts, where the elevation in risk ranges from mod-
erate to large. We consider a single isolated hotspot, a small cluster of contiguous
hotspots, and, for comparison, several non-contiguous isolated hotspots.
In the investigations, the background relative risks are spatially correlated while an
independent discrete random effect inflates the risks in the target regions. Specifically,
counts were generated from a multinomial distribution
yi ∼ Multinomial
(n∑i=1
Ei,Ei�i∑ni=1Ei�i
), �i = exp(� + bi + log�i) , (2.10)
where � is the overall mean rate over the map; bi denotes a spatially correlated ran-
dom effect; �i = 1 if the region is not a hotspot, and constant t otherwise, t being
the inflation factor. To accommodate sampling variability, each simulation scenario is
replicated 500 times. Two MCMC chains have been run for a total of 20,000 iterations,
keeping every 10th, after a 10,000 iteration burn-in period. Brooks-Gelman-Rubin
diagnostics (Brooks and Gelman, 1998), as well as graphical checks of chains and
their autocorrelations were performed to assess convergence. The distribution of the
spatially correlated random effects, the expected disease counts and the neighbor-
hood structure mimic the fitted distribution from an initial analysis of the Scottish
lip cancer data (see Breslow and Clayton, 1993, for example). The data comprise
observed and expected counts of lip cancer cases during the period 1975-1980 over 56
Scottish counties. Table 2.1 summarizes observed and expected counts for this data.
The lip cancer data is known for exhibiting severe extra-Poisson variation (Clayton
and Kaldor, 1987). Breslow and Clayton (1993) and others have found that a con-
ditional Poisson model with spatially correlated CAR random effects provides a fair
fit to these data. We use the estimated model parameters from such an analysis to
define the background spatial pattern. Additionally, emerging isolated hotspots are
generated as
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 21
∙ Scenario I: A single region is considered as emerging hotspot. Three candi-
dates are considered with expected counts of 1.8, 6 and 14.6, corresponding
approximately to the 10th, 50th and 90th percentiles of the expected counts,
respectively. Note that we choose the hotspot with low expected count (10th
percentile of the expected counts), such that it is surrounded by neighbours
with fairly high expected counts, one of which has expected count 50.7. In
this case, the neighbours may have substantial smoothing effects on the target
region under the CAR model.
∙ Scenario II: A group of three contiguous regions is considered as an isolated clus-
ter. Two cases are considered: (i) areas with low expected counts 3.3, 4.8 and
2.9; (ii) areas with high expected counts of 9.3, 14.6 and 88.7. When contigu-
ous regions are proposed as hotspots, di (2.1) is calculated by excluding target
hotspots from Ni. This mimics a hypothesis testing scenario where a specific
cluster is being tested. Note that the expected counts from the neighbours of
case (i) are fairly low, with mean expected counts about 7.5.
∙ Scenario III: A group of three non-contiguous regions is considered as an isolated
group of regions of higher risk. Two cases are considered (i) areas with low
expected counts of 2.5, 3.3 and 3.6; (ii) areas with high expected counts of 10.1,
50.7 and 8.2. The expected counts of the neighbours for two of the isolated
hotspots in case (i) are fairly low, while the third hotspot has a neighbor with
the highest expected count over the map.
Note that estimated risks for the isolated hotspots which have moderate or high
expected counts are less likely to be influenced by disease counts for their neighbours.
In the simulation studies, the risks of the elevated regions are inflated to be sharply
different from their neighbors and (i) not overly high (rank about 10th place), and
(ii) moderately high (rank about 3rd place) and (iii) high (rank about 1st place). The
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 22
Table 2.1: Scottish lip cancer data: summary statistics.
Minimum First quartile Median Third quartile MaximumObserved count (y) 0.00 4.75 8.00 11.00 39.00Expected count (E) 1.10 4.05 6.30 10.12 88.70SMR (y/E) 0 0.49 1.11 2.24 6.43
magnitudes of the inflations for Scenarios I, II and III are reflected in Figures 2.2,
2.3 and 2.4, respectively. The corresponding geographical locations for the isolated
hotspots are shown in Figures 2.5 and 2.6, respectively. We also consider scaling the
expected counts by a factor u = 1, 4 and 8 for all scenarios. The threshold in (2.9)
corresponds to 1/(n + 1) for Scenario I and 3/(n + 1) for Scenarios II and III. The
simulated data are analyzed using model (2.1).
To assess the accuracy of the proposed ranking methods for identification of iso-
lated hotspots, we consider the root mean squared error of R̂i, for an isolated hotspot
at site i, given by
RMSE(R̂i) =
{1
M
M∑m=1
(R̂
(m)i −Ri
)2}1/2, (2.11)
where Ri is the true rank (2.2) and R̂(m)i is the estimated value based on the mth
simulated dataset, m = 1, ⋅ ⋅ ⋅ ,M . For cases where clusters are considered (Scenario
II and III), we calculate the average RMSE for the hotspots.
We also evaluate the ranking methods based on the correct positive and false
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 23
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●
●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
10th●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●
●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●3rd ●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●
●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
1st
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●
●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
10th●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●
●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●3rd
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●
●●
●●●
●●
●●●●
●●●●
●●●● ●●●
●● ●●●●●
0 10 30 501
23
45
true rank
tru
e r
ela
tive
ris
k
●
1st
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●
●
●●●●●●●●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
10th●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●
●
●●●●●●●●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●3rd
●●
●
●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
●
●●
●●●
●●●
●●
●●●●
●●●●
●
●
●●●●●●●●●●
●●
0 10 30 50
12
34
true rank
tru
e r
ela
tive
ris
k
●
1st
Figure 2.2: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top, middle and bottom rows correspondto Scenario I with the target region having low, moderate and high expected incidencecount, respectively. The isolated hotspot, shown as black dots, are inflated to aboutthe 10th (column 1), 3rd (column 2) and 1st (column 3) places. The symbol +identifies neighboring regions of the isolated hotspot.
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 24
●●
●
●
●●●
●
●
●●●
●●
●●
●
●
●
●●●
●
●
●●
●●
●
●
●●
●●
●●●
●●
●●●
●
●●● ● ●
●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●
●●
10th ●●
●
●
●●
●●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●
●●
●●
●●●
●●
●●●
●
●●● ● ●
●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●
●●
3rd●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●●
●●●
●●
●●●
●
●●● ● ●
●●●
● ●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●
●
● 1st
●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●●●
●●
●●●
●
●
●● ●
●●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●●●
10th ●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●●●
●●
●●●
●
●
●● ●
●
●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●●
●
3rd
●●
●
●
●●●
●
●
●●●
●
●●
●●
●
●
●●●
●
●
●●
●●
●
●●
●●
●●
●●
●●
●●●●
●
●● ●
●●
●● ●●●
●●
0 10 20 30 40 50
12
34
true rank
true
rela
tive
risk
●
●●1st
Figure 2.3: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 25
●●
●
●
●●●
●
●
●●●
●●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●●●
●●
●●●
●
●●● ●●
●
●●
●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rel
ativ
e ris
k
●●●
10th ●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●●●
●●
●●●
●
●●● ●●
●
●●
●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rel
ativ
e ris
k
●
●
●
3rd
●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●●
●● ●●
●●
●●●●
●●● ●● ●●●
●
●●
●●
0 10 20 30 40 50
12
34
5
true rank
true
rel
ativ
e ris
k
●●
●
1st
●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●●●
●
●
●●
●●●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rel
ativ
e ris
k
●●●
10th ●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●●●
●
●
●●
●●●
●●●
●●
●●
0 10 20 30 40 50
12
34
true rank
true
rel
ativ
e ris
k
●
●●
3rd
●●
●
●
●●●
●
●
●●●
●
●●
●●
●
●
●●●
●
●
●●
●●
●
●●
●●
●●
●
●
●●
●●●●
●
●
●
●● ●●● ● ●●
●●
0 10 20 30 40 50
12
34
5
true rank
true
rel
ativ
e ris
k
●
●
●
1st
Figure 2.4: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three non-contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 26
unde
r −1.
5−1
.5 −
00
− 1.
5ov
er 1
.5
low E
unde
r −1.
5−1
.5 −
00
− 1.
5ov
er 1
.5
mod
erat
e E
unde
r −1.
5−1
.5 −
00
− 1.
5ov
er 1
.5
high
E
Fig
ure
2.5:
The
pan
els
dis
pla
ydi
for
Sce
nar
ioI.
The
singl
eis
olat
edhot
spot
wit
hlo
w,
moder
ate
and
and
hig
hex
pec
ted
count,
are
iden
tified
by
the
red
circ
lein
the
1st,
2nd
and
3rd
pan
els,
resp
ecti
vely
.
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 27
under −1.5−1.5 − 00 − 1.5over 1.5
low E
under −1.5−1.5 − 00 − 1.5over 1.5
high E
under −1.5−1.5 − 00 − 1.5over 1.5
low E
under −1.5−1.5 − 00 − 1.5over 1.5
high E
Figure 2.6: The top and bottom panels display di for Scenarios II and III, respectively.The cluster of three contiguous isolated hotspots with low and high expected countsfor simulation Scenario II are identified by the red circles in the left and right toppanels, respectively; the cluster of three non-contiguous hotspots with low and highexpected counts for simulation Scenario III are identified by the red circles in the leftand right bottom panels, respectively.
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 28
positive rates
CP = P (R̂i < �∣Ri < �) =1
M
M∑m=1
I{R̂
(m)i < �∣Ri < �
};
FP = P (R̂i < �∣Ri > �) =1
M
M∑m=1
I{R̂
(m)i < �∣Ri > �
}, (2.12)
where � in (2.12) denotes the threshold defining high ranks, � = 2 for Scenario I and
4 for Scenarios II and III.
Table 2.2 displays RMSE, CP and FP for all the ranking methods evaluated here
for Scenario I for the case where the hotspot is associated with a low expected count
surrounded by neighbours with high expected values. It is not surprising that SMR
performs better in this case, as the CAR model pools information from the neighbours
to produce an estimate for the target region; therefore, the risk estimate for this
isolated hotspot tends to be smoothed under the CAR model. In contrast, for the
case of an isolated hotspot with moderate or large expected count, as shown in Tables
2.3 and 2.4, PRANK outperforms SMR. The gains of using PRANK are substantial
when the expected incidence count for the emerging hotspot is large. For example,
in Table 2.4, when the isolated hotspot is in the 10th place, CP is about 71.2% while
FP is about 0.5% for PRANK, yielding a performance which is far superior to the
other ranking methods. In general, WRSEL(a), WRSEL(b) and PPR, perform less
well. The WRSEL function tends to inflate the point estimates of the high risks;
because their weights are low, inaccuracies in point estimates of the other regions
with low isolation measures are relatively unimportant. WRSEL does not provide
precise estimates of all the risks and this may make it unsuitable for ranking purposes
(ranking requires good estimates over the whole map). Our empirical evaluation of
PPR over a sequence of values of the threshold (not shown here) suggests that
the performance of this estimator in terms of RMSE, CP and FP is influenced by ,
especially when the expected disease counts are low for the isolated hotspots. For
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 29
all the ranking methods, RMSE decreases, CP increases and FP decreases when the
emerging hotspots are gradually elevated above the whole surface, and when the
expected incidence counts for all the regions are inflated. These findings apply also
to the cases where the isolated hotspots are a cluster of three contiguous regions (see
Tables 2.5 and 2.6) and also where the isolated hotspots are three non-contiguous
regions (see Tables 2.7 and 2.8). It is also interesting to note that, in contrast to
Scenario I, for Scenario II, where three contiguous regions with low expected counts
are inflated as a cluster of hotspots, PRANK is superior to SMR, as the CAR model
has less of a smoothing effect on these isolated hotspots.
2.5 Summary
In this study, we focus on developing and evaluating rank estimators for disease map-
ping for the identification of emerging isolated hotspots. To determine the magnitude
of elevation of the hotspots relative to their neighbours, we developed an isolation
measure, the difference of risks or their rank estimators for the emerging high risk
regions and their neighbours. In summary, we note that though the CAR model
provides a smoothed risk surface, the estimates for PRANK or PM based on this
model perform reasonably well in detecting the emerging isolated hotspots. Simula-
tion studies show that gains of using PRANK may be substantial compared to other
ranking methods considered, especially when the disease is rare and the high risk area
is not yet a global outlier. The research has adopted the widely used CAR model.
Rank estimators based on other models may yield different results on identification of
isolated hotspots. The isolation measure developed here depends on the definition of
the neighborhood structure. The performance of the isolation measure may depend
on the distribution of the number of neighbours; hence the development of methods
which account for the number of neighbours may be useful.
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 30
In addition, in comparison to the classical scan statistic, we expect that the rank-
ing methods based on the spatial model may have lower false positive rates for identify-
ing isolated hotspots, since the classical scan statistic is very sensitive to the violation
of the assumption of spatial independence, detecting clusters at the 5% level much
more often than 5% of the time when spatially correlated data are simulated (Loh
and Zhu, 2007). It would be useful to compare the use of the scan statistic to our
ranking methods through simulation studies when no isolated hotspots exist.
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 31
Table 2.2: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with LOW expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 8.988 0.262 0.013 4.211 0.542 0.008 1.842 0.706 0.005PM 11.924 0.014 0.018 6.991 0.098 0.016 4.631 0.212 0.014WRSEL(a) 14.479 0.004 0.018 10.173 0.064 0.017 7.689 0.132 0.016WRSEL(b) 15.673 0.006 0.018 10.304 0.074 0.017 7.542 0.160 0.015PPR 21.577 0.008 0.018 13.456 0.088 0.017 9.316 0.162 0.015PRANK 11.643 0.024 0.018 6.555 0.150 0.015 4.103 0.278 0.013
u = 4 SMR 2.109 0.428 0.010 0.417 0.886 0.002 0.253 0.948 0.001PM 4.331 0.070 0.017 1.291 0.646 0.006 0.629 0.834 0.003WRSEL(a) 6.937 0.036 0.018 1.865 0.560 0.008 0.913 0.784 0.004WRSEL(b) 5.753 0.050 0.017 1.449 0.612 0.007 0.700 0.828 0.003PPR 10.509 0.054 0.017 1.785 0.626 0.007 0.739 0.832 0.003PRANK 3.935 0.096 0.016 1.154 0.676 0.006 0.576 0.862 0.003
u = 8 SMR 1.305 0.472 0.010 0.205 0.958 0.001 0.118 0.986 0.000PM 2.510 0.164 0.015 0.397 0.872 0.002 0.161 0.974 0.000WRSEL(a) 3.409 0.132 0.016 0.443 0.834 0.003 0.179 0.968 0.001WRSEL(b) 2.781 0.164 0.015 0.422 0.858 0.003 0.161 0.974 0.000PPR 5.720 0.160 0.015 0.422 0.864 0.002 0.155 0.976 0.000PRANK 2.373 0.188 0.015 0.374 0.890 0.002 0.141 0.980 0.000
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 32
Table 2.3: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with MODERATE expected disease counts, whose risk wasinflated to about the 10th, 3rd and 1st place; the expected disease counts for all theregions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 3.159 0.208 0.014 1.152 0.580 0.008 0.605 0.802 0.004PM 3.151 0.200 0.015 1.275 0.542 0.008 0.560 0.832 0.003WRSEL(a) 9.498 0.028 0.018 4.629 0.206 0.014 1.997 0.602 0.007WRSEL(b) 6.426 0.082 0.017 2.577 0.420 0.011 0.980 0.750 0.005PPR 8.818 0.112 0.016 2.705 0.474 0.010 0.995 0.784 0.004PRANK 2.164 0.378 0.011 0.729 0.764 0.004 0.319 0.922 0.001
u = 4 SMR 0.931 0.604 0.007 0.341 0.884 0.002 0.110 0.988 0.000PM 0.963 0.588 0.007 0.241 0.942 0.001 0.089 0.992 0.000WRSEL(a) 1.957 0.326 0.012 0.392 0.864 0.002 0.110 0.988 0.000WRSEL(b) 1.138 0.512 0.009 0.300 0.922 0.001 0.089 0.992 0.000PPR 1.483 0.534 0.008 0.272 0.938 0.001 0.089 0.992 0.000PRANK 0.769 0.690 0.006 0.195 0.962 0.001 0.077 0.994 0.000
u = 8 SMR 0.597 0.710 0.005 0.200 0.960 0.001 0.000 1.000 0.000PM 0.642 0.706 0.005 0.179 0.968 0.001 0.000 1.000 0.000WRSEL(a) 0.872 0.552 0.008 0.205 0.958 0.001 0.000 1.000 0.000WRSEL(b) 0.672 0.678 0.006 0.179 0.968 0.001 0.000 1.000 0.000PPR 0.722 0.700 0.005 0.179 0.968 0.001 0.000 1.000 0.000PRANK 0.546 0.790 0.004 0.161 0.974 0.000 0.000 1.000 0.000
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 33
Table 2.4: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with HIGH expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 1.834 0.250 0.014 0.799 0.630 0.007 0.417 0.860 0.003PM 1.321 0.428 0.010 0.494 0.804 0.004 0.249 0.950 0.001WRSEL(a) 7.695 0.030 0.018 2.490 0.358 0.012 0.832 0.782 0.004WRSEL(b) 3.025 0.188 0.015 0.906 0.674 0.006 0.319 0.920 0.001PPR 5.272 0.250 0.014 0.926 0.744 0.005 0.382 0.936 0.001PRANK 0.696 0.712 0.005 0.257 0.940 0.001 0.118 0.986 0.000
u = 4 SMR 0.651 0.684 0.006 0.283 0.920 0.001 0.077 0.994 0.000PM 0.562 0.766 0.004 0.200 0.960 0.001 0.063 0.996 0.000WRSEL(a) 1.049 0.504 0.009 0.268 0.934 0.001 0.077 0.994 0.000WRSEL(b) 0.660 0.716 0.005 0.205 0.958 0.001 0.063 0.996 0.000PPR 0.720 0.734 0.005 0.195 0.962 0.001 0.063 0.996 0.000PRANK 0.415 0.870 0.002 0.167 0.972 0.001 0.045 0.998 0.000
u = 8 SMR 0.537 0.734 0.005 0.118 0.986 0.000 0.000 1.000 0.000PM 0.454 0.816 0.003 0.110 0.988 0.000 0.000 1.000 0.000WRSEL(a) 0.610 0.686 0.006 0.110 0.988 0.000 0.000 1.000 0.000WRSEL(b) 0.486 0.792 0.004 0.110 0.988 0.000 0.000 1.000 0.000PPR 0.475 0.808 0.003 0.100 0.990 0.000 0.000 1.000 0.000PRANK 0.369 0.880 0.002 0.089 0.992 0.000 0.000 1.000 0.000
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 34
Table 2.5: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 5.264 0.455 0.031 3.530 0.578 0.024 1.712 0.808 0.011PM 3.904 0.412 0.033 2.762 0.568 0.024 1.611 0.821 0.010WRSEL(a) 9.437 0.111 0.050 7.373 0.276 0.041 3.445 0.619 0.022WRSEL(b) 7.071 0.249 0.043 4.987 0.425 0.033 2.207 0.740 0.015PPR 6.695 0.351 0.042 4.601 0.517 0.032 2.278 0.800 0.016PRANK 3.268 0.549 0.026 2.214 0.689 0.018 1.385 0.879 0.007
u = 4 SMR 2.172 0.656 0.019 1.508 0.823 0.010 1.204 0.977 0.001PM 2.148 0.635 0.021 1.485 0.819 0.010 1.193 0.987 0.001WRSEL(a) 3.767 0.442 0.032 2.081 0.706 0.017 1.228 0.961 0.002WRSEL(b) 2.465 0.583 0.024 1.586 0.793 0.012 1.198 0.983 0.001PPR 2.523 0.623 0.027 1.584 0.813 0.015 1.180 0.990 0.003PRANK 2.011 0.683 0.018 1.422 0.845 0.009 1.185 0.991 0.001
u = 8 SMR 1.701 0.740 0.015 1.279 0.908 0.005 1.179 0.997 0.000PM 1.697 0.745 0.014 1.268 0.914 0.005 1.182 0.999 0.000WRSEL(a) 2.043 0.646 0.020 1.389 0.853 0.008 1.191 0.996 0.000WRSEL(b) 1.755 0.721 0.016 1.301 0.901 0.006 1.183 0.999 0.000PPR 1.781 0.741 0.019 1.287 0.915 0.010 1.179 0.999 0.001PRANK 1.655 0.764 0.013 1.253 0.922 0.004 1.181 0.999 0.000
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 35
Table 2.6: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 4.105 0.347 0.037 2.511 0.599 0.023 1.767 0.778 0.013PM 3.514 0.509 0.028 2.121 0.727 0.015 1.504 0.865 0.008WRSEL(a) 14.481 0.014 0.056 9.552 0.119 0.050 4.741 0.432 0.032WRSEL(b) 7.166 0.209 0.045 4.073 0.522 0.027 2.161 0.777 0.013PPR 7.550 0.413 0.046 4.196 0.672 0.032 2.466 0.843 0.019PRANK 2.628 0.713 0.016 1.572 0.835 0.009 1.328 0.934 0.004
u = 4 SMR 2.269 0.646 0.020 1.445 0.854 0.008 1.204 0.972 0.002PM 2.184 0.677 0.018 1.378 0.883 0.007 1.184 0.983 0.001WRSEL(a) 5.851 0.294 0.040 2.518 0.673 0.019 1.262 0.929 0.004WRSEL(b) 2.722 0.601 0.023 1.504 0.849 0.009 1.190 0.978 0.001PPR 4.034 0.648 0.032 1.583 0.867 0.023 1.184 0.984 0.010PRANK 1.966 0.729 0.015 1.304 0.917 0.005 1.179 0.989 0.001
u = 8 SMR 1.965 0.705 0.017 1.291 0.909 0.005 1.163 0.994 0.000PM 1.967 0.718 0.016 1.268 0.923 0.004 1.160 0.997 0.000WRSEL(a) 3.107 0.551 0.025 1.437 0.838 0.009 1.167 0.989 0.001WRSEL(b) 2.091 0.691 0.018 1.290 0.911 0.005 1.161 0.995 0.000PPR 3.275 0.705 0.030 1.271 0.927 0.021 1.167 0.998 0.011PRANK 1.893 0.753 0.014 1.244 0.941 0.003 1.162 0.995 0.000
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 36
Table 2.7: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 9.478 0.412 0.033 4.000 0.630 0.021 2.561 0.816 0.010PM 8.487 0.286 0.040 3.993 0.543 0.026 2.544 0.777 0.013WRSEL(a) 9.800 0.081 0.052 6.782 0.301 0.040 3.668 0.613 0.022WRSEL(b) 9.080 0.147 0.048 5.522 0.427 0.032 2.872 0.717 0.016PPR 8.361 0.227 0.047 5.112 0.497 0.032 2.600 0.758 0.017PRANK 8.642 0.391 0.034 3.926 0.631 0.021 2.482 0.823 0.010
u = 4 SMR 3.484 0.577 0.024 1.458 0.851 0.008 1.191 0.971 0.002PM 3.335 0.521 0.027 1.487 0.835 0.009 1.180 0.976 0.001WRSEL(a) 4.550 0.367 0.036 1.871 0.750 0.014 1.208 0.954 0.003WRSEL(b) 3.618 0.476 0.030 1.570 0.818 0.010 1.186 0.972 0.002PPR 4.030 0.505 0.032 1.536 0.833 0.013 1.171 0.979 0.003PRANK 3.337 0.569 0.024 1.437 0.856 0.008 1.167 0.983 0.001
u = 8 SMR 2.495 0.641 0.020 1.308 0.904 0.005 1.177 0.998 0.000PM 2.509 0.600 0.023 1.314 0.905 0.005 1.176 0.998 0.000WRSEL(a) 2.878 0.540 0.026 1.384 0.861 0.008 1.184 0.995 0.000WRSEL(b) 2.591 0.585 0.024 1.326 0.897 0.006 1.178 0.998 0.000PPR 3.277 0.603 0.025 1.302 0.915 0.009 1.174 0.999 0.001PRANK 2.513 0.612 0.022 1.303 0.911 0.005 1.178 0.998 0.000
-
CHAPTER 2. ISOLATED HOTSPOT DETECTION 37
Table 2.8: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.
10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP
u = 1 SMR 3.787 0.379 0.035 2.308 0.593 0.023 1.552 0.809 0.011PM 3.438 0.465 0.030 2.056 0.699 0.017 1.396 0.882 0.007WRSEL(a) 13.639 0.015 0.056 8.508 0.124 0.050 4.016 0.481 0.029WRSEL(b) 7.210 0.183 0.046 3.736 0.494 0.029 1.794 0.801 0.011PPR 7.308 0.375 0.047 3.646 0.647 0.032 1.678 0.871 0.018PRANK 2.550 0.665 0.019 1.547 0.819 0.010 1.230 0.941 0.003
u = 4 SMR 2.032 0.649 0.020 1.341 0.885 0.007 1.170 0.973 0.002PM 1.991 0.668 0.019 1.297 0.918 0.005 1.154 0.985 0.001WRSEL(a) 4.936 0.218 0.044 1.827 0.717 0.016 1.194 0.954 0.003WRSEL(b) 2.349 0.569 0.024 1.356 0.885 0.007 1.159 0.981 0.001PPR 2.618 0.647 0.032 1.290 0.922 0.016 1.142 0.990 0.007PRANK 1.852 0.738 0.015 1.252 0.944 0.003 1.150 0.987 0.001
u = 8 SMR 1.785 0.749 0.014 1.246 0.935 0.004 1.152 0.996 0.000PM 1.770 0.768 0.013 1.215 0.955 0.003 1.145 0.997 0.000WRSEL(a) 2.456 0.539 0.026 1.321 0.872 0.007 1.154 0.994 0.000WRSEL(b) 1.865 0.718 0.016 1.224 0.946 0.003 1.148 0.996 0.000PPR 1.930 0.766 0.024 1.216 0.964 0.013 1.144 0.998 0.008PRANK 1.756 0.799 0.011 1.212 0.959 0.002 1.145 0.996 0.000
-
Chapter 3
Joint Analysis of Multivariate
Spatial Count and Zero-Heavy
Count Outcomes
3.1 Introduction
In public health, environmental and ecological studies, variables measured at the same
spatial locations may be correlated so that the spatial structures of such variables
across the region under consideration are very similar, indicating that they may be
characterized by a common spatial risk surface. Employing such a commonality in
risks may be useful for gaining precision of local area risk estimates, especially for
rare diseases.
Shared component spatial models have been studied in a variety of applied con-
texts. Knorr-Held and Best (2001) proposed a shared-component model which mim-
ics an ecological regression on the unobserved shared component. The two diseases
38
-
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 39
considered in that application share a common spatial structure and, as well, sup-
port disease-specific spatially uncorrelated random errors. Fitting the model requires
strong prior assumptions of the random spatial and uncorrelated errors, typically be-
cause of challenges arising related to identifiability of the latent spatial fields. Wang
and Wall (2003) proposed a common spatial factor model to study multivariate indi-
cators of cancer risk across counties in Minnesota. To avoid identifiability issues, the
model includes the common spatial structure term but no excess heterogeneity and,
as well, the variance of the shared spatially correlated random effect is considered
as fixed. Hogan and Tchernis (2004) proposed a common factor model for spatial
multivariate count data with constraints imposed on the variance structure of the
conditional autoregressive model they employ. Congdon (2006) set out a modeling
framework for modeling multiple health outcomes over area, age, and time dimensions
that takes account of spatial correlation as well as interactions between dimensions.
Tzala and Best (2006) proposed a Bayesian latent variable model for cancer mor-
tality data, which linked spatial effects. As well, other joint modeling approaches
for multivariate spatial data have been proposed including the multivariate version
of the conditional autoregressive model (MVCAR) (Gelfand and Vounatsou, 2003),
which assumes the spatial structure is the same across the multivariate outcomes.
Such modelling allows for the pooling of information across spatial units as well as
across multiple outcomes within units. In contrast, the common spatial factor model
may stratify the spatial variation into two components: the shared component and
outcome-specific components. Such a modeling approach permits a simple analysis
of which spatial term dominates as well as an identification of the common spatial
structure. Though testing for a common spatial structure is quite relevant in certain
studies, there has been very little discussion of the power of such tests. We consider
this in the context of the analysis of count data an also examine the utility of joint
modeling in terms of gains in the efficiency of estimating relative risks.
-
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 40
In environmental and ecological studies, counts data are often characterized by an
excess of zeros and spatial dependence (Clarke and Green, 1988; Welsh et al., 1996;
Martin et al., 2005). When studying of abundance of species in ecological studies, hav-
ing a large proportion of zero counts may indicate the habitat is unsuitable in certain
areas, for example. In such cases, standard distributions such as Poisson, binomial
and negative-binomial may fail to provide an adequate fit. A class of distributions
for such data is defined as zero-inflated distributions (Lambert, 1992).
For handling zero-inflation, the use of mixture models and conditional models are
two common approaches within the context of ecological and health studies. The
well-known zero-inflated Poisson (ZIP) model (Lambert, 1992) is a mixture of a de-
generate zero mass and a Poisson distribution. On the other hand, Welsh et al. (1996)
formulate a two-component conditional model where the presence/absence of counts
is modeled with a binomial distribution and the abundance at active sites is mod-
eled using a truncated Poisson or truncated negative binomial distribution. These
two models have different interpretations. Structural zeros and random zeros are not
distinguished under the conditional specification, whereas the mixture model permits
an examination of the different sources of error (Kuhnert et al., 2005). For more
discussion of zero-inflated models from a Bayesian perspective see Angers and Biswas
(2003) and Ainsworth (2007).
In many applications, zero-inflated count data are spatially correlated. Rathbun
and Fei (2006) introduced a zero-inflated Poisson model, in which the component
modeling the excess zeros is governed by a hidden spatial probit model; a threshold,
defining large probabilities in the probit layer, governs the proportion of zeros. Agar-
wal et al. (2002) also proposed a zero inflated model for spatial count data using a
mixture model approach and incorporating spatial random errors into either or both
of the model components. With multivariate zero-inflated count data corresponding
-
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 41
to several related spatial outcomes, there is also the possibility of linking model com-
ponents across the various outcomes using a shared latent spatial structure. This
would be relevant, for example, if the underlying, hidden mechanisms resulting in the
structural zeros, or the abundance of counts, are related across the outcomes.
The methods developed in this chapter for joint outcome analysis of spatial count
and zero-heavy count data focus on the use of shared latent spatial frailty models.
We discuss such joint mapping models and evaluate what benefits may be achieved
through joint modeling. The rest of the chapter is structured as follows. Section
3.2 describes a general modeling framework for common spatial factor models for
count data and zero-inflated count data. Section 3.3 presents two motivating appli-
cations, applying the common spatial factor model to Ontario lung cancer data and
zero-inflated forestry infection data related to a study of Comandra blister rust on
lodgepole pine trees. Section 3.4 examines hypothesis testing of whether two spatial
maps share the same underlying spatial structure for count data. A power study is
performed based on the situational context of the Ontario lung cancer data. Section
3.5 compares joint and separate modeling in terms of accuracy and efficiency of es-
timating relative risks through simulation investigations. Some closing remarks are
provided in Section 3.6.
3.2 Models for Joint Count Outcomes
We present here a general modeling framework for the common spatial factor model
for joint modeling of count data and zero-inflated count data. In disease mapping,
the typical response is a rate (both in health and forest epidemiology), hence the
focus on the analysis of counts herein. However, generalization of the model to other
non-normal data is straightforward.
-
CHAPTER 3. COMMON SPATIAL FACTOR MODEL 42
3.2.1 Common Spatial Factor Model for Counts
Let yij∣�ij ∼ Poisson(�ij) for region i = 1, ⋅ ⋅ ⋅ , n and outcome j = 1, ⋅ ⋅ ⋅ , J , where
yij denotes the response and �ij denotes the expected mean count for outcome j in
region i. The common spatial factor model can be written as:
log(�ij) = �j + log(Eij) + jbi + ℎij , (3.1)
where �j denotes the overall mean rate for the jth outcome and Eij is the expected
number of disease counts in region i for the jth outcome based on some standardized
rates; bi, i = 1, ⋅ ⋅ ⋅ , n is the spatial random effect assumed here to follow a condi-
tional autoregressive distribution (Besag, 1974) to account for the spatially struc-
tured correlation