models and methods for spatial data: applications in...

MODELS AND METHODS FOR SPATIAL DATA:

APPLICATIONS IN EPIDEMIOLOGICAL,

ENVIRONMENTAL AND ECOLOGICAL STUDIES

by

Cindy Xin Feng

M.Sc. (Statistics), Simon Fraser University, 2006

B.Sc. (Applied Mathematics), Beijing University of Technology, 2003

a Thesis submitted in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

in the Department of

Statistics and Actuarial Science

c⃝ Cindy Xin Feng 2011

SIMON FRASER UNIVERSITY

Summer 2011

All rights reserved. However, in accordance with the Copyright Act of

Canada, this work may be reproduced without authorization under the

conditions for Fair Dealing. Therefore, limited reproduction of this

work for the purposes of private study, research, criticism, review and

news reporting is likely to be in accordance with the law, particularly

if cited appropriately.

APPROVAL

Name: Cindy Xin Feng

Degree: Doctor of Philosophy

Title of Thesis: Models and Methods for Spatial Data: Applications in

Epidemiological, Environmental and Ecological Stud-

ies

Examining Committee: Dr. Rick Routledge

Chair

Dr. Charmaine Dean, Senior Supervisor

Dr. Jiguo Cao, Supervisor

Dr. Yi Lu, Supervisor

Dr. Paramjit Gill, Internal External Examiner

Dr. Patrick Brown, External Examiner,

University of Toronto

Date Approved:

ii

lib m-scan5Typewritten TextAugust 24. 2011

Partial Copyright Licence

Abstract

This thesis develops new methodologies for applied problems using smoothing tech-

niques for spatial or spatial temporal data. We investigate Bayesian ranking methods

for identifying high risk areas in disease mapping, assessing these particularly with

regard their performance in isolating emerging unusual and extreme risks in small

areas. We build on information obtained through mapping multivariate outcomes by

developing models which investigate if the multivariate spatial outcomes share the

same underlying spatial structure. We develop a general framework for joint model-

ing of multivariate spatial outcomes for count and zero-inflated count data using a

common spatial factor model.

We also study spatial exposure measures, motivated by an analysis of Comandra

blister rust infection on lodgepole pine trees from British Columbia. We contrast

nearest distance with other, more general, exposure measures and consider the impact

of mis-specification of exposure measures in a semiparametric generalized additive

modeling framework including a spatial residual term modeled as thin plate regression

spline. An appealing feature of the new spatial exposure measures considered is that

they can be easily adapted to other problems, such as investigation of the association

of asthma incidence to traffic exposures. A common theme in the thesis is the use of

functional data analysis, and we specifically adapt such methods for assessing spatial

and temporal variation of Cadmium concentration in Pacific oysters from British

iii

Columbia.

The methodologies developed in these projects widen the toolbox for spatial anal-

ysis in applications in epidemiology, and in environmental and ecological studies.

iv

Acknowledgments

I am deeply indebted to my senior supervisor Dr. Charmaine Dean for her guidance

and support in countless ways. Without her enlightening instruction, great kindness

and patience, I could not have completed my thesis. Her support and encouragement

were very helpful to me through some very difficult times in my life. I also want

to extend my gratitude to my examining committee members, Dr. Rick Routledge,

Dr. Jiguo Cao, Dr. Yi Lu, Dr. Paramjit Gill and Dr. Patrick Brown for all their

careful reviewing and insightful comments. Their detailed reviews and constructive

comments greatly improved the thesis.

Many thanks to the faculty and staff of the Department of Statistics and Actuarial

Science of Simon Fraser University for providing me a wonderful environment for

graduate studies. In particular, I would also like to thank Dr. Derek Bingham, Dr.

Boxin Tang, Dr. Richard Lockhart, Dr. Tim Swartz, Dr. Leilei Zeng, Dr. Joan Hu

and Mr. Ian Bercovitz for their support and Sadika, Kelly and Charlene for your

help always. Thank you also to the fellow graduate students for being company and

growing together with me during my graduate studies.

Finally, and most importantly, I would like to thank my family, I would not be

able to go this far without their care and encouragement.

v

Contents

Approval ii

Abstract iii

Acknowledgments v

Contents vi

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Disease Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Conditional Autoregressive Priors . . . . . . . . . . . . . . . . 3

1.3 Thin-Plate Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4.5 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Bayesian Ranking Methods for the Detection of Isolated Hotspots

vi

in Disease Mapping 10

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Bayesian Disease-Mapping Model . . . . . . . . . . . . . . . . . . . . 13

2.3 Ranking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Squared error loss function for the isolation measures . . . . . 16

2.3.2 Squared error loss function for the ranks of the isolation measures 16

2.3.3 Weighted rank squared error loss function . . . . . . . . . . . 17

2.3.4 Misclassification rates of regions in the top 100% group . . . 18

2.4 Comparison of Rank Estimators of Isolation . . . . . . . . . . . . . . 19

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Joint Analysis of Multivariate Spatial Count and Zero-Heavy Count

Outcomes 38

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Models for Joint Count Outcomes . . . . . . . . . . . . . . . . . . . . 41

3.2.1 Common Spatial Factor Model for Counts . . . . . . . . . . . 42

3.2.2 Common Spatial Factor Model for Zero Heavy Counts . . . . 43

3.2.3 Model Assessment and Comparison . . . . . . . . . . . . . . . 46

3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.1 Ontario Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.2 Comandra Blister Rust Tree Infection . . . . . . . . . . . . . . 53

3.4 Power of the Test for Common Spatial Structure . . . . . . . . . . . . 59

3.5 Precision Gains Through Joint Outcome Modeling . . . . . . . . . . . 64

3.6 Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . . 66

4 Impact of Misspecifying Spatial Exposures 71

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

vii

4.2 Comandra Blister Rust Study . . . . . . . . . . . . . . . . . . . . . . 73

4.3 Flexible Smooth Models . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4 Comparison of Exposure Measures for CBR Infection . . . . . . . . . 83

4.5 Assessing the Effect of Misspecification of Spatial Exposure Measures 87

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5 Exploring Spatial and Temporal Variations of Cadmium Concentra-

tions in Pacific Oysters from British Columbia 97

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.1.1 The Motivating Datasets . . . . . . . . . . . . . . . . . . . . . 98

5.1.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.2.1 Spline Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.2.2 Monotone Spline Smoothing . . . . . . . . . . . . . . . . . . . 104

5.2.3 Functional Principal Component Analysis . . . . . . . . . . . 105

5.2.4 Semi-Parametric Additive Model . . . . . . . . . . . . . . . . 107

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . 110

5.3.2 Spatial Variability . . . . . . . . . . . . . . . . . . . . . . . . 112

5.3.3 The Semi-Parametric Additive Model . . . . . . . . . . . . . . 116

5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6 Future Work 122

6.1 Spatial-temporal Modeling for Multivariate Spatial Outcomes . . . . 122

6.2 Spatial Modeling for Infectious Disease . . . . . . . . . . . . . . . . . 124

6.3 Curve Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Bibliography 128

viii

A Appendix for Chapter 3 138

B Appendix for Chapter 4 140

C Appendix for Chapter 5 147

ix

Chapter 1

Introduction

1.1 Overview

In recent years, there has been considerable interest in the development and applica-

tion of spatial models and methods for the analysis of spatially correlated data, which

are often geographically referenced, temporally correlated or highly multivariate. For

example, a motivating dataset considers the analysis of lung cancer for males and

females by local health unit in Ontario. The key idea throughout the approaches

considered is to take advantage of the correlation structure among observations to

perform estimation, prediction, hypothesis testing and other statistical procedures.

We begin with a review of some important concepts which form the building blocks

of the methods and models developed in later chapters. This is followed by an outline

of the material presented in each of the chapters of the thesis.

1

CHAPTER 1. INTRODUCTION 2

1.2 Disease Mapping

Mapping of disease incidence mortality rates is of primary importance in many epi-

demiological studies. The use of crude rates to estimate rare disease risks in small

areas such as health units, census areas or administrative zones, is problematic since it

does not account for the high variability of population sizes over the different regions,

nor the spatial patterns of the regions under study. Because of this, interpretation of

the spatial distribution of disease based on crude estimates is often misleading. Al-

ternatively, Bayesian inference is widely used to produce stabilized risk maps through

borrowing information from neighborhoods across the map. Early developments of

disease mapping methodology included the use of empirical Bayes (EB) techniques

(Manton et al., 1989; Marshall, 1991; Dean and MacNab, 2001; Breslow and Clay-

ton, 1993) to estimate parameters, and a plug-in approximation of these for posterior

inference, which yielded unbiased estimates of the relative risks. However, the vari-

ance of these estimates were underestimated, since the EB approach does not account

for the uncertainty arising from estimating hyperparameters. In recent years, (fully)

Bayesian (FB) approaches have gained prominence. Inference is based on Markov

chain Monte Carlo (MCMC) algorithms (Besag et al., 1991; Bernardinelli and Mon-

tomoli, 1991; MacNab et al., 2004; Congdon, 2006). Interval estimation of relative

risks based on posterior distributions account for the uncertainty associated with the

estimates through the hyperprior specifications. Bayesian methods for disease map-

ping is often termed hierarchical spatial modeling. The first level of the hierarchy

depicts the distribution of the data; the second level introduces the spatial depen-

dence through random effects which account for heterogeneity in the risks; at the

lowest level is specified the distribution of the hyperparameters.


1.2.1 Conditional Autoregressive Priors

One of the most popular choices for the distribution of the random effects in hi-

erarchical spatial modeling is the intrinsic conditional autoregressive (CAR) model

(Besag et al., 1991). Let W = (wij) denote the so-called spatial proximity matrix,

i = 1, ⋅ ⋅ ⋅ , n and j = 1, ⋅ ⋅ ⋅ , n for n regions, where wii = 0 and wij = 1 if the ith

and the jth areas are neighbours (denoted j ∼ i), and 0 otherwise. The conditional

expectation and variance are

E(bi∣bj ∕=i) =1

wi+

∑j∼i

bj, Var(bi∣bj ∕=i) =�2bwi+

, (1.1)

where b−i represents and b = (b1, ⋅ ⋅ ⋅ , bn) has joint distribution

b ∼MVN(0,Σ), Σ = �2b (D −W )−1 , (1.2)

where D = diag(w1+, ⋅ ⋅ ⋅ , wn+), wi+ =∑

j wij. The forms (1.1) and (1.2) define the

intrinsic CAR (Besag et al., 1991) uniquely. With this model, local smoothing can be

achieved, as E(bi∣bj ∕=i) is the local risk average over the neighborhood of region i and

Var(bi∣bj ∕=i) is scaled by the inverse of the number of neighbors, so that the greater

the number of neighbors the smaller the variance. However, the intrinsic CAR prior

is improper, since the matrix (D−W ) is singular. This impropriety can be remedied

by enforcing constraints such as∑n

i=1 bi = 0, which can be implemented numerically

at each iteration of an MCMC algorithm used for model fitting. Alternatively, the

so-called proper CAR model may be used; this model incorporates an additional

parameter �, so that the full conditionals are

E(bi∣bj ∕=i) =�

wi+

∑j∼i

bj, Var(bi∣bj ∕=i) =�2bwi+

, � ∈ (0, 1) (1.3)

leading to the unique joint distribution

b ∼MVN(0,Σ), Σ = �2b (D − �W )−1 , (1.4)


so that the covariance matrix (D − �W ) is non-singular.

Alternatively, Leroux et al. (1999) proposed a CAR model defining the full con-

ditionals as

E(bi∣bj ∕=i) =�

1− �+ �wi+

∑j∼i

bj, Var(bi∣bj ∕=i) =�2b

1− �+ �wi+, � ∈ (0, 1) (1.5)

leading to the unique joint distribution

b ∼MVN(0,Σ), Σ = �2b {�(D −W ) + (1− �)I}−1 , (1.6)

where � is a weighting parameter which weights the contributions from the spatially

correlated effect, modeled as intrinsic CAR, and the independent random noise term,

an independent normal distribution.

For point referenced data, geostatistical models (Cressie, 1993) are often used,

which directly specify the covariance matrix based on the distance between the spa-

tial sites. For example, the correlation between two spatial sites may decay expo-

nentially with distance; whereas, CAR models are specified based on the adjacency

structure among the spatial units, and can be used for either point-referenced data

or lattice data. In addition, inference for geostatistical models usually requires in-

version of covariance matrixes at each MCMC iteration; CAR models are therefore

computationally more efficient than geostatistical models.

1.3 Thin-Plate Splines

Thin-plate splines (Duchon, 1977) offer a very elegant approach for estimating a

smooth function of multiple predictor variables. The following provides a concise

introduction to thin plate splines. For a more detailed description, see (Duchon,

1977; Meinguet, 1979; Green and Silverman, 1994; Wood, 2004, 2006).


Suppose the response yi, i = 1, ⋅ ⋅ ⋅ , n, is modeled as a smooth function of covari-

ates xi such that

yi = f(xi) + �i, i = 1, ⋅ ⋅ ⋅ , n, (1.7)

where f is an unknown function on a fixed domain D ⊂ Rd, �i is a random error term,

and xi ⊂ D are fixed values for covariates.

Thin-plate spline smoothing estimates f by finding the function f̂ which minimizes

the penalized sum of squares

1

n

n∑i=1

wi {yi − f(xi)}2 + �Jm(f) , (1.8)

where wi, i = 1, 2, ⋅ ⋅ ⋅ , n, are some fixed constants; Jm(f) is penalty function measur-

ing the non-smoothness or so-called ‘wiggliness’ of f , and � is the smoothing param-

eter, which controls the tradeoff between f fitting the data precisely and smoothness

of f . The penalty term is defined as

Jm(f) =

∫⋅ ⋅ ⋅∫Rd

∑�1+⋅⋅⋅+�d=m

m!

�1! ⋅ ⋅ ⋅ �d!

( ∂mf∂x�11 ⋅ ⋅ ⋅ ∂x

�dd

)2dx1 ⋅ ⋅ ⋅ dxd . (1.9)

The sum in the integral is taken over all the integers � = (�1, ⋅ ⋅ ⋅ , �d)T such that

�1 + ⋅ ⋅ ⋅ �d = m, where d denotes the number of covariates, so d = 2 for spatial

longitude and latitude coordinate data, and the order m of differentiation in the

penalty can be any integer satisfying 2m > d. Matheron (1973) and Duchon (1977)

showed that the function minimizing (1.8) has the form

f(x) =k∑j=1

�j�j(x) +n∑i=1

i i(x) , (1.10)

where (�1, ⋅ ⋅ ⋅ , �k) are linearly independent polynomials spanning the space of all

d-dimensioned polynomials of degree less than m, and �j, j = 1, ⋅ ⋅ ⋅ , k and i, i =

1, ⋅ ⋅ ⋅ , n are coefficients to be estimated. For example, when d = 2, m = 2, k = 3


and x = (x1, x2), we have �1(x) = 1, �2(x) = x1 and �3(x) = x2. For d = 2, m = 3,

k = 6, we have �1(x) = 1, �2(x) = x1, �3(x) = x2, �4(x) = x1x2, �5(x) = x21,

�6(x) = x22. The functions ( 1, ⋅ ⋅ ⋅ , n) are a set of n radial basis functions, defined

as

i(r) =

⎧⎨⎩ amd∥r∥2m−dlog∥r∥, d evenbmd∥r∥2m−d, d oddwhere amd and bmd are constants.

For modeling spatial effects, thin-plate regression splines can be viewed as a Gaus-

sian process with generalized covariance (Cressie, 1993), characterized in terms of

distance �. The form of the covariance in two dimensions is C(�) ∝ �2m−2log(�),

where m is the order of the spline (commonly two). Paciorek (2007) provided a nice

comparison of a variety of approaches for modeling spatial surface. Wood (2000, 2003,

2004) proposed the use of iterative weighted fitting of reduced rank thin-plate splines

for computational efficiency.

1.4 Outline of Thesis

This thesis develops models and methods for the analysis of spatial or spatial-temporal

data arising from epidemiology, environmental and ecological studies. Specific prob-

lems will be considered including identification of high risk isolated areas in Chapter

2; misspecification of spatial exposure measures in Chapter 4; joint modeling of multi-

variate spatially correlated outcomes using common spatial factor models in Chapter

3; and investigation of functional data analysis approaches for modeling spatially and

temporally correlated data in Chapter 5. Each of Chapters 2, 3, 4 and 5 constitute

papers submitted. As a result, some introductory material may be repeated through

these chapters as well as the descriptions of motivating data sets.


1.4.1 Chapter 2

In disease mapping studies, often there is interest in identifying high risk areas in

order to investigate causes of mortality for surveillance purposes, or perhaps for effi-

cient allocation of health funding. Here, we focus on identification of locally isolated

high risk regions termed ‘local hotspots’ or ‘emerging hotspots’, defined as regions

with elevated risks, with respect to their neighbors. Identification of ‘local hotspots’ or

‘emerging hotspots’ before they become extreme is crucial for disease surveillance. We

develop methods of ranking the difference between area risks or ranks and correspond-

ing values for neighbours, based on (1) the standardized mortality ratio (SMR), (2)

minimizing mean squared errors of estimation for relative risks (3) minimizing mean

squared errors of estimation for ranks of risks, (4) minimizing a weighted squared

error loss function for ranks and (5) maximizing the sensitivity in the upper and

lower 100% relative risks at prespecified . We evaluate our methods through sim-

ulation investigation in a scenario which reflects the Scottish lip cancer data used in

several mapping studies. Our simulation results show that ranking the difference be-

tween posterior ranks of emerging hotspots and corresponding values for neighbours,

based on minimizing mean squared errors of estimation for ranks, is superior to other

methods for identifying emerging hotspots.

1.4.2 Chapter 3

This chapter discusses joint outcome modeling of multivariate spatial data, where

outcomes include count as well as zero-inflated count data. The framework utilized for

the joint spatial count outcome analysis reflects that which is now commonly employed

for the joint analysis of longitudinal and survival data, termed shared frailty models,

in which the outcomes are linked through a shared latent spatial random risk term.

We discuss these types of joint mapping models and consider the benefits achieved


through such joint modeling in the disease mapping context. We also consider the

power of tests for common spatial structure and develop recommendations on the

sort of power achievable in some contexts, as well as overall recommendations on the

utility of joint mapping. We illustrate the approaches in an analysis of lung cancer

mortality as well as an ecological study of Comandra blister rust infection of lodgepole

pine trees.

1.4.3 Chapter 4

In environmental and epidemiological studies, the nearest distance between the sus-

ceptible subject and the exposure source is a commonly used exposure measure, prin-

cipally because this measure is easy to collect. However, the density of the exposure in

the neighborhood of the subject may play an important role in the response to expo-

sure. Misspecification of exposure measures may result in inaccurate determinations

of the link between exposure and the response of interest. Such considerations are

motivated by the study of the disease dynamics of Comandra blister rust (Cronartium

comandrae) on lodgepole. This disease spreads to pine trees through alternate host

plants near the trees. We aim at understanding the relationship between the alternate

host plant presence and the disease, as well as effects relating to genetic variation in

the trees. We contrast the use of nearest distance to the alternate host plant, with

host plant densities at different orders of neighborhood, as exposure measures, in the

framework of a flexible semiparametric generalized additive model, while adjusting for

a spatially smooth surface. We demonstrate that if exposure is inaccurately modeled,

bias in estimating genetic effects may manifest themselves. Our study also provides

information on the added benefit of collecting more detailed information on exposure

beyond the simple nearest distance measure.


1.4.4 Chapter 5

Oysters from the Pacific Northwest coast of British Columbia, Canada, contain high

levels of cadmium, in some cases exceeding some international food safety guidelines.

A primary goal of this chapter is the investigation of the spatial and temporal variation

in cadmium concentrations for oysters sampled from coastal British Columbia. Such

information is important so that recommendations can be made as to where and when

oysters can be cultured such that accumulation of cadmium within these oysters

is minimized. Some modern statistical methods are applied to achieve this goal,

including monotone spline smoothing, functional principal component analysis and

semi-parametric additive modelling. Oyster growth rates are estimated as the first

derivatives of the monotone smoothing growth curves. Some important patterns in

cadmium accumulation by oysters are observed. For example, most inland regions

tend to have a higher level of cadmium concentration than most coastal regions, so

more caution needs to be taken for shellfish aquaculture practices occurring in the

inland regions. The semi-parametric additive modelling shows that oyster cadmium

concentration decreases with oyster length, and oysters sampled at 7m have higher

average cadmium concentration than those sampled at 1m.

1.4.5 Chapter 6

The thesis closes with a discussion of future research topics.

Chapter 2

Bayesian Ranking Methods for the

Detection of Isolated Hotspots in

Disease Mapping

2.1 Introduction

In disease mapping, early capture of emerging hotspots, that is, regions with ele-

vated risks which are surrounded by areas with much lower risks, before they become

extreme, is crucial in decision-making related to health surveillance. Such decision-

making processes may refer to optimal allocation of resources for health prevention, or

to decisions reflecting mobility of a society or other environmental controls. A typical

approach for detection of disease hotspots through a hypothesis testing framework

utilizes the scan statistic (see Kulldorff and Nagarwalla, 1995; Kulldorff et al., 1998),

which aims at detecting the location and size of hotspots without any preconceived

assumptions about these values. Our focus here is quite different as we seek to es-

timate and rank various local elevations in risk across a map. Model based spatial

10

CHAPTER 2. ISOLATED HOTSPOT DETECTION 11

methods are used here to estimate such ranks. For rare diseases, the observed dis-

ease count may exhibit extra Poisson variation. Hence, the standardized mortality

ratios (SMRs), a basic investigative tool for epidemiologists, may be highly variable.

Subsequently, in maps of SMRs, the most variable values, arising typically from low

population areas, tend to be highlighted, masking the true underlying pattern of dis-

ease risk. To address the issue of such overdispersion, the field of disease mapping

has flourished in the last decade with a variety of estimation methods and spatial

models for latent levels of the model hierarchy. In particular, there have been many

developments related to Bayesian hierarchical models, which allow the risk in an area

to borrow strength from neighboring areas where the disease risks are similar. These

models have indeed become standard tools for mapping rates (see Besag et al., 1991;

Clayton and Bernardinelli, 1992; Clayton et al., 1993; Lawson et al., 2000; MacNab

et al., 2004; Best et al., 2005, for example) in order to identify global hotspots and

trends in the risk surface across the map.

Identification of local or emerging hotspots have received less attention. It is

unclear whether and what sorts of smoothing techniques offer advantages for iden-

tifying isolated hotspots, over basic estimates such as raw rates. Here, we maintain

the focus on Bayesian hierarchical conditional autoregressive (CAR) models, devel-

oped by Besag et al. (1991); Clayton and Bernardinelli (1992); Clayton et al. (1993).

This model and its extensions have become commonplace in epidemiological studies

and have been shown to be flexible and robust (Lawson et al., 2000). Best et al.

(2005) demonstrates the merits of the CAR model when compared to other contem-

porary models including a multivariate normal geostatistical model with exponential

covariance, a spatial mixture model, a partition model and a gamma moving average

model. While CAR models were not designed to detect isolated hotspots or clusters

of isolated hotspots, they have nevertheless been used broadly for identifying extreme

risks.


The most natural measure of isolation is the difference between the risk or rank of a

potential hotspot and the corresponding quantity for its neighbors. Ranking methods

play a valuable role in drawing attention to elevated regions. This chapter considers

methods for ranking isolation measures with the goal of using these to identify local

or emerging hostpots. We note that Laird and Louis (1989) showed that ranking of

empirical Bayes estimators can be more accurate than that of conventional maximum

likelihood estimators. Shen and Louis (1998) investigated ranking procedures using

squared error loss functions operating on the difference between the estimated and

true ranks. We note also that in many applications, interest focuses principally on

identifying the locations with relatively high (e.g. in the upper 10 %) or low risks.

With such an emphasis, Lin et al. (2006) discussed various loss functions for Bayesian

optimal ranking, as well as decision rules for identifying the regions with the top

100% risk values. Wright et al. (2003) developed a weighted rank squared error loss

function targeted at the most likely high-risk locations. We contrast these methods

for identifying the highest and lowest isolation measures across a map and develop

recommendations based on adaptations of these procedures. Though we focus on

disease mapping, we note that methods for ranking isolation measures may be broadly

useful in many other contexts, particularly sociological, for ranking political or racial

isolation, or ecological, for diversity studies.

In Section 2.2, we review the Bayesian hierarchical models commonly used for

analyzing disease incidence and mortality data. Section 2.3 discusses the ranking

methods considered, focusing on identifying regions associated with high risks which

are isolated and building upon Bayesian hierarchical models. Section 2.4 evaluates the

methods using the spatial distribution of lip cancer from Scotland where local hotspots

are artificially generated. Section 2.5 closes with a discussion and recommendations.


2.2 Bayesian Disease-Mapping Model

It is well known that Bayesian hierarchical models for disease mapping provide a trade-

off between bias and variance reduction of estimates, and is particularly helpful in

cases where the disease is rare. The variance reduction is achieved through borrowing

information from the neighboring region to produce a more stable estimate of the risk

surface with estimated risks shrunk toward the overall mean risk, or some function of

this mean. Marshall (1991) reviews empirical Bayes and some early Bayesian methods

for disease mapping; Lawson et al. (2000) compares disease mapping models using

various goodness of fit criteria; Best et al. (2005) provides a comprehensive review of

the recent development in Bayesian disease mapping and compares models through

simulation studies; Richardson et al. (2004) conducts a comprehensive evaluation

designed to highlight the amount of smoothing of risk which occurs and the effects on

identifying global hotspots in a variety of settings. Our aim here is to evaluate various

ranking methods for risk estimators obtained from fitting Bayesian disease mapping

models. We focus on the basic spatial model described by Besag et al. (1991).

Let the area under study be divided into n contiguous regions labeled i = 1, ⋅ ⋅ ⋅ , n,

and let y = (y1, ⋅ ⋅ ⋅ , yn)T be the observed, and E = (E1, ⋅ ⋅ ⋅ , En)T be the expected,

disease counts. Denote by � = (�1, ⋅ ⋅ ⋅ , �n)T , i = 1, ⋅ ⋅ ⋅ , n the underlying random

region-specific disease risks. The response variables, conditional on �i, i = 1, ⋅ ⋅ ⋅ , n,

are assumed independent and Poisson distributed: yi∣�i∼Poisson(�i), �i = �iEi. The

conditional log linear model (Besag et al., 1991) specifies

log(�i) = � + log(Ei) + �i, �i = � + bi + ℎi ,

where � denotes the overall mean risk, while �i is decomposed into a spatially cor-

related random error term bi, and a uncorrelated error ℎi. The spatially correlated


random effects, b = (b1, ⋅ ⋅ ⋅ , bn)T are conveniently interpreted conditionally, as

bi∣bj ∕=i ∼ N

(∑j∼iwijbj∑j∼iwij

,�2b∑j∼iwij

),

where j ∼ i indicates that region j belongs to the neighbourhood of region i,

i = 1, ⋅ ⋅ ⋅ , n. Neighborhoods define the scope of the conditional influence and may

be constructed in different ways depending on the context of the analysis. In our

application, we define regions which are contiguous in space with the ith region,

sharing a common boundary, as its neighborhood. The weights, wij ≥ 0, wii = 0,

i, j = 1, ⋅ ⋅ ⋅ , n may be based on adjacency indicators for a lattice, or on a distance

measure between region i and j. Where the weights are based on adjacency indica-

tors, the joint distribution of random effects, b, is described as the intrinsic condi-

tional autoregressive model (Besag, 1974; Sun et al., 1999): b ∼ MVN(0, �2bQ−1),

where Q has ith diagonal element equal to the number of neighbors of the ith region

while for i ∕= j, Qij = −1 if i and j are neighbors, and 0 otherwise. The vector of

random risks, �, accommodates extra variation by a white noise error vector, and

h = (ℎ1, ⋅ ⋅ ⋅ , ℎn)T ∼ MVN(0, �2ℎI), where I is an identity matrix of dimension

n. By combining the independent and spatially correlated sources of random errors,

we obtain the convolution conditional autoregressive model for defining the distribu-

tion of the risks �i, as defined by Besag et al. (1991): h+ b ∼ MVN(0,Σ), where

Σ = �2ℎI + �2bQ−1. The values of �2ℎ and �

2b give a sense of the contributions of

spatial and non-spatial components in explaining the variability in the map of risks.

Bayesian analysis requires the specification of prior distribution for the parameters.

We put diffuse prior on the intercept �. For the variance parameters (�2b , �2ℎ) of

the random effects (b,h), we let the square root be a noninformative uniform prior

density between 0 and 100 (Gelman, 2006).

In the Bayesian approach to disease mapping, inference on the relative risks is

based on the posterior distribution of the risks given the data. The use of Markov


chain Monte Carlo (MCMC) methods based on Gibbs sampling (Geman and Ge-

man, 1984; Gelfand and Smith, 1990) yields easy implementation in the WinBUGS

software package (Spiegelhalter et al., 2003), allowing for estimation of the posterior

distribution of the relative risks. The R project R2WinBUGS (Sturtz et al., 2005)

may be used to export results for additional analyses using R.

2.3 Ranking Methods

To estimate isolation, we propose to rank the difference between the rank or risk es-

timates of the region under consideration and the corresponding mean value from its

neighbours. We expect this to provide a useful mechanism for identifying areas with

emerging or unusual elevated risk, and hence for prioritizing public health investiga-

tions. Our discussion of ranking approaches are from both (i) traditional perspectives

which use estimates based upon the SMR and (ii) those based on smoothing methods

with a focus of obtaining a general impression of trends over space as well as utilizing

these to provide more precise identification of isolated high risk areas.

Let d = (d1, ⋅ ⋅ ⋅ , dn)T be a vector representing the isolation measure defined as

the true difference in relative risks between the region and the mean value of the risk

for its neighborhood

di = �i −1

Ni

∑j∼i

�j , (2.1)

where Ni denotes the number of neighbours for region i, i = 1, ⋅ ⋅ ⋅ , n. Define the

corresponding rank of di as

rank(di) = Ri =n∑j=1

I {di ≤ dj} , (2.2)

where I {A} is the indicator function for event A. The smallest difference has rank

n and the largest has rank 1. The ranking methods considered are obtained by


minimizing the following loss functions.

2.3.1 Squared error loss function for the isolation measures

It is well known that the posterior mean minimizes the Bayesian risk with respect

to the squared-error loss (SEL) function (Berger, 1985). For example, the posterior

mean, E(�i∣y), is the optimal Bayes estimate obtained by minimizing the posterior

expectation of the sum of squared error loss function L(�, �̂) =∑n

i=1(�̂i − �i)2/n

(Carlin and Louis, 1996). In our case, we rank the posterior mean of the isolation

value, E(di∣y), which minimizes the posterior expectation L(d, d̂) =∑n

i=1(d̂i−di)2/n.

The corresponding estimated ranks are denoted as PM.

2.3.2 Squared error loss function for the ranks of the isola-

tion measures

Laird and Louis (1989), Shen and Louis (1998) and Louis and Shen (1999) showed

that if ranks of parameters are of interest, using a rank estimator directly is more ap-

propriate than using the parameter estimator to obtain ranks. The posterior expected

rank is obtained by minimizing the sum of squared error loss function of the ranks

L(R, R̂) =∑n

i=1(R̂i−Ri)2/n. The estimated ranks, which are non-integer quantities,

are

R̄i = E(Ri∣y) =n∑j=1

P (di ≤ dj∣y) , (2.3)

and tend to be shrunk towards the mid-rank (n+ 1)/2. Hence, we rank the posterior

means of (2.3) as described below, and denote the corresponding estimated ranks,

R̂i = rank(R̄i), as PRANK. Lin et al. (2006) shows that the estimator is also optimal

under weighted squared error loss of ranks, 1/n∑n

i=1wi(R̂i − Ri)2 for any values of

wi, i = 1, ⋅ ⋅ ⋅ , n. Calculation of PRANK can be easily implemented in the Bayesian


context. Let �(r) = (�(r)1 , ⋅ ⋅ ⋅ , �

(r)n )T be a random draw of � from p(�∣y); rank the

isolation measures d(r)i = �

(r)i − 1/Ni

∑j∼i �

(r)j , i = 1, ⋅ ⋅ ⋅ , n, j = 1, ⋅ ⋅ ⋅ ,Ni, and

subsequently rank the average rank of d(r)i over the MCMC iterations, r = 1, ⋅ ⋅ ⋅ , R,

to obtain the optimal rank based on (2.3).

Ranking methods described in Subsections 2.3.1 and 2.3.2 may be reasonable

choices when accurate ranking of all regions is of interest. In contrast, the methods

described in Subsection 2.3.3 and 2.3.4 focus on high risk areas.

2.3.3 Weighted rank squared error loss function

The posterior means are less variable than a typical draw from the posterior distribu-

tion (Louis, 1984). Therefore, high risks tend to be underestimated, while low risks

tend to be overestimated. Wright et al. (2003) introduces weighted rank squared error

loss functions in a hierarchical setting for estimating extrema (hotspot) of parame-

ters. In an exploratory approach, we adapt this method to be aligned with a focus

on identifying local isolated hotspots.

Let (d(1), ⋅ ⋅ ⋅ , d(n)) be the ordered vector of d, d(1) < ⋅ ⋅ ⋅ < d(n), assuming no ties.

To identify the most isolated hotspot, we consider the following loss function:

J(d, d̂, c

)=

n∑k=1

n∑j=1

cjI{dk = d(j)

}(dk − d̂k

)2=

n∑k=1

cr(k)

(dk − d̂k

)2, (2.4)

where r(k) ≡{j : dk = d(j)

}, cr(k) =

∑nj=1 cjI

{dk = d(j)

}and c = (c1, ⋅ ⋅ ⋅ , cn)T is the

vector of weights for d. The optimal Bayes estimator of dk is obtained by minimizing

the conditional expectation of the kth element in (2.4),

E{Jk(d, d̂k, c∣y

)}=

∫ n∑j=1

cjI{dk = d(j)

}(dk − d̂k

)2p(d∣y)dd , (2.5)


which yields

d̂k =

∑nj=1 cj

∫I{dk = d(j)

}dkp(d∣y)dd∑n

j=1 cj∫I{dk = d(j)

}p(d∣y)dd

=

∑nj=1E

(dk∣dk = d(j),y

)cjp(dk = d(j)∣y

)∑n

j=1 cjp(dk = d(j)∣y

) .(2.6)

The estimate d̂k is a weighted average of conditional posterior means of dk, with the

weight being cj multiplied by the posterior probability that dk has rank j. The corre-

sponding estimated ranks are denoted as WRSEL. For identifying extreme risks, we

use the suggestion in Wright et al. (2003) to consider a sharply increasing weighting

vector, with ci = exp [{(n+ 1)− i} /s] as the weight for rank i, i = 1, ⋅ ⋅ ⋅ , n. We

let WRSEL(a) denote the estimated ranks when s = 2, so that the weighting func-

tion puts large weight on highly isolated risks and almost 0 weight otherwise, and

WRSEL(b) denote the estimated ranks when s = 10, so that the weight function de-

clines less steeply as risks become less isolated. Figure 2.1 displays the weight vectors

c for WRSEL(a) and WRSEL(b) when n = 56.

2.3.4 Misclassification rates of regions in the top 100% group

Lin et al. (2006) considered specific loss functions tailored for estimating extreme

ranks. They recommended ranking the posterior probability that a region’s rank is

in the top 100% of ranks based on the rank-based misclassification loss function:

L0∣1(,R, R̂) =1

n

n∑i=1

{FP(,Ri, R̂i) + FN(,Ri, R̂i)

}, (2.7)

where

FP(,Ri, R̂i) = I{Ri > (n+ 1), R̂i ≤ (n+ 1)

};

FN(,Ri, R̂i) = I{Ri ≤ (n+ 1), R̂i > (n+ 1)

}, (2.8)


0 10 20 30 40 50

0.00.2

0.40.6

0.81.0

rank

c

WRSEL(a)WRSEL(b)

Figure 2.1: Plot of the weight function c for WRSEL(a) and WRSEL(b). In thisplot, the weight functions are scaled to have maximum value of 1.

where FP (false positive) and FN (false negative) indicate the two possible misclassi-

fication rates.

Lin et al. (2006) shows the loss function (2.7) is minimized by ranking the following

posterior probabilities:

P (R̂i ≤ (n+ 1)∣y) , (2.9)

as in Lin et al. (2006), based on the posterior distribution of Ri, and minimizes

errors in classifying regions above or below a percentile threshold. The corresponding

estimated ranks are denoted as PPR.

2.4 Comparison of Rank Estimators of Isolation

In an effort to understand how these ranking methods perform and how well they cap-

ture isolated hotspots when these are only modestly elevated, we consider hotspots


from regions with low expected counts, where the elevation in risk ranges from mod-

erate to large. We consider a single isolated hotspot, a small cluster of contiguous

hotspots, and, for comparison, several non-contiguous isolated hotspots.

In the investigations, the background relative risks are spatially correlated while an

independent discrete random effect inflates the risks in the target regions. Specifically,

counts were generated from a multinomial distribution

yi ∼ Multinomial

(n∑i=1

Ei,Ei�i∑ni=1Ei�i

), �i = exp(� + bi + log�i) , (2.10)

where � is the overall mean rate over the map; bi denotes a spatially correlated ran-

dom effect; �i = 1 if the region is not a hotspot, and constant t otherwise, t being

the inflation factor. To accommodate sampling variability, each simulation scenario is

replicated 500 times. Two MCMC chains have been run for a total of 20,000 iterations,

keeping every 10th, after a 10,000 iteration burn-in period. Brooks-Gelman-Rubin

diagnostics (Brooks and Gelman, 1998), as well as graphical checks of chains and

their autocorrelations were performed to assess convergence. The distribution of the

spatially correlated random effects, the expected disease counts and the neighbor-

hood structure mimic the fitted distribution from an initial analysis of the Scottish

lip cancer data (see Breslow and Clayton, 1993, for example). The data comprise

observed and expected counts of lip cancer cases during the period 1975-1980 over 56

Scottish counties. Table 2.1 summarizes observed and expected counts for this data.

The lip cancer data is known for exhibiting severe extra-Poisson variation (Clayton

and Kaldor, 1987). Breslow and Clayton (1993) and others have found that a con-

ditional Poisson model with spatially correlated CAR random effects provides a fair

fit to these data. We use the estimated model parameters from such an analysis to

define the background spatial pattern. Additionally, emerging isolated hotspots are

generated as


∙ Scenario I: A single region is considered as emerging hotspot. Three candi-

dates are considered with expected counts of 1.8, 6 and 14.6, corresponding

approximately to the 10th, 50th and 90th percentiles of the expected counts,

respectively. Note that we choose the hotspot with low expected count (10th

percentile of the expected counts), such that it is surrounded by neighbours

with fairly high expected counts, one of which has expected count 50.7. In

this case, the neighbours may have substantial smoothing effects on the target

region under the CAR model.

∙ Scenario II: A group of three contiguous regions is considered as an isolated clus-

ter. Two cases are considered: (i) areas with low expected counts 3.3, 4.8 and

2.9; (ii) areas with high expected counts of 9.3, 14.6 and 88.7. When contigu-

ous regions are proposed as hotspots, di (2.1) is calculated by excluding target

hotspots from Ni. This mimics a hypothesis testing scenario where a specific

cluster is being tested. Note that the expected counts from the neighbours of

case (i) are fairly low, with mean expected counts about 7.5.

∙ Scenario III: A group of three non-contiguous regions is considered as an isolated

group of regions of higher risk. Two cases are considered (i) areas with low

expected counts of 2.5, 3.3 and 3.6; (ii) areas with high expected counts of 10.1,

50.7 and 8.2. The expected counts of the neighbours for two of the isolated

hotspots in case (i) are fairly low, while the third hotspot has a neighbor with

the highest expected count over the map.

Note that estimated risks for the isolated hotspots which have moderate or high

expected counts are less likely to be influenced by disease counts for their neighbours.

In the simulation studies, the risks of the elevated regions are inflated to be sharply

different from their neighbors and (i) not overly high (rank about 10th place), and

(ii) moderately high (rank about 3rd place) and (iii) high (rank about 1st place). The


Table 2.1: Scottish lip cancer data: summary statistics.

Minimum First quartile Median Third quartile MaximumObserved count (y) 0.00 4.75 8.00 11.00 39.00Expected count (E) 1.10 4.05 6.30 10.12 88.70SMR (y/E) 0 0.49 1.11 2.24 6.43

magnitudes of the inflations for Scenarios I, II and III are reflected in Figures 2.2,

2.3 and 2.4, respectively. The corresponding geographical locations for the isolated

hotspots are shown in Figures 2.5 and 2.6, respectively. We also consider scaling the

expected counts by a factor u = 1, 4 and 8 for all scenarios. The threshold in (2.9)

corresponds to 1/(n + 1) for Scenario I and 3/(n + 1) for Scenarios II and III. The

simulated data are analyzed using model (2.1).

To assess the accuracy of the proposed ranking methods for identification of iso-

lated hotspots, we consider the root mean squared error of R̂i, for an isolated hotspot

at site i, given by

RMSE(R̂i) =

{1

M

M∑m=1

(R̂

(m)i −Ri

)2}1/2, (2.11)

where Ri is the true rank (2.2) and R̂(m)i is the estimated value based on the mth

simulated dataset, m = 1, ⋅ ⋅ ⋅ ,M . For cases where clusters are considered (Scenario

II and III), we calculate the average RMSE for the hotspots.

We also evaluate the ranking methods based on the correct positive and false


●●

●

●

●●●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

●

●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●

10th●●

●

●

●●●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

●

●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●3rd ●●

●

●

●●●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

●

●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●

1st

●●

●

●

●●●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●

●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●

10th●●

●

●

●●●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●

●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●3rd

●●

●

●

●●●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●

●●

●●●

●●

●●●●

●●●●

●●●● ●●●

●● ●●●●●

0 10 30 501

23

45

true rank

tru

e r

ela

tive

ris

k

●

1st

●●

●

●

●●●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●●●

●●●

●●

●●●●

●●●●

●

●

●●●●●●●●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●

10th●●

●

●

●●●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●●●

●●●

●●

●●●●

●●●●

●

●

●●●●●●●●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●3rd

●●

●

●

●●●

●

●

●●●

●

●●●

●

●

●

●●●

●

●

●●

●●●

●●●

●●

●●●●

●●●●

●

●

●●●●●●●●●●

●●

0 10 30 50

12

34

true rank

tru

e r

ela

tive

ris

k

●

1st

Figure 2.2: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top, middle and bottom rows correspondto Scenario I with the target region having low, moderate and high expected incidencecount, respectively. The isolated hotspot, shown as black dots, are inflated to aboutthe 10th (column 1), 3rd (column 2) and 1st (column 3) places. The symbol +identifies neighboring regions of the isolated hotspot.


●●

●

●

●●●

●

●

●●●

●●

●●

●

●

●

●●●

●

●

●●

●●

●

●

●●

●●

●●●

●●

●●●

●

●●● ● ●

●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●

●●

10th ●●

●

●

●●

●●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●

●●

●●

●●●

●●

●●●

●

●●● ● ●

●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●

●●

3rd●●

●

●

●●●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●

●

●

●●

●●●

●●

●●●

●

●●● ● ●

●●●

● ●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●

●

● 1st

●●

●

●

●●●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●●

●

●●

●●●

●●

●●●

●

●

●● ●

●●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●●●

10th ●●

●

●

●●●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●●

●

●●

●●●

●●

●●●

●

●

●● ●

●

●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●●

●

3rd

●●

●

●

●●●

●

●

●●●

●

●●

●●

●

●

●●●

●

●

●●

●●

●

●●

●●

●●

●●

●●

●●●●

●

●● ●

●●

●● ●●●

●●

0 10 20 30 40 50

12

34

true rank

true

rela

tive

risk

●

●●1st

Figure 2.3: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.


●●

●

●

●●●

●

●

●●●

●●

●●

●

●

●

●●●

●

●

●●

●●

●

●●

●

●●

●●●

●●

●●●

●

●●● ●●

●

●●

●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rel

ativ

e ris

k

●●●

10th ●●

●

●

●●●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●●

●

●●

●●●

●●

●●●

●

●●● ●●

●

●●

●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rel

ativ

e ris

k

●

●

●

3rd

●●

●

●

●●●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●●

●●

●● ●●

●●

●●●●

●●● ●● ●●●

●

●●

●●

0 10 20 30 40 50

12

34

5

true rank

true

rel

ativ

e ris

k

●●

●

1st

●●

●

●

●●●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●●

●

●●

●

●

●

●●

●●●

●

●

●●

●●●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rel

ativ

e ris

k

●●●

10th ●●

●

●

●●●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●●

●

●●

●

●

●

●●

●●●

●

●

●●

●●●

●●●

●●

●●

0 10 20 30 40 50

12

34

true rank

true

rel

ativ

e ris

k

●

●●

3rd

●●

●

●

●●●

●

●

●●●

●

●●

●●

●

●

●●●

●

●

●●

●●

●

●●

●●

●●

●

●

●●

●●●●

●

●

●

●● ●●● ● ●●

●●

0 10 20 30 40 50

12

34

5

true rank

true

rel

ativ

e ris

k

●

●

●

1st

Figure 2.4: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three non-contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.


unde

r −1.

5−1

.5 −

00

− 1.

5ov

er 1

.5

low E

unde

r −1.

5−1

.5 −

00

− 1.

5ov

er 1

.5

mod

erat

e E

unde

r −1.

5−1

.5 −

00

− 1.

5ov

er 1

.5

high

E

Fig

ure

2.5:

The

pan

els

dis

pla

ydi

for

Sce

nar

ioI.

The

singl

eis

olat

edhot

spot

wit

hlo

w,

moder

ate

and

and

hig

hex

pec

ted

count,

are

iden

tified

by

the

red

circ

lein

the

1st,

2nd

and

3rd

pan

els,

resp

ecti

vely

.


under −1.5−1.5 − 00 − 1.5over 1.5

low E

under −1.5−1.5 − 00 − 1.5over 1.5

high E

under −1.5−1.5 − 00 − 1.5over 1.5

low E

under −1.5−1.5 − 00 − 1.5over 1.5

high E

Figure 2.6: The top and bottom panels display di for Scenarios II and III, respectively.The cluster of three contiguous isolated hotspots with low and high expected countsfor simulation Scenario II are identified by the red circles in the left and right toppanels, respectively; the cluster of three non-contiguous hotspots with low and highexpected counts for simulation Scenario III are identified by the red circles in the leftand right bottom panels, respectively.


positive rates

CP = P (R̂i < �∣Ri < �) =1

M

M∑m=1

I{R̂

(m)i < �∣Ri < �

};

FP = P (R̂i < �∣Ri > �) =1

M

M∑m=1

I{R̂

(m)i < �∣Ri > �

}, (2.12)

where � in (2.12) denotes the threshold defining high ranks, � = 2 for Scenario I and

4 for Scenarios II and III.

Table 2.2 displays RMSE, CP and FP for all the ranking methods evaluated here

for Scenario I for the case where the hotspot is associated with a low expected count

surrounded by neighbours with high expected values. It is not surprising that SMR

performs better in this case, as the CAR model pools information from the neighbours

to produce an estimate for the target region; therefore, the risk estimate for this

isolated hotspot tends to be smoothed under the CAR model. In contrast, for the

case of an isolated hotspot with moderate or large expected count, as shown in Tables

2.3 and 2.4, PRANK outperforms SMR. The gains of using PRANK are substantial

when the expected incidence count for the emerging hotspot is large. For example,

in Table 2.4, when the isolated hotspot is in the 10th place, CP is about 71.2% while

FP is about 0.5% for PRANK, yielding a performance which is far superior to the

other ranking methods. In general, WRSEL(a), WRSEL(b) and PPR, perform less

well. The WRSEL function tends to inflate the point estimates of the high risks;

because their weights are low, inaccuracies in point estimates of the other regions

with low isolation measures are relatively unimportant. WRSEL does not provide

precise estimates of all the risks and this may make it unsuitable for ranking purposes

(ranking requires good estimates over the whole map). Our empirical evaluation of

PPR over a sequence of values of the threshold (not shown here) suggests that

the performance of this estimator in terms of RMSE, CP and FP is influenced by ,

especially when the expected disease counts are low for the isolated hotspots. For


all the ranking methods, RMSE decreases, CP increases and FP decreases when the

emerging hotspots are gradually elevated above the whole surface, and when the

expected incidence counts for all the regions are inflated. These findings apply also

to the cases where the isolated hotspots are a cluster of three contiguous regions (see

Tables 2.5 and 2.6) and also where the isolated hotspots are three non-contiguous

regions (see Tables 2.7 and 2.8). It is also interesting to note that, in contrast to

Scenario I, for Scenario II, where three contiguous regions with low expected counts

are inflated as a cluster of hotspots, PRANK is superior to SMR, as the CAR model

has less of a smoothing effect on these isolated hotspots.

2.5 Summary

In this study, we focus on developing and evaluating rank estimators for disease map-

ping for the identification of emerging isolated hotspots. To determine the magnitude

of elevation of the hotspots relative to their neighbours, we developed an isolation

measure, the difference of risks or their rank estimators for the emerging high risk

regions and their neighbours. In summary, we note that though the CAR model

provides a smoothed risk surface, the estimates for PRANK or PM based on this

model perform reasonably well in detecting the emerging isolated hotspots. Simula-

tion studies show that gains of using PRANK may be substantial compared to other

ranking methods considered, especially when the disease is rare and the high risk area

is not yet a global outlier. The research has adopted the widely used CAR model.

Rank estimators based on other models may yield different results on identification of

isolated hotspots. The isolation measure developed here depends on the definition of

the neighborhood structure. The performance of the isolation measure may depend

on the distribution of the number of neighbours; hence the development of methods

which account for the number of neighbours may be useful.


In addition, in comparison to the classical scan statistic, we expect that the rank-

ing methods based on the spatial model may have lower false positive rates for identify-

ing isolated hotspots, since the classical scan statistic is very sensitive to the violation

of the assumption of spatial independence, detecting clusters at the 5% level much

more often than 5% of the time when spatially correlated data are simulated (Loh

and Zhu, 2007). It would be useful to compare the use of the scan statistic to our

ranking methods through simulation studies when no isolated hotspots exist.


Table 2.2: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with LOW expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.

10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

u = 1 SMR 8.988 0.262 0.013 4.211 0.542 0.008 1.842 0.706 0.005PM 11.924 0.014 0.018 6.991 0.098 0.016 4.631 0.212 0.014WRSEL(a) 14.479 0.004 0.018 10.173 0.064 0.017 7.689 0.132 0.016WRSEL(b) 15.673 0.006 0.018 10.304 0.074 0.017 7.542 0.160 0.015PPR 21.577 0.008 0.018 13.456 0.088 0.017 9.316 0.162 0.015PRANK 11.643 0.024 0.018 6.555 0.150 0.015 4.103 0.278 0.013




Table 2.3: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with MODERATE expected disease counts, whose risk wasinflated to about the 10th, 3rd and 1st place; the expected disease counts for all theregions are scaled by u = 1, 4 and 8.






Table 2.4: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with HIGH expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.






Table 2.5: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.






Table 2.6: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.






Table 2.7: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.






Table 2.8: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.





Chapter 3

Joint Analysis of Multivariate

Spatial Count and Zero-Heavy

Count Outcomes

3.1 Introduction

In public health, environmental and ecological studies, variables measured at the same

spatial locations may be correlated so that the spatial structures of such variables

across the region under consideration are very similar, indicating that they may be

characterized by a common spatial risk surface. Employing such a commonality in

risks may be useful for gaining precision of local area risk estimates, especially for

rare diseases.

Shared component spatial models have been studied in a variety of applied con-

texts. Knorr-Held and Best (2001) proposed a shared-component model which mim-

ics an ecological regression on the unobserved shared component. The two diseases

38

CHAPTER 3. COMMON SPATIAL FACTOR MODEL 39

considered in that application share a common spatial structure and, as well, sup-

port disease-specific spatially uncorrelated random errors. Fitting the model requires

strong prior assumptions of the random spatial and uncorrelated errors, typically be-

cause of challenges arising related to identifiability of the latent spatial fields. Wang

and Wall (2003) proposed a common spatial factor model to study multivariate indi-

cators of cancer risk across counties in Minnesota. To avoid identifiability issues, the

model includes the common spatial structure term but no excess heterogeneity and,

as well, the variance of the shared spatially correlated random effect is considered

as fixed. Hogan and Tchernis (2004) proposed a common factor model for spatial

multivariate count data with constraints imposed on the variance structure of the

conditional autoregressive model they employ. Congdon (2006) set out a modeling

framework for modeling multiple health outcomes over area, age, and time dimensions

that takes account of spatial correlation as well as interactions between dimensions.

Tzala and Best (2006) proposed a Bayesian latent variable model for cancer mor-

tality data, which linked spatial effects. As well, other joint modeling approaches

for multivariate spatial data have been proposed including the multivariate version

of the conditional autoregressive model (MVCAR) (Gelfand and Vounatsou, 2003),

which assumes the spatial structure is the same across the multivariate outcomes.

Such modelling allows for the pooling of information across spatial units as well as

across multiple outcomes within units. In contrast, the common spatial factor model

may stratify the spatial variation into two components: the shared component and

outcome-specific components. Such a modeling approach permits a simple analysis

of which spatial term dominates as well as an identification of the common spatial

structure. Though testing for a common spatial structure is quite relevant in certain

studies, there has been very little discussion of the power of such tests. We consider

this in the context of the analysis of count data an also examine the utility of joint

modeling in terms of gains in the efficiency of estimating relative risks.


In environmental and ecological studies, counts data are often characterized by an

excess of zeros and spatial dependence (Clarke and Green, 1988; Welsh et al., 1996;

Martin et al., 2005). When studying of abundance of species in ecological studies, hav-

ing a large proportion of zero counts may indicate the habitat is unsuitable in certain

areas, for example. In such cases, standard distributions such as Poisson, binomial

and negative-binomial may fail to provide an adequate fit. A class of distributions

for such data is defined as zero-inflated distributions (Lambert, 1992).

For handling zero-inflation, the use of mixture models and conditional models are

two common approaches within the context of ecological and health studies. The

well-known zero-inflated Poisson (ZIP) model (Lambert, 1992) is a mixture of a de-

generate zero mass and a Poisson distribution. On the other hand, Welsh et al. (1996)

formulate a two-component conditional model where the presence/absence of counts

is modeled with a binomial distribution and the abundance at active sites is mod-

eled using a truncated Poisson or truncated negative binomial distribution. These

two models have different interpretations. Structural zeros and random zeros are not

distinguished under the conditional specification, whereas the mixture model permits

an examination of the different sources of error (Kuhnert et al., 2005). For more

discussion of zero-inflated models from a Bayesian perspective see Angers and Biswas

(2003) and Ainsworth (2007).

In many applications, zero-inflated count data are spatially correlated. Rathbun

and Fei (2006) introduced a zero-inflated Poisson model, in which the component

modeling the excess zeros is governed by a hidden spatial probit model; a threshold,

defining large probabilities in the probit layer, governs the proportion of zeros. Agar-

wal et al. (2002) also proposed a zero inflated model for spatial count data using a

mixture model approach and incorporating spatial random errors into either or both

of the model components. With multivariate zero-inflated count data corresponding


to several related spatial outcomes, there is also the possibility of linking model com-

ponents across the various outcomes using a shared latent spatial structure. This

would be relevant, for example, if the underlying, hidden mechanisms resulting in the

structural zeros, or the abundance of counts, are related across the outcomes.

The methods developed in this chapter for joint outcome analysis of spatial count

and zero-heavy count data focus on the use of shared latent spatial frailty models.

We discuss such joint mapping models and evaluate what benefits may be achieved

through joint modeling. The rest of the chapter is structured as follows. Section

3.2 describes a general modeling framework for common spatial factor models for

count data and zero-inflated count data. Section 3.3 presents two motivating appli-

cations, applying the common spatial factor model to Ontario lung cancer data and

zero-inflated forestry infection data related to a study of Comandra blister rust on

lodgepole pine trees. Section 3.4 examines hypothesis testing of whether two spatial

maps share the same underlying spatial structure for count data. A power study is

performed based on the situational context of the Ontario lung cancer data. Section

3.5 compares joint and separate modeling in terms of accuracy and efficiency of es-

timating relative risks through simulation investigations. Some closing remarks are

provided in Section 3.6.

3.2 Models for Joint Count Outcomes

We present here a general modeling framework for the common spatial factor model

for joint modeling of count data and zero-inflated count data. In disease mapping,

the typical response is a rate (both in health and forest epidemiology), hence the

focus on the analysis of counts herein. However, generalization of the model to other

non-normal data is straightforward.


3.2.1 Common Spatial Factor Model for Counts

Let yij∣�ij ∼ Poisson(�ij) for region i = 1, ⋅ ⋅ ⋅ , n and outcome j = 1, ⋅ ⋅ ⋅ , J , where

yij denotes the response and �ij denotes the expected mean count for outcome j in

region i. The common spatial factor model can be written as:

log(�ij) = �j + log(Eij) + jbi + ℎij , (3.1)

where �j denotes the overall mean rate for the jth outcome and Eij is the expected

number of disease counts in region i for the jth outcome based on some standardized

rates; bi, i = 1, ⋅ ⋅ ⋅ , n is the spatial random effect assumed here to follow a condi-

tional autoregressive distribution (Besag, 1974) to account for the spatially struc-

tured correlation

models and methods for spatial data: applications in...

Documents