models and methods for spatial data: applications in...

Click here to load reader

Upload: others

Post on 17-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

  • MODELS AND METHODS FOR SPATIAL DATA:

    APPLICATIONS IN EPIDEMIOLOGICAL,

    ENVIRONMENTAL AND ECOLOGICAL STUDIES

    by

    Cindy Xin Feng

    M.Sc. (Statistics), Simon Fraser University, 2006

    B.Sc. (Applied Mathematics), Beijing University of Technology, 2003

    a Thesis submitted in partial fulfillment

    of the requirements for the degree of

    Doctor of Philosophy

    in the Department of

    Statistics and Actuarial Science

    c⃝ Cindy Xin Feng 2011

    SIMON FRASER UNIVERSITY

    Summer 2011

    All rights reserved. However, in accordance with the Copyright Act of

    Canada, this work may be reproduced without authorization under the

    conditions for Fair Dealing. Therefore, limited reproduction of this

    work for the purposes of private study, research, criticism, review and

    news reporting is likely to be in accordance with the law, particularly

    if cited appropriately.

  • APPROVAL

    Name: Cindy Xin Feng

    Degree: Doctor of Philosophy

    Title of Thesis: Models and Methods for Spatial Data: Applications in

    Epidemiological, Environmental and Ecological Stud-

    ies

    Examining Committee: Dr. Rick Routledge

    Chair

    Dr. Charmaine Dean, Senior Supervisor

    Dr. Jiguo Cao, Supervisor

    Dr. Yi Lu, Supervisor

    Dr. Paramjit Gill, Internal External Examiner

    Dr. Patrick Brown, External Examiner,

    University of Toronto

    Date Approved:

    ii

    lib m-scan5Typewritten TextAugust 24. 2011

  • Partial Copyright Licence

  • Abstract

    This thesis develops new methodologies for applied problems using smoothing tech-

    niques for spatial or spatial temporal data. We investigate Bayesian ranking methods

    for identifying high risk areas in disease mapping, assessing these particularly with

    regard their performance in isolating emerging unusual and extreme risks in small

    areas. We build on information obtained through mapping multivariate outcomes by

    developing models which investigate if the multivariate spatial outcomes share the

    same underlying spatial structure. We develop a general framework for joint model-

    ing of multivariate spatial outcomes for count and zero-inflated count data using a

    common spatial factor model.

    We also study spatial exposure measures, motivated by an analysis of Comandra

    blister rust infection on lodgepole pine trees from British Columbia. We contrast

    nearest distance with other, more general, exposure measures and consider the impact

    of mis-specification of exposure measures in a semiparametric generalized additive

    modeling framework including a spatial residual term modeled as thin plate regression

    spline. An appealing feature of the new spatial exposure measures considered is that

    they can be easily adapted to other problems, such as investigation of the association

    of asthma incidence to traffic exposures. A common theme in the thesis is the use of

    functional data analysis, and we specifically adapt such methods for assessing spatial

    and temporal variation of Cadmium concentration in Pacific oysters from British

    iii

  • Columbia.

    The methodologies developed in these projects widen the toolbox for spatial anal-

    ysis in applications in epidemiology, and in environmental and ecological studies.

    iv

  • Acknowledgments

    I am deeply indebted to my senior supervisor Dr. Charmaine Dean for her guidance

    and support in countless ways. Without her enlightening instruction, great kindness

    and patience, I could not have completed my thesis. Her support and encouragement

    were very helpful to me through some very difficult times in my life. I also want

    to extend my gratitude to my examining committee members, Dr. Rick Routledge,

    Dr. Jiguo Cao, Dr. Yi Lu, Dr. Paramjit Gill and Dr. Patrick Brown for all their

    careful reviewing and insightful comments. Their detailed reviews and constructive

    comments greatly improved the thesis.

    Many thanks to the faculty and staff of the Department of Statistics and Actuarial

    Science of Simon Fraser University for providing me a wonderful environment for

    graduate studies. In particular, I would also like to thank Dr. Derek Bingham, Dr.

    Boxin Tang, Dr. Richard Lockhart, Dr. Tim Swartz, Dr. Leilei Zeng, Dr. Joan Hu

    and Mr. Ian Bercovitz for their support and Sadika, Kelly and Charlene for your

    help always. Thank you also to the fellow graduate students for being company and

    growing together with me during my graduate studies.

    Finally, and most importantly, I would like to thank my family, I would not be

    able to go this far without their care and encouragement.

    v

  • Contents

    Approval ii

    Abstract iii

    Acknowledgments v

    Contents vi

    1 Introduction 1

    1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Disease Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2.1 Conditional Autoregressive Priors . . . . . . . . . . . . . . . . 3

    1.3 Thin-Plate Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.4 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.4.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.4.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.4.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.4.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    1.4.5 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2 Bayesian Ranking Methods for the Detection of Isolated Hotspots

    vi

  • in Disease Mapping 10

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.2 Bayesian Disease-Mapping Model . . . . . . . . . . . . . . . . . . . . 13

    2.3 Ranking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.3.1 Squared error loss function for the isolation measures . . . . . 16

    2.3.2 Squared error loss function for the ranks of the isolation measures 16

    2.3.3 Weighted rank squared error loss function . . . . . . . . . . . 17

    2.3.4 Misclassification rates of regions in the top 100% group . . . 18

    2.4 Comparison of Rank Estimators of Isolation . . . . . . . . . . . . . . 19

    2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    3 Joint Analysis of Multivariate Spatial Count and Zero-Heavy Count

    Outcomes 38

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3.2 Models for Joint Count Outcomes . . . . . . . . . . . . . . . . . . . . 41

    3.2.1 Common Spatial Factor Model for Counts . . . . . . . . . . . 42

    3.2.2 Common Spatial Factor Model for Zero Heavy Counts . . . . 43

    3.2.3 Model Assessment and Comparison . . . . . . . . . . . . . . . 46

    3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.3.1 Ontario Lung Cancer . . . . . . . . . . . . . . . . . . . . . . . 47

    3.3.2 Comandra Blister Rust Tree Infection . . . . . . . . . . . . . . 53

    3.4 Power of the Test for Common Spatial Structure . . . . . . . . . . . . 59

    3.5 Precision Gains Through Joint Outcome Modeling . . . . . . . . . . . 64

    3.6 Summary and Concluding Remarks . . . . . . . . . . . . . . . . . . . 66

    4 Impact of Misspecifying Spatial Exposures 71

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    vii

  • 4.2 Comandra Blister Rust Study . . . . . . . . . . . . . . . . . . . . . . 73

    4.3 Flexible Smooth Models . . . . . . . . . . . . . . . . . . . . . . . . . 75

    4.4 Comparison of Exposure Measures for CBR Infection . . . . . . . . . 83

    4.5 Assessing the Effect of Misspecification of Spatial Exposure Measures 87

    4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    5 Exploring Spatial and Temporal Variations of Cadmium Concentra-

    tions in Pacific Oysters from British Columbia 97

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

    5.1.1 The Motivating Datasets . . . . . . . . . . . . . . . . . . . . . 98

    5.1.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 99

    5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    5.2.1 Spline Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . 103

    5.2.2 Monotone Spline Smoothing . . . . . . . . . . . . . . . . . . . 104

    5.2.3 Functional Principal Component Analysis . . . . . . . . . . . 105

    5.2.4 Semi-Parametric Additive Model . . . . . . . . . . . . . . . . 107

    5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    5.3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . 110

    5.3.2 Spatial Variability . . . . . . . . . . . . . . . . . . . . . . . . 112

    5.3.3 The Semi-Parametric Additive Model . . . . . . . . . . . . . . 116

    5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    6 Future Work 122

    6.1 Spatial-temporal Modeling for Multivariate Spatial Outcomes . . . . 122

    6.2 Spatial Modeling for Infectious Disease . . . . . . . . . . . . . . . . . 124

    6.3 Curve Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

    Bibliography 128

    viii

  • A Appendix for Chapter 3 138

    B Appendix for Chapter 4 140

    C Appendix for Chapter 5 147

    ix

  • Chapter 1

    Introduction

    1.1 Overview

    In recent years, there has been considerable interest in the development and applica-

    tion of spatial models and methods for the analysis of spatially correlated data, which

    are often geographically referenced, temporally correlated or highly multivariate. For

    example, a motivating dataset considers the analysis of lung cancer for males and

    females by local health unit in Ontario. The key idea throughout the approaches

    considered is to take advantage of the correlation structure among observations to

    perform estimation, prediction, hypothesis testing and other statistical procedures.

    We begin with a review of some important concepts which form the building blocks

    of the methods and models developed in later chapters. This is followed by an outline

    of the material presented in each of the chapters of the thesis.

    1

  • CHAPTER 1. INTRODUCTION 2

    1.2 Disease Mapping

    Mapping of disease incidence mortality rates is of primary importance in many epi-

    demiological studies. The use of crude rates to estimate rare disease risks in small

    areas such as health units, census areas or administrative zones, is problematic since it

    does not account for the high variability of population sizes over the different regions,

    nor the spatial patterns of the regions under study. Because of this, interpretation of

    the spatial distribution of disease based on crude estimates is often misleading. Al-

    ternatively, Bayesian inference is widely used to produce stabilized risk maps through

    borrowing information from neighborhoods across the map. Early developments of

    disease mapping methodology included the use of empirical Bayes (EB) techniques

    (Manton et al., 1989; Marshall, 1991; Dean and MacNab, 2001; Breslow and Clay-

    ton, 1993) to estimate parameters, and a plug-in approximation of these for posterior

    inference, which yielded unbiased estimates of the relative risks. However, the vari-

    ance of these estimates were underestimated, since the EB approach does not account

    for the uncertainty arising from estimating hyperparameters. In recent years, (fully)

    Bayesian (FB) approaches have gained prominence. Inference is based on Markov

    chain Monte Carlo (MCMC) algorithms (Besag et al., 1991; Bernardinelli and Mon-

    tomoli, 1991; MacNab et al., 2004; Congdon, 2006). Interval estimation of relative

    risks based on posterior distributions account for the uncertainty associated with the

    estimates through the hyperprior specifications. Bayesian methods for disease map-

    ping is often termed hierarchical spatial modeling. The first level of the hierarchy

    depicts the distribution of the data; the second level introduces the spatial depen-

    dence through random effects which account for heterogeneity in the risks; at the

    lowest level is specified the distribution of the hyperparameters.

  • CHAPTER 1. INTRODUCTION 3

    1.2.1 Conditional Autoregressive Priors

    One of the most popular choices for the distribution of the random effects in hi-

    erarchical spatial modeling is the intrinsic conditional autoregressive (CAR) model

    (Besag et al., 1991). Let W = (wij) denote the so-called spatial proximity matrix,

    i = 1, ⋅ ⋅ ⋅ , n and j = 1, ⋅ ⋅ ⋅ , n for n regions, where wii = 0 and wij = 1 if the ith

    and the jth areas are neighbours (denoted j ∼ i), and 0 otherwise. The conditional

    expectation and variance are

    E(bi∣bj ∕=i) =1

    wi+

    ∑j∼i

    bj, Var(bi∣bj ∕=i) =�2bwi+

    , (1.1)

    where b−i represents and b = (b1, ⋅ ⋅ ⋅ , bn) has joint distribution

    b ∼MVN(0,Σ), Σ = �2b (D −W )−1 , (1.2)

    where D = diag(w1+, ⋅ ⋅ ⋅ , wn+), wi+ =∑

    j wij. The forms (1.1) and (1.2) define the

    intrinsic CAR (Besag et al., 1991) uniquely. With this model, local smoothing can be

    achieved, as E(bi∣bj ∕=i) is the local risk average over the neighborhood of region i and

    Var(bi∣bj ∕=i) is scaled by the inverse of the number of neighbors, so that the greater

    the number of neighbors the smaller the variance. However, the intrinsic CAR prior

    is improper, since the matrix (D−W ) is singular. This impropriety can be remedied

    by enforcing constraints such as∑n

    i=1 bi = 0, which can be implemented numerically

    at each iteration of an MCMC algorithm used for model fitting. Alternatively, the

    so-called proper CAR model may be used; this model incorporates an additional

    parameter �, so that the full conditionals are

    E(bi∣bj ∕=i) =�

    wi+

    ∑j∼i

    bj, Var(bi∣bj ∕=i) =�2bwi+

    , � ∈ (0, 1) (1.3)

    leading to the unique joint distribution

    b ∼MVN(0,Σ), Σ = �2b (D − �W )−1 , (1.4)

  • CHAPTER 1. INTRODUCTION 4

    so that the covariance matrix (D − �W ) is non-singular.

    Alternatively, Leroux et al. (1999) proposed a CAR model defining the full con-

    ditionals as

    E(bi∣bj ∕=i) =�

    1− �+ �wi+

    ∑j∼i

    bj, Var(bi∣bj ∕=i) =�2b

    1− �+ �wi+, � ∈ (0, 1) (1.5)

    leading to the unique joint distribution

    b ∼MVN(0,Σ), Σ = �2b {�(D −W ) + (1− �)I}−1 , (1.6)

    where � is a weighting parameter which weights the contributions from the spatially

    correlated effect, modeled as intrinsic CAR, and the independent random noise term,

    an independent normal distribution.

    For point referenced data, geostatistical models (Cressie, 1993) are often used,

    which directly specify the covariance matrix based on the distance between the spa-

    tial sites. For example, the correlation between two spatial sites may decay expo-

    nentially with distance; whereas, CAR models are specified based on the adjacency

    structure among the spatial units, and can be used for either point-referenced data

    or lattice data. In addition, inference for geostatistical models usually requires in-

    version of covariance matrixes at each MCMC iteration; CAR models are therefore

    computationally more efficient than geostatistical models.

    1.3 Thin-Plate Splines

    Thin-plate splines (Duchon, 1977) offer a very elegant approach for estimating a

    smooth function of multiple predictor variables. The following provides a concise

    introduction to thin plate splines. For a more detailed description, see (Duchon,

    1977; Meinguet, 1979; Green and Silverman, 1994; Wood, 2004, 2006).

  • CHAPTER 1. INTRODUCTION 5

    Suppose the response yi, i = 1, ⋅ ⋅ ⋅ , n, is modeled as a smooth function of covari-

    ates xi such that

    yi = f(xi) + �i, i = 1, ⋅ ⋅ ⋅ , n, (1.7)

    where f is an unknown function on a fixed domain D ⊂ Rd, �i is a random error term,

    and xi ⊂ D are fixed values for covariates.

    Thin-plate spline smoothing estimates f by finding the function f̂ which minimizes

    the penalized sum of squares

    1

    n

    n∑i=1

    wi {yi − f(xi)}2 + �Jm(f) , (1.8)

    where wi, i = 1, 2, ⋅ ⋅ ⋅ , n, are some fixed constants; Jm(f) is penalty function measur-

    ing the non-smoothness or so-called ‘wiggliness’ of f , and � is the smoothing param-

    eter, which controls the tradeoff between f fitting the data precisely and smoothness

    of f . The penalty term is defined as

    Jm(f) =

    ∫⋅ ⋅ ⋅∫Rd

    ∑�1+⋅⋅⋅+�d=m

    m!

    �1! ⋅ ⋅ ⋅ �d!

    ( ∂mf∂x�11 ⋅ ⋅ ⋅ ∂x

    �dd

    )2dx1 ⋅ ⋅ ⋅ dxd . (1.9)

    The sum in the integral is taken over all the integers � = (�1, ⋅ ⋅ ⋅ , �d)T such that

    �1 + ⋅ ⋅ ⋅ �d = m, where d denotes the number of covariates, so d = 2 for spatial

    longitude and latitude coordinate data, and the order m of differentiation in the

    penalty can be any integer satisfying 2m > d. Matheron (1973) and Duchon (1977)

    showed that the function minimizing (1.8) has the form

    f(x) =k∑j=1

    �j�j(x) +n∑i=1

    i i(x) , (1.10)

    where (�1, ⋅ ⋅ ⋅ , �k) are linearly independent polynomials spanning the space of all

    d-dimensioned polynomials of degree less than m, and �j, j = 1, ⋅ ⋅ ⋅ , k and i, i =

    1, ⋅ ⋅ ⋅ , n are coefficients to be estimated. For example, when d = 2, m = 2, k = 3

  • CHAPTER 1. INTRODUCTION 6

    and x = (x1, x2), we have �1(x) = 1, �2(x) = x1 and �3(x) = x2. For d = 2, m = 3,

    k = 6, we have �1(x) = 1, �2(x) = x1, �3(x) = x2, �4(x) = x1x2, �5(x) = x21,

    �6(x) = x22. The functions ( 1, ⋅ ⋅ ⋅ , n) are a set of n radial basis functions, defined

    as

    i(r) =

    ⎧⎨⎩ amd∥r∥2m−dlog∥r∥, d evenbmd∥r∥2m−d, d oddwhere amd and bmd are constants.

    For modeling spatial effects, thin-plate regression splines can be viewed as a Gaus-

    sian process with generalized covariance (Cressie, 1993), characterized in terms of

    distance �. The form of the covariance in two dimensions is C(�) ∝ �2m−2log(�),

    where m is the order of the spline (commonly two). Paciorek (2007) provided a nice

    comparison of a variety of approaches for modeling spatial surface. Wood (2000, 2003,

    2004) proposed the use of iterative weighted fitting of reduced rank thin-plate splines

    for computational efficiency.

    1.4 Outline of Thesis

    This thesis develops models and methods for the analysis of spatial or spatial-temporal

    data arising from epidemiology, environmental and ecological studies. Specific prob-

    lems will be considered including identification of high risk isolated areas in Chapter

    2; misspecification of spatial exposure measures in Chapter 4; joint modeling of multi-

    variate spatially correlated outcomes using common spatial factor models in Chapter

    3; and investigation of functional data analysis approaches for modeling spatially and

    temporally correlated data in Chapter 5. Each of Chapters 2, 3, 4 and 5 constitute

    papers submitted. As a result, some introductory material may be repeated through

    these chapters as well as the descriptions of motivating data sets.

  • CHAPTER 1. INTRODUCTION 7

    1.4.1 Chapter 2

    In disease mapping studies, often there is interest in identifying high risk areas in

    order to investigate causes of mortality for surveillance purposes, or perhaps for effi-

    cient allocation of health funding. Here, we focus on identification of locally isolated

    high risk regions termed ‘local hotspots’ or ‘emerging hotspots’, defined as regions

    with elevated risks, with respect to their neighbors. Identification of ‘local hotspots’ or

    ‘emerging hotspots’ before they become extreme is crucial for disease surveillance. We

    develop methods of ranking the difference between area risks or ranks and correspond-

    ing values for neighbours, based on (1) the standardized mortality ratio (SMR), (2)

    minimizing mean squared errors of estimation for relative risks (3) minimizing mean

    squared errors of estimation for ranks of risks, (4) minimizing a weighted squared

    error loss function for ranks and (5) maximizing the sensitivity in the upper and

    lower 100% relative risks at prespecified . We evaluate our methods through sim-

    ulation investigation in a scenario which reflects the Scottish lip cancer data used in

    several mapping studies. Our simulation results show that ranking the difference be-

    tween posterior ranks of emerging hotspots and corresponding values for neighbours,

    based on minimizing mean squared errors of estimation for ranks, is superior to other

    methods for identifying emerging hotspots.

    1.4.2 Chapter 3

    This chapter discusses joint outcome modeling of multivariate spatial data, where

    outcomes include count as well as zero-inflated count data. The framework utilized for

    the joint spatial count outcome analysis reflects that which is now commonly employed

    for the joint analysis of longitudinal and survival data, termed shared frailty models,

    in which the outcomes are linked through a shared latent spatial random risk term.

    We discuss these types of joint mapping models and consider the benefits achieved

  • CHAPTER 1. INTRODUCTION 8

    through such joint modeling in the disease mapping context. We also consider the

    power of tests for common spatial structure and develop recommendations on the

    sort of power achievable in some contexts, as well as overall recommendations on the

    utility of joint mapping. We illustrate the approaches in an analysis of lung cancer

    mortality as well as an ecological study of Comandra blister rust infection of lodgepole

    pine trees.

    1.4.3 Chapter 4

    In environmental and epidemiological studies, the nearest distance between the sus-

    ceptible subject and the exposure source is a commonly used exposure measure, prin-

    cipally because this measure is easy to collect. However, the density of the exposure in

    the neighborhood of the subject may play an important role in the response to expo-

    sure. Misspecification of exposure measures may result in inaccurate determinations

    of the link between exposure and the response of interest. Such considerations are

    motivated by the study of the disease dynamics of Comandra blister rust (Cronartium

    comandrae) on lodgepole. This disease spreads to pine trees through alternate host

    plants near the trees. We aim at understanding the relationship between the alternate

    host plant presence and the disease, as well as effects relating to genetic variation in

    the trees. We contrast the use of nearest distance to the alternate host plant, with

    host plant densities at different orders of neighborhood, as exposure measures, in the

    framework of a flexible semiparametric generalized additive model, while adjusting for

    a spatially smooth surface. We demonstrate that if exposure is inaccurately modeled,

    bias in estimating genetic effects may manifest themselves. Our study also provides

    information on the added benefit of collecting more detailed information on exposure

    beyond the simple nearest distance measure.

  • CHAPTER 1. INTRODUCTION 9

    1.4.4 Chapter 5

    Oysters from the Pacific Northwest coast of British Columbia, Canada, contain high

    levels of cadmium, in some cases exceeding some international food safety guidelines.

    A primary goal of this chapter is the investigation of the spatial and temporal variation

    in cadmium concentrations for oysters sampled from coastal British Columbia. Such

    information is important so that recommendations can be made as to where and when

    oysters can be cultured such that accumulation of cadmium within these oysters

    is minimized. Some modern statistical methods are applied to achieve this goal,

    including monotone spline smoothing, functional principal component analysis and

    semi-parametric additive modelling. Oyster growth rates are estimated as the first

    derivatives of the monotone smoothing growth curves. Some important patterns in

    cadmium accumulation by oysters are observed. For example, most inland regions

    tend to have a higher level of cadmium concentration than most coastal regions, so

    more caution needs to be taken for shellfish aquaculture practices occurring in the

    inland regions. The semi-parametric additive modelling shows that oyster cadmium

    concentration decreases with oyster length, and oysters sampled at 7m have higher

    average cadmium concentration than those sampled at 1m.

    1.4.5 Chapter 6

    The thesis closes with a discussion of future research topics.

  • Chapter 2

    Bayesian Ranking Methods for the

    Detection of Isolated Hotspots in

    Disease Mapping

    2.1 Introduction

    In disease mapping, early capture of emerging hotspots, that is, regions with ele-

    vated risks which are surrounded by areas with much lower risks, before they become

    extreme, is crucial in decision-making related to health surveillance. Such decision-

    making processes may refer to optimal allocation of resources for health prevention, or

    to decisions reflecting mobility of a society or other environmental controls. A typical

    approach for detection of disease hotspots through a hypothesis testing framework

    utilizes the scan statistic (see Kulldorff and Nagarwalla, 1995; Kulldorff et al., 1998),

    which aims at detecting the location and size of hotspots without any preconceived

    assumptions about these values. Our focus here is quite different as we seek to es-

    timate and rank various local elevations in risk across a map. Model based spatial

    10

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 11

    methods are used here to estimate such ranks. For rare diseases, the observed dis-

    ease count may exhibit extra Poisson variation. Hence, the standardized mortality

    ratios (SMRs), a basic investigative tool for epidemiologists, may be highly variable.

    Subsequently, in maps of SMRs, the most variable values, arising typically from low

    population areas, tend to be highlighted, masking the true underlying pattern of dis-

    ease risk. To address the issue of such overdispersion, the field of disease mapping

    has flourished in the last decade with a variety of estimation methods and spatial

    models for latent levels of the model hierarchy. In particular, there have been many

    developments related to Bayesian hierarchical models, which allow the risk in an area

    to borrow strength from neighboring areas where the disease risks are similar. These

    models have indeed become standard tools for mapping rates (see Besag et al., 1991;

    Clayton and Bernardinelli, 1992; Clayton et al., 1993; Lawson et al., 2000; MacNab

    et al., 2004; Best et al., 2005, for example) in order to identify global hotspots and

    trends in the risk surface across the map.

    Identification of local or emerging hotspots have received less attention. It is

    unclear whether and what sorts of smoothing techniques offer advantages for iden-

    tifying isolated hotspots, over basic estimates such as raw rates. Here, we maintain

    the focus on Bayesian hierarchical conditional autoregressive (CAR) models, devel-

    oped by Besag et al. (1991); Clayton and Bernardinelli (1992); Clayton et al. (1993).

    This model and its extensions have become commonplace in epidemiological studies

    and have been shown to be flexible and robust (Lawson et al., 2000). Best et al.

    (2005) demonstrates the merits of the CAR model when compared to other contem-

    porary models including a multivariate normal geostatistical model with exponential

    covariance, a spatial mixture model, a partition model and a gamma moving average

    model. While CAR models were not designed to detect isolated hotspots or clusters

    of isolated hotspots, they have nevertheless been used broadly for identifying extreme

    risks.

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 12

    The most natural measure of isolation is the difference between the risk or rank of a

    potential hotspot and the corresponding quantity for its neighbors. Ranking methods

    play a valuable role in drawing attention to elevated regions. This chapter considers

    methods for ranking isolation measures with the goal of using these to identify local

    or emerging hostpots. We note that Laird and Louis (1989) showed that ranking of

    empirical Bayes estimators can be more accurate than that of conventional maximum

    likelihood estimators. Shen and Louis (1998) investigated ranking procedures using

    squared error loss functions operating on the difference between the estimated and

    true ranks. We note also that in many applications, interest focuses principally on

    identifying the locations with relatively high (e.g. in the upper 10 %) or low risks.

    With such an emphasis, Lin et al. (2006) discussed various loss functions for Bayesian

    optimal ranking, as well as decision rules for identifying the regions with the top

    100% risk values. Wright et al. (2003) developed a weighted rank squared error loss

    function targeted at the most likely high-risk locations. We contrast these methods

    for identifying the highest and lowest isolation measures across a map and develop

    recommendations based on adaptations of these procedures. Though we focus on

    disease mapping, we note that methods for ranking isolation measures may be broadly

    useful in many other contexts, particularly sociological, for ranking political or racial

    isolation, or ecological, for diversity studies.

    In Section 2.2, we review the Bayesian hierarchical models commonly used for

    analyzing disease incidence and mortality data. Section 2.3 discusses the ranking

    methods considered, focusing on identifying regions associated with high risks which

    are isolated and building upon Bayesian hierarchical models. Section 2.4 evaluates the

    methods using the spatial distribution of lip cancer from Scotland where local hotspots

    are artificially generated. Section 2.5 closes with a discussion and recommendations.

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 13

    2.2 Bayesian Disease-Mapping Model

    It is well known that Bayesian hierarchical models for disease mapping provide a trade-

    off between bias and variance reduction of estimates, and is particularly helpful in

    cases where the disease is rare. The variance reduction is achieved through borrowing

    information from the neighboring region to produce a more stable estimate of the risk

    surface with estimated risks shrunk toward the overall mean risk, or some function of

    this mean. Marshall (1991) reviews empirical Bayes and some early Bayesian methods

    for disease mapping; Lawson et al. (2000) compares disease mapping models using

    various goodness of fit criteria; Best et al. (2005) provides a comprehensive review of

    the recent development in Bayesian disease mapping and compares models through

    simulation studies; Richardson et al. (2004) conducts a comprehensive evaluation

    designed to highlight the amount of smoothing of risk which occurs and the effects on

    identifying global hotspots in a variety of settings. Our aim here is to evaluate various

    ranking methods for risk estimators obtained from fitting Bayesian disease mapping

    models. We focus on the basic spatial model described by Besag et al. (1991).

    Let the area under study be divided into n contiguous regions labeled i = 1, ⋅ ⋅ ⋅ , n,

    and let y = (y1, ⋅ ⋅ ⋅ , yn)T be the observed, and E = (E1, ⋅ ⋅ ⋅ , En)T be the expected,

    disease counts. Denote by � = (�1, ⋅ ⋅ ⋅ , �n)T , i = 1, ⋅ ⋅ ⋅ , n the underlying random

    region-specific disease risks. The response variables, conditional on �i, i = 1, ⋅ ⋅ ⋅ , n,

    are assumed independent and Poisson distributed: yi∣�i∼Poisson(�i), �i = �iEi. The

    conditional log linear model (Besag et al., 1991) specifies

    log(�i) = � + log(Ei) + �i, �i = � + bi + ℎi ,

    where � denotes the overall mean risk, while �i is decomposed into a spatially cor-

    related random error term bi, and a uncorrelated error ℎi. The spatially correlated

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 14

    random effects, b = (b1, ⋅ ⋅ ⋅ , bn)T are conveniently interpreted conditionally, as

    bi∣bj ∕=i ∼ N

    (∑j∼iwijbj∑j∼iwij

    ,�2b∑j∼iwij

    ),

    where j ∼ i indicates that region j belongs to the neighbourhood of region i,

    i = 1, ⋅ ⋅ ⋅ , n. Neighborhoods define the scope of the conditional influence and may

    be constructed in different ways depending on the context of the analysis. In our

    application, we define regions which are contiguous in space with the ith region,

    sharing a common boundary, as its neighborhood. The weights, wij ≥ 0, wii = 0,

    i, j = 1, ⋅ ⋅ ⋅ , n may be based on adjacency indicators for a lattice, or on a distance

    measure between region i and j. Where the weights are based on adjacency indica-

    tors, the joint distribution of random effects, b, is described as the intrinsic condi-

    tional autoregressive model (Besag, 1974; Sun et al., 1999): b ∼ MVN(0, �2bQ−1),

    where Q has ith diagonal element equal to the number of neighbors of the ith region

    while for i ∕= j, Qij = −1 if i and j are neighbors, and 0 otherwise. The vector of

    random risks, �, accommodates extra variation by a white noise error vector, and

    h = (ℎ1, ⋅ ⋅ ⋅ , ℎn)T ∼ MVN(0, �2ℎI), where I is an identity matrix of dimension

    n. By combining the independent and spatially correlated sources of random errors,

    we obtain the convolution conditional autoregressive model for defining the distribu-

    tion of the risks �i, as defined by Besag et al. (1991): h+ b ∼ MVN(0,Σ), where

    Σ = �2ℎI + �2bQ−1. The values of �2ℎ and �

    2b give a sense of the contributions of

    spatial and non-spatial components in explaining the variability in the map of risks.

    Bayesian analysis requires the specification of prior distribution for the parameters.

    We put diffuse prior on the intercept �. For the variance parameters (�2b , �2ℎ) of

    the random effects (b,h), we let the square root be a noninformative uniform prior

    density between 0 and 100 (Gelman, 2006).

    In the Bayesian approach to disease mapping, inference on the relative risks is

    based on the posterior distribution of the risks given the data. The use of Markov

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 15

    chain Monte Carlo (MCMC) methods based on Gibbs sampling (Geman and Ge-

    man, 1984; Gelfand and Smith, 1990) yields easy implementation in the WinBUGS

    software package (Spiegelhalter et al., 2003), allowing for estimation of the posterior

    distribution of the relative risks. The R project R2WinBUGS (Sturtz et al., 2005)

    may be used to export results for additional analyses using R.

    2.3 Ranking Methods

    To estimate isolation, we propose to rank the difference between the rank or risk es-

    timates of the region under consideration and the corresponding mean value from its

    neighbours. We expect this to provide a useful mechanism for identifying areas with

    emerging or unusual elevated risk, and hence for prioritizing public health investiga-

    tions. Our discussion of ranking approaches are from both (i) traditional perspectives

    which use estimates based upon the SMR and (ii) those based on smoothing methods

    with a focus of obtaining a general impression of trends over space as well as utilizing

    these to provide more precise identification of isolated high risk areas.

    Let d = (d1, ⋅ ⋅ ⋅ , dn)T be a vector representing the isolation measure defined as

    the true difference in relative risks between the region and the mean value of the risk

    for its neighborhood

    di = �i −1

    Ni

    ∑j∼i

    �j , (2.1)

    where Ni denotes the number of neighbours for region i, i = 1, ⋅ ⋅ ⋅ , n. Define the

    corresponding rank of di as

    rank(di) = Ri =n∑j=1

    I {di ≤ dj} , (2.2)

    where I {A} is the indicator function for event A. The smallest difference has rank

    n and the largest has rank 1. The ranking methods considered are obtained by

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 16

    minimizing the following loss functions.

    2.3.1 Squared error loss function for the isolation measures

    It is well known that the posterior mean minimizes the Bayesian risk with respect

    to the squared-error loss (SEL) function (Berger, 1985). For example, the posterior

    mean, E(�i∣y), is the optimal Bayes estimate obtained by minimizing the posterior

    expectation of the sum of squared error loss function L(�, �̂) =∑n

    i=1(�̂i − �i)2/n

    (Carlin and Louis, 1996). In our case, we rank the posterior mean of the isolation

    value, E(di∣y), which minimizes the posterior expectation L(d, d̂) =∑n

    i=1(d̂i−di)2/n.

    The corresponding estimated ranks are denoted as PM.

    2.3.2 Squared error loss function for the ranks of the isola-

    tion measures

    Laird and Louis (1989), Shen and Louis (1998) and Louis and Shen (1999) showed

    that if ranks of parameters are of interest, using a rank estimator directly is more ap-

    propriate than using the parameter estimator to obtain ranks. The posterior expected

    rank is obtained by minimizing the sum of squared error loss function of the ranks

    L(R, R̂) =∑n

    i=1(R̂i−Ri)2/n. The estimated ranks, which are non-integer quantities,

    are

    R̄i = E(Ri∣y) =n∑j=1

    P (di ≤ dj∣y) , (2.3)

    and tend to be shrunk towards the mid-rank (n+ 1)/2. Hence, we rank the posterior

    means of (2.3) as described below, and denote the corresponding estimated ranks,

    R̂i = rank(R̄i), as PRANK. Lin et al. (2006) shows that the estimator is also optimal

    under weighted squared error loss of ranks, 1/n∑n

    i=1wi(R̂i − Ri)2 for any values of

    wi, i = 1, ⋅ ⋅ ⋅ , n. Calculation of PRANK can be easily implemented in the Bayesian

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 17

    context. Let �(r) = (�(r)1 , ⋅ ⋅ ⋅ , �

    (r)n )T be a random draw of � from p(�∣y); rank the

    isolation measures d(r)i = �

    (r)i − 1/Ni

    ∑j∼i �

    (r)j , i = 1, ⋅ ⋅ ⋅ , n, j = 1, ⋅ ⋅ ⋅ ,Ni, and

    subsequently rank the average rank of d(r)i over the MCMC iterations, r = 1, ⋅ ⋅ ⋅ , R,

    to obtain the optimal rank based on (2.3).

    Ranking methods described in Subsections 2.3.1 and 2.3.2 may be reasonable

    choices when accurate ranking of all regions is of interest. In contrast, the methods

    described in Subsection 2.3.3 and 2.3.4 focus on high risk areas.

    2.3.3 Weighted rank squared error loss function

    The posterior means are less variable than a typical draw from the posterior distribu-

    tion (Louis, 1984). Therefore, high risks tend to be underestimated, while low risks

    tend to be overestimated. Wright et al. (2003) introduces weighted rank squared error

    loss functions in a hierarchical setting for estimating extrema (hotspot) of parame-

    ters. In an exploratory approach, we adapt this method to be aligned with a focus

    on identifying local isolated hotspots.

    Let (d(1), ⋅ ⋅ ⋅ , d(n)) be the ordered vector of d, d(1) < ⋅ ⋅ ⋅ < d(n), assuming no ties.

    To identify the most isolated hotspot, we consider the following loss function:

    J(d, d̂, c

    )=

    n∑k=1

    n∑j=1

    cjI{dk = d(j)

    }(dk − d̂k

    )2=

    n∑k=1

    cr(k)

    (dk − d̂k

    )2, (2.4)

    where r(k) ≡{j : dk = d(j)

    }, cr(k) =

    ∑nj=1 cjI

    {dk = d(j)

    }and c = (c1, ⋅ ⋅ ⋅ , cn)T is the

    vector of weights for d. The optimal Bayes estimator of dk is obtained by minimizing

    the conditional expectation of the kth element in (2.4),

    E{Jk(d, d̂k, c∣y

    )}=

    ∫ n∑j=1

    cjI{dk = d(j)

    }(dk − d̂k

    )2p(d∣y)dd , (2.5)

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 18

    which yields

    d̂k =

    ∑nj=1 cj

    ∫I{dk = d(j)

    }dkp(d∣y)dd∑n

    j=1 cj∫I{dk = d(j)

    }p(d∣y)dd

    =

    ∑nj=1E

    (dk∣dk = d(j),y

    )cjp(dk = d(j)∣y

    )∑n

    j=1 cjp(dk = d(j)∣y

    ) .(2.6)

    The estimate d̂k is a weighted average of conditional posterior means of dk, with the

    weight being cj multiplied by the posterior probability that dk has rank j. The corre-

    sponding estimated ranks are denoted as WRSEL. For identifying extreme risks, we

    use the suggestion in Wright et al. (2003) to consider a sharply increasing weighting

    vector, with ci = exp [{(n+ 1)− i} /s] as the weight for rank i, i = 1, ⋅ ⋅ ⋅ , n. We

    let WRSEL(a) denote the estimated ranks when s = 2, so that the weighting func-

    tion puts large weight on highly isolated risks and almost 0 weight otherwise, and

    WRSEL(b) denote the estimated ranks when s = 10, so that the weight function de-

    clines less steeply as risks become less isolated. Figure 2.1 displays the weight vectors

    c for WRSEL(a) and WRSEL(b) when n = 56.

    2.3.4 Misclassification rates of regions in the top 100% group

    Lin et al. (2006) considered specific loss functions tailored for estimating extreme

    ranks. They recommended ranking the posterior probability that a region’s rank is

    in the top 100% of ranks based on the rank-based misclassification loss function:

    L0∣1(,R, R̂) =1

    n

    n∑i=1

    {FP(,Ri, R̂i) + FN(,Ri, R̂i)

    }, (2.7)

    where

    FP(,Ri, R̂i) = I{Ri > (n+ 1), R̂i ≤ (n+ 1)

    };

    FN(,Ri, R̂i) = I{Ri ≤ (n+ 1), R̂i > (n+ 1)

    }, (2.8)

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 19

    0 10 20 30 40 50

    0.00.2

    0.40.6

    0.81.0

    rank

    c

    WRSEL(a)WRSEL(b)

    Figure 2.1: Plot of the weight function c for WRSEL(a) and WRSEL(b). In thisplot, the weight functions are scaled to have maximum value of 1.

    where FP (false positive) and FN (false negative) indicate the two possible misclassi-

    fication rates.

    Lin et al. (2006) shows the loss function (2.7) is minimized by ranking the following

    posterior probabilities:

    P (R̂i ≤ (n+ 1)∣y) , (2.9)

    as in Lin et al. (2006), based on the posterior distribution of Ri, and minimizes

    errors in classifying regions above or below a percentile threshold. The corresponding

    estimated ranks are denoted as PPR.

    2.4 Comparison of Rank Estimators of Isolation

    In an effort to understand how these ranking methods perform and how well they cap-

    ture isolated hotspots when these are only modestly elevated, we consider hotspots

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 20

    from regions with low expected counts, where the elevation in risk ranges from mod-

    erate to large. We consider a single isolated hotspot, a small cluster of contiguous

    hotspots, and, for comparison, several non-contiguous isolated hotspots.

    In the investigations, the background relative risks are spatially correlated while an

    independent discrete random effect inflates the risks in the target regions. Specifically,

    counts were generated from a multinomial distribution

    yi ∼ Multinomial

    (n∑i=1

    Ei,Ei�i∑ni=1Ei�i

    ), �i = exp(� + bi + log�i) , (2.10)

    where � is the overall mean rate over the map; bi denotes a spatially correlated ran-

    dom effect; �i = 1 if the region is not a hotspot, and constant t otherwise, t being

    the inflation factor. To accommodate sampling variability, each simulation scenario is

    replicated 500 times. Two MCMC chains have been run for a total of 20,000 iterations,

    keeping every 10th, after a 10,000 iteration burn-in period. Brooks-Gelman-Rubin

    diagnostics (Brooks and Gelman, 1998), as well as graphical checks of chains and

    their autocorrelations were performed to assess convergence. The distribution of the

    spatially correlated random effects, the expected disease counts and the neighbor-

    hood structure mimic the fitted distribution from an initial analysis of the Scottish

    lip cancer data (see Breslow and Clayton, 1993, for example). The data comprise

    observed and expected counts of lip cancer cases during the period 1975-1980 over 56

    Scottish counties. Table 2.1 summarizes observed and expected counts for this data.

    The lip cancer data is known for exhibiting severe extra-Poisson variation (Clayton

    and Kaldor, 1987). Breslow and Clayton (1993) and others have found that a con-

    ditional Poisson model with spatially correlated CAR random effects provides a fair

    fit to these data. We use the estimated model parameters from such an analysis to

    define the background spatial pattern. Additionally, emerging isolated hotspots are

    generated as

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 21

    ∙ Scenario I: A single region is considered as emerging hotspot. Three candi-

    dates are considered with expected counts of 1.8, 6 and 14.6, corresponding

    approximately to the 10th, 50th and 90th percentiles of the expected counts,

    respectively. Note that we choose the hotspot with low expected count (10th

    percentile of the expected counts), such that it is surrounded by neighbours

    with fairly high expected counts, one of which has expected count 50.7. In

    this case, the neighbours may have substantial smoothing effects on the target

    region under the CAR model.

    ∙ Scenario II: A group of three contiguous regions is considered as an isolated clus-

    ter. Two cases are considered: (i) areas with low expected counts 3.3, 4.8 and

    2.9; (ii) areas with high expected counts of 9.3, 14.6 and 88.7. When contigu-

    ous regions are proposed as hotspots, di (2.1) is calculated by excluding target

    hotspots from Ni. This mimics a hypothesis testing scenario where a specific

    cluster is being tested. Note that the expected counts from the neighbours of

    case (i) are fairly low, with mean expected counts about 7.5.

    ∙ Scenario III: A group of three non-contiguous regions is considered as an isolated

    group of regions of higher risk. Two cases are considered (i) areas with low

    expected counts of 2.5, 3.3 and 3.6; (ii) areas with high expected counts of 10.1,

    50.7 and 8.2. The expected counts of the neighbours for two of the isolated

    hotspots in case (i) are fairly low, while the third hotspot has a neighbor with

    the highest expected count over the map.

    Note that estimated risks for the isolated hotspots which have moderate or high

    expected counts are less likely to be influenced by disease counts for their neighbours.

    In the simulation studies, the risks of the elevated regions are inflated to be sharply

    different from their neighbors and (i) not overly high (rank about 10th place), and

    (ii) moderately high (rank about 3rd place) and (iii) high (rank about 1st place). The

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 22

    Table 2.1: Scottish lip cancer data: summary statistics.

    Minimum First quartile Median Third quartile MaximumObserved count (y) 0.00 4.75 8.00 11.00 39.00Expected count (E) 1.10 4.05 6.30 10.12 88.70SMR (y/E) 0 0.49 1.11 2.24 6.43

    magnitudes of the inflations for Scenarios I, II and III are reflected in Figures 2.2,

    2.3 and 2.4, respectively. The corresponding geographical locations for the isolated

    hotspots are shown in Figures 2.5 and 2.6, respectively. We also consider scaling the

    expected counts by a factor u = 1, 4 and 8 for all scenarios. The threshold in (2.9)

    corresponds to 1/(n + 1) for Scenario I and 3/(n + 1) for Scenarios II and III. The

    simulated data are analyzed using model (2.1).

    To assess the accuracy of the proposed ranking methods for identification of iso-

    lated hotspots, we consider the root mean squared error of R̂i, for an isolated hotspot

    at site i, given by

    RMSE(R̂i) =

    {1

    M

    M∑m=1

    (R̂

    (m)i −Ri

    )2}1/2, (2.11)

    where Ri is the true rank (2.2) and R̂(m)i is the estimated value based on the mth

    simulated dataset, m = 1, ⋅ ⋅ ⋅ ,M . For cases where clusters are considered (Scenario

    II and III), we calculate the average RMSE for the hotspots.

    We also evaluate the ranking methods based on the correct positive and false

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 23

    ●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●●

    ●●●●

    ●●●● ●●●

    ●● ●●●

    0 10 30 50

    12

    34

    true rank

    tru

    e r

    ela

    tive

    ris

    k

    10th●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●●

    ●●●●

    ●●●● ●●●

    ●● ●●●

    0 10 30 50

    12

    34

    true rank

    tru

    e r

    ela

    tive

    ris

    k

    ●3rd ●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●●

    ●●●●

    ●●●● ●●●

    ●● ●●●

    0 10 30 50

    12

    34

    true rank

    tru

    e r

    ela

    tive

    ris

    k

    1st

    ●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●●●

    ●●●●

    ●●●● ●●●

    ●● ●●●

    ●●

    0 10 30 50

    12

    34

    true rank

    tru

    e r

    ela

    tive

    ris

    k

    10th●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●●●

    ●●●●

    ●●●● ●●●

    ●● ●●●

    ●●

    0 10 30 50

    12

    34

    true rank

    tru

    e r

    ela

    tive

    ris

    k

    ●3rd

    ●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●●●

    ●●●●

    ●●●● ●●●

    ●● ●●●●●

    0 10 30 501

    23

    45

    true rank

    tru

    e r

    ela

    tive

    ris

    k

    1st

    ●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●●

    ●●●●

    ●●●●●●●●●●

    ●●

    0 10 30 50

    12

    34

    true rank

    tru

    e r

    ela

    tive

    ris

    k

    10th●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●●

    ●●●●

    ●●●●●●●●●●

    ●●

    0 10 30 50

    12

    34

    true rank

    tru

    e r

    ela

    tive

    ris

    k

    ●3rd

    ●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●●

    ●●●●

    ●●●●●●●●●●

    ●●

    0 10 30 50

    12

    34

    true rank

    tru

    e r

    ela

    tive

    ris

    k

    1st

    Figure 2.2: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top, middle and bottom rows correspondto Scenario I with the target region having low, moderate and high expected incidencecount, respectively. The isolated hotspot, shown as black dots, are inflated to aboutthe 10th (column 1), 3rd (column 2) and 1st (column 3) places. The symbol +identifies neighboring regions of the isolated hotspot.

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 24

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●● ● ●

    ●●●

    ●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rela

    tive

    risk

    ●●

    10th ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●● ● ●

    ●●●

    ●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rela

    tive

    risk

    ●●

    3rd●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●● ● ●

    ●●●

    ● ●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rela

    tive

    risk

    ● 1st

    ●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●● ●

    ●●

    ●●●

    ●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rela

    tive

    risk

    ●●●

    10th ●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●● ●

    ●●●

    ●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rela

    tive

    risk

    ●●

    3rd

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●● ●

    ●●

    ●● ●●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rela

    tive

    risk

    ●●1st

    Figure 2.3: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 25

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●● ●●

    ●●

    ●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rel

    ativ

    e ris

    k

    ●●●

    10th ●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●● ●●

    ●●

    ●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rel

    ativ

    e ris

    k

    3rd

    ●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●● ●●

    ●●

    ●●●●

    ●●● ●● ●●●

    ●●

    ●●

    0 10 20 30 40 50

    12

    34

    5

    true rank

    true

    rel

    ativ

    e ris

    k

    ●●

    1st

    ●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rel

    ativ

    e ris

    k

    ●●●

    10th ●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    0 10 20 30 40 50

    12

    34

    true rank

    true

    rel

    ativ

    e ris

    k

    ●●

    3rd

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●● ●●● ● ●●

    ●●

    0 10 20 30 40 50

    12

    34

    5

    true rank

    true

    rel

    ativ

    e ris

    k

    1st

    Figure 2.4: Plot of true relative risks versus true ranks of the relative risks for theScottish lip cancer data. The panels in the top and bottom rows correspond to anisolated cluster of three non-contiguous hotspots with low and high expected incidencecount, respectively. The isolated hotspots, shown as black dots, are inflated to aboutthe 10th, 3rd and 1st places. The symbol + identifies neighboring regions of theisolated hotspots.

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 26

    unde

    r −1.

    5−1

    .5 −

    00

    − 1.

    5ov

    er 1

    .5

    low E

    unde

    r −1.

    5−1

    .5 −

    00

    − 1.

    5ov

    er 1

    .5

    mod

    erat

    e E

    unde

    r −1.

    5−1

    .5 −

    00

    − 1.

    5ov

    er 1

    .5

    high

    E

    Fig

    ure

    2.5:

    The

    pan

    els

    dis

    pla

    ydi

    for

    Sce

    nar

    ioI.

    The

    singl

    eis

    olat

    edhot

    spot

    wit

    hlo

    w,

    moder

    ate

    and

    and

    hig

    hex

    pec

    ted

    count,

    are

    iden

    tified

    by

    the

    red

    circ

    lein

    the

    1st,

    2nd

    and

    3rd

    pan

    els,

    resp

    ecti

    vely

    .

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 27

    under −1.5−1.5 − 00 − 1.5over 1.5

    low E

    under −1.5−1.5 − 00 − 1.5over 1.5

    high E

    under −1.5−1.5 − 00 − 1.5over 1.5

    low E

    under −1.5−1.5 − 00 − 1.5over 1.5

    high E

    Figure 2.6: The top and bottom panels display di for Scenarios II and III, respectively.The cluster of three contiguous isolated hotspots with low and high expected countsfor simulation Scenario II are identified by the red circles in the left and right toppanels, respectively; the cluster of three non-contiguous hotspots with low and highexpected counts for simulation Scenario III are identified by the red circles in the leftand right bottom panels, respectively.

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 28

    positive rates

    CP = P (R̂i < �∣Ri < �) =1

    M

    M∑m=1

    I{R̂

    (m)i < �∣Ri < �

    };

    FP = P (R̂i < �∣Ri > �) =1

    M

    M∑m=1

    I{R̂

    (m)i < �∣Ri > �

    }, (2.12)

    where � in (2.12) denotes the threshold defining high ranks, � = 2 for Scenario I and

    4 for Scenarios II and III.

    Table 2.2 displays RMSE, CP and FP for all the ranking methods evaluated here

    for Scenario I for the case where the hotspot is associated with a low expected count

    surrounded by neighbours with high expected values. It is not surprising that SMR

    performs better in this case, as the CAR model pools information from the neighbours

    to produce an estimate for the target region; therefore, the risk estimate for this

    isolated hotspot tends to be smoothed under the CAR model. In contrast, for the

    case of an isolated hotspot with moderate or large expected count, as shown in Tables

    2.3 and 2.4, PRANK outperforms SMR. The gains of using PRANK are substantial

    when the expected incidence count for the emerging hotspot is large. For example,

    in Table 2.4, when the isolated hotspot is in the 10th place, CP is about 71.2% while

    FP is about 0.5% for PRANK, yielding a performance which is far superior to the

    other ranking methods. In general, WRSEL(a), WRSEL(b) and PPR, perform less

    well. The WRSEL function tends to inflate the point estimates of the high risks;

    because their weights are low, inaccuracies in point estimates of the other regions

    with low isolation measures are relatively unimportant. WRSEL does not provide

    precise estimates of all the risks and this may make it unsuitable for ranking purposes

    (ranking requires good estimates over the whole map). Our empirical evaluation of

    PPR over a sequence of values of the threshold (not shown here) suggests that

    the performance of this estimator in terms of RMSE, CP and FP is influenced by ,

    especially when the expected disease counts are low for the isolated hotspots. For

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 29

    all the ranking methods, RMSE decreases, CP increases and FP decreases when the

    emerging hotspots are gradually elevated above the whole surface, and when the

    expected incidence counts for all the regions are inflated. These findings apply also

    to the cases where the isolated hotspots are a cluster of three contiguous regions (see

    Tables 2.5 and 2.6) and also where the isolated hotspots are three non-contiguous

    regions (see Tables 2.7 and 2.8). It is also interesting to note that, in contrast to

    Scenario I, for Scenario II, where three contiguous regions with low expected counts

    are inflated as a cluster of hotspots, PRANK is superior to SMR, as the CAR model

    has less of a smoothing effect on these isolated hotspots.

    2.5 Summary

    In this study, we focus on developing and evaluating rank estimators for disease map-

    ping for the identification of emerging isolated hotspots. To determine the magnitude

    of elevation of the hotspots relative to their neighbours, we developed an isolation

    measure, the difference of risks or their rank estimators for the emerging high risk

    regions and their neighbours. In summary, we note that though the CAR model

    provides a smoothed risk surface, the estimates for PRANK or PM based on this

    model perform reasonably well in detecting the emerging isolated hotspots. Simula-

    tion studies show that gains of using PRANK may be substantial compared to other

    ranking methods considered, especially when the disease is rare and the high risk area

    is not yet a global outlier. The research has adopted the widely used CAR model.

    Rank estimators based on other models may yield different results on identification of

    isolated hotspots. The isolation measure developed here depends on the definition of

    the neighborhood structure. The performance of the isolation measure may depend

    on the distribution of the number of neighbours; hence the development of methods

    which account for the number of neighbours may be useful.

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 30

    In addition, in comparison to the classical scan statistic, we expect that the rank-

    ing methods based on the spatial model may have lower false positive rates for identify-

    ing isolated hotspots, since the classical scan statistic is very sensitive to the violation

    of the assumption of spatial independence, detecting clusters at the 5% level much

    more often than 5% of the time when spatially correlated data are simulated (Loh

    and Zhu, 2007). It would be useful to compare the use of the scan statistic to our

    ranking methods through simulation studies when no isolated hotspots exist.

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 31

    Table 2.2: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with LOW expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.

    10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

    u = 1 SMR 8.988 0.262 0.013 4.211 0.542 0.008 1.842 0.706 0.005PM 11.924 0.014 0.018 6.991 0.098 0.016 4.631 0.212 0.014WRSEL(a) 14.479 0.004 0.018 10.173 0.064 0.017 7.689 0.132 0.016WRSEL(b) 15.673 0.006 0.018 10.304 0.074 0.017 7.542 0.160 0.015PPR 21.577 0.008 0.018 13.456 0.088 0.017 9.316 0.162 0.015PRANK 11.643 0.024 0.018 6.555 0.150 0.015 4.103 0.278 0.013

    u = 4 SMR 2.109 0.428 0.010 0.417 0.886 0.002 0.253 0.948 0.001PM 4.331 0.070 0.017 1.291 0.646 0.006 0.629 0.834 0.003WRSEL(a) 6.937 0.036 0.018 1.865 0.560 0.008 0.913 0.784 0.004WRSEL(b) 5.753 0.050 0.017 1.449 0.612 0.007 0.700 0.828 0.003PPR 10.509 0.054 0.017 1.785 0.626 0.007 0.739 0.832 0.003PRANK 3.935 0.096 0.016 1.154 0.676 0.006 0.576 0.862 0.003

    u = 8 SMR 1.305 0.472 0.010 0.205 0.958 0.001 0.118 0.986 0.000PM 2.510 0.164 0.015 0.397 0.872 0.002 0.161 0.974 0.000WRSEL(a) 3.409 0.132 0.016 0.443 0.834 0.003 0.179 0.968 0.001WRSEL(b) 2.781 0.164 0.015 0.422 0.858 0.003 0.161 0.974 0.000PPR 5.720 0.160 0.015 0.422 0.864 0.002 0.155 0.976 0.000PRANK 2.373 0.188 0.015 0.374 0.890 0.002 0.141 0.980 0.000

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 32

    Table 2.3: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with MODERATE expected disease counts, whose risk wasinflated to about the 10th, 3rd and 1st place; the expected disease counts for all theregions are scaled by u = 1, 4 and 8.

    10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

    u = 1 SMR 3.159 0.208 0.014 1.152 0.580 0.008 0.605 0.802 0.004PM 3.151 0.200 0.015 1.275 0.542 0.008 0.560 0.832 0.003WRSEL(a) 9.498 0.028 0.018 4.629 0.206 0.014 1.997 0.602 0.007WRSEL(b) 6.426 0.082 0.017 2.577 0.420 0.011 0.980 0.750 0.005PPR 8.818 0.112 0.016 2.705 0.474 0.010 0.995 0.784 0.004PRANK 2.164 0.378 0.011 0.729 0.764 0.004 0.319 0.922 0.001

    u = 4 SMR 0.931 0.604 0.007 0.341 0.884 0.002 0.110 0.988 0.000PM 0.963 0.588 0.007 0.241 0.942 0.001 0.089 0.992 0.000WRSEL(a) 1.957 0.326 0.012 0.392 0.864 0.002 0.110 0.988 0.000WRSEL(b) 1.138 0.512 0.009 0.300 0.922 0.001 0.089 0.992 0.000PPR 1.483 0.534 0.008 0.272 0.938 0.001 0.089 0.992 0.000PRANK 0.769 0.690 0.006 0.195 0.962 0.001 0.077 0.994 0.000

    u = 8 SMR 0.597 0.710 0.005 0.200 0.960 0.001 0.000 1.000 0.000PM 0.642 0.706 0.005 0.179 0.968 0.001 0.000 1.000 0.000WRSEL(a) 0.872 0.552 0.008 0.205 0.958 0.001 0.000 1.000 0.000WRSEL(b) 0.672 0.678 0.006 0.179 0.968 0.001 0.000 1.000 0.000PPR 0.722 0.700 0.005 0.179 0.968 0.001 0.000 1.000 0.000PRANK 0.546 0.790 0.004 0.161 0.974 0.000 0.000 1.000 0.000

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 33

    Table 2.4: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated single hotspot with HIGH expected disease counts, whose risk was inflatedto about the 10th, 3rd and 1st place; the expected disease counts for all the regionsare scaled by u = 1, 4 and 8.

    10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

    u = 1 SMR 1.834 0.250 0.014 0.799 0.630 0.007 0.417 0.860 0.003PM 1.321 0.428 0.010 0.494 0.804 0.004 0.249 0.950 0.001WRSEL(a) 7.695 0.030 0.018 2.490 0.358 0.012 0.832 0.782 0.004WRSEL(b) 3.025 0.188 0.015 0.906 0.674 0.006 0.319 0.920 0.001PPR 5.272 0.250 0.014 0.926 0.744 0.005 0.382 0.936 0.001PRANK 0.696 0.712 0.005 0.257 0.940 0.001 0.118 0.986 0.000

    u = 4 SMR 0.651 0.684 0.006 0.283 0.920 0.001 0.077 0.994 0.000PM 0.562 0.766 0.004 0.200 0.960 0.001 0.063 0.996 0.000WRSEL(a) 1.049 0.504 0.009 0.268 0.934 0.001 0.077 0.994 0.000WRSEL(b) 0.660 0.716 0.005 0.205 0.958 0.001 0.063 0.996 0.000PPR 0.720 0.734 0.005 0.195 0.962 0.001 0.063 0.996 0.000PRANK 0.415 0.870 0.002 0.167 0.972 0.001 0.045 0.998 0.000

    u = 8 SMR 0.537 0.734 0.005 0.118 0.986 0.000 0.000 1.000 0.000PM 0.454 0.816 0.003 0.110 0.988 0.000 0.000 1.000 0.000WRSEL(a) 0.610 0.686 0.006 0.110 0.988 0.000 0.000 1.000 0.000WRSEL(b) 0.486 0.792 0.004 0.110 0.988 0.000 0.000 1.000 0.000PPR 0.475 0.808 0.003 0.100 0.990 0.000 0.000 1.000 0.000PRANK 0.369 0.880 0.002 0.089 0.992 0.000 0.000 1.000 0.000

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 34

    Table 2.5: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.

    10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

    u = 1 SMR 5.264 0.455 0.031 3.530 0.578 0.024 1.712 0.808 0.011PM 3.904 0.412 0.033 2.762 0.568 0.024 1.611 0.821 0.010WRSEL(a) 9.437 0.111 0.050 7.373 0.276 0.041 3.445 0.619 0.022WRSEL(b) 7.071 0.249 0.043 4.987 0.425 0.033 2.207 0.740 0.015PPR 6.695 0.351 0.042 4.601 0.517 0.032 2.278 0.800 0.016PRANK 3.268 0.549 0.026 2.214 0.689 0.018 1.385 0.879 0.007

    u = 4 SMR 2.172 0.656 0.019 1.508 0.823 0.010 1.204 0.977 0.001PM 2.148 0.635 0.021 1.485 0.819 0.010 1.193 0.987 0.001WRSEL(a) 3.767 0.442 0.032 2.081 0.706 0.017 1.228 0.961 0.002WRSEL(b) 2.465 0.583 0.024 1.586 0.793 0.012 1.198 0.983 0.001PPR 2.523 0.623 0.027 1.584 0.813 0.015 1.180 0.990 0.003PRANK 2.011 0.683 0.018 1.422 0.845 0.009 1.185 0.991 0.001

    u = 8 SMR 1.701 0.740 0.015 1.279 0.908 0.005 1.179 0.997 0.000PM 1.697 0.745 0.014 1.268 0.914 0.005 1.182 0.999 0.000WRSEL(a) 2.043 0.646 0.020 1.389 0.853 0.008 1.191 0.996 0.000WRSEL(b) 1.755 0.721 0.016 1.301 0.901 0.006 1.183 0.999 0.000PPR 1.781 0.741 0.019 1.287 0.915 0.010 1.179 0.999 0.001PRANK 1.655 0.764 0.013 1.253 0.922 0.004 1.181 0.999 0.000

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 35

    Table 2.6: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifyingan isolated cluster of three contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.

    10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

    u = 1 SMR 4.105 0.347 0.037 2.511 0.599 0.023 1.767 0.778 0.013PM 3.514 0.509 0.028 2.121 0.727 0.015 1.504 0.865 0.008WRSEL(a) 14.481 0.014 0.056 9.552 0.119 0.050 4.741 0.432 0.032WRSEL(b) 7.166 0.209 0.045 4.073 0.522 0.027 2.161 0.777 0.013PPR 7.550 0.413 0.046 4.196 0.672 0.032 2.466 0.843 0.019PRANK 2.628 0.713 0.016 1.572 0.835 0.009 1.328 0.934 0.004

    u = 4 SMR 2.269 0.646 0.020 1.445 0.854 0.008 1.204 0.972 0.002PM 2.184 0.677 0.018 1.378 0.883 0.007 1.184 0.983 0.001WRSEL(a) 5.851 0.294 0.040 2.518 0.673 0.019 1.262 0.929 0.004WRSEL(b) 2.722 0.601 0.023 1.504 0.849 0.009 1.190 0.978 0.001PPR 4.034 0.648 0.032 1.583 0.867 0.023 1.184 0.984 0.010PRANK 1.966 0.729 0.015 1.304 0.917 0.005 1.179 0.989 0.001

    u = 8 SMR 1.965 0.705 0.017 1.291 0.909 0.005 1.163 0.994 0.000PM 1.967 0.718 0.016 1.268 0.923 0.004 1.160 0.997 0.000WRSEL(a) 3.107 0.551 0.025 1.437 0.838 0.009 1.167 0.989 0.001WRSEL(b) 2.091 0.691 0.018 1.290 0.911 0.005 1.161 0.995 0.000PPR 3.275 0.705 0.030 1.271 0.927 0.021 1.167 0.998 0.011PRANK 1.893 0.753 0.014 1.244 0.941 0.003 1.162 0.995 0.000

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 36

    Table 2.7: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with LOW expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.

    10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

    u = 1 SMR 9.478 0.412 0.033 4.000 0.630 0.021 2.561 0.816 0.010PM 8.487 0.286 0.040 3.993 0.543 0.026 2.544 0.777 0.013WRSEL(a) 9.800 0.081 0.052 6.782 0.301 0.040 3.668 0.613 0.022WRSEL(b) 9.080 0.147 0.048 5.522 0.427 0.032 2.872 0.717 0.016PPR 8.361 0.227 0.047 5.112 0.497 0.032 2.600 0.758 0.017PRANK 8.642 0.391 0.034 3.926 0.631 0.021 2.482 0.823 0.010

    u = 4 SMR 3.484 0.577 0.024 1.458 0.851 0.008 1.191 0.971 0.002PM 3.335 0.521 0.027 1.487 0.835 0.009 1.180 0.976 0.001WRSEL(a) 4.550 0.367 0.036 1.871 0.750 0.014 1.208 0.954 0.003WRSEL(b) 3.618 0.476 0.030 1.570 0.818 0.010 1.186 0.972 0.002PPR 4.030 0.505 0.032 1.536 0.833 0.013 1.171 0.979 0.003PRANK 3.337 0.569 0.024 1.437 0.856 0.008 1.167 0.983 0.001

    u = 8 SMR 2.495 0.641 0.020 1.308 0.904 0.005 1.177 0.998 0.000PM 2.509 0.600 0.023 1.314 0.905 0.005 1.176 0.998 0.000WRSEL(a) 2.878 0.540 0.026 1.384 0.861 0.008 1.184 0.995 0.000WRSEL(b) 2.591 0.585 0.024 1.326 0.897 0.006 1.178 0.998 0.000PPR 3.277 0.603 0.025 1.302 0.915 0.009 1.174 0.999 0.001PRANK 2.513 0.612 0.022 1.303 0.911 0.005 1.178 0.998 0.000

  • CHAPTER 2. ISOLATED HOTSPOT DETECTION 37

    Table 2.8: Root mean squared error (RMSE), correct positive (CP) and false positive(FP) rates of SMR, PM, WRSEL(a), WRSEL(b), PPR and PRANK in identifying anisolated cluster of three non-contiguous hotspots with HIGH expected disease counts,risk of which were inflated to about the 10th, 3rd and 1st places; the expected diseasecounts for all the regions are scaled by u = 1, 4 and 8.

    10th 3rd 1stRMSE CP FP RMSE CP FP RMSE CP FP

    u = 1 SMR 3.787 0.379 0.035 2.308 0.593 0.023 1.552 0.809 0.011PM 3.438 0.465 0.030 2.056 0.699 0.017 1.396 0.882 0.007WRSEL(a) 13.639 0.015 0.056 8.508 0.124 0.050 4.016 0.481 0.029WRSEL(b) 7.210 0.183 0.046 3.736 0.494 0.029 1.794 0.801 0.011PPR 7.308 0.375 0.047 3.646 0.647 0.032 1.678 0.871 0.018PRANK 2.550 0.665 0.019 1.547 0.819 0.010 1.230 0.941 0.003

    u = 4 SMR 2.032 0.649 0.020 1.341 0.885 0.007 1.170 0.973 0.002PM 1.991 0.668 0.019 1.297 0.918 0.005 1.154 0.985 0.001WRSEL(a) 4.936 0.218 0.044 1.827 0.717 0.016 1.194 0.954 0.003WRSEL(b) 2.349 0.569 0.024 1.356 0.885 0.007 1.159 0.981 0.001PPR 2.618 0.647 0.032 1.290 0.922 0.016 1.142 0.990 0.007PRANK 1.852 0.738 0.015 1.252 0.944 0.003 1.150 0.987 0.001

    u = 8 SMR 1.785 0.749 0.014 1.246 0.935 0.004 1.152 0.996 0.000PM 1.770 0.768 0.013 1.215 0.955 0.003 1.145 0.997 0.000WRSEL(a) 2.456 0.539 0.026 1.321 0.872 0.007 1.154 0.994 0.000WRSEL(b) 1.865 0.718 0.016 1.224 0.946 0.003 1.148 0.996 0.000PPR 1.930 0.766 0.024 1.216 0.964 0.013 1.144 0.998 0.008PRANK 1.756 0.799 0.011 1.212 0.959 0.002 1.145 0.996 0.000

  • Chapter 3

    Joint Analysis of Multivariate

    Spatial Count and Zero-Heavy

    Count Outcomes

    3.1 Introduction

    In public health, environmental and ecological studies, variables measured at the same

    spatial locations may be correlated so that the spatial structures of such variables

    across the region under consideration are very similar, indicating that they may be

    characterized by a common spatial risk surface. Employing such a commonality in

    risks may be useful for gaining precision of local area risk estimates, especially for

    rare diseases.

    Shared component spatial models have been studied in a variety of applied con-

    texts. Knorr-Held and Best (2001) proposed a shared-component model which mim-

    ics an ecological regression on the unobserved shared component. The two diseases

    38

  • CHAPTER 3. COMMON SPATIAL FACTOR MODEL 39

    considered in that application share a common spatial structure and, as well, sup-

    port disease-specific spatially uncorrelated random errors. Fitting the model requires

    strong prior assumptions of the random spatial and uncorrelated errors, typically be-

    cause of challenges arising related to identifiability of the latent spatial fields. Wang

    and Wall (2003) proposed a common spatial factor model to study multivariate indi-

    cators of cancer risk across counties in Minnesota. To avoid identifiability issues, the

    model includes the common spatial structure term but no excess heterogeneity and,

    as well, the variance of the shared spatially correlated random effect is considered

    as fixed. Hogan and Tchernis (2004) proposed a common factor model for spatial

    multivariate count data with constraints imposed on the variance structure of the

    conditional autoregressive model they employ. Congdon (2006) set out a modeling

    framework for modeling multiple health outcomes over area, age, and time dimensions

    that takes account of spatial correlation as well as interactions between dimensions.

    Tzala and Best (2006) proposed a Bayesian latent variable model for cancer mor-

    tality data, which linked spatial effects. As well, other joint modeling approaches

    for multivariate spatial data have been proposed including the multivariate version

    of the conditional autoregressive model (MVCAR) (Gelfand and Vounatsou, 2003),

    which assumes the spatial structure is the same across the multivariate outcomes.

    Such modelling allows for the pooling of information across spatial units as well as

    across multiple outcomes within units. In contrast, the common spatial factor model

    may stratify the spatial variation into two components: the shared component and

    outcome-specific components. Such a modeling approach permits a simple analysis

    of which spatial term dominates as well as an identification of the common spatial

    structure. Though testing for a common spatial structure is quite relevant in certain

    studies, there has been very little discussion of the power of such tests. We consider

    this in the context of the analysis of count data an also examine the utility of joint

    modeling in terms of gains in the efficiency of estimating relative risks.

  • CHAPTER 3. COMMON SPATIAL FACTOR MODEL 40

    In environmental and ecological studies, counts data are often characterized by an

    excess of zeros and spatial dependence (Clarke and Green, 1988; Welsh et al., 1996;

    Martin et al., 2005). When studying of abundance of species in ecological studies, hav-

    ing a large proportion of zero counts may indicate the habitat is unsuitable in certain

    areas, for example. In such cases, standard distributions such as Poisson, binomial

    and negative-binomial may fail to provide an adequate fit. A class of distributions

    for such data is defined as zero-inflated distributions (Lambert, 1992).

    For handling zero-inflation, the use of mixture models and conditional models are

    two common approaches within the context of ecological and health studies. The

    well-known zero-inflated Poisson (ZIP) model (Lambert, 1992) is a mixture of a de-

    generate zero mass and a Poisson distribution. On the other hand, Welsh et al. (1996)

    formulate a two-component conditional model where the presence/absence of counts

    is modeled with a binomial distribution and the abundance at active sites is mod-

    eled using a truncated Poisson or truncated negative binomial distribution. These

    two models have different interpretations. Structural zeros and random zeros are not

    distinguished under the conditional specification, whereas the mixture model permits

    an examination of the different sources of error (Kuhnert et al., 2005). For more

    discussion of zero-inflated models from a Bayesian perspective see Angers and Biswas

    (2003) and Ainsworth (2007).

    In many applications, zero-inflated count data are spatially correlated. Rathbun

    and Fei (2006) introduced a zero-inflated Poisson model, in which the component

    modeling the excess zeros is governed by a hidden spatial probit model; a threshold,

    defining large probabilities in the probit layer, governs the proportion of zeros. Agar-

    wal et al. (2002) also proposed a zero inflated model for spatial count data using a

    mixture model approach and incorporating spatial random errors into either or both

    of the model components. With multivariate zero-inflated count data corresponding

  • CHAPTER 3. COMMON SPATIAL FACTOR MODEL 41

    to several related spatial outcomes, there is also the possibility of linking model com-

    ponents across the various outcomes using a shared latent spatial structure. This

    would be relevant, for example, if the underlying, hidden mechanisms resulting in the

    structural zeros, or the abundance of counts, are related across the outcomes.

    The methods developed in this chapter for joint outcome analysis of spatial count

    and zero-heavy count data focus on the use of shared latent spatial frailty models.

    We discuss such joint mapping models and evaluate what benefits may be achieved

    through joint modeling. The rest of the chapter is structured as follows. Section

    3.2 describes a general modeling framework for common spatial factor models for

    count data and zero-inflated count data. Section 3.3 presents two motivating appli-

    cations, applying the common spatial factor model to Ontario lung cancer data and

    zero-inflated forestry infection data related to a study of Comandra blister rust on

    lodgepole pine trees. Section 3.4 examines hypothesis testing of whether two spatial

    maps share the same underlying spatial structure for count data. A power study is

    performed based on the situational context of the Ontario lung cancer data. Section

    3.5 compares joint and separate modeling in terms of accuracy and efficiency of es-

    timating relative risks through simulation investigations. Some closing remarks are

    provided in Section 3.6.

    3.2 Models for Joint Count Outcomes

    We present here a general modeling framework for the common spatial factor model

    for joint modeling of count data and zero-inflated count data. In disease mapping,

    the typical response is a rate (both in health and forest epidemiology), hence the

    focus on the analysis of counts herein. However, generalization of the model to other

    non-normal data is straightforward.

  • CHAPTER 3. COMMON SPATIAL FACTOR MODEL 42

    3.2.1 Common Spatial Factor Model for Counts

    Let yij∣�ij ∼ Poisson(�ij) for region i = 1, ⋅ ⋅ ⋅ , n and outcome j = 1, ⋅ ⋅ ⋅ , J , where

    yij denotes the response and �ij denotes the expected mean count for outcome j in

    region i. The common spatial factor model can be written as:

    log(�ij) = �j + log(Eij) + jbi + ℎij , (3.1)

    where �j denotes the overall mean rate for the jth outcome and Eij is the expected

    number of disease counts in region i for the jth outcome based on some standardized

    rates; bi, i = 1, ⋅ ⋅ ⋅ , n is the spatial random effect assumed here to follow a condi-

    tional autoregressive distribution (Besag, 1974) to account for the spatially struc-

    tured correlation