mining boundary effects in areally referenced spatial data using the bayesian information criterion

Geoinformatica (2011) 15:435–454DOI 10.1007/s10707-010-0109-0

Mining boundary effects in areally referenced spatialdata using the Bayesian information criterion

Pei Li · Sudipto Banerjee · Alexander M. McBean

Received: 2 December 2009 / Revised: 2 May 2010 /Accepted: 26 May 2010 / Published online: 18 June 2010© Springer Science+Business Media, LLC 2010

Abstract Statistical models for areal data are primarily used for smoothing mapsrevealing spatial trends. Subsequent interest often resides in the formal identificationof ‘boundaries’ on the map. Here boundaries refer to ‘difference boundaries’, rep-resenting significant differences between adjacent regions. Recently, Lu and Carlin(Geogr Anal 37:265–285, 2005) discussed a Bayesian framework to carry out edgedetection employing a spatial hierarchical model that is estimated using Markovchain Monte Carlo (MCMC) methods. Here we offer an alternative that avoidsMCMC and is easier to implement. Our approach resembles a model comparisonproblem where the models correspond to different underlying edge configurationsacross which we wish to smooth (or not). We incorporate these edge configurationsin spatially autoregressive models and demonstrate how the Bayesian InformationCriteria (BIC) can be used to detect difference boundaries in the map. We illustrateour methods with a Minnesota Pneumonia and Influenza Hospitalization dataset toelicit boundaries detected from the different models.

Keywords Areal data · Bayesian information criteria · Boundaries · Conditionallyautoregressive models · Simultaneous autoregressive models · Wombling

1 Introduction

The growing popularity of Geographical Information Systems (GIS) has gener-ated much interest in analyzing and modelling geographically referenced data.

This work was supported in part by NIH grants 1-R01-CA95995 and 1-RC1-GM092400-01.

P. Li · S. Banerjee (B) · A. M. McBeanDivision of Biostatistics, School of Public Health, University of Minnesota,Mayo Mail Code 303, Minneapolis, MN 55455–0392, USAe-mail: [email protected], [email protected]

A. M. McBeanDivision of Health Policy & Management, School of Public Health, University of Minnesota,Minneapolis, MN 55455, USA

436 Geoinformatica (2011) 15:435–454

Geographical referencing depends upon the resolution of the data: when datareferencing is done with respect to the coordinates of the location (e.g. latitudeand longitude), we call them point-referenced, as is common in environmental andecological studies, while data aggregated over regions in a map (e.g. mortality ratesby counties or zip-codes) are called areally-referenced or lattice. In the domain ofpublic health, due to patient confidentialities, data are usually of the latter typeand are usually available as case counts or rates referenced to areal regions, suchas counties, census-tracts or ZIP codes, rather than the geographical location ofthe individual residences. These regions offer a convenient way of grouping thepopulation and preserving confidentiality.

Statistical models for spatial data are primarily concerned with explaining varia-tion, separating spatial signals from noise and improving estimation and prediction.These models capture associations or correlations across space depending upon thetype of referencing in the data. For point-referenced datasets, models customarilyemploy spatial processes to capture spatial associations as a function of Euclideangeometric objects such as distance and direction. These models are popular in geo-statistics (see, e.g., [4, 13]) and provide spatial interpolation or “kriging” accountingfor uncertainty in estimation and prediction.

For areally-referenced data, the association structures are built upon adjacenciesor neighborhood structures of the regions. Here the statistical models regard ob-servations from a region to be more similar to those from its neighboring regionsthan those arising from regions farther away. These structures underpin spatiallyweighted regression models and spatial autoregressive models that have been widelyemployed for smoothing maps and evincing spatial trends and clusters. They havebeen applied extensively in econometrics (see, e.g. [1, 2, 23, 24]) and public health(see, e.g., [4, 34]).

Subsequent inferential interest often resides not in the statistically estimated mapsthemselves, but on the formal identification of “edges” or “boundaries” on the spatialsurface or map. The ‘boundary’ here refers to those on the map that reflect sharpdifferences of the outcome variable between its two incident regions. Detecting suchboundaries for contagious diseases such as influenza, can help surveillance systemscontrol or at least slow down the spread of the infection and better manage localtreatment response (e.g. targeting vaccine delivery).

In this article, we offer a BIC based spatial autoregressive model to MinnesotaPneumonia and Influenza Hospitalization data. We identify the boundaries thatseparate the more affected areas from the less affected areas. These boundaries couldprovide information to the government or other related departments to identify areasof the most rapid change in incidence and prevalence for adjusting local treatmentresponse (e.g. targeting vaccine delivery).

2 “Wombling”: detecting boundaries of abrupt change on maps

This boundary detection problem is often referred to as “wombling”, after a foun-dational article by Womble [36], much like “kriging” obtained its name from thepioneering work of Krige. For point-referenced models, investigators often seekboundaries that reflect rapid change on the estimated spatial surface. Applicationsin the literature include detection of ecotones in forests [16] and the edges of

Geoinformatica (2011) 15:435–454 437

distinct soil zones. Fortin and Drapeau [17] reported that this technique correctlydetects boundaries in both simulated and real environmental data. For example,raster wombling, also known as lattice wombling, operates on numeric raster data –where the sampling locations are aligned in a rectangular grid, forming pixels.Barbujani and Sokal [6] used raster wombling to identify genetic boundaries inEurasian human populations. Bocquet-Appel and Bacro [10] applied a multivariateapproach to genetic, morphometric and physiologic characteristics, and found that itcorrectly detected the locations of simulated transition zones. Fortin [18] delineatedboundaries for tree and shrub density, percent coverage, and species presence-absence. Recently Banerjee and Gelfand [5] developed a more formal statisticalinferential framework for detecting curves representing rapid change on estimatedspatial process surfaces.

While wombling methods have been applied extensively to point-referenced data,they are relatively less visible in areal contexts. Areal wombling, also known aspolygonal wombling, has been addressed by Jacquez and Greiling [21, 22] to estimateboundaries of rapid change for colorectal, lung and breast cancer incidence in Nas-sau, Suffolk and Queens counties in New York. They proposed algorithms assigningboundary likelihood values (BLV’s) to each areal boundary using an Euclideandistance metric between neighboring observations. This Euclidean distance metricis looked upon as a “dissimilarity” value. Dissimilarity values are calculated for eachpair of adjacent regions, adjacency being defined as sharing a border. Thus, if i and jare neighbors then the BLV associated with the edge (i, j) is ||yi − y j||, where || · || issome appropriate metric (for instance Euclidean for continuous responses, Hammingfor binary responses). Locations with higher BLV’s are more likely to be a part of adifference boundary, since the variable changes rapidly there.

While attractive in their simplicity and ease of use, the algorithmic approachesdo not account for all sources of uncertainty and can lead to spurious statisticalinference. For instance, public health data are often characterized by extremenessin counts and rates corresponding to certain thinly populated regions that arises dueto random variation in the observed data rather than any systematic differences.Statistical models assist in capturing spatial variation and separating them fromrandom noise. A more detailed review of the existing algorithmic approaches andtheir deficiencies can be found in Lu and Carlin [26], who proposed a statisticalmodelling framework to carry out areal wombling. They considered disease countdata (Yi, Ei), where Yi and Ei are the observed and internally standardized expectedcounts from the ith county and employed a spatial hierarchical model that is estimatedusing Markov chain Monte Carlo (MCMC) methods. Statistical inference proceedsfrom the posterior distribution of the parameters. Lu and Carlin [26], and Wheelerand Waller [35] investigate different metrics �ij for the BLV and identify boundariesusing the posterior means of the BLV. The CAR model, however, smooths across allgeographical neighbors, and can lead to over-smoothing and subsequent underesti-mation of several �ij’s.

To remedy this problem, Lu et al. [27] and Ma et al. [28] investigated estimatingthe adjacency matrix within a hierarchical framework using priors on the adjacencyrelationships. However, these models often involve weakly identifiable parametersthat are difficult to estimate from the data. Fairly informative prior knowledgeis required that is usually unavailable. Furthermore, they employ computationallyexpensive MCMC algorithms that can be inexorably slow in converging to the desiredposterior distributions.

438 Geoinformatica (2011) 15:435–454

The current article focuses primarily upon areally-referenced models and de-tecting “difference boundaries” on spatial maps. Our current work investigates amiddle ground between the algorithmic approaches that ignore sources of variationand the fully Bayesian hierarchical modelling approaches that are computationallyprohibitive. We treat the areal wombling problem as one of model comparison andseek to learn about difference boundaries from the data by considering the influenceof each edge on these models. For this purpose, we employ the Bayesian InformationCriteria (BIC) that has become a popular tool in statistical learning and datamining to approximate the marginal posterior probabilities of the different modelsand identify the influence of the edges. Exhausting all possible models will againbecome computationally prohibitive, hence we consider a “leave-one-out” algorithmthat assesses the influence of each edge in the map, given all the other edges arepresent.

The remainder of the paper is organized as follows. Section 3 reviews spatialautoregression models for areal data analysis. Section 5 discusses the Bayesianinferential paradigm and the Bayesian Information Criteria. Section 6 illustrates theBIC methodology for areal wombling using some simulated scenarios as well as anappication to a Pneumonia and Influenza (P& I) dataset from Minnesota. Finally,Section 7 concludes the paper with some discussion and indication towards futurework.

3 Statistical models for areal data analysis

Areal data are referenced by regions in a geographical map, which can be repre-sented as an n × n matrix W, whose (i, j)-th entry, wij, connects units in i and jspatially in some fashion. Customarily wii is set to 0. Possibilities include binarychoices, i.e. wij = 1 if i and j share some common boundary, perhaps a vertex (asin a regular grid). Alternatively, wij could reflect “distance” between units, e.g., adecreasing function of inter-centroidal distances between the units (as in a countyor other regional map). But distance can be returned to a binary determination. Forexample we could set wij = 1 for all i and j within a specified distance. Or, for a giveni, we could get wij = 1 if j is one of the K nearest (in distance) neighbors of i. WhileW is often symmetric, it is not necessarily so; for instance, the K-nearest neighborsexample provides a setting where symmetry may be violated. For the illustrations inthis article we will consider a connected map (i.e. no islands) and a symmetric binaryproximty matrix W.

As the notation suggests, the entries in W can be viewed as weights. More weightwill be associated with j’s closer (in some sense) to i than those farther away from i. Inthis exploratory context W provides the mechanism for introducing spatial structureinto statistical models. To see this consider the symmetric binary specification for Wand let y = (y1, . . . , yn)

′ be an n × 1 vector of outcomes where yi has been observedin the i-th region. An intuitively appealing spatial smoother would smooth theobervation in each region by taking the mean of its neighbors. Thus, each yi wouldbe predicted by the average of its neighbors, say yi = 1

wi+

∑j∼i y j with ∼ denoting “is

a neighbor of” and wi+ being the number of neighbors of region i. A statistical modelfor this smoother would relate the i-th observation to the mean of its neighbors.Specifically, we write yi = ∑n

j=1 ρwijy j + εi, where wij = wij/wi+, ρ is a parameter

Geoinformatica (2011) 15:435–454 439

representing the strength of the spatial association and εiiid∼∼ N(0, τ 2) is a stochastic

error or noise component for each observation. This error could be representativeof variability from a number of sources including unobserved explanatory variables,sampling error and so on.

The key problem in statistical inference is to sensibly model spatial associationsover the map while yielding a theoretically valid joint probability distributions.Letting W be the row-normalized matrix with entries wij, we can write the abovemodel as y = ρWy + ε, whence y = (I − ρW)−1ε. Provided that the inverse exists,we have the dispersion matrix of y as �(τ 2, ρ, W) = τ 2[(I − ρW ′)(I − ρW)]−1. Usingstandard eigen-analysis (see, e.g., [1, 4]) it can be shown that (I − ρW)−1 existswhenever ρ ∈ (1/λ(1), 1), where λ(1) is the smallest eigen-value of W. It is also truethat λ(1) is real-valued and negative, but restricting ρ ∈ (0, 1) yields non-negativeelements in (I − ρW)−1. This seems to be more intuitive in spatial settings, wherenegative associations between proximate locations is difficult to envision. With thisrestriction on ρ, we obtain a valid joint multivariate Gaussian distribution y ∼MV N(0, �(τ 2, ρ, W)). This is called a Simultaneous Autoregression (SAR) model.

In public-health data analysis contexts, it is often desired to carry out spatialinference after adjusting for certain important covariates that explain the large-scalevariation seen in the data. This leads to random effect or hierarchical linear models,

yi = x ′i β + φi, i = 1, . . . , n, (1)

where yi is the dependent variable, xi a vector of areally-referenced regressors and φi

are spatial random effects that model association between adjacent regions. Thus, wewould now let φ = (φ1, . . . , φn)

′ follow a SAR model, i.e. φ ∼ MV N(0, �(τ 2, ρ, W)).A very important point, at least in our current context, is to note that the SAR modelsare well-suited to maximum likelihood estimation but not at all for MCMC fitting ofBayesian models. In fact, the log likelihood associated with Eq. 1 is

1

2log

∣∣∣τ−1

(I − ρW

)∣∣∣ − 1

2τ 2

(y − Xβ

)′ (I − ρW

) (I − ρW ′

) (y − Xβ

). (2)

Though ρW will introduce a regression or autocorrelation parameter, the quadraticform is quick to evaluate (requiring no matrix inverse) and the determinant can usu-ally be calculated rapidly using diagonally dominant, sparse matrix approximations.Thus maximization can be done iteratively but, in general, efficiently. On the otherhand, we note that the absence of a hierarchical form with random effects impliescomplex Bayesian model fitting.

The SAR model, with the help of the proximity matrix, captures spatial associa-tions by assuming that neighboring regions are likely to exhibit greater associationthan regions that are not neighbors. This degree of association is controlled by theso-called spatial autocorrelation parameter ρ. A consequence of this is that the SARmodel smooths the outcome across neighboring regions to produce maps that betterreveal regions where the response tends to cluster. However, smoothing across allgeographical boundaries may lead to oversmoothing resulting in maps that wouldtend to conceal difference boundaries. Arriving at models that are formally selectedusing a statistical paradigm will deliver the optimal adjacency matrix Wk and, in theprocess, would have solved the “wombling” problem by identifying “true” edges thatshould not be smoothed across. Indeed these edges are the “complements” of Wk in

440 Geoinformatica (2011) 15:435–454

being those entries that were 1 in W but are 0 in Wk, i.e., edges corresponding to the1 entries in W − Wk.

Unlike SAR models, a Conditional Autoregression (CAR) specification wouldmodel each effect conditional upon the remaining effects. Such a model specifiesconditional distributions

φi | φ j, j �= i ∼ N

⎛

⎝ρ

K∑

j=1

wij

wi+φ j,

τ 2

wi+

⎞

⎠ .

Besag [9] proved that these full conditional distributions specify a joint distributionfor the φi’s such that φ ∼ MV N(0, τ 2[D − ρW]−1). Now, one needs to make surethat D − ρW is positive definite, a sufficient condition for which (see, e.g., [4]) is torestrict ρ ∈ (1/λ(1), 1). Assuncao and Krainski [3] also provide discussion of the ρ

parameter and explanations that help the practitioner to view the covariance matrixof a CAR model in a natural way. The CAR model has been especially popular inBayesian inference as its conditional specification is convenient for Gibbs samplingand MCMC schemes. The relationship between the CAR and SAR models in termsof resulting spatial correlations and their interpretations has been explored by Wall[33].

4 Statistical detection of boundary effects

Returning to our primary problem of detecting difference boundaries, we formulatethis problem as one of comparing between models that represent different boundaryhypotheses. A boundary hypotheses corresponds to a particular underlying mapspecifying which edges should be smoothed over and which should not. A fewissues, however, arise regarding the exact choice of the model. For instance, considerthe hypothesis of no difference boundaries at all in the map. What model wouldcorrespond to this hypothesis?

If we believe there are no difference boundaries at all, should we consider themap as comprising a single region? This implies having no region-specific effectsat all or, equivalently, W = 0 (the null matrix), thereby reducing Eq. 1 to a simplelinear regression model with no random effects and ε ∼ MV N(0, σ 2 I). Alternatively,we could still regard yi as arising from n different regions but, given the absenceof difference boundaries, we would retain independent regional effects instead ofspatial structures in the model. This would amount to ρW = I so that Cov(φ, ε) = 0and we obtain a linear random effects model with iid regional effects. The choice isnot straightforward and may depend upon the objectives of the analysis.

Here we will adopt the second approach, where we always retain random effectsand consider models varying in their specification of W that controls spatial smooth-ing. At the other extreme all the geographical edges may in fact be differenceboundaries. Any intermediate model that lies between these extremes is completelyspecified by modifying the original map to delete some edges.

Ideally we would like to consider a class of models M = {M1, . . . , MK} represent-ing all possible models or all possible maps derived from W by deleting combinationsof geographical edges. In other words, let W = {wij} be the adjacency matrix ofthe map (i.e., wii = 0, and wij = 1 if i is adjacent to j and 0 otherwise). ModelMk will be a SAR model with the adjacency matrix Wk that has been derived by

Geoinformatica (2011) 15:435–454 441

changing some of the 1’s to 0’s in W. This amounts to dropping some edges fromthe original map or, equivalently, combining two regions into one. However, nowwe encounter an explosion in the number of models. To be precise, if W is theoriginal geographical map, we have 21′W1/2 models to compare. This is infeasibleand will require sophisticated MCMC Model Composition or MC3 algorithms (see.e.g., [20]) for selecting models. These formal statistical methods will again becomecomputationally intensive and inconducive for learning of edge effects in large maps.Therefore, we consider only models that arise by changing only one entry in the Wmatrix. This avoids MCMC and resorts to the simpler Bayesian Information Criteriathat requires only the maximum likelihood estimates for the models.

5 The Bayesian information approach

By modelling both the observed data and any unknown parameter or other unob-served effects as random variables, the hierarchical Bayesian approach to statisticalanalysis provides a cohesive framework for combining complex data models andexternal knowledge or expert opinion (e.g., [8, 11, 19, 32]). In this approach, inaddition to specifying the distributional model f (y|θ) for the observed data y =(y1, . . . , yn) given a vector of unknown parameters θ = (θ1, . . . , θk), we suppose thatθ is a random quantity sampled from a prior distribution p(θ | γ ), where γ is avector of hyperparameters. Inference concerning θ is then based on its posteriordistribution,

p(θ | y, γ ) = p(y, θ | γ )

p(y | γ )= p(y, θ | γ )

∫p(y, θ | γ ) dθ

= p(y | θ)p(θ | γ )∫

p(y | θ)p(θ | γ ) dθ. (3)

Notice the contribution of both the data (in the form of the likelihood p(y | θ)) andthe external knowledge or opinion (in the form of the prior p(θ |γ )) to the posterior.If γ is known, this posterior distribution is fully specified; if not, a second-stage priordistribution (called a hyper-prior) may be specified for it, leading to a fully Bayesiananalysis. Alternatively, we might simply replace γ by an estimate γ obtained as thevalue which maximizes the marginal distribution p(y | γ ) viewed as a function of γ .Inference proceeds based on the estimated posterior distribution p(θ | y, γ ), obtainedby plugging γ into Eq. 3. This is called an empirical Bayes analysis and is closer tomaximum likelihood estimation techniques.

The Bayesian decision-making paradigm improves upon the classical approachesto statistical analysis in its more philosophically sound foundation, its unified ap-proach to data analysis, and its ability to formally incorporate prior opinion orexternal empirical evidence into the results via the prior distribution. Statisticians,formerly reluctant to adopt the Bayesian approach due to general skepticism con-cerning its philosophy and a lack of necessary computational tools, are now turningto it with increasing regularity as classical methods emerge as both theoretically andpractically inadequate. Modelling the θi’s as random (instead of fixed) effects allowsus to induce specific (e.g. spatial, temporal or more general) correlation structuresamong them, hence among the observed data yi as well. Hierarchical Bayesianmethods now enjoy broad application in the analysis of complex systems, whereit is natural to pool information across from different sources (e.g. [19]). ModernBayesian methods seek complete evaluation of the posterior distributions using

442 Geoinformatica (2011) 15:435–454

simulation methods that draw samples from the posterior distribution. This sampling-based paradigm enables exact inference free of unverifiable asymptotic assumptionson sample sizes and other regularity conditions.

A computational challenge in applying Bayesian methods is that for many com-plex systems, inference under Eq. 3 generally involves distributions that are in-tractable in closed form, and thus one needs more sophisticated algorithms to samplefrom the posterior. Forms for the prior distributions (called conjugate forms) mayoften be found which enable at least partial analytic evaluation of these distributions,but in the presence of nuisance parameters (typically unknown variances), someintractable distributions remain. Here the emergence of inexpensive, high-speedcomputing equipment and software comes to the rescue, enabling the applicationof recently developed Markov chain Monte Carlo (MCMC) integration methods,such as the Metropolis-Hastings algorithm and the Gibbs sampler. See the books byGelman et al. [19], Carlin and Louis [11] and Robert [32] for details on Bayesiananalysis and computing.

Bayesian inference proceeds by considering a set of models, say M ={M1, . . . , MK}, each representing a hypothesis, and then selecting the best model(s)using some statistical metric. Assuming that model Mj has parameters, say θ j,associated with it and we have specified priors p(θ j | Mj) for each j, we will seekposterior distributions of the model itself,

p(Mj | y) = p(Mj)p(y | Mj)∑K

k=1 p(Mk)p(y | Mk)= p(Mj) × ∫

p(θ j | Mj)p(y | θ j, Mj) dθ j∑K

k=1 p(Mk) × ∫p(θk | Mk)p(y | θk, Mk) dθk

.

(4)To compare two models, say M1 and M2, we form their posterior odds

p(M1 | y)

p(M2 | y)= p(M1)

p(M2)× p(y | M1)

p(y | M2). (5)

If the odds are greater than one, we choose model M1, otherwise we opt for M2. TheBayes factor is defined as

BF12 = p(y | M1)

p(y | M2)(6)

and represents the contribution of the data towards the posterior odds. Often, wewill have no prior reason to favor one model over another and the posterior oddswill equal the Bayes Factor. Thus, one seeks to evaluate the marginal distribution ofthe data, given a model, as

p(y | Mj) =∫

p(y|θ j)p(θ j | Mj) dθ j. (7)

Computation of the marginal distribution in Eq. 7 for general hierarchical models canbe much more complicated and has occupied plenty of attention over the last severalyears. Many of these methods, while offering better evaluations and approximations,involve computationally intensive simulation algorithms, such as MCMC methods,that require much finessing and several thousands of iterations to yield accurateresults.

LeSage and Parent [25] provide a computationally simple and fast approach toevaluating the true log-marginal likelihood for the SAR model. However, theirapproach seems to be best suited to the Zellner g-prior on the regression coefficients

Geoinformatica (2011) 15:435–454 443

and may not be directly applicable to more general priors. LeSage and Pace [24, Ch.6]also discuss the issue of comparing SAR models based on different adjacency matri-ces W. They rely upon univariate numerical integration over the range of supportfor the parameter ρ in the SAR model, which involves calculating log (det(In − ρW))

for every value of the parameter ρ. Computationally fast methods to compute the logdeterminant terms are presented in Pace and Barry [29] and Barry and Pace [7] . Herewe approximate the marginal distribution in Eq. 7 using the Laplace-approximation,which avoids the numerical integration over the parameter space.

In edge detection problems, as outlined earlier, we encounter a large number ofmodels. Hence, a faster approach will be to employ an inexpensive approximationto Eq. 7. One such approximation is based upon a Laplace-approximation and somesubsequent simplifications (see, e.g., [30]):

log p(y | Mj) = log p(y | θ j

) − dim(Mj)

2log n + O(1), (8)

where dim(Mj) is the number of parameters that are being estimated in the modelMj. The Bayesian Information Criteria is derived from this approximation as

BIC j = −2 log p(y | θ j

) + dim(Mj) log n. (9)

Therefore, choosing the model with the minimum BIC amounts to choosing themodel with the maximum posterior probability. In fact, if we consider the modelsin M then we can estimate the posterior probability for each model as

p(Mj | y) ≈ exp(− 1

2 BIC j)

∑Kk=1 exp

(− 12 BICk

) . (10)

The advantages of computing the posterior model probabilities as Eq. 10 includecomputational simplicity and a direct connection with the thoroughly investigatedBIC. While the justification of the approximation Eq. 10 is asymptotic in general,this can also be looked upon as an approximation for a noninformative prior evenfor moderate and small sample sizes.

6 Bayesian information criteria in SAR/CAR models to detectdifference boundaries: illustrations

6.1 Simulation experiments

We illustrate our model comparison approach first with some simulation studiesand then apply it to a real data analysis in Section 6.2. The simulation was set upunder two scenarios: with and without explanatory variables. The spatial adjacencymatrix was based on the Minnesota county map and in another scenario, the 8 × 8rectangular grid. There are 87 counties and 211 boundaries between counties on theMinnesota map, thus there are 211 different boundary hypotheses in our analysis.We generated data {Yi} from a Poisson distribution whose true parameter values areknown.

Without the explanatory variables, we divided the Minnesota map into six regions,and let μi ∈ {0, 0.5, 1, 1.5, 2, 2.5} with the true difference boundaries mapped onFig. 1. Note that two of the clusters are shaded white. The one in the interior

444 Geoinformatica (2011) 15:435–454

Fig. 1 A map of the simulateddata with the grey-scalesshowing the six differentclusters, each having its ownmean. Two of the clusters areshaded white with the one inthe interior comprising a singlecounty (Sherburne) and hasmean 0, while the other has amean of 0.5. There are 47boundary segments thatseparate regions with differentmeans (shades)

0,0.51.01.52.02.5

comprises a single county (Sherburne) and has mean 0, while the other has a mean of0.5. This configuration creates a county with all its boundaries being true differenceboundaries. Letting Yi be the simulated number of cases in county i, we generate

{Yi} ∼ Poisson(5 exp(μi)) for i = 1, 2, . . . , 87. Let Ei =∑N

k=1 Yi∑N

k=1 OiOi be the expected

number of cases, where Oi is the population of county i, and N is the total number ofcounties. Assuming equal population in all counties, we take the log standardizedmorbidity ratio, yi = log(100 × Yi/Ei), as our outcome variable. Note that this isessentially a relative rate expressed as a percentage and transformed to a logarithmicscale to strengthen its Gaussian behavior.

We next fit the model in Eq. 1 with yi as the response, where the regressionstructure consists only of an intercept, i.e. x′

iβ = β0, and {φi}’s follow a SAR and aCAR distribution (see Section 3). There are 211 different boundary hypothesis inour analysis, as there are 211 edges on the Minnesota County Map and each modelarises from deleting (hence not smoothing across) exactly one geographical edge.We compute the BIC for all these models using Eq. 9, in which the log likelihood iscomputed by Eq. 2. Therefore, each model corresponds to one edge and the modelswith higher posterior probabilities provide evidence in favor of the correspondingedge being a difference boundary.

As we know the 47 true difference boundaries on the map, we are able to obtainthe “true” detection rates (sensitivity) for the BIC approach by declaring the edgescorresponding to the top 47 models as difference boundaries. We compare the resultswith the existing methods, e.g. Lu and Carlin [26] (LC). The average detection rateof the 50 simulated datasets for the different methods are listed in Table 1. Thesensitivity of the BIC based model comparison approach is competitive with the

Geoinformatica (2011) 15:435–454 445

Table 1 Average detection rate in the simulation study (50 datasets) by BIC-based model compar-ison approach with SAR and CAR spatial effects, the LC method as well as a simple ranking basedupon absolute differences

MN county map 8 × 8 grid

BIC-SAR 77.2 73.3BIC-CAR 74.1 64.6LC 78.7 79.5BLV’s 83.8 78.5

The simulation study was based on MN county map and 8 × 8 rectangular grid respectively

LC approach, especially when the SAR prior probability model is specified for thespatial effects. Indeed, Table 1 reveals that the detection rate for the BIC basedapproach using SAR spatial effects differs from the Lu and Carlin method by only1.5%. With CAR spatial effects, this difference is slightly higher at 4.6%. Yet, theinferential procedure in LC employs MCMC algorithms that involve substantiallygreater computational demands. The BIC based model comparison approach thatwe proffer here provides much quicker inference via generalized least squares, whilelosing little sensitivity as compared to the LC method.

In another scenario, we conduct the simulation on an 8 × 8 rectangular grid. Wepartition the 64 rectangular units into 5 areal clusters, as depicted in Fig. 2. We assignfour grey-scale values to these five clusters, with the clusters in the lower left andthe extreme right column having the highest true grey-scale value of 5.0. The regionin the upper middle has the lowest true value of 2.0, the upper-left has a true valueof 3.0, while the cluster in the lower middle has a true value of 4.0. Analogous tothe previous scenario, we generate {Yi} ∼ Poisson(exp(μi)), and use yi = log(100 ×Yi/Ei) as the response variable in the model. The average detection rate of the 50simulated datasets for the different methods are listed in Table 1. The detection rate

Fig. 2 A 8 × 8 rectangulargrid of the true values with thegrey-scales showing the fivedifferent clusters. There are atotal of 112 boundaries ofwhich 22 are designated astrue difference (“wombling”)boundaries

2

4

6

8

2 4 6 8

2.0

2.5

3.0

3.5

4.0

4.5

5.0

446 Geoinformatica (2011) 15:435–454

for the BIC based approach using SAR spatial effects differs from the Lu and Carlinmethod by 6.2%. With CAR spatial effects this difference is slightly higher at 14.9%.

We also computed the BLV’s associated with the edges (see Section 2). These arethe absolute difference of the outcomes, i.e. �ij = ‖yi − y j‖, computed for every pairof adjacent counties. We identified the 47 highest BLV’s as difference boundaries(see, e.g., Jacquez and Greiling [21, 22]). The last row of Table 1 presents theaverage detection rates from the 50 simulated datasets based upon the �ij’s. Thesewere 84.8% for the Minnesota county map and 78.5% for the 8 × 8 grid. While the�ij’s seem to provide a simple and practical way of identifying boundaries for theoutcome variable, this procedure is not model-based and will not apply to our nextscenario.

In our final scenario, we assume that there is an explanatory variable associatedwith the outcome. In other words, we now set x′

i = (1, xi) in Eq. 1, where we generateeach xi from N(μi, σ ) with μi taking values in 0, 0.2, 0.4, 0.6, 0.8, 1, depending uponwhere county i lies in the map in Fig. 1, and σ = 0.5. We subsequently fix thesegenerated xi’s and draw the spatial random effects, i.e. the φi’s from a SAR model.Next, we simulate Yi ∼ Poisson(exp(θ0 + θ1xi + φi)) with θ0 = 0, θ1 = 5. In fitting themodel and carrying out subsequent boundary analysis, we again use yi = log(100 ×Yi/Ei) as our outcome variable, where Ei is as defined in the preceding scenarios.

When the model accommodates an explanatory variable, as in our current sce-nario, the boundary effects of interest pertain not to the outcome, but the residualsafter adjusting for the explanatory variable. Since the residuals are never observed,BLVs defined as Euclidean metrics between observations is no longer applicable fordetecting the difference boundaries on the spatial residual map. Nevertheless, theBIC based approach and the LC method are both able to do so. True differenceboundaries are unknown in this case. As such, we compute the rank of each of the211 models based on BIC, and compare them with the difference between the spatialresiduals produced by LC. Figure 3 plots the difference in spatial residuals from theLC approach against the rank of the model using BIC. A locally weighted scatter-plot

Fig. 3 A simulation examplewith a single explanatoryvariable in the model. Thex-axis is the expectation of theabsolute difference betweenthe spatial residuals of theadjacent counties by LCmethod. The y-axis marks theranks produced by BIC for the211 models using SAR spatialeffects. A loess smoothed lineis also shown on the plot

0.2 0.4 0.6 0.8 1.0

050

100

150

200

LC

Mod

el R

ank

Geoinformatica (2011) 15:435–454 447

smoother or “loess” [12] is also fitted ot the plot. The figure reveals a very cleardecreasing trend indicating that the model with a better fit (lower BIC) will tendto have larger differences in the spatial residuals from the LC method. This indicatesconsistency between the BIC-based methods and the LC method. Again, we note thecomputational efficiency of the BIC-based methods as compared to the LC method.

6.2 Application to Minnesota P & I dataset

We apply our model comparison approach to a Pneumonia and Influenza diagnosisdataset from the state of Minnesota. Residents of Minnesota who were 65 yearsof age and older and who were enrolled in the Medicare fee-for-service programas of December 31, 2001, were included in our study. This population had beenidentified as part of a multi-year study regarding the impact of vaccination on elderlyMinnesota residents. The Medicare Denominator file for 2001 was used to define thecohort. In addition to meeting the criteria for age and state of residence, to be eligiblefor inclusion in the study the person had to be enrolled in both Medicare Part A andMedicare Part B, not be enrolled in a Medicare Advantage health plan, and not haveend-stage renal disease. The Denominator file also indicated the county of residencefor each person. County-level average per capita income was obtained from the 2000U.S. Census SF3 file.

Hospitalizations for pneumonia and influenza (P&I) were identified by theMedicare Provider Analysis and Review (MedPAR) short stay inpatient file for the

Table 2 Names of adjacent counties that have significant boundary effects from the SAR model

1 Cook, Lake 26 Murray, Pipestone2 Itasca, Koochiching 27 Morrison, Todd3 Beltrami, Koochiching 28 Redwood, Yellow Medicine4 Steele, Waseca 29 Jackson, Martin5 Pope, Stearns 30 Goodhue, Wabasha6 Cass, Wadena 31 Otter Tail, Todd7 Todd, Wadena 32 Douglas, Pope8 Murray, Redwood 33 Becker, Wadena9 Traverse, Wilkin 34 Brown, Renville10 Koochiching, Lake of the Woods 35 Kandiyohi, Pope11 Freeborn, Steele 36 Benton, Morrison12 Isanti, Sherburne 37 Fillmore, Houston13 Clearwater, Mahnomen 38 Anoka, Isanti14 Renville, Yellow Medicine 39 Blue Earth, Brown15 Chippewa, Renville 40 Hubbard, Wadena16 Cottonwood, Murray 41 Carver, McLeod17 Isanti, Mille Lacs 42 Clay, Otter Tail18 Koochiching, St. Louis 43 Blue Earth, Watonwan19 Grant, Wilkin 44 Mille Lacs, Morrison20 Lyon, Redwood 45 Murray, Nobles21 Becker, Mahnomen 46 Dodge, Olmsted22 Cottonwood, Jackson 47 Morrison, Stearns23 Lincoln, Pipestone 48 Douglas, Grant24 Goodhue, Olmsted 49 Dakota, Goodhue25 Cass, Morrison 50 Fillmore, Olmsted

The numbers in the first column are the ranks according to their BIC scores

448 Geoinformatica (2011) 15:435–454

Table 3 Names of adjacent counties that have significant boundary effects from the CAR model

1 Cook, Lake 26 Goodhue, Wabasha2 Itasca, Koochiching 27 Otter Tail, Todd3 Beltrami, Koochiching 28 Lincoln, Pipestone4 Pope, Stearns 29 Hubbard, Wadena5 Steele, Waseca 30 Carver, McLeod6 Cass, Wadena 31 Becker, Wadena7 Todd, Wadena 32 Douglas, Pope8 Renville, Yellow Medicine 33 Anoka, Isanti9 Koochiching, St. Louis 34 Cass, Morrison10 Clearwater, Mahnomen 35 Kandiyohi, Pope11 Isanti, Sherburne 36 Murray, Pipestone12 Chippewa, Renville 37 Morrison, Todd13 Koochiching, Lake of the Woods 38 Brown, Renville14 Freeborn, Steele 39 Clay, Otter Tail15 Murray, Redwood 40 Blue Earth, Watonwan16 Isanti, Mille Lacs 41 Dodge, Olmsted17 Traverse, Wilkin 42 Blue Earth, Watonwan18 Cottonwood, Murray 43 Mahnomen, Norman19 Grant, Wilkin 44 Le Sueur, Scott20 Becker, Mahnomen 45 Benton, Morrison21 Lyon, Redwood 46 Aitkin, Kanabec22 Goodhue, Olmsted 47 Dakota, Goodhue23 Cottonwood, Jackson 48 Fillmore, Houston24 Redwood, Yellow Medicine 49 Mille Lacs, Morrison25 Jackson, Martin 50 Dakota, Hennepin

The numbers in the first column are the ranks according to their BIC scores

above Minnesota residents. This annual file contains one record per hospitalizationbased on the date of discharge. Hospitalizations for P&I (Pneumonia and Influenza)were identified using ICD-9-CM codes 481–487. Rates of P&I hospitalization aretraditional measures of the impact of influenza virus in the elderly population.Boundary analysis might help identify barriers separating counties that experience

Fig. 4 Difference Boundariesdetected by the BIC basedmodel comparison approachwith SAR spatial effects. Top50 boundaries correspondingto models with lowest BIC arehighlighted. The map for theCAR spatial effects is verysimilar and not shown

Geoinformatica (2011) 15:435–454 449

different impacts of the influenza virus. Here we studied the number of hospital-izations from P&I in both influenza and shoulder period among persons at riskin each county. We adjust for the average income per person in each county byincorporating it as an explanatory variable in our model. Therefore, the vector xin Eq. 1 has two columns, the other being an intercept. Let Yi be the observed

number of hospitalizations in county i, Ei =∑N

k=1 Yi∑N

k=1 OiOi be the expected number of

Fig. 5 Plot of model rank(SAR: top panel; CAR: bottompanel) against the absolutedifference of the observed logstandardized morbidity ratio.The horizontal axis is the rankof the models in terms ofincreasing BIC. The verticalaxis is the absolute differenceof the observed logstandardized morbidity ratio.A loess smoothed line is alsoshown on the plots

0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

1.0

1.2

SAR

model rank

0 50 100 150 200

model rank

|yi–

yj|

0.0

0.2

0.4

0.6

0.8

1.0

1.2

CAR

|yi–

yj|

450 Geoinformatica (2011) 15:435–454

cases, where Oi is the population (age 65 and older) of county i, and N is the totalnumber of counties. Similar to the simulation study, we take log transformationyi = log(100 × Yi/Ei) as our outcome variable under study.

The intercept from the SAR model was estimated to be 5.12 (mean) with a95% credible interval (4.62, 5.63), while for the CAR model these were 5.13 and(4.63, 5.63) respectively. The regression coefficient for average income per personwas estimated to be −0.017 with a 95% credible interval (−0.037, 0.003) from theSAR model, while they were −0.017 and (−0.037, 0.002) from the CAR model. Theestimate of τ 2 from the SAR and CAR models were 0.097 and 0.095 respectively,while ρ was estimated to be 0.072 and 0.127 respectively. The parameter ρ hasdifferent interpretations for the SAR and CAR models and can be looked upon as ameasure of spatial smoothing. It is, however, dangerous to interpret this as a spatialcorrelation in the strict statistical sense [33].

Tables 2 and 3 list the adjacent counties having the 50 largest boundary effects,ranked by their BIC scores from the SAR and CAR models respectively. Forty-sixof the top fifty pairs are present in both tables (although they may not necessarilyagree in rank), while four (in bold) boundaries are unique. The top seven countypairs also agree in their rankings. The fifty difference boundaries detected by themodel comparison approach using SAR spatial effects are highlighted in Fig. 4.Figure 5 reveals a similarly consistent performance between the SAR and CAReffects in detecting difference boundaries on this map. Models ranked higher (or

Fig. 6 Choropleth map ofresiduals from the SAR model.The map from the CAR modelis very similar. Darker colorsrepresents higher value of thespatial residuals after adjustingfor the covariate, which alsoimplies the county is moreaffected by P&I

0–20%20%– 40%40%– 60%60%– 80%80%– 100%

Geoinformatica (2011) 15:435–454 451

that fit better) usually detect boundaries with a big difference in the raw data, whichcorroborates our approach. But some of the points with large differences in the rawdata are ranked low in the plot. This could be attributed to the smoothing andborrowing of strength from neighboring regions that could diminish the strengthof the neighboring regions and dilute the difference in some cases. Figure 6 is thechoropleth map of the spatial residuals under the model which has a neighborhoodstructure with fifty detected boundaries (i.e. non-smoothing boundaries).

7 Discussion and future directions

Clearly we have only skimmed the surface of the edge detection problem. Infact, here we have investigated the Bayesian Information Criteria and its utility inmarginal probability approximations. Still, our approach of formulating the edgedetection problem as a model comparison problem is relatively novel. We viewour current work as a relatively simple data-mining tool that can suggest influentialboundary effects in health maps. The use of the BIC is straightforward in our leave-one-out framework and can prove a useful tool for spatial analysts.

Our future methodolological investigations will focus upon three directions: (1)more sophisticated model search algorithms such as MC3 (Markov Chain MonteCarlo Model Composition) and Bayesian Model Averaging algorithms that exhaustthe space of all models (see [20]); (2) using Bayesian False Discovery Rates togetherwith a formal Bayesian hypothesis testing framework to make decisions regardingwombling boundaries, and (3) to develop alternative nonparametric Bayesian mod-els for areal data that would facilitate boundary detection.

The third direction merits some further discussion. Instead of incorporatingrandom “edge effects” 0 in Lu et al. [27], Ma et al. [28], one can explore an alternativestochastic mechanism that would let us detect wombling boundaries by consideringprobabilities such as P(φi = φ j | i ∼ j). Clearly using direct CAR specifications willsimply not work as it yields continuous measures for the φis, rendering P(φi =φ j | i ∼ j) = 0. The challenge here is to model the spatial effects in an almost surelydiscrete fashion while at the same time accounting for the spatial dependence. Anonparametric Bayesian framework that models the spatial effects as almost surediscrete realizations of some distribution comes to mind—the Dirichlet process [15]has been employed extensively for modelling clustered data and presents itself asa natural choice, but how do we accommodate the spatial dependence? These andother issues will form a part of our future research plans.

References

1. Anselin L (1988) Spatial econometrics: methods and models. Kluwer, Boston2. Anselin L (1990) Spatial dependence and spatial structural instability in applied regression

analysis. J Reg Sci 30:185–2073. Assuncao R, Krainski E (2009) Neighborhood dependence in Bayesian spatial models. Biom J

51:851–869

452 Geoinformatica (2011) 15:435–454

4. Banerjee S, Carlin BP, Gelfand AE (2004) Hierarchical modeling and analysis for spatial data.Chapman and Hall/CRC Press, Boca Raton

5. Banerjee S, Gelfand AE (2006) Bayesian wombling: curvilinear gradient assessment underspatial process models. J Am Stat Assoc 101:1487–1501

6. Barbujani G, Sokal RR (1990) Zones of sharp genetic change in Europe are also linguisticboundaries. Proc Natl Acad Sci USA 87:1816–1819

7. Barry R, Pace RK (1999) A Monte Carlo estimator of the log determinant of large sparsematrices. Linear Algebra Appl 289:41–54

8. Berger JO (1985) Statistical decision theory and Bayesian analysis. Springer-Verlag, Berlin.ISBN 0-387-96098-8

9. Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc, B36(2):192–236

10. Bocquet-Appel JP, Bacro JN (1994) Generalized wombling. Syst Biol 43:442–44811. Carlin BP, Louis TA (2000) Bayes and empirical bayes methods for data analysis, 2nd edn.

Chapman and Hall/CRC Press, Boca Raton12. Cleveland WS, Devlin SJ (1988) Locally-weighted regression: an approach to regression analysis

by local fitting. J Am Stat Assoc 83:596–61013. Cressie NAC (1993) Statistics for spatial data, 2nd edn. Wiley, New York14. Cromley EK, McLafferty SL (2002) GIS and public health. Guilford Publications, New York15. Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1:209–

23016. Fortin MJ (1994) Edge detection algorithms for two-dimensional ecological data. Ecology

75:956–96517. Fortin MJ, Drapeau P (1995) Delineation of ecological boundaries: comparisons of approaches

and significance tests. Oikos 72:323–33218. Fortin MJ (1997) Effects of data types on vegetation boundary delineation. Can J For Res

27:1851–185819. Gelman A, Carlin JB, Stern HS, Rubin DB (2004) Bayesian data analysis, 2nd edn. Chapman

and Hall/CRC Press, Boca Raton20. Hoeting JA, Madigan D, Raftery AE, Volinsky C (1999) Bayesian model averaging: a tutorial.

Stat Sci 14:382–41721. Jacquez GM, Greiling DA (2003). Local clustering in breast, lung and colorectal cancer in Long

Island, New York. Int J Health Geogr 2:322. Jacquez GM, Greiling DA (2003). Geographic boundaries in breast, lung and colorectal cancers

in relation to exposure to air toxics in long Island, New York. Int J Health Geogr 2:423. LeSage JP (1997) Analysis of spatial contiguity influences on state price level formation. Int J

Forecast 13:245–25324. LeSage JP, Pace K (2009) Introduction to spatial econometrics. Chapman and Hall/CRC, Boca

Raton, FL25. LeSage JP, Parent O (2007) Bayesian model averaging for spatial econometric models. Geogr

Anal 39:241–26726. Lu H, Carlin BP (2005) Bayesian areal wombling for geographical boundary analysis. Geogr

Anal 37:265–28527. Lu H, Reilly C, Banerjee S, Carlin BP (2007) Bayesian areal wombling via adjacency modeling.

Environ Ecol Stat 14:433–45228. Ma H, Carlin BP, Banerjee S (2010) Hierarchical and joint site-edge methods for Medicare

hospice service region boundary analysis. Biometrics 66(2):355–36429. Pace RK, Barry R (1997) Sparse spatial autoregressions. Stat Probab Lett 33:291–29730. Raftery AE (1995) Bayesian model selection in social research (with discussion). In: Marsden

PV (ed) Sociological methodology 1995. Blackwell, Cambridge, MA, pp 111–19531. Raftery AE, Madigan D, Hoeting J (1997) Bayesian model averaging for linear regression

models. J Am Stat Assoc 92:179–19132. Robert C (2001) The Bayesian choice, 2nd ed. Springer, New York33. Wall MM (2004) A close look at the spatial correlation structure implied by the CAR and SAR

models. J Stat Plan Inference 121(2):311–32434. Waller, Gotway (2004) Applied spatial statistics for public health data. Wiley, New York35. Wheeler D, Waller L (2008) Mountains, valleys, and rivers: the transmission of raccoon rabies

over a heterogeneous landscape. J Agric Biol Environ Stat 13:388–40636. Womble WH (1951) Differential systematics. Science 114:315–322

Geoinformatica (2011) 15:435–454 453

Pei Li is a graduate research assistant in the Division of Biostatistics at the University of Minnesota.Pei joined the Division of Biostatistics in 2005 after graduating from Peking University. Shecompleted an MS in Biostatistics in 2007 and is currently a PhD candidate at the University.

Sudipto Banerjee is an Associate Professor of Biostatistics in the School of Public Health, Universityof Minnesota. He received a Ph.D. (2000) and an M.S. (1998) in Statistics from the Department ofStatistics at the University of Connecticut, an M.STAT from the Indian Statistical Institute Calcuttaand a B.S. from Presidency College, Calcutta, India. His research focuses upon statistical modelingand analysis of geographically referenced datasets, Bayesian statistics (theory and methods), inter-face between statistics and Geographical Information Systems, and statistical computing. He haspublished over fifty-five peerreviewed journal articles, several book chapters and has co-authoreda book titled “Hierarchical Modeling and Analysis for Spatial Data”. In 2009 he was honoredwith the Abdel El Sharaawi Young Investigator Award from the The International EnvironmetricsSociety. This is awarded to a young investigator who has made outstanding contributions to the fieldenvironmetrics.

454 Geoinformatica (2011) 15:435–454

Alexander M. McBean is a Professor of Health Policy and Management in the School of PublicHealth, University of Minnesota. He received his M.D. degree from Harvard University in 1968and an M.Sc. in Social Medicine from the London School of Hygiene and Tropical Medicine in1976. His research focuses on the application of administrative data, particularly data from the U.S.Medicare program, to various research topics concerning the elderly. This includes studies in theuse and effectiveness of preventive health services, the use of services to treat diabetes and cancer,and disparities in the use of services based on race or ethnicity. He also directs the Research DataAssistance Center (ResDAC) which assists researchers in understanding and using Medicare andMedicaid administrative data.

mining boundary effects in areally referenced spatial data using the bayesian information criterion

Documents