a geographic regression model for medical statistics

11
Sm. Sci. Med. Vol. 26. No. I, pp. 119-129, 1988 Printed in Great Britam 0277-9536188 163.00 + 0.00 Per&qmon Journals Ltd A GEOGRAPHIC REGRESSION MODEL FOR MEDICAL STATISTICS SUSAN KENNEDY Department of Geography, University of California, Santa Barbara. CA 93106. U.S.A Abstract-A method for modeling geographic processes using census-type data is introduced in an analysis of male and female lung cancer mortality rates. The study area comprises the counties in those states which abut the Gulf of Mexico and the southeast Atlantic Coast of the United States. A spatially autoregressive model is used to estimate the strength of the univariate relationship between both the male and female lung cancer mortality rates in a county and in the respective lung cancer rates in the first to fifth order adjacent counties. The results show that male lung cancer exhibits spatial autocorrelation while female lung cancer does not, and that the female data exhibit a spatial trend while the male data do not. These findings suggest that factors which vary at the regional scale play a greater role in the etiology of female lung cancer and that factors that vary at the neighborhood scale play a greater role in the etiology of male lung cancer. Key words-cancer, spatial statistics, spatial pattern analysis INTRODUCTION A generalized method for modeling medical geo- graphic processes using a spatially autoregressive model for statistical census-type data will be intro- duced in this paper. This model is particularly well suited to modeling medical geographic processes because of the inherent spatial dependencies almost always found in such data. A fundamental property of geographically located data is that the set of data values is likely to be dependent over space [I]. There are three basic approaches available for modelling spatially dependent data [2]: (1) Apply the standard textbook methodologies and ignore the dependencies. This is the most common approach although it ignores the statistical problems that arise when such an approach is chosen, (2) Use a two-step orocedure like that sueeested (3) by Durbin and Watson [3] or Cliff and Gy);d [l]. In regression analysis, for example, one would use a standard procedure such as ordinary least squares (QLS) and then test to see if the residuals generated by this method exhibit some form of spatial dependency. A variant of this approach is the three-step procedure pro- posed by Glick [4] to decompose the one- dimensional geographic variation in cancer oc- currence by estimating and removing first the trend and then the neighborhood effects and then testing the residuals for dependence. Assume a specific dependency structure at the start and alter existing methods to take this dependency structure into account. This is the approach taken in this paper. The methodology described in this paper modifies the OLS regression model to take into consideration a particular dependency structure based on a con- tiguity measure for neighboring counties. The model also simultaneously estimates both the trend and neighborhood components of the geographic vari- ation in empirical cancer data. Many spatial process models assume that observations are available on a rectangular grid [1], which is not the case with most social data. This method has the advantage that it does not require that medical statistics collected by geographic areas or census divisions be interpolated to a regular lattice or assigned to an irregular lattice using arbitrary centroids, and it preserves topological relationships of geographic units. The method is also suitable for the analysis of lattice-type data, and for the analysis of large two-dimensional data sets. Two examples have been computed; one for male and one for female lung cancer rates in the Gulf of Mexico and southeast Atlantic region of the United States. DATA Lung cancer is a process suited to modeling by this technique because it is a disease which has been shown to exhibit spatial dependency at the county level of analysis [6]. The analysis is performed on lung cancer statistics. Both the male and female rates of the disease are examined, as they have both been shown to have dissimilar spatial patterns (4,7]. Amongst the criteria used to determine suitability for inclusion in this study are the reliability and scale of variation of the data. Since lung cancer is a highly fatal disease, the mortality rate is not significantly influenced by treatment variables and is essentially equal to the morbidity rate. Lung cancer is also a disease with a relatively high signal-to-noise ratio. In this case, the signal is the true death rate from cancer and the noise is the random and/or binomial vari- ation due to sampling error. In addition, lung cancer was chosen because this disease is known to exhibit spatial autocorrelation at the county scale [S]. The scale of spatial variation plays an important role in the geographic approach to cancer etiology. For 119

Upload: susan-kennedy

Post on 31-Aug-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A geographic regression model for medical statistics

Sm. Sci. Med. Vol. 26. No. I, pp. 119-129, 1988 Printed in Great Britam

0277-9536188 163.00 + 0.00 Per&qmon Journals Ltd

A GEOGRAPHIC REGRESSION MODEL FOR MEDICAL STATISTICS

SUSAN KENNEDY

Department of Geography, University of California, Santa Barbara. CA 93106. U.S.A

Abstract-A method for modeling geographic processes using census-type data is introduced in an analysis of male and female lung cancer mortality rates. The study area comprises the counties in those states which abut the Gulf of Mexico and the southeast Atlantic Coast of the United States. A spatially autoregressive model is used to estimate the strength of the univariate relationship between both the male and female lung cancer mortality rates in a county and in the respective lung cancer rates in the first to fifth order adjacent counties. The results show that male lung cancer exhibits spatial autocorrelation while female lung cancer does not, and that the female data exhibit a spatial trend while the male data do not. These findings suggest that factors which vary at the regional scale play a greater role in the etiology of female lung cancer and that factors that vary at the neighborhood scale play a greater role in the etiology of male lung cancer.

Key words-cancer, spatial statistics, spatial pattern analysis

INTRODUCTION

A generalized method for modeling medical geo- graphic processes using a spatially autoregressive model for statistical census-type data will be intro- duced in this paper. This model is particularly well suited to modeling medical geographic processes because of the inherent spatial dependencies almost always found in such data. A fundamental property of geographically located data is that the set of data values is likely to be dependent over space [I]. There are three basic approaches available for modelling spatially dependent data [2]:

(1) Apply the standard textbook methodologies and ignore the dependencies. This is the most common approach although it ignores the statistical problems that arise when such an approach is chosen,

(2) Use a two-step orocedure like that sueeested

(3)

by Durbin and Watson [3] or Cliff and Gy);d [l]. In regression analysis, for example, one would use a standard procedure such as ordinary least squares (QLS) and then test to see if the residuals generated by this method exhibit some form of spatial dependency. A variant of this approach is the three-step procedure pro- posed by Glick [4] to decompose the one- dimensional geographic variation in cancer oc- currence by estimating and removing first the trend and then the neighborhood effects and then testing the residuals for dependence. Assume a specific dependency structure at the start and alter existing methods to take this dependency structure into account. This is the approach taken in this paper.

The methodology described in this paper modifies the OLS regression model to take into consideration a particular dependency structure based on a con- tiguity measure for neighboring counties. The model also simultaneously estimates both the trend and

neighborhood components of the geographic vari- ation in empirical cancer data. Many spatial process models assume that observations are available on a rectangular grid [1], which is not the case with most social data. This method has the advantage that it does not require that medical statistics collected by geographic areas or census divisions be interpolated to a regular lattice or assigned to an irregular lattice using arbitrary centroids, and it preserves topological relationships of geographic units. The method is also suitable for the analysis of lattice-type data, and for the analysis of large two-dimensional data sets. Two examples have been computed; one for male and one for female lung cancer rates in the Gulf of Mexico and southeast Atlantic region of the United States.

DATA

Lung cancer is a process suited to modeling by this technique because it is a disease which has been shown to exhibit spatial dependency at the county level of analysis [6]. The analysis is performed on lung cancer statistics. Both the male and female rates of the disease are examined, as they have both been shown to have dissimilar spatial patterns (4,7]. Amongst the criteria used to determine suitability for inclusion in this study are the reliability and scale of variation of the data. Since lung cancer is a highly fatal disease, the mortality rate is not significantly influenced by treatment variables and is essentially equal to the morbidity rate. Lung cancer is also a disease with a relatively high signal-to-noise ratio. In this case, the signal is the true death rate from cancer and the noise is the random and/or binomial vari- ation due to sampling error. In addition, lung cancer was chosen because this disease is known to exhibit spatial autocorrelation at the county scale [S]. The scale of spatial variation plays an important role in the geographic approach to cancer etiology. For

119

Page 2: A geographic regression model for medical statistics

120 SUSAN KENNEDY

example, certain cancers typically exhibit a great deal of variation at one scale of aggregation, but are characterized by very little variation at another scale. It is important to analyze data collected by geo- graphic areas at the scale at which it exhibits vari- ation [8]. For example, it would not make sense to use this model to describe spatial processes of leukemia at the county scale since it has been shown that there is not much difference in leukemia occurrence among counties [7].

The primary data used for this investigation were obtained from the National Cancer Institute’s U.S. Cancer Mortality By County: 1950-1969 [8]. This source furnishes age standardized 20-year average mortality rates by race, anatomical site, and sex for every county in the contiguous United States. This data set is one of the most accurate, exhaustive, and disaggregate data sets available on cancer occurrence in the United States. After the computer analysis for this paper was completed, another data set became available [lo]. However, restrictions on computer time and cost made it impossible to include the more recent data in this analysis.

Secondary data used in this analysis are the county boundaries. From a digital record [l I] I have com- puted the first to fifth-order neighbor counties of each county, the corresponding lengths of the com- mon county boundary, and the county centroids (in latitude and longitude) for each county in the study area.

STUDY AREA

The study area chosen for this analysis consists of the states which abut the Gulf of Mexico and the southeast Atlantic Coast of the United States. This area includes the states of Alabama, Arkansas, Flor- ida, Georgia, Louisiana, Mississippi, North Carolina, Oklahoma, South Carolina, Tennessee. and Texas. These 11 states comprise a total of 1086 counties. This particular region was chosen because it has higher cancer rates than most parts of the United States [12]. Although the method could have been used for the entire set of 3056 counties in the con- tiguous United States, the decreasing resolution of the data from east to west (due to the increasing size of the counties from east to west), as well as the lower rates of cancer on the west coast, might have biased the results. Continuously scaled choropleth maps of male and female lung cancer mortality for the study area are given in Figs 1 and 2, respectively. The unclassed maps were produced because they are more accurate with respect to quantization error [13, 141 and it has been suggested that unclassed maps appear to be of superior visual quality [14]. As it is difficult to detect patterns on these maps simply by looking at them, a statistical-spatial analysis was attempted.

COMPUTATIONAL PROCEDURE

A spatially autoregressive model is used to estimate the strength of the relationship between each variable in a county and its value in the first to fifth-order neighboring counties. The independent variables are

weighted linear combinations of the value in each order of neighbor. To qualify as a neighbor, counties must have a finite length of common border. Thus, Zn is the weighted linear combination of the variable in the n th order neighboring counties. The weights are the normalized lengths of common boundary in relation to the perimeter of the n - I order neighbors.

This model is most easily explained by reference to a specific county. for which I have chosen Autauga County, Alabama. For the five first-order neighbors of Autagua County the weights are proportional to the length of common boundary between each of the five neighbors and the perimeter of Autauga County. For a higher-order neighbor, for example one of Autauga County’s 20 third-order neighbors, the weights are proportional to the length of common boundary between each of the 20 third-order neigh- bors and the perimeter of the ring formed by Autauga County’s 12 second-order neighbors. The perimeter of these second-order neighbors is calculated by summing the lengths of common boundary between Autauga County’s 12 second-order neighbors and 20 third-order neighbors. This approach implicitly includes distance effects because the length of the perimeter of the n - 1 order neighbors and the number of nth order neighbors increases as the order increases. A map of Autauga County and its first to fifth-order neighbors is shown in Fig. 3. This approach is related to the weighting by common boundary method for calculating spatial auto- correlation discussed by Cliff and Ord (1 I].

As can be seen from Fig. 3, Autauga county has five first-order neighbors, 12 second-order neighbors, 20 third-order neighbors, 29 fourth-order neighbors, and 38 fifth-order neighbors. The complexity (vari- ation in size, shape, and orientation) of county units and of spatial neighborhoods can also be seen in this figure. Of course, a different county would have a different appearing neighborhood but the general procedure is the same. First to fifth-order neighborhoods were computed for 1086 counties.

The weighted sum of each observation within a neighborhood ‘ring’ now becomes one of the inde- pendent variables in a multiple regression. In this manner, the spatial autocorrelation effect, if it exists, is extracted from the observations.

This method has the useful property that the neighbor computations near the boundaries do not require special processing. Neighborhood com- putations are handled in the same manner for a county on the edge of a study area as for a county in the center of a study area. The only difference in neighborhood computation between interior and boundary counties is that the boundary counties have fewer neighbors than do the interior counties.

THE MODEL

In order to estimate whether regional effects or neighborhood effects are more important, both trend and local effects were estimated simultaneously. This method is another way to estimate global system- wide trends and separate them from local neigh- borhood effects. The model differs from related ap- proaches [4, 17, 181 in that it does not estimate the

Page 3: A geographic regression model for medical statistics

A geographic regression model for medical statistics 121

Page 4: A geographic regression model for medical statistics

Fig

. 2.

F

emal

e lu

ng

ca

nce

r b

y co

un

ty

in

the

Gu

lf

of

Mex

ico

an

d

sou

thea

st

Atl

anti

c co

ast.

D

arke

r sh

adin

g

ind

icat

es

hig

her

m

ort

alit

y ra

tes.

Page 5: A geographic regression model for medical statistics

A geographic regression model for medical statistics 123

Fig. 3. First to fifth-order neighbors of Autauga County, Alabama identified by number.

regional and neighborhood effects separately. The trend effect was modeled by including the location within the United States in the model via the polygon centroids for each county.

The model fitted is given as follows:

Y,=A + 74, + 72?5 L

I I

Geographic trend compbnent

+ B,Z,, + &Zxj + &Z,, + B4Zdi + B,Z,, + e; \ ” I

Neighborhood effect component

Where:

Y, is the age standardized death rate in county i; li is Cos(rpi)*li, where I, is the longitude coordi-

nate of the centroid of county i and pi is the

latitude coordinate of the county centroid. This transformation is a modified cylindrical equidistant projection [ 151;

cp, is the latitude coordinate of the centroid of county i;

Zk, = WvV,, where W is the weighting matrix for the j neighbors of county i, and V is the matrix of death rates for the neighbors j of county i, and k is the order of neighbor;

e, = error term (to be minimized by the least squares regression).

RESULTS

The results of the ordinary least squares regression are reported in Tables 1 and 2. The male lung cancer data exhibit a highly significant degree of spatial association as defined by a two-tailed f-test. The male lung cancer data exhibit spatial autocorrelation out to the second-order neighbors. At the third-order neighbors and beyond the autocorrelation coefficients (standardized regression coefficients) go to zero. This latter effect is consistent with beta coefficients for the trend analysis, which indicates that these data are spatially stationary (the converse, autocorrelation coefficients not equal to zero, does not necessarily imply nonstationarity). If the data were not station- ary, the neighborhood effects would be ‘mixed up’ with the trend effects and the beta coefficients would not necessarily decrease with increasing distance.

The female lung cancer data exhibit a strong trend component in the easterly direction and a weaker, although statistically significant, component in the southerly direction. Thus the female lung cancer data surface exhibits a trend surface increasing in a southeasterly direction which is not likely to be due to chance alone. For female lung cancer the trend component in the easterly direction is larger than any of the neighborhood effects. Both the first and

Table 1. Summary of regression analysis for male lung cancer

Significant Significant* Standardized? trend neighborhood

component component Constant

NO Yes 8.86

*Significant at the 99% confidence level. tStandard error in parentheses.

regressions coefficients RZ

0.006 (0.0284) 0.376 0.039 (0.0378) 0.406. (0.0367) 0.208’ (0.0441) 0.032 (0.0475) 0.004 (0.0502) 0.018 (0.0436)

Table 2. Summarv of reeression analvsis for female lung cancer , I

Significant* Significant’ Standardizedf trend

component

Yes

neighborhood component

Yes

Constant

8.079

regression coefficients R’

0.084. (0.0395) 0.0737 0.153t (0.0453) 0.062 (0.0329) 0.084* (0.0357) 0.02 (0.0384) 0.024 (0.0377) 0.1031 (0.0391)

*Significant at the 95% confidence level. tsigniticant at the 99% confidence level. $Standard error of coefficients in parentheses.

Page 6: A geographic regression model for medical statistics

124 .%‘SAN uNNF.DY

E -.I N 0 -.2

2 p -.3

= -.4 ii

-.5J

Fig. 4. Spatial correlogram for male lung cancer.

second-order neighbors exhibit weak autocorrelation effects. The strongest neighborhood effect is a nega- tive autocorrelation with the fifth-order neighbors. This negative autocorrelation may represent a medium scale cyclical feature of the data, but it is not possible to determine if this is so without estimating the correlogram for a greater number of county lags.

The female lung cancer age standardized rates exhibit weakly homogeneous neighborhood effects and strong regional effects, while male lung cancer rates exhibit strong neighborhood effects and weak regional effects. In addition, the pattern of male auto- correlation coefficients is characteristic of a distance decay function. The diagrams shown in Figs 4 and 5 express the relationship between spatial lag and the beta coefficients for the male and female data.

Analysis of the correlation matrix for the female lung cancer data revealed that the correlations between the pairs of independent variables ranged from 0.2331 to 0.5734. It is clear that multi- collinearity is not a significant problem in the female lung cancer data. The values in the male lung cancer correlation matrix ranged from -0.1144 to 0.8943. Thus, the extent of multicollinearity is larger in the male data than in the female data. The existence of multicollinearity could potentially create a situation in which the beta coefficients were unstable, but since the highest correlations in the male lung cancer correlation matrix are among variables that are not significant, this does not pose a serious problem.

* .s1

LAO

Fig. 5. Spatial correlogram for female lung cancer.

In addition, the existence of autocorrelation in the male data implies that the true number of degrees of freedom is less than would be expected from the sample size. The required adjustment to the degrees of freedom for a situation of this type is discussed in Davis [16]. Since this adjustment would only effect the confidence interval for marginally significant esti- mators (and, in this case, there are no such marginal estimators), it was not performed when using the r-test for the male data.

Choropleth maps of the predicted mortality rate for both the male and female lung cancer are given in Figs 6 and 7. These predicted maps were based on the estimated significant regression coefficients. Thus, in the case of males, the cancer rates are predicted on the basis of the neighboring rates only. For females the absolute location is the more important predictor variable. The most striking feature evident in these maps is the preeminent delineation of the counties surrounding the Mississippi River on the map of male predicted mortality rates. This feature is also evident on the map of actual lung cancer mortality for males (see Fig. 1).

Continuously scaled choropleth maps of the posi- tive and negative residuals for both the male lung cancer data are given in Figs 8 and 9, respectively. Visual inspection of these maps, as well as of the female lung cancer residuals (not shown), suggests, but of course does not prove, that the residuals are randomly distributed on the maps. This inspection suggests that I have extracted as much as possible from these data using an autoregressive model.

CONCLUSION

The difference between the patterns of male and female lung cancer mortality rates suggests that different factors are at work in the etiologic develop- ment of lung cancer in men and women. Although differences in cigarette smoking habits between men and women can account for the difference in magnitude between male and female death rates observed during the study period [19,20], this mech- anism cannot explain the differences in the spatial organization of the male and female death rates. The existence of the neighborhood effects exhibited by the male lung cancer data coupled with the corre- sponding lack of neighborhood effects exhibited by the female lung cancer data suggests that lung cancer in the male may be occupational in origin. It is quite possible that this effect is due to the difference in commuting patterns between men and women during the study period. In the 1950s and early 196Os, there were fewer women in the work force than there are now, so we can assume that a higher percentage of women than men worked at home and thus did not have to travel to work, while many men commuted to their jobs. This hypothesis accounts for the dis- tance decay effect evident in the male spatial cor- relogram because men commuting to work in an adjacent county may develop cancer as a result of occupational exposure but go home to die in their county of residence.

On the other hand, the significance of the trend component in the pattern of female lung cancer mortality suggests that the female etiology is affected

Page 7: A geographic regression model for medical statistics

A geographic regression model for medical statistics 125

Page 8: A geographic regression model for medical statistics

126 SUSAN KENNEDY

Page 9: A geographic regression model for medical statistics

A geographic regression model for medical statistics

Page 10: A geographic regression model for medical statistics

Fig

. 9.

N

egat

ive

resi

du

als

for

mal

e lu

ng

ca

nce

r.

Dar

ker

shad

ing

in

dic

ates

la

rger

m

agn

itu

des

.

Page 11: A geographic regression model for medical statistics

A geographic regression model

by some environmental process operating at the state or regional level. If physical environmental effects were a major cause of lung cancer in the Gul f Coast Southeastern Region of the United States we would expect to see similar patterns of spatial organization of mortality rates between men and women. Thus, we can speculate that regional differences in employment patterns for women, or regional differences in the diffusion of cigarette smoking by women may be represented by the trend component. This idea that there are regional differences in the diffusion of cigarette smoking is supported by Greenberg [21] who suggests that there are substantial differences in female cigarette smoking within metropoli tan regions and between metropoli tan regions. These geographic differences in the sociology of American women have not yet been studied.

The results of this study suggest that factors that vary at the local neighborhood scale appear to be a fruitful avenue of research in the etiology of male lung cancer. With both the increased cigarette smok- ing and participation of women in the work force during the past 15 years, it would be interesting to investigate whether the spatial organization of female lung cancer mortali ty patterns has changed over time to more closely approximate the spatial organization of male lung cancer mortality patterns, patterns of female lung cancer mortality to more closely approx- imate the organization of male lung cancer mortali ty patterns. These findings suggest that research on the role that environmental carcinogens play in the devel- opment of lung cancer, and perhaps other cancers as well, should focus on the female mortality rates fbr the period 1950-1969.

Acknowledgement--The assistance of Waldo R. Tobler is gratefully acknowledged.

REFERENCES

1. CliffA. D. and Ord J. K. SpatiatProcesses, Models and Applications. Pion, London, 1981.

2. Streitberg B. Multivariate models of dependent spatial data. In Exploratory and Explanatory Statistical Ana- lysis of Spatial Data (Edited by Bartels C. P. A. and Ketellapper R. H.), p. 139. Nijhoff, Boston, 1979.

3. Durbin J. and Watson G. S. Testing for serial cor- relation in least squares regression llI. Biometrika 58, 1, 1971.

4. Glick B. J. The spatial organization of cancer mortality. Ann. Ass. Am. Geogr. 72, 471, 1982.

for medical statistics 129

5. Glick B. J. The spatial autocorrelation of cancer mor- tality. Soc. Sci. Med. 13D, 123, 1979.

6. Glick B. J. The spatial organization of cancer occur- rence: theoretical and empirical perspectives. Un- published Ph.D. dissertation, State University of New York at Buffalo, 1981.

7. Mason T. J., McKay F. W., Hoover R. et al. Atlas of Cancer Mortality in U.S. Counties: 1950-1969. National Institute of Health, DHEW Publ. No. (NIH) 75-780. U.S. Government Printing Office, Washington, D,C,, 1975.

8. Harvey D. W. Pattern, process and the scale problem in geographical research, Transact. Inst. Br. Geogr. 48, 7I, 1968.

9. Mason T. J, and McKay F. W. b\S. Cancer Mortality by County: 1950-1969. Department of Health, Edu- cation, and Welfare. Publication No. (NIH) 74-615. U,S. Government Printing Office, Washington, D.C., 1974.

I0. Riggan W. B., Van Bruggen J., Acquavella J. F., Beaubier J. and Mason T. J. U.S. Cancer Mortality Rates and Trends, 1950-1979, United States Environ- mental Protection Agency. Health Effects Research Laboratory. Research Triangle Park, N.C., 1983,

11. Tobler W. R. Personal communication, 1983. 12. Blot W. J. and Fraumeni J. F. Geographic patterns of

lung cancer: industrial correlations. Am. J. Epidern. 103, 539-550, 1976.

I3. ToNer W. R. Choropleth maps without class intervals? Geogr. Anal 3, 262, 1973.

14. Gale N. and Halperin W. C. A case for better graphics: the unclassed choropleth map. Am. Statist. 36, 330, 1982.

15. Maling D. H. Coordinate Systems and Map Projections. Phillip, London, 1973.

16. Davis R. E. Predictability of sea-surface temperature and sea-level pressure anomalies over the north pacific ocean. J. phys. Oceanogr. 6, 246-266.

17. Haining R. P. Specification and estimation problems in models of spatial dependence. In Studies in Geography, No. 24. Department of Geography, Northwestern Uni- versity, Evanston, II1., 1978.

18. Richards K. S. Stochastic processes in one-dimensional series: an introduction. In Concepts and Techniques in Modern Geography (CATMOG), No. 23. Geo- Abstracts, Norwich, 1979.

19. Blot W. J. and Fraumeni J. F. Changing patterns of lung cancer in the United States. Am. J. Epidem. 115, 664, 1982.

20. Doll R. and Peto R. The causes of cancer: quantitative estimates of avoidable risks of cancer in the United States today. J. natn. Cancer Inst. 66, i191, 1981.

21. Greenberg M. et aL White female respiratory cancer mortality: a geographical anomaly. Lung 161,235, 1983.