applications of extraction mapping in environmental epidemiology

STATISTICS IN MEDICINE, VOL. 12, 1249-1258 (1993)

APPLICATIONS OF EXTRACTION MAPPING IN ENVIRONMENTAL EPIDEMIOLOGY

A. 9. LAWSON Deparimeni of Mathemarical and C'ontputer Sciences. Dundee lns!irute of Technnlogy, Bell Street, Dundee DDI IHG, U . K .

AND

F. L. R. WILLIAMS Department of Epideniiologs and Public Healrh. Ninewells Hospital and Medical School. Universiry of Dundee.

Dundee DDI 9SY. U .K .

SUMMARY This paper discusses a new method which allows the extraction of a background disease rate from a data set consisting of spatial co-ordinates of morbidity or mortality events. We demonstrate its application with two data sets, one based on cancer registry data and the other on death certificates.

INTRODUCTION

There is growing interest in the development of methods for the detection of clusters of disease which are thought to be caused by environmental pollutants. Several methods already exist, which range from little more than a simple visual assessment of the spatial distribution to more sophisticated procedures involving statistical tests and modelling. Clusters may be identified during the routine monitoring of health events, or they may be hypothesized to exist a priori as being causally related to a postulated source of pollution. There are two main approaches to the collection of data for the evaluation of clusters: either the data are summarized into areal units such as postcodes, enumeration districts or grid squares, or the data are treated as discrete events and described by precise geographical coordinates. One concern about using area based data is the artificial boundaries which are imposed upon the data by the use of administrative units which probably bear no relation to the postulated source of pollution. The imposition of inappropriate boundaries can result in dilution of the data to such an extent that an effect, although present, cannot be detected; this is a phenomenon which is particularly important when the health outcomes being measured are especially rare forms of a disease. However, an advantage of using area based data lies in the availability of demographic data which allow the disease to be standardized for population and for social and gender characteristics. The alternative to area based data is point-event data, and the pioneering studies by John Snow of the cholera epidemics in London during the mid nineteenth century were among the first to use this methodology. In general, point-event data adequately reflect exposure to a pollutant; however, some method is needed to account for factors which may spuriously indicate a cluster. One method is to design a case-control study, and to match for individual case backgrounds; another is to develop a method for interpolating population data to a point event.

0277-67 1519311 3 1249- 1 0$10.00 0 1993 by John Wiley & Sons, Ltd.

Received November 1991 Revised January I993

1250 A. B. LAWSON AND F. L. R. WILLIAMS

One novel method which has been used for identifying a cluster of disease uses geographical data derived from addresses on the death certificates.' The null model used was a heterogeneous Poisson process (HEPP) model, where the concentration of events (intensity function) consisted of the background disease rate (background hazard) and a separate function of the spatial location of the data points.' Use of this model requires that the case disease (the disease which is thought to be the result of exposure to pollution) is matched to a control disease (a disease which has the same age and sex distribution as the case disease but which remains unaffected by the pollution source). The excess disease rate is estimated by comparison of the case and control disease rates. The successful use of this HEPP model requires the correct choice of control disease. The ideal control disease is one that has the same age, sex, occupational and frequency distribution as the disease under investigation, but has no (known) association with the postulated type of pollution. For example, in a study of lung cancer deaths an appropriate control disease would be coronary heart disease. Both diseases occur frequently and share major aetiological characteristics: the frequency of both increases with age, and they have a higher prevalence in men, an inverse social class gradient, and strong associations with smoking. But importantly, of the two diseases, only lung cancer has an association with airborne pollution. Sometimes the choice of a control disease is not ideal, owing to the constraints of the study, the obscurity of the case disease, or the uncertainty of the pollution type; in these instances one solution to the problem is to select two control diseases and to compare outcomes.

One drawback of this HEPP model is that it assumes a radial distribution of disease around the source of pollution. However, prevailing winds, local topography, meteorological peculiarities of the area, characteristics of the chimney flue and height, all contribute to the directional flow of pollution emanating from a chimney and make questionable this simple assumption of radial disease distribution.

In this paper we develop the application of kernel regression smoothing to the problem of assessing the appropriateness of radial and/or directional components of an HEPP modeL2v3 This technique may be regarded as an exploratory tool, in that i t should be used before considering a specific model for describing the spatial excess of a particular disease. By using kernel regression smoothing it would be possible to validate, for example, the assumption of radial disease distribution within an HEPP model. We apply this method to two different data sets, one based on death certificates and the other on cancer registrations grouped by postcode units. Both data sets were exposed to a point source of airborne pollution.

METHODS

Kernel regression smoothing

Kernel regression smoothing is used to allow the extraction of the underlying disease rate, in order to isolate the locations of excess. The basis of the method has been discussed in a number of recent text^.^.^ We present a brief outline of the basic method and our proposed application to case-control extraction.

Kernel regression smoothing is one of a number of non-parametric methods for smoothing data. In one dimension, the method consists of the estimation of m ( X j ) in the relation:

Yj = rn(Xj) + ci, where e i , i = 1, ... , n, are assumed to be symmetric errors. The function m ( X i ) is not restricted to simple functions such as linear or quadratic forms, but does represent the conditional mean of Y given X = x.

EXTRACTION MAPPING IN ENVIRONMENTAL EPIDEMIOLOGY 1251

The function m ( x ) at x is estimated from the Yi values in a neighbourhood of x. The estimator used in kernel regression can be defined as

+(x) = n-I Hqx) Yi, i = I

that is, the curve is a weighted sum of Yi values. Suitable weights can be defined as

c K ( x - X j ) j = 1

I f these weights are used, then (1 ) is known as the Nadaraya-Watson e ~ t i m a t o r . ~ The function K is known as the kernel function, and it defines the relative weight to be given to data points at different distances from x. A probability density function is usually chosen for this function. In fact a commonly used kernel function is the Gaussian, where

This kernel is symmetrical about each data point and gives low weight to points far from data values.

The type of kernel used in this method is not as important as the choice of the smoothing bandwidth parameters. This parameter must be chosen prior to smoothing and its value critically affects the resulting regression curve. Too large a value of s will yield a smooth fit, while too small a value will leave spikes at data values. It is important to be able to derive a value for s which is optimal for a given data set. Optimality criteria vary, but an estimator which minimizes average squared error (ASE) from the model is often assumed. The method of cross-validation provides an unbiased estimate of ASE. and is defined as

n

C V ( s ) = , - I c (Yi - ms,i(xi))z, i = 1

where hs.i is the estimator calculated with the ith item left out. Minimizing CV(s ) with respect to s provides an estimate of the optimal bandwidth. The methods above can be extended to two dimensions.

If the response variable is defined as zi observed at coordinates { X i , Y i } , the zis can be smoothed by assuming the model

z i = m ( X i , Y i ) + ei.

In this model, a bivariate Gaussian kernel can be employed and the resulting m ( X , Y ) values will yield a smooth underlying surface. Again, cross-validation can be used to estimate s.

Model definition and extraction method

The point locations of a disease, { Xili = 1, ... , n } , are observed within a predefined window. It is common practice to assume that such data are governed by an HEPP model with first-order intensity given by

l * ( X i ) = ph(Xi)i(Xi),

1252 A. B. LAWSON A N D F. L. R . WILLIAMS

where p is the constant baseline intensity, h(X) is a case-control background haza_rd and A(X) is a function of spatial variables. If Diggle' is followed, the h(X) is evaluated at { X i } as h (Xi) by kernel density estimation, and subsequent parametric modelling is based on the HEPP log-likelihood:

n r I = C InA*(Xi) - J A* (X) dX.

i = 1 ( 3 )

The integral in ( 3 ) is usually taken over the predefined window, although Digglel exploits the special form of his A(X) to yield an analytic solution over the whole plane.

To allow correct specification of A(X) it would be advantageous to be able to 'extract' h (Xi) from the process of interest. This can be achieved in a number of ways. One simple method is to use kernel smoothing applied to 6- '(Xi) to isolate the excess intensity. Define

where K ( .) is a normalized Nadaraya-Watson kernel weight. This smoothing has parallels in the discrete count case where standardized mortality ratios (SMRs) are smoothed over time.6 Other examples of kernel smoothing are available.' - 9

In the above estimator x ( X i ) , two smoothing operations are applied: first for h^(Xi), and second for x ( X i ) . As is well known, kernel smoothing is very sensitive to the bandwidth parameter s, but not to the kernel used (Reference 4, p. 127).

It is of importance that an objective method be sought to estimate both parameters. In fact, Diggle' and Lawson' have both found that A(Xi) estimation is very sensitive to the bandwidth used for h(Xi) estimation. In our examples we have used cross-validation to estimate both bandwidths. In the case of h ( X i ) we have used likelihood cross-validation, while for x ( Xi) we have used ordinary cross-validation (for example Reference 5). Note that in the resultant map, areas with ,@Xi) > 1 show an excess of intensity.

It is also possible to assess the variability of n^(Xi), as a pointwise conditional variance is available. Usually, in one dimension, a confidence interval is derived from this. However, it is more difficult in two dimensions to display clearly confidence regions for an estimated surface. One possibility is to simply present a conditional standard error surface, although this may ignore bias in the conditional variance surface.' Define

n

L?~(X) = n - ' C Wi(Xi) ( y i - mSi(X))*

as the conditional variance at (X) . The standard error surface can be computed straightforwardly as ir(X).

i = 1

EXAMPLES

We demonstrate the application of kernel regression smoothing to two data sets. One data set enables the evaluation of the association between a disease and a suspected point source of pollution. The other data set allows the comparison of cancer incidence over two periods - with the former involving exposure to a point source of pollution.

Example 1: larynx cancer (Lancashire)

Between 1972 and 1980 an industrial waste incinerator operated at a site approximately 2 km south-west of a small town, Coppull, Lancashire, England. An investigation was designed to


Figure 1. (a) Cross-validated density surface of cancer of the larynx (s = 0.01) for Lancashire. (b) Cross-validated density surface of respiratory cancer (s = 0.01) for Lancashire

determine whether the incinerator had had a detrimental effect upon health. Cancer registration data were scrutinized and only one cancer, cancer of the larynx, was thought to be unusual in the vicinity of the incinerator." Fifty-eight cancers were found in the ten year period of the study, of which five were within 2 km of the incinerator.

In the investigation of the significance of this cluster, the authors" applied a heterogeneous Poisson process (HEPP) model to the distribution of larynx cancer and estimated a background intensity function A o ( x ) using a control disease. The authors acknowledged the need to choose a control disease with a similar age and sex distribution as the case disease, and selected respiratory cancer. Given the associations between airborne pollution and lung cancer,' ' the use of that disease as a control is questionable.

As an example of the application only, we applied kernel regression smoothing to the larynx and respiratory cancer data. Figure 1 shows the cross-validated density surfaces for (a ) larynx cancer and (b) respiratory cancer. Figure 2(a) shows the derived surface following the extraction

1254 A. B. LAWSON AND F. L. R . WILLIAMS

Figure 2. (a) Contour plot for Lancashire following extraction of respiratory cancer (s = 0.0168). (b) Standard error surface for Lancdshire following extraction of respiratory cancer (s = 0.015). The putative source is depicted by a solid

square

of respiratory cancer. When cross-validation is used the main feature of this surface is a south- east/north-west axis with a large peak close to and north-east of the incinerator. The cross- validated surfaces do not appear to show a simple radial distance decline in intensity from the incinerator, as hypothesized by the original investigators;' this may account for the large standard errors found for the two-parameter model used in that study. An alternative model with multi-modal directional preference and a peaked radial effect may be more appropriate, although the variability in the location of these peaks suggests that such a simplistic model may be unrealistic for detecting an association of deaths with a point source of pollution. The standard error surface (Figure 2( b)) shows a peak of variability to the east of the incinerator which reflects either the true variability or possibly the paucity of the data.


ID ro 0 4

0

Figure 3. (a) Cross-validated density surface of respiratory cancer (s = 2.707) for Bathgate 196Cb-65. (b) Cross-validated density surface of respiratory cancer (s = 3.523) for Bathgate 1966-76

Example 2 respiratory cancer (Bathgate)

The study in Bathgate, Scotland,12 was instigated in an attempt to corroborate the findings of a previous study in which airborne pollution emanating from a steel foundry was shown to be associated with a cluster of deaths from respiratory cancer in the residences downwind from the f o ~ n d r y . ' ~ The exact location of the foundry in Bathgate was ascertained, and the probable distribution of air pollution from it estimated using two approaches. First, a scale model was made of the town, and the wind patterns were estimated by a wind tunnel experiment. Secondly, soil samples from various locations in Bathgate were analysed for their metal content. Mortality was investigated by plotting on a map the number of deaths from respiratory cancer in two periods: 196G65 and 1966-76. The result of the epidemiological investigations indicated that there was no evidence of a clustering of deaths downwind of the foundry in the former period, but that in the latter period there was an excess of deaths from respiratory cancer.

1256 A. B. LAWSON A N D F. L. R . WILLIAMS

A Ln L Y C a +

t Y v

Z I

m

50

40

30

20

1 0

n - 0 10 20 30 40 50 (a) W-E (km tenths)

0

Figure 4. (a) Contour plot for Bathgate following extraction of respiratory cancer for the years 196Cb-65 (s = 1.888). (b) Standard error surface for Bathgate following extraction of respiratory cancer for the years 196Ck65 (s = 2.105).

The putative source is depicted by a solid square

In this example, the period 1960-65 acted as the control for the period 1966-76. This was justifiable for two reasons. First, the period 196CL65 was not hypothesized a priori to have an increased SMR from respiratory cancer, because the metallurgical processes thought to be associated with an increase in the respiratory cancer were not in operation in the earlier period. Secondly, the population structure remained fairly constant between 1960 and 1976. Figures 3(a) and 3( b) show the cross-validated density surfaces for 1960-65 and 196676 respectively. Figure 4(a) shows the residual surface following extraction, and clearly shows a pronounced peak in the area downwind of the source of pollution. The standard error surface (Figure q b ) ) indicates variability to the east of the putative source of pollution, with high variability evident to the south-east.


CONCLUSIONS

The ability to visualize clearly clusters of a disease is of potential benefit to anyone within the general field of environmental epidemiology, but in particular investigators trying to assess the possible associations between a point source of pollution and subsequent cases of a disease. Several mathematical methods are available for assessing the significance of cluster^;'^^ however, most assume a radial distribution of disease around a point source. Environmental monitoring of soil and plumes, in conjunction with wind tunnel experiments, indicates in at least two towns that this is not a valid assumption,’” 1 6 * ” and clustering methodologies which assume a radial distribution of disease would not be a valid procedure in these instances. This underlines the need for a procedure, such as the extraction method described in this paper, which demon- strates the presence or absence of a cluster of disease around a point source of pollution.

The identification of clusters of a disease, however, is only the first stage in assessing the public health outcomes and only one of several approaches that will be necessary in trying to establish cause and effect. Delineating areas of possible exposure from a point source of pollution requires a knowledge and appreciation of local topography, prevailing wind direction and meteorological peculiarities. To avoid diminishing the extent of the problem the choice of control disease should reflect similar age and sex distributions as the case disease and must remain unaffected by the presence of a source of pollution. Cause and effect can only be shown conclusively when the responsible pathogen is identified thus, in any environmental study, physical monitoring of the environment and of the pollution is the ultimate requirement.

ACKNOWLEDGEMENTS

We thank Professor 0. Lloyd (Chinese University of Hong Kong) for providing the data for Bathgate; Professor P. Diggle and Dr. A. Gatrell (University of Lancaster) for giving access to the data on larynx cancer; and Professor C. du V. Florey (University of Dundee) for helpful comments on an early draft of this paper.

REFERENCES

1. Diggle, P. J. ‘A point process modelling approach to raised incidence of a rate phenomenon in the vicinity of a prespecified point’, Journal of the Royal Statistical Society, Series A, 153(3), 349-362 ( 1990).

2. Lawson, A. B. The Statistical Analysis of Point Events Associated with a Fixed Point, Ph.D. Thesis, University of St. Andrews, 1991.

3. Lawson, A. B. ‘GLIM and normalising constant models in spatial and directional data analysis’, Computational Statistics and Data Analysis, 3, 331-348 ( 1992).

4. Hardle, W. Applied Nonparametric Regression, Cambridge University Press, 1990. 5. Hardle, W. Smoothing Techniques with Applications, Springer, 1991. 6. Breslow, N. and Day, N. Statistical Methods in Cancer Research: Cohort Analysis, IARC, Lyon, 1987. 7. Ramlau-Hansen, H. ‘Smoothing counting process intensities by means of Kernel functions’, Annals of

Statistics, 11(2), 453466 (1983). 8. Yandell, B. ‘Nonparametric inference for rates with censored survival data’, Annals of Statistics, 11(4),

1119-1135 (1983). 9. Titterington, D. M. ‘A comparative study of kernel-based density estimates for categorical data’,

Technometrics, 22(2), 259-268 ( 1980). 10. Diggle, P. J., Gatrell, A. C. and Lovett, A. A. ‘Modelling the prevalence of cancer of the larynx in part of

Lancashire: a new methodology for spatial epidemiology’, in Thames, R. W. (ed.), Spatial Epidemiology, Pion Press, 1990.

1 1 . Crofton, J. and Douglas, A. Respiratory Diseases, 3rd edn, Blackwell, 1981. 12. Lloyd, 0. L. I., Smith, G., Lloyd, M. M., Holland, Y. and Gailey, F. ‘Raised mortality from lung cancer

and high sex ratios of births associated with industrial pollution’, British Journal of Industrial Medicine, 42,475480 (1985).

1258 A. B. L A W S O N A N D F. L. R . W I L L I A M S

13. Lloyd, 0. L. I. ‘Respiratory cancer clustering associated with localized industrial air pollution’, Lancet,

14. Openshaw, S . CATMOG 38: The Modifiable Areal Unit Problem, Geo Books, Norwich, 1984. 15. Openshaw, S., Charlton, M., Wymer, C. and Craft, A. W. ‘A mark 1 geographical analysis machine for

the automated analysis of point data sets’, International Journal OfGeographical Information Systems, 1, 335-338 (1987).

16. Gailey, F. A. Y. and Lloyd, 0. LI. ‘A wind tunnel study of the flow of air pollution in Armadale, Central Scotland’, Ecology of Disease, 4, 419431 (1983).

17. Lloyd, 0. LI. and Gailey, F. A. ‘Techniques of low technology sampling of air pollutions by metals: a comparison of concentrations and map patterns’, British Journal o j Industrial Medicine, 44, 494-504 (1987).

8059, 318-320 (1978).

applications of extraction mapping in environmental epidemiology

Documents