application of geostatistical methods for public health ... · daniela nuvolone (*), roberto fresco...

Application of Geostatistical Methods for Public Health Risk Mapping

Daniela Nuvolone (*), Roberto Fresco (*), Sara Maio (**), Sandra Baldacci (**),

Anna Angino (**), Franca Martini (**), Marco Borbotti (**), Giovanni Viegi (**), Roberto della Maggiore (*)

(*) CNR Institute of Information Science and Technologies “Alessandro Faedo” Pisa Research Area

Pisa, Italy [email protected]

(**) CNR Institute of Clinical Physiology Pisa, Italy

[email protected]

SUMMARY The Information Systems Technology Centre of ISTI-CNR uses GIS technology for the management of epidemiological and environmental data, mainly about air quality topics. This work uses data from epidemiological surveys based on a standardized interviewer-administered questionnaire referring to respiratory symptoms and diseases, life style and personal habits. This study aims to apply geostatistical methods in order to obtain a respiratory health risk map. A variable representing population respiratory health status was created from information about presence/absence of respiratory symptoms/diseases derived from the epidemiological questionnaire. Exploratory Spatial Data Analysis (ESDA) tools assured a deep understanding of spatial and structural properties of dataset. Inverse Distance Weighted interpolation technique was used to create continuous surfaces from sample points. To assess the resulting map population was classified according to it. Classical statistical analyses were performed. Results show that interpolation gives a good representation of the distribution of symptoms/diseases over land. KEYWORDS: Geostatistics, epidemiology, health risk map.

INTRODUCTION During the last years increasingly powerful and versatile geostatistical tools have been developed for geoscience applications. Geostatistics is a methodology for incorporating the spatial and temporal coordinates of observations in data processing (Houlding, 2000). It is a relatively new discipline developed in the 1960s primarily by mining engineers who were facing the problem of evaluating recoverable reserves in mining deposits. The applications of geostatistics have spread from the original metal mining topics to such diverse fields as soil science, oceanography, hydrogeology, agriculture, environmental science, and more recently health science (Berke, 2004). A key feature of epidemiological data is their location in a space-time continuum and this space-time dimension should be incorporated in any descriptive analysis of the data (Lawson, 2001). Recently GIS has emerged as an innovative and important component of many projects in public health and epidemiology: GIS has proved to be useful for epidemiological research purposes, decision-making, planning, management and dissemination of information (Elliott, 2004). GIS can be used to map and analyse the geographical distribution of population at risk, health outcomes and risk factors; to explore associations between risk factors and health status; to plan public health services (De Lepper, 1994; Bottai, 2002). Many studies deal with the integrated use of GIS techniques (geocoding, buffers) and classical statistics (chi-square test, multiple logistic regression) examining the relation between risk factors and population health status (Nicolai, 1996). A classical approach consists in the use of GIS as a tool to classify population with reference to geographical items (Rytkönen, 2004). Then classical statistics is applied to assess the hypothesis that inspired classification (Nicolai, 2003; Wjst, 1993; English, 1999).

GEOSTATISTIC AND EPIDEMIOLOGY We followed the classical epidemiological approach previously described in another work we are presenting at the 8th Agile Conference (“Gis for epidemiological studies” by Maio et al.): we used distance from roads as a proxy for exposure to road traffic pollution, population was classified according to home distance from roads and then we analysed the relationship between proximity to roads and respiratory symptoms/diseases prevalence. In the present study we tried out a new concern for geostatistic in epidemiology, experimenting an innovative GIS-based approach to the analysis of epidemiological data. We applied geostatistical tools to an epidemiological dataset in order to discover its structural and spatial patterns, if any, using correlation functions (semivariogram). Then a health risk map, representing the spatial distribution of respiratory symptoms and diseases, was produced through spatial interpolation techniques. A troublesome aspect of our study is linked to the binary nature of our dataset (absence-presence of respiratory symptoms/diseases) and we solved this by constructing a continuous variable useful for geostatistical application. THE EXPERIMENTED MODEL Starting from epidemiological data we produced a risk map for respiratory symptoms/diseases through the following steps:

1. geocoding of subjects by home addresses to obtain a layer of points; 2. calculation of a respiratory health value for each point by a variable that gets information

from epidemiological data (local symptomatical index); 3. use of geostatistical tools to get respiratory health risk map.

To assess the significance of the produced map we performed two additional steps:

4. land classification according to four levels of health status; 5. validation of results by means of classical statistical analysis (bivariate and multiple logistic

regression analysis).

CASE STUDY The study makes use of data gathered during an epidemiological survey performed by the Italian National Council of Research (CNR) and the Italian National Electricity Power Supplier (ENEL) in 1991-93. This survey aimed to analyse the relationship between traffic exposure and respiratory symptoms/diseases in Pisa-Cascina area (Tuscany, Italy). This epidemiological dataset has been subjected to many statistical analyses by the CNR Institute of Clinical Physiology in order to assess the distribution of symptoms/diseases and their relation with different environmental conditions (Viegi, 1991; Viegi, 1994; Viegi 1999; Simoni 1998; Carrozzi, 1993). Basic numeric cartography used in this study was mainly supplied by local authorities: Municipalities of Pisa and Cascina, Provincial Administration of Pisa and Tuscany Region. GEOCODING The subjects’ residences were integrated in a Geographical Information System: geocoding is based on subjects’ home addresses and this operation produced 1121 different points for the whole sample of 2816 subjects (the survey was designed on a family cluster basis, so many subjects share the same home address). As the figure 1 clearly shows, sample points have a really unhomogeneous and clusterized distribution. This is a consequence of the original sampling method that was based on administrative entities (census-tracts) rather than on strictly geographical features.

0 3.000 6.000meters

Figure 1: Geocoding of subjects by home addresses

THE LOCAL SYMPTOMATICAL INDEX The first step of the research dealt with the possibility to apply geostatistical tools to the epidemiological dataset. Because of the binary nature of our data (presence/absence of respiratory symptoms/diseases), we faced the problem of constructing a new continuous variable useful for geostatistical application. In fact geostatistics works on (x,y,z) triplets (coordinates of a point and the value of a continuous variable in that point). On the contrary our sample is featured by an high number of binary variables which represent the health situation of the studied population. The schema we adopted to overcame this inconvenience is based on the creation of a new variable working as an indicator of population health status. The CNR standardized questionnaire gives information about 14 respiratory symptoms/diseases: cough, phlegm, attacks of shortness of breath with wheeze, bronchial asthma, eye redness, hay-fever, wheezes, dyspnea, chronic bronchitis, bronchiectasis, emphysema, rhinitis, sinusitis, eczema. We assigned a value 0-1 to each symptom/disease according to absence-presence; the continuous variable, called local symptomatical index, was computed for each of 1121 home addresses as the mean of values referred to subjects associated to each point. Because of the great number of missing data for rhinitis, sinusitis, chronic bronchitis and bronchiectasis, we initially removed these four symptoms/diseases from our analysis. After calculating the mean values for each of 1121 home addresses, we obtained a variable ranging from 0 to 0.75 with only 67 different values. This low variability of our local symptomatical index led us to review our calculation and to recover part of the information we previously lost. For this reason, to obtain the variable values, we considered now up to 14 symptoms/diseases where available. Then we calculated the proper mean for each point depending on the number of symptoms/diseases actually reported for that point. So we obtained a new variable ranging from 0 to 0.714 with 162 different values. This is our local symptomatical index.

EXPLORATORY SPATIAL DATA ANALYSIS We used a commercial tool (the Geostatistical Analyst extension of ArcMap 8.2 by ESRI Inc) for geostatistical analysis. Exploratory Spatial Data Analysis (ESDA) allows a deeper understanding of the properties of the spatial dataset in order to best model the data using geostatistics. A series of tools were used to explore the distribution of data (Histogram and Normal QQPlot), to look for global and local outliers (Histogram, Semivariogram cloud, Voronoi Map), to look for global trends (Trend Analysis), to examine spatial autocorrelation and directional variation (Semivariogram cloud). Figures 2, 3, 4 and 5 resume the ESDA results. Histogram (fig. 2) and NormalQQPlot (fig. 3) indicate that local symptomatical index values do not have an approximately normal distribution. Statistics calculated by histogram tool helps to examine the distribution: in a normal distribution the mean and the median

should be similar, the skewness index should be near zero and the kurtosis index should be near three. If data are normally distributed, in the QQPlot the points should be close to a straight line. The positive skewness reported in the histogram (median lower than mean) suggest that more than 50% of points have a local symptomatical index value lower than mean value, which means a general health status better than mean level in each of those home address points. Trend Analysis (fig.4) shows that no global trend can be identified in our dataset. The two lines plotted in xz and yz planes represent the polynomial curves fitted to projected points in the north and south directions respectively. If the curve through the projected points is flat, as in our case, no trend exists. Semivariogram cloud (fig.5) allows to explore spatial autocorrelation, by measuring the distance between any two locations and then plotting half the difference squared between the values at the locations. The x-axis represents the distance between the locations and the y-axis the difference of their index values squared. Each dot in the semivariogram represents a pair of locations. If data are spatially dependent, pairs of points that are close together (on the far left of the x-axis) should have a lower difference (be low on the y-axis). As points become farther away from each other (moving right on the x-axis) the difference squared should be greater (moving up on the y-axis).

Figure 2: Histogram Figure 3: Normal QQPlot

Figure 4: Trend Analysis Figure 5: Semivariogram cloud Semivariogram cloud in figure 5 clearly shows that no spatial autocorrelation can be identified in our sample points. Also varying lag size and number of lags no autocorrelation appears: reducing lag size means zooming the details of the local variation between neighbouring sample points. We also used geostatistical tool to look for directional influences (or anisotropy) in our dataset: analysing semivariogram along main road direction no autocorrelation is evident. Surely we expected directional

influences were relevant in our analysis; but the unusual nature of our variable and perhaps the many factors adding noise to the variable (winds, personal history of subjects) evidently play a role to mask spatial phenomena.

THE RESPIRATORY HEALTH RISK MAP Geostatistical Analyst provides a set of interpolation techniques for the prediction of the variable value at unknown locations, basing on the sample points values. We attempted diverse interpolations using different methods available in Geostatistical Analyst. One of the most used is the Kriging method (Bailey, 2001; Deyian, 2004; Kleinschmidt, 2000) but, because of the absence of autocorrelation between sample points, the Kriging application to our dataset is not so adequate. Then we compared two different techniques: Inverse Distance Weighted (IDW) and Spline. IDW implements the assumption that things that are close to each other are more alike than those that are farther apart. To predict a value for any unmeasured location, IDW uses the measured values surrounding the prediction location: IDW assumes that each measured point has a local influence on the predicted value that diminishes when distance grows up. IDW technique does not take into account spatial autocorrelation (Johnston, 2001). Spline function is conceptually similar to fitting a rubber membrane through the measured sample values while minimizing the total curvature of the surface. When comparing Spline to the IDW method, both are exact interpolator, that is the surface is forced to pass through each measured sample points, but IDW will never predict values above the maximum measured value or below the minimum measured value while Spline can do so (Johnston, 2001). In our application the two methods give a very similar interpolation: comparing cross-validation results IDW provides for a root-mean-square prediction error (RMSPE) less than Spline. So we decided to apply Inverse Distance Weighted technique. We used a power value equal to 1: this is the optimal value calculated by Geostatistical Analyst by minimizing root-mean-square prediction error. Another matter with IDW is the neighbourhood search strategy. Because IDW is based on reducing influence of measured values on the prediction while distance grows up, in order to speed calculation, it is a common practice to limit the number of measured values involved in the prediction process. After attempting different values for the number of neighbours to include, we finally decided for a value of 10 neighbours. About the shape of the neighbourhood, we selected the circle because of the absence of evident directional influences. So we obtained the continuous surface representing the study area (figure 6). This surface is to be intended as the risk map for respiratory symptoms/diseases as derived by the local symptomatical index built upon epidemiological data. It shows the trend of the “respiratory health” phenomenon considered as an environmental indicator.

0 3.000 6.000meters

Figure 6: The interpolated surface

ASSESSING THE MAP BY CLASSICAL STATISTICS We then classified the obtained map according to 4 ascending threshold values identifying areas with a homogeneous level of symptoms/diseases, as shown in figure 7. The choice of threshold values was based on experimental reviews: the nature of our local symptomatical index is quite unusual and literature cannot give any indications about this topic.

0 3.000 6.000meters

sample points

medium-low medium-high high

low Local symptomatical index

Figure 7: Population respiratory health status map

The validation of the obtained classification was performed by means of statistical analysis. Using GIS technology, the whole sample (2816 subjects) was classified according to the same values used to classify the interpolated surface, obtaining 4 classes of subjects characterized by an increasing value of local symptomatical index from class 1 to class 4. Statistical analysis, performed using Statistical Package for the Social Science (SPSS) routines, aimed to investigate statistically significant associations between presenting respiratory symptoms/diseases and belonging to one of the four classes in which the area was classified after interpolation. In other words classical statistical analysis was used to assess the geostatistical application’s results. The prevalence rates of respiratory symptoms/diseases and the significance indexes (p values) in relation to subjects classification in 4 classes are shown in table 1. In order to take into account the role of different risk factors in determining respiratory symptoms and diseases, we used a multiple logistic regression model: we considered sex, age, smoking habit, passive smoking exposure, education and work exposure as independent variables. Table 2 resumes the results of multivariate models: odds ratio and confidence intervals obtained for the 2816 subjects (statistically significant values are represented in bold, borderline values in italic). Table 1 clearly shows increasing prevalence rates from class 1 (lowest local symptomatical index) to class 4 (highest local symptomatical index) for all symptoms/diseases. This trend is very statistically significant (p=0,000) for all symptoms/diseases, except for rhinitis (borderline value p=0.083), sinusitis end eczema (p=0.127 and p=0.130 respectively). In the multiple logistic regression model, we selected class1 as the control class and we studied the relationship with the other three classes. We obtained increasing OR values from class 2 to class 4 for quite all symptoms/diseases: this trend is statistically significant for cough, wheezes, attacks of shortness of breath, hay fever and bronchial asthma. In the other cases significant OR values are anyway present in one or two subjects classes, mainly in class 4. These findings provide evidence that the geostatistical map gives a good indication about the distribution of symptoms/diseases over land.

Table 1 - Prevalence rates (%) of respiratory symptoms/diseases in relation to subjects classification.

class 1 class 2 class 3 class 4 p

symptoms/diseases

cough 12.5 15.7 20.6 27.4 0.000

phlegm 12.2 13.0 17.3 27.4 0.000

wheeze 16.5 24.4 30.2 36.2 0.000

dyspnea 19.2 25.8 23.4 31.4 0.000

attacks 5.1 8.6 10.4 13.3 0.000

emphysema 2.9 4.7 7.1 9.1 0.000

chronic bronchitis 2.6 2.9 5.1 8.5 0.000

bronchiectasis - - - - .

hay fever 14.5 20.2 23.6 30.1 0.000

eye redness 17.7 20.9 24.9 26.8 0.000

bronchial asthma 4.0 6.6 8.4 12.7 0.000

rhinitis 18.9 24.2 24.1 34.0 0.083

sinusitis 23.6 31.9 37.9 34.9 0.127

eczema 9.9 10.8 11.3 13.9 0.130

Table 2 - Relationship between respiratory symptoms/diseases and the classification obtained by a geostatistical approach: odds ratio (OR) and 95% confidence intervals.

control class 2 class 3 class 4 OR OR 95% CI OR 95% CI OR 95% CI

symptoms/diseases cough 1 1.32 0.97-1.78 1.80 1.36-2.38 2.34 1.75-3.13 phlegm 1 1.06 0.77-1.46 1.41 1.05-1.88 2.40 1.79-3.23 wheezes 1 1.71 1.32-2.21 2.27 1.78-2.89 2.74 2.11-3.56 dyspnea 1 1.37 1.05-1.80 1.11 0.85-1.45 1.56 1.18-2.07 attacks 1 1.77 1.19-2.64 2.07 1.52-3.21 2.87 1.95-4.23 emphysema 1 1.45 0.84-2.53 2.04 1.25-3.34 2.49 1.05-4.13 chronic bronchitis 1 1.09 0.57-2.07 1.71 0.99-2.95 2.55 1.49-4.34 bronchiectasis - - - - . - - hay fever 1 1.54 1.18-2.01 1.86 1.45-2.39 2.76 2.12-3.60 eye redness 1 1.26 0.97-1.62 1.59 1.25-2.02 1.80 1.39-2.34 bronchial asthma 1 1.71 1.09-2.69 2.28 1.50-3.47 3.66 2.41-5.54 rhinitis 1 1.26 0.63-2.55 1.30 0.67-2.52 2.09 1.09-4.00 sinusitis 1 1.67 0.87-3.20 2.07 1.13-3.79 1.82 0.98-3.38 eczema 1 1.12 0.81-1.56 1.14 0.83-1.57 1.46 1.04-2.03

THE SUBDIVISION IN SMALL AREAS The spatial distribution of data suggested the opportunity to identify subsamples characterized by a more homogeneous and less clusterized distribution pattern. For this reason the whole sample was subdivided in seven subsamples (small areas) as shown in figure 8. The small areas were indicated as Pisa, Putignano, Oratoio, Titignano, Navacchio, San Benedetto and Cascina, from the administrative boundaries of locations.

0 3.000 6.000meters

Pisa

OratoioNavacchio

Putignano Cascina Titignano

San Benedetto

Figure 8: Subdivision of the whole sample in small areas

The subdivision in small areas is also justified by considerations of other kinds. If we analyse the population sample distribution over land and we overlay different layers (streets, buildings, general topography) we can notice different characteristics between adjacent parts of our sample. In particular we may point out an urban area in the western part (Pisa), a suburban area in the proximity of the city (Putignano, Oratoio) and a quite rural area in the eastern part of our sample (Titignano, Navacchio, San Benedetto) while Cascina area has again suburban characteristics. It is clear that diverse factors may affect respiratory health in such different conditions. For this reason reducing the study scale may point out local phenomena. The ESDA was repeated for each small area: results generally showed a more normal distribution for local symptomatical index value. Indeed, also for the small areas it cannot be observed any spatial autocorrelation, except for Cascina subsample in which empiric semivariogram showed a little correlation between points. We used the same interpolation method (IDW) to obtain continuous surfaces for the small areas and these were classified according to the same local symptomatical index values used for the whole sample. In figure 9 it is represented the classified surface produced for Cascina subsample.

Local symptomatical index

0 400 800meters

low

medium-low

medium-high

high

sample points

Figure 9: Cascina subsample

The classical statistical approach applied in order to validate the map obtained for the whole sample was repeated for two small areas: Cascina and Pisa. This choice is due to the fact that among the seven subareas, Cascina and Pisa samples have a sufficient number of subjects for statistical analysis. Statistical analyses confirm the good results observed for the whole sample: for some symptoms/diseases we obtained statistically significant trend, whereas in other cases, statistical significance is present in one or two classes. CONCLUSIONS The overall model we experimented gave good results as far as the study approach is concerned. We used epidemiological sampled data to derive an environmental indicator of respiratory health (the local symptomatical index). We succeeded to get a land map where the trend of “respiratory health” phenomenon is shown. We applied geostatistical tools and spatial interpolation in order to produce the distribution map of respiratory symptoms. We used classical statistical methods to assess the interpolated map. Statistical outputs gave satisfying results, both for the whole sample and the small areas. The great number of significant associations obtained indicates that the interpolated map may be considered a good representation of the distribution of respiratory symptoms/diseases over land. In particular, no substantial difference emerged from the separate study applied to small areas: the overall results obtained for the whole area were confirmed with low additional information. From an environmental point of view the study suggests the need to exploit a multilayer analysis to consider all the possible risk factors for population respiratory health status. For example it could be relevant to focus on the presence of different kinds of pollution sources. About this aspect we may also underline the need of a greater collaboration of administrative unities in order to collect and publish data necessary to a deeper knowledge of the study area. We especially refer to data about traffic density, air emissions, pollutants monitoring. In our study the lack of spatial autocorrelation did not allow to exploit the advantages of other interpolation techniques available in the Geostatistical Analyst. For example the Kriging family methods produce not only prediction surface but also error or uncertainty surfaces, giving an indication of how good predictions are. Our approach implements a scheme of data mining that is a quite unusual one for epidemiological study, and we propose this to epidemiologists as a possible new approach for environmental investigations. ACKNOWLEDGEMENTS We specially thank Matteo Bottai for his precious support on statistical items and for the revision of the text. We also thank the review Committee of the 8thAGILE Conference for the useful suggestions which addressed us to refine our work. And finally thanks to Fred Toppen for the great help he gives us to lay out our study. BIBLIOGRAPHY Bailey T. C. Spatial statistical methods in health, Cad. Saude Publica. Rio de Janeiro. 17(5):1083-1098,

2001 Berke O. Exploratory disease mapping: kriging the spatial risk function from regional count data.

International Journal of Health Geographics, 3:18, 2004 Bottai M., Geraci M. Uso dei metodi di statistica spaziale in ambienti GIS: alcune applicazioni

geostatistiche. Working Paper CNUCE-B4-2002-02, CNUCE CNR, 2002 Carrozzi L., Viegi G., Baldacci S. Effetti respiratori dell’inquinamento negli ambienti confinati:

valutazioni epidemiologiche. Atti del Convegno nazionale “Aria 92” La qualità dell’aria negli ambienti interni, Pisa, 28-29 ottobre 1992: 246, 1993

Cromley E. K., McLafferty S. L. Gis and Public Health. The Guilford Press, 2002 Deyian L. Geostatistical Analysis of Chinese Cancer Mortality: Variogram, Kriging and beyond.

Journal of Data Science 2, 177-193, 2004 De Lepper M. J. C., Scholten H. J., Stern R. M. The Added Value of Geographical Information Systems

in Public and Environmental Health. Kluwer Academic Publishers, 1994 Elliott P., Wartenberg D. Spatial epidemiology: current approaches and future challenges. Environ

Health Perspect. Jun;112(9):998-1006. 2004 English P., Neutra R., Scalf R., Sullivan M., Waller L., Zhu L. Examining associations between

childhood asthma and traffic flow using a geographic information system. Environmental Health Perspectives, 107:761-767, 1999

Houlding S. W. Practical Geostatistics- Modeling and Spatial Analysis. Springer, New York, 2000 Johnston K., Ver Hoef J. M., Krivoruchko K., Lucas N. Using Geostatistical Analyst. ESRI, 2001 Kleinschmidt I., Bagayoko M., Clarke G.P.Y., Craig M., Le Sueur D. A spatial statistical approach to

malaria mapping. International Journal of Epidemiology; 29:355-361, 2000 Lawson A. B. Statistical Methods in Spatial Epidemiology. John Wiley & Sons, 2001 Nicolai T., von Mutius E. Respiratory hypersensitivity and environmental factors: East and West

Germany. Toxicology Letters, 86:105-113, 1996 Nicolai T., Carr D., Weiland S.K., Duhme H., von Ehrenstein O., Wagner C., von Mutius E. Urban

traffic and pollutant exposure related to respiratory outcomes and atopy in a large sample of children. Eur Respir J21, 956-963, 2003

Rytkönen M.J. Not all maps are equal: GIS and spatial analysis in epidemiology. International Journal

of Circumpolar Health 63:1, 2004 Simoni M., Biavati P., Carrozzi L., Viegi G., Paoletti P., Matteucci G., Ziliani GL, Ioannilli E., Sapigni

T. The Po River Delta (North Italy) indoor epidemiological study: home characteristics indoor pollutants and subjects’ daily activity pattern. Indoor Air, 8: 70-9, 1998

Viegi G., Paoletti P., Carrozzi L., Vellutini M., Diviggiano E., Di Pede C., Pistelli G., Giuntini C.,

Lebowitz M. Prevalence rates of respiratory symptoms in Italian general population samples, exposed to different levels of air pollution. Environ Health Perspect, 94: 95-99, 1991

Viegi G., Baldacci S., Vellutini M., Carrozzi L., Modena P., Pedreschi M., Maggiorelli F., Di Pede F.,

Paoletti P., Giuntini C. Prevalence rates of diagnosis of asthma in general population samples of Northern and Central Italy. Monaldi Arch Chest Dis, 49: 3, 191-196, 1994

Viegi G., Pedreschi M., Baldacci S. et al. Prevalence rates of respiratory symptoms and diseases in

general population samples of North and Central Italy. Int J Tuberc Lung Dis, 3: 1034-1042, 1999

Wjst M., Reitmeir P., Dold S., Wulff A., Nicolai T., von Loeffelholz-Colberg E.F., von Mutius E. Road

traffic and adverse effects on respiratory health in children. British Medical Journal, 307:596-600, 1993

application of geostatistical methods for public health ... · daniela nuvolone (*), roberto fresco...

Documents