edoardo pizzoli, chiara piccini ntts 2013 - new techniques and technologies for statistics spatial...

Edoardo PIZZOLI, Chiara PICCINI

NTTS 2013 - New Techniques and Technologies for Statistics

SPATIAL DATA REPRESENTATION: AN IMPROVEMENT OF STATISTICAL DISSEMINATION FOR POLICY ANALYSIS

Brussels, 5-7 March 2013

NTTS 2013New Techniques and

Technologies for Statistics


Hypothesis

Statistical units bring information on the territory they belong to: their individual characteristics are proxies of the territorial characteristics.

Available data at the statistical units level can be used to expand information on administrative areas as soon as there is empirically a space correlation in the studied phenomena.

If such a statistical relationship exists among a set of units with respect to a specific characteristic, a spatial analysis is possible.

A representation on a map or cartogram will be always considered to represent in space the phenomena under investigation.

Introduction/1




Geostatistics

Usually applied for the analysis of natural phenomena. The main assumptions are: point values are influenced by the values in nearest points (spatial autocorrelation) and the phenomenon is continuously distributed over the territory.

Socio-economic variables, discrete by nature, can be elaborated by geostatistical methods assuming both that their values are spatially dependent, and that the analyzed phenomenon is continuously distributed over the territory. The basic hypothesis is that the analyzed units can represent measurement points of phenomena distributed over the whole territory.

The graphical result is certainly better than a thematic cartogram which forces the data into administrative boundaries, often arbitrarily set.

Introduction/2

Spatial data

each data value is associated with a location in space and there is at least an implied connection between the location and the data value

Spatial autocorrelation and geostatistics

The spatial autocorrelation is defined as the variation of a property within a geo-space: characteristics at proximal locations appear to be correlated, either positively or negatively.

Spatial autocorrelation is the matter of geostatistics.




Spatial prediction models (algorithms)

Arbitrary or empirical models.No estimate of model error.e.g. Thiessen polygons, Inverse distance weighting, Trend surfaces, Splines.More primitive and often suboptimal; in some situations they can perform as good as the statistical methods or even better.

Model parameters estimated in an objective way.Estimate of the prediction error available.e.g. Kriging, Environmental correlation, Bayesian-based models, mixed models.Input dataset usually need to satisfy strict statistical assumptions.

StatisticalDeterministic

Geostatistical modelling




Basic steps of geostatistical analysis:

1. estimation of semivariogram

2. estimation of the parameters of the

semivariogram model

3. estimation of the surface.

A validation of the estimation can be added.

Kriging algorithm is an optimal interpolator - generates best linear unbiased estimate at each location, employing semivariogram model.

The most commonly used kriging algorithm is the Ordinary Kriging (OK).

A normal distribution of the data is usually a prerequisite for the application of geostatistics: OK may give unacceptable results if the data are severely non-normal.




Indicator kriging (IK) is another geostatistical approach to geospatial modeling, which makes no assumption of normality and is essentially a non-parametric counterpart to OK.

IK uses indicator (0 or 1) variables to generate probabilities that a critical value was exceeded or not at each location in the study area, and then proceeds the same as OK.

If a threshold is used to create the indicator variable, the resulting interpolation map would show the probabilities of exceeding (or being below) the threshold.

This allows to produce probability maps and risk maps.




The problem of outliers

Outliers may provide some useful information concerning the magnitude of the phenomenon; at the same time may unduly influence the results of the analysis.

Possible solutions: logarithmic transformation; winsorization; trimming.

A logarithmic transformation allows to approach a normal distribution, but on the other hand flattens out the data, completely loosing the information brought by outliers.

In a trimmed dataset, the extreme values are discarded; in a Winsorized dataset, the extreme values are instead replaced by other values.




Data from ASIA-Agricoltura (year 2009).

Process of representing data is as follows:

1. specification of the territorial administrative level of interest;

2. normalization of the units’ addresses (data quality control);

3. assignment of geographical coordinates to units, starting from addresses;

4. correction of errors on coordinates;

5. mapping point location, visualizing their spatial distribution;

6. variographic analysis of data;

7. estimation in non-sampled points by means of the appropriate algorithm;

8. mapping estimated values over the territory;

9. cross-validation to test the accuracy of the estimation.




Application example - the Province of Palermo

Results/1

Revenues/number of employees ratio (R/N)




0

0,2

0,4

0,6

0,8

1

0 500 1000 1500 2000 2500

Re

lati

ve f

req

ue

nce

R/N

0

0,05

0,1

0,15

0,2

0,25

0 2 4 6 8

Re

lati

ve f

req

ue

nce

LogR/N

0

0,1

0,2

0,3

0,4

0,5

0 100 200 300 400

Rela

tive

fre

quence

Wins_R/N

0

0,1

0,2

0,3

0,4

0,5

0 50 100 150 200 250 300 350 400 450

Rela

tive

fre

quence

R/N

a) original dataset b) logarithmic transformation

c) winsorized dataset d) trimmed dataset

Results/2

Deterministic interpolator: the Radial Basis Function (RBF)




The semivariogram is a function that describes the differences between samples separated by varying distances.

In both cases, the spatial autocorrelation is weak, but a model can be adjusted to the point semivariograms, Spherical for R/N and Exponential for log(R/N).

Their parameters can be used in the OK algorithm to estimate maps.




Variogram modelling

a) R/N b) log(R/N)




Estimated maps/1

Estimated map of R/N by OK

Prediction error map of R/N by OK




Estimated maps/2

Estimated map of log(R/N) by OK

Prediction error map of log(R/N) by OK

Contour lines arrangement looks very flattened out towards low values.




Estimated maps/3

Estimated map of R/N by OK (winsorized dataset)

Prediction error map of R/N by OK (winsorized dataset)

The value of the outlier has been lowered.




Estimated maps/4

Estimated map of R/N by OK (trimmed dataset)

Prediction error map of R/N by OK (trimmed dataset)

Deleting the outlier causes a loss of information.

The estimation error is high everywhere.

The variable R/N was transformed in an indicator variable using as threshold the value 26 (proposed by the software).

The spatial autocorrelation is almost absent, but an Exponential model can be adjusted to the point semivariogram.

Its parameters are used in the Kriging algorithm to estimate a probability map.




Indicator Kriging/1

Indicator variable




Probability of exceeding the threshold

Prediction error map

Indicator Kriging/2

Estimated maps of socio-economic indicators, based on available micro-data at the statistical units level, represent a clear enhancement on the dissemination of statistical information.Visualizing statistical information on a map is not just a further dimension - the space - added to the data, but an improvement of socio-economic information linking them to the geographical characteristics.Maps are useful to identify areas of policy intervention, to plan and evaluate actions, to perform simulations and to get future scenarios.The error map is a further improvement, showing the limitations and the reliability of the statistical information available.The choice of the interpolation method depends essentially upon the type of available data and upon the objective of the elaboration.


Conclusioni


Thank you for your attention

edoardo pizzoli, chiara piccini ntts 2013 - new techniques and technologies for statistics spatial...

Documents

new techniques

hypothesis statistical

spatial analysis

statistical units level

available data

statistical relationship

strict statistical assumptions

analyzed units