l1 spatial data - uantwerpen

Spatial issues in data analysis and model building:

distance, scale and complexity.

Isabelle THOMAS Francqui Chair

March 11th 2015

Spatial analysis

• Visualization Showing interesting patterns (Maps)

• Exploratory Spatial Data Analysis (ESDA) Finding interesting patterns

• Spatial modelling (regression, …) Explaining interesting patterns

Spatial is special

INTRODUCTION Distance Scale Complexity Accidents Conclusions

BAD NEWS

GOOD NEWS

ESDA DESCRIPTION

Spatial STATISTICS

Statistical MAPS

Modeling Spatial statistical

analysis and hypothesis testing

(Spatial) modeling and prediction

LEVEL OF DIFFICULTY

INTRODUCTION Distance Scale Complexity Accidents Conclusions

DISTANCE

DISTANCE Adjacency, interaction, and neighborhoods SCALE MAUP, spatial autocorrelation, ecology fallacy, edge/border effect

Why is distance so important ? (1)

Price of land

Quantity of Land

Towards downtown

Towards the periphery

Q1 Q2 Q3

Distance to CBD

High densities ----------------------------------------------------Low densities

The core of (transport) geography Enters most models, many indices

LOCATION

Absolute Latitude,

longitude; an address

Relative Distance,

directions to other places

Distance

Adjacency

Neighbourhood

Interaction

Why distance so important ? (2)

Introduction DISTANCE Scale Complexity Accidents Conclusions

Adjacency Distance Interaction Neighboorhood

Adjacency matrix (or adjacency list)

i and j are adjacent - if they share a common boundary - Share = ? - if they are within a specified distance (buffer - neighbourhood) Binary or distance-based weights.

Order of adjacency.

Rook Queen

1st order

2nd order

– dij measures the separation between i and j – (mathematical) definition:

• dij>0 if i≠j (distinction/separation) • dij=0 if i=j (co-location/equivalence)

Diagonal of the adjacency matrix

• dij+djk≥dik (triangle inequality) • dij=dji symmetry (is the graph symmetric ?)

Measuring distance is not simple …

In spatial analysis Objects may not be truly point-like/distinct Triangle inequality may not hold Symmetry condition may not hold

Terrain distances – cross section view

Measuring distance is not simple …

NB.- Spherical coordinates – spherical /ellipsoidal computations • Metrics

( ) ( )

coscossinsinsin2 221

BAwhere

λλφφ

+= −

Measuring distance • lp metrics

p = 1 Manhattan; p = 2 Euclidean; ...

Distance Adjacency Interaction Neighboorhood

Distance decay models – Simple inverse power models

– Trip distribution models

– Statistical modelling

0,})({

≥= ββij

)( ijjijiij dfDOBAT =

Errors A : d(2,j) < d(i,j) < d(5;j) B : d(1,i) = 0 C : i can be allocated to j while closer to j ’

Aggregation decreases – data collection costs – modeling costs – computing costs – confidentiality concerns – data statistical

uncertainty (smaller sample deviations for larger samples)

Increases – modeling errors/biases

Distance – agregation & scale

Introduction Distance SCALE Complexity Accidents Conclusions

LOCATION

Don’t forget the essence of your problem

SITUATION

SOCIOECONOMIC ENVIRONMENT

Land, transportation, amenities, …

Labor, materials, energy, …

Capital, subsidies, regulations, …

MACRO (national)

MICRO (local)

MESO (regional)

SCALE: cartographically

Large cartographic scale Small cartographic scale

Statistical sectors Communes, provinces, …

Extent constant, different grain

Increasing extent, grain constant

• Extent: spatial dimension

of an object (or process) observed/analyzed

• Grain (BSU): level of spatial resolution at which an object (or process) is measured/observed.

SCALE: 2 aspects

Source « INS »

Land rent

(by sq m) 2013

SCALE: Extent

Results obtained at one scale do not necessarily apply at other scales. A pattern may be clustered at one scale but dispersed at another scale

Population clustered into cities

City populations are dispersed

Scale is always important in spatial analysis!

SCALE: Extent

1. Patterns are dependent upon the scale of observation 2. The importance of explanatory variables changes with scale. 3. Statistical relationships may change with scale. 4. Patterns are generated by processes acting over various

spatial (and temporal) scales.

No unique solution Nested models, power laws, fractals, networks, …

Why being concerned about scale?

Power laws • Summarize how relationships

change with changes in scale • Often expressed on a log-log

plot. • Y = constant (X)n

• Similar slopes are thought to have similar structuring processes (n = slope)

• Example • Species-area relationships

! However : power laws often lack an explanatory process

• The same pattern appears across all scales. It is scale invariant.

• The relationship between size of box and pattern in it is constant.

• Fractals follow their own power law relating how number of boxes needed to cover a shape change in relation to their size.

Fractals

• Can represent relationships at a variety of scales at once.

• Structural properties of networks provide means of understanding how they work. – Nodes and links – Degree centrality and

betweeness – Weak versus strong links – Directional versus non-

directional graphs

Networks

1. Modifiable Areal Unit Problem (MAUP) 2. Ecology fallacy, 3. Edge/border effect 4. Spatial autocorrelation, (…)

Fallacies of scale

1. Modifiable Areal Unit Problem (MAUP)

Ecological fallacy: making claims about local-scale phenomena based on broad-scale observations Individualistic fallacy: making claims about broad scale phenomena based on observations conducted at small, local scales

2. Ecological fallacy

Do not generalise conclusions at other scales

Points close to the border are closer to locations out of the studied area. Arises when an artificial boundary is imposed on a study, often just to keep it manageable. Biases > nearest-neighbor distances > (model results) ? How to consider “the rest of the world”.

3. Edge/Border effects Solution:

1)Biased parameter estimates 2)Data redundancy (affecting the calculation of confidence intervals) 3)Moran and Geary

4. Spatial autocorrelation (1)

Coefficient – Coordinate (x,y,Z) – Spatial weights matrix (binary or other), W={wij} – Coefficient formulation – desirable properties

• Reflects co-variation patterns • Reflects adjacency patterns via weights matrix • Normalised for absolute cell values • Normalised for data variation • Adjusts for number of included cells in totals

w.spatialanalysisonline.com

• Moran’s I

• Modification for point data • Replace weights matrix with distance bands, width h • Pre-normalise z values by subtracting means • Count number of other points in each band, N(h)

∑∑∑∑∑

−−

i jjiij

pI / where,

∑∑∑

Extending SA concepts – Distance formula weights vs bands – Lattice models with more complex

neighbourhoods and lag models (GeoDa) – Disaggregation of SA index computations (row-

wise) with/without row standardisation (LISA) – Significance testing

• Normal model • Randomisation models • Bonferroni/other corrections

Moran I Correlogram

Source data points Lag distance bands, h Correlogram

• Underlying socio-economic process has led to clustered distribution of variable values – Grouping, Spatial interaction – Diffusion, Dispersal – Spatial hierarchies

• Mis-match betw. process and spatial units

– Counties vs retail trade zones – Census block groups vs neighborhood networks

4. Spatial autocorrelation (6) Causes of spatial dependence / Interpretation

What is Spatial autocorrelation D. Griffith, 1992 – L’Esp. Géo.

Explore the data

Fit an OLS

Perform diagnosis

Run adapted model

(ex GWR)

Compare models

EDA ESDA

Global autocorrelation Local autocorrelation

Global model Local model

RESULTS DECISION

Hypo theses

Introduction Distance SCALE COMPLEXITY Accidents Conclusions

Start with OLS and look for

– Positive spatial autocorrelation > dependence between samples exists

– Datasets often non-Normal >> transformations may be required (Log, Box-Cox, Logistic)

– Samples are often clustered >> spatial declustering may be required

– Heteroskedasticity is common (iid) – Spatial coordinates (x,y) may form part of the

modelling process

Type of spatial effect > Remedies – Spatial heterogeneity (Koenker-Bassett test)

• Include covariate which accounts for heterogeneity? • Split region?

– Spatial autocorrelation (Lagrange Multiplier tests) • Identify missing variables? • Explore effects of spatially-lagged independent variables? • Use appropriate spatial regression model?

Regression models

Introduction Distance SCALE COMPLEXITY Accidents Conclusions

• Identify the source (LM tests will help) – Regression residuals (LM-Error)

• Mismatch of process and spatial units => systematic errors, correlated across spatial units

– Dependent variable (LM-Lag) • Underlying socio-economic process has led to clustered

distribution of variable values => influence of neighboring values on unit values

Regression models

LARGE number of solutions : Spatial autoregressive process (SAR) Spatial moving average process (SMA), …

COMPLEXITY or COMPLICATION ?

Introduction Distance Scale COMPLEXITY Accidents Conclusions

• Algorithmic complexity • Deterministic complexity • Aggregate complexity Key generic properties 1. Nonlinear relationships 2. Techniques such as artificial intelligence 3. Emerges form relatively simple interactions System change and evolve

Complexity is hard to define

Property Attributes

Has a distributed nature & representation Multiscalar.

Openness Open system

Non-linear dynamics Path dependence.

Limited functional decomposability

Emergence and self-organisation Emergence

Adaptive behaviour and adaptation Self organization

Non deterministic and non tractability Stochastic

Vocabulary about complexity

SYSTEM ANALYSIS

MIT, Jay Forrester (6’), Bertalanffy (67) General system

theorySystem’s autonomy

SELFORGANIZATION Prigogine, Haken (1970-80)

Open systems, dissipative structures, impredictible effects of

non linear micro-interactions on system’s macro structure and dynamics, path dependence

(irreversibility)

COMPLEX SYSTEMS Santa Fe Institute,

ISI, ECSS (1990-2000)

Emerging properties

Models: Multi-Agents-Systems

Models: differential equations

Urban systems are complex systems • Urban systems are produced by social interactions (conveying

information), according to their range in space and duration in time

• Non-linear interaction occur at micro, meso or macro levels, and between levels

• Emergence of collective properties within cities: • Hierarchical organisation (« cities as systems within systems of cities »

Reynaud, 1841, Berry, 1964, Pred, 1977) • Urban « memory » (dynamic path dependence) as a constraint on

urban dynamics at both levels

PLACE(S)(Environment)

Road(s) PEOPLE (Roadusers):

(x, y, t)

VEHICLE(S)

INTERACTIONS

From facts … to geography

Introduction Distance Scale Complexity ACCIDENTS Conclusions

Multi-level problem

Explore the data

Fit an OLS

Perform diagnosis

Run adapted model

(ex GWR)

Compare models

EDA ESDA

Global autocorrelation Local autocorrelation

Global model Local model

Step 1: EDA Select variable and describe

Univariate

Bi- and multi- variate

Visualizations

Tables, Charts, Plots, autocorr, hot spot

Step 2 : ESDA

Test spatial homogeneity

Spatial weights

Global & Local spatial autocorrelation

• Point pattern analysis Describing a point pattern. Black spots, black zones

- Density-based point pattern measures - Distance-based point pattern measures

Assessing point patterns statistically • Aggregation - Segments of road - Communes (stat sectors) • Explanation/prediction - Measuring and modeling numbers/risk

Pinpoint location (point) Black spot Black road segment (line) Black « region » (polygon) Multi- scale, dimensional, disciplinary, causal analysis. Necessity: to isolate, to control for in order to avoid badly specified models.

Describe / Understand / Explain / predict + ACT (Engineering, Enforcement, Education, Environment)

Poisson or not ?

• Poisson > Binomial • Aggregation effects • Length of segments

Road accidents N29 Charleroi-Jodoigne

Moran for black segments

Kernel

Mechelen

Infrastructure &

Environnement

Yi = 1 if hm belongs to a « black segment ».

Yi = 0 otherwise

Characteristics of the road - Usage - Physical properties - Environment (landuse, …)

(Official data; Numerical Digital Terrain Model; IGN maps)

Logistic regression 5.2

N 0 250m

Objective : explain variations in Y Controlling spatial biases

EXPLORATORY

Identify potential explanatory factors

Statistical tools: • Graphics, (basic statistics) • Cluster analyses, (PCA) • Correlations (x,y)

STATISTICAL MODELLING

Relative importance of variables?

Statistical tools • Statistical models • Corrections for

multicollinearity & spatial effects

2 steps

Factor X ?

Factor 1

Factor 2

village

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

% cycling

Distance (km)

• Commuting distances (< 10 km) • Town size: regional towns > large towns • Regional differences (culture + …)

Exploratory step

Exploratory step 5.

Unsatisfaction of cycleways: –0.82

Slopes: –0.77 Bad health: – 0.58

ρxy = 1

(correlation)

Active people < 25 years: 0.54

Accident risk: – 0.32

Job density: 0.38

No child, town size: 0.23

ρxy = 0

ρxy = –1

Commuting distances (km)

Average slopes (d°)

Commuting distances: – 0.54

POLICY-RELATED FACTORS

ENVIRONMENTAL FACTORS

INDIVIDUAL FACTORS

- Income - Education - Gender - Age - Car availability - Young childrens/household

Socio-economic data (NIS)

- Subjective health

Health data (NIS)

- Slopes (d°)

Physical data (UCL)

- Air pollution (PM10)

Environmental data (IRCEL-CELINE)

- Accident risk: f (number of accidents, travel time)

Accident data (NIS)

- Land-use (e.g. urban) - City size - Job and pop. densities

Land-use data (UCL)

- Satisfaction of cycle paths - Traffic volume - Commuting distance (km)

Trip/local characteristics

BICYCLE USE

Scale : communes (INS 5)

Vandenbulcke et al Transportation Research Part A (2011)

SPATIAL AUTOREGRESSIVE

MODEL + REGIMES

Uncorrelated X

"White correction »

OLS (Ordinary-Least Squares )

Spatial autocorrelation (LM tests)

Structural instability (Chow tests)

Multicollinearity (VIF, …)

Heteroskedasticity (BP tests)

Spatial autoregressive model (spatial lag)

Inclusion of spatial regimes (ESDA)

111111 εβρ ++= XyWy

222222 εβρ ++= XyWy

εβρ ++= XWyy(Queenmatrix)

εβ += Xy5.

OLS Model (n = 589)

Italics: ln(x+1)

Y = % commuter cyclists in commune i

Estimation OLS (y)

Intercept 6,4124****

Median income 0,0030

Active men 0,0472****

Age 2 (45-54 years) -0,0460****

Young children -0,0567****

Cycleways unsatisfaction -0,0127****

Commuting distance -0,0114***

Air quality 0,0141****

City size -0,0954****

Bad health -0,0521****

Accident risk -0,1673**

Traffic volume 2 (municipal network) -0,9216****

Age 3 (> 54 years) -0,2054*

Education 3 (university degree) -0,4988****

Slopes -0,4873****

R-squared (R²) 0,879

Log Likelihood -102,43

Moran's I of residuals 0,34 (0,00)

Estimation OLS (y) ML (y)

Intercept 6,4124**** 3,2698****

Median income 0,0030 0,00852

Active men 0,0472**** 0,01673**

Age 2 (45-54 years) -0,0460**** -0,02505***

Young children -0,0567**** -0,0218****

Cycleways unsatisfaction -0,0127**** -0,0049****

Commuting distance -0,0114*** -0,00652**

Air quality 0,0141**** 0,00405

City size -0,0954**** -0,08747****

Bad health -0,0521**** -0,01889****

Accident risk -0,1673** -0,14495***

Traffic volume 2 (municipal network) -0,9216**** -0,46952****

Age 3 (> 54 years) -0,2054* -0,14503*

Education 3 (university degree) -0,4988**** -0,23034***

Slopes -0,4873**** -0,17630****

Lag coefficient (ρ) - 0,6015****

R-squared (R²) 0,879 -

Log Likelihood -102,43 33,68

Moran's I of residuals 0,34 (0,00) 0,01 (0,45)

SAR Model (LAG)

Residuals

Simpson’s paradox 5.

Spatial LAG model + Regimes N-S

North South

Intercept 2,3084* 4,30951****

Median income 0,0311* -0,0027

Active men 0,0296** 0,0008

Age 2 (45-54 years) -0,0417** -0,0205***

Young children -0,0365*** -0,0247***

Cycleways unsatisfaction -0,0052*** -0,0045***

Commuting distance -0,0165*** -0,0047*

Air quality 0,01384**** -0,0054

City size -0,11459**** -0,03615****

Bad health -0,0098 -0,0146**

Accident risk -0,76319**** -0,14892****

Traffic volume 2 (municipal network) -0,2357 -0,4521**

Age 3 (> 54 years) -0,1074 -0,0680

Education 3 (university degree) -0,0968 -0,3132***

Slopes -0,1931** -0,19718****

Lag coefficient (ρ) 0,5362****

N 589 (NNorth = 308; NSouth = 281)

Log Likelihood 93,923

North = Flanders South = Wallonia & Brussels

Main results

– Demographic factors: e.g. gender, children – Socio-economic: e.g. education – Environmental & policy-related factors, e.g.:

• Dissatisfaction with cycle facilities • Town size • Accident risk • Traffic volume

location 2 > location 1

Spatial factors?

Importance of space/location

Network location 1 Network location 2

Bicycle traffic =

? accident

street network

• Binary Yi = 0,1 logistic specification

• Corrections for – Multicollinearity – Heteroskedasticity – Residual spatial autocorrelation

omitted variables? spatial models

• Spatial models (Bayesian framework) – ICAR model… but fit not improved – Hierarchical auto-logistic model

Cases = accidents + Controls = generated absences yi = (0,1)

Regression methods (e.g. logistic models) Advantage: estimation of risk, reduced statistical bias Issues: no vehicle & human factors, selection of controls

Models based on case-controls?

Methodology

Regression methods (e.g. multinomial logit models) Issues: over-/under-dispersion, underreporting, etc.

Regression methods (e.g. logistic models) Main issue: bias in the selection of road trajectories

Case-control

strategy

Transportation (gravity-based

models)

Epidemiology (case-control

studies)

Ecology (generation of

controls)

Models based on surveys, road trajectories

Models based on accident-only data

Data collection

• Accident risk = time-consuming process – Accidents (cases) to be geocoded/located

– ‘Absences’ (controls) to be generated • … but no rigorous sampling method tricky and questionable results!

– Road network exclude ‘unbikeable’ links

– Risk factors to be collected…

• Software requirements: GIS 4.4

• Controls = locations without any accident (officially) supposed to be safe

• Generation of controls = random sampling of points along the road network, BUT:

Proportional to bicycle traffic (stratified sampling) Exclude ‘black zones’ (hot spots of accidents) from the

bikeable network

Black zones

Data collection: controls and absences

1) Negative exponential function

2) 500 impedance functions 3) No edge effect

Stratified random sampling

Potential bicycle traffic

111111

Black spots (network kernel densities)

111111

Ncontrols = 4*Naccidents

Data collection: risk factors Infrastructure factors • Cycling facilities & contraflow cycling • Discontinuities • Parking areas & garages • Bridge & funnels • Crossroads & complexity • Tram railways • Traffic-calming areas • Major roads • Proximity city centre • Distance to specific points of interest (e.g. schools, bus stops, etc.)

Traffic conditions • Cars • Trucks/lorries & buses • Vans

Environmental factors • Gradients • Green blocks (parks, etc.)

• Advantage of GIS: combination of several datasets

• Accidents/controls – ‘Attached’ variables – ‘Crossings’

Data collection: risk factors

DATASET

Results: Modelling process

DEPENDENT VARIABLE (BINARY) Accident data (geocoded)

Controls/absences

INDEPENDENT VARIABLES (RISK FACTORS)

Infrastructure factors

Traffic conditions

Environment (physical)

MODELLING PROCESS

FINAL MODEL

Choice of the specification

Convergence diagnostics

Corrections for spatial effects

PREDICTIONS

Results: robust

Results: Predictions for a trajectory

Schuman’s roundabout

Tram railways

High traffic

volume

Exit High traffic

volume

Succession of crossroads on a major road (Wetstraat/Rue de la Loi) + segregated cycling facility

End of a separated cycling facility at

the crossroad Residential ward

Residential ward + contraflow

Take home message

• Location(s) and distance (s) • Scale : independance of scales; nested. • COMPLEXITY of spatial processes • UNCERTAINTY

Introduction Distance Scale Complexity Accidents CONCLUSIONS

Spatial statistics Large data sets Spatial autocorrelation Scales Border/edge effects MAUP (scale + zoning) Heterogeneity …

SPACE BIASES

Introduction Distance Scale Complexity Accidents CONCLUSIONS

Readings

Data analysis • Fotheringham A., Brunsdon C. &Charlton M. (2000) Quantitative Geography Perspectives on Spatial Data Analysis, London, SAGE • Fotheringham A, C Brunsdon &M Charlton (2002) Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Chichester. • Bailey, T., & A. Gatrell. 1995. Interactive spatial data analysis. Essex, UK: Longman. • www.spatialanalysisonline.com Road accidents in Belgium • Thomas I. (1996), Spatial Data Aggregation. Exploratory Analysis of Road Accidents. AAP, 28:2, 251-264 • SteenberghenT. et al. (2004) Intra-urban location of road accidents blackzones: a Belgian example. IJGIS: 18,2, 169-181. • Vandenbulcke G., Thomas I., IntPanis L. (2014), Predicting cycling accident risk in Brussels: an innovative spatial case-control approach. AAP, 62, 341-357 • Vandenbulcke G.,. et al. (2011) Bicycle commuting in Belgium: Spatial determinants and re-cycling strategies, TR – A 45 118–137

Your exercice – 10 pages. Take your own data set (If you haven’t : go to Census11) and « PLAY » with them. Get 3 variables : Y (your choice) + 1 X « explanatory » + a measure of distance 1. Define/describe them very well; justify the scale (extent and grain) and its

limitations 2. EDA and ESDA + Statistical map of the 3 variables. Compute correlations between variables for several extents and/or 2 levels of aggregation and/or 2 subsets. 3. Compute simple OLS and map residuals (compute spatial autocorrelation) for both levels of aggregation. 4. If possible enhance regression by adopting other method f.i. correct for spatial autocorrelation. 5. Critical and strong conclusion (incl. potentials, challenges, …)

l1 spatial data - uantwerpen

Documents

christophe de bie - uantwerpen

discovery services at the uantwerpen vlir-uos workshop...

impala search forcoil partners - uantwerpen

symposium - uantwerpen

atomic spectroscopy - uantwerpen

november 17 2016 - uantwerpen

chapter 6 – design by contract - uantwerpen

l1-spatial concepts ngen06 & tek230: algorithms in...

i. introduction - uantwerpen

january 2015 - uantwerpen

family versus non-family firms - uantwerpen

text linguistics and text editing - uantwerpen

expert systems with applications - uantwerpen

approximation by spline functions - uantwerpen

applying multispectral unmixing and spatial analyses to...

course notes for solid state physics ii - uantwerpen

4 nwfashionconference - uantwerpen

quantitative spatial profiling of pd-1/pd-l1 interaction...

introduction to gis - spatial {query}...

sustalab - uantwerpen