the identification of exceptional values in the espon database

30
The identification of The identification of exceptional values in the ESPON exceptional values in the ESPON database database Paul Harris Paul Harris Martin Charlton Martin Charlton National Centre for Geocomputation National Centre for Geocomputation NUIM Maynooth Ireland NUIM Maynooth Ireland Madrid seminar - 10/6/10 Madrid seminar - 10/6/10

Upload: arich

Post on 14-Jan-2016

28 views

Category:

Documents


2 download

DESCRIPTION

Paul Harris Martin Charlton National Centre for Geocomputation NUIM Maynooth Ireland Madrid seminar - 10/6/10. The identification of exceptional values in the ESPON database. ESPON DB data Identifying exceptional values Case study 1 (detecting logical input errors) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The identification of exceptional values in the ESPON  database

The identification of exceptional The identification of exceptional values in the ESPONvalues in the ESPON database database

Paul HarrisPaul HarrisMartin CharltonMartin Charlton

National Centre for GeocomputationNational Centre for GeocomputationNUIM Maynooth IrelandNUIM Maynooth Ireland

Madrid seminar - 10/6/10Madrid seminar - 10/6/10

Page 2: The identification of exceptional values in the ESPON  database

OutlineOutline

1.1. ESPON DB dataESPON DB data

2.2. Identifying exceptional valuesIdentifying exceptional values

3.3. Case study 1 (detecting logical input errors)Case study 1 (detecting logical input errors)

4.4. Case study 2 (detecting statistical outliers)Case study 2 (detecting statistical outliers)

5.5. Next things to do..Next things to do..

Page 3: The identification of exceptional values in the ESPON  database

1. ESPON DB data1. ESPON DB data

Socio-economic, land cover,…Socio-economic, land cover,…

Continuous, categorical, nominal, ordinal,….Continuous, categorical, nominal, ordinal,….

Spatial support:Spatial support:Area units – NUTS 0/1/2/23/3Area units – NUTS 0/1/2/23/3(whose boundaries may also change over time)(whose boundaries may also change over time)

Temporal support:Temporal support:Commonly, yearly units (with only a short time series)Commonly, yearly units (with only a short time series)

Page 4: The identification of exceptional values in the ESPON  database

Define two types:Define two types:

1.1. Logical input errorsLogical input errors(e.g. a negative unemployment rate)(e.g. a negative unemployment rate)

2.2. Statistical outliersStatistical outliers(e.g. an unusually high unemployment rate)(e.g. an unusually high unemployment rate)

Two-stage identification algorithm:Two-stage identification algorithm:

Stage 1: identify input errors via mechanical techniquesStage 1: identify input errors via mechanical techniques

Stage 2: identify outliers via statistical techniquesStage 2: identify outliers via statistical techniques

2. Identifying exceptional values2. Identifying exceptional values

Page 5: The identification of exceptional values in the ESPON  database

Stage 1:Stage 1:

Identify logical Input ErrorsIdentify logical Input Errors

Page 6: The identification of exceptional values in the ESPON  database

Logical input errors…Logical input errors… Usually detected using some logical, mathematical approachUsually detected using some logical, mathematical approach

Statistical detection may also help…Statistical detection may also help…

Typical input errors:Typical input errors:

Impossible values (e.g. negatives, fractions…)Impossible values (e.g. negatives, fractions…)

Repeated data for different variablesRepeated data for different variables

Data displaced between or within columnsData displaced between or within columns

Data swapped between or within columnsData swapped between or within columns

Wrong NUTS code or nameWrong NUTS code or name

Wrong NUTS regions used (e.g. for 1999 instead of 2006)Wrong NUTS regions used (e.g. for 1999 instead of 2006)

Missing value code (e.g. 9999 treated as a true value)Missing value code (e.g. 9999 treated as a true value)

Etc.Etc.

Page 7: The identification of exceptional values in the ESPON  database

Our approach…Our approach…

Detect input errors mathematically (& statistically)Detect input errors mathematically (& statistically)

Flag observations if they are likely input errorsFlag observations if they are likely input errors

If possible - correct themIf possible - correct them

More likely - consult an expert on the dataMore likely - consult an expert on the data

Once happy - go to stage 2 - assume data is error-freeOnce happy - go to stage 2 - assume data is error-free

Page 8: The identification of exceptional values in the ESPON  database

Stage 2:Stage 2:

Identify statistical outliersIdentify statistical outliers

Page 9: The identification of exceptional values in the ESPON  database

Types of outliers….Types of outliers….

Page 10: The identification of exceptional values in the ESPON  database

Our approach…Our approach…

There is no single ‘best’ outlier detection technique, so…There is no single ‘best’ outlier detection technique, so…

Apply a representative selection of outlier detection Apply a representative selection of outlier detection techniques (which are simple & robust)techniques (which are simple & robust)

Flag an observation if it is a likely outlier according to each Flag an observation if it is a likely outlier according to each techniquetechnique

Build up a Build up a weight of evidenceweight of evidence for the likelihood of a given for the likelihood of a given observation being statistically outlyingobservation being statistically outlying

Suggest what type of outlier it is likely to beSuggest what type of outlier it is likely to be - - aspatial, spatial, temporal, relationship, some mixture…aspatial, spatial, temporal, relationship, some mixture…

Consult an expert on the data to decide on the appropriate Consult an expert on the data to decide on the appropriate course of actioncourse of action

Here’s an example using nine techniques & three Here’s an example using nine techniques & three observations…observations…

Page 11: The identification of exceptional values in the ESPON  database

Identification technique Identification type Obs. 1 Obs. 2 Obs. 3

1. Boxplot statistics Aspatial & univariate Yes Yes

2. Hawkins’ spatial test statistic Spatial & univariate Yes

3. Time series statistics Temporal & univariate Yes YesYes

4. Large residuals from multiple linear regression*

Aspatial & multivariate,Linear relationships

Yes YesYes

5. Large residuals from locally weighted regression*

Aspatial & multivariate,Nonlinear relationships

Yes

6. Large residuals from geographically weighted regression*

Spatial & multivariate,Nonlinear relationships

Yes

7. Principal component analysis* Aspatial & multivariate,Linear relationships

Yes

8. Locally weighted principal component analysis*

Aspatial & multivariate,Nonlinear relationships

Yes

9. Geographically weighted principal component analysis*

Spatial & multivariate,Nonlinear relationships

Yes

* Can have a spatial, univariate form if the coordinate data are used as variables

Page 12: The identification of exceptional values in the ESPON  database

DataData Data at NUTS3 level (1351 observations/regions)Data at NUTS3 level (1351 observations/regions) Variables:Variables: GDP evolution (2000 to 2005) (%age)GDP evolution (2000 to 2005) (%age) Calculated using 4 other variables:Calculated using 4 other variables:

205 logical input errors deliberately introduced to:205 logical input errors deliberately introduced to: NUTS codes & the 4 variables used to calculate GDP NUTS codes & the 4 variables used to calculate GDP

evolution onlyevolution only ~ 15% of data infected~ 15% of data infected

2005

2000

2000

2005

20002000

200520050500 POP

POP

GDP

GDP

POPGDP

POPGDPE

3. Case study 1 (detecting logical input errors)3. Case study 1 (detecting logical input errors)

Page 13: The identification of exceptional values in the ESPON  database

Performance resultsPerformance results

False negatives - 13.2% (e.g. in Italy)False positives - 2.0% (e.g. in Spain)Overall misclassification rate - 3.7%

Page 14: The identification of exceptional values in the ESPON  database

Consequences if we had ignored input Consequences if we had ignored input errors….errors….

Page 15: The identification of exceptional values in the ESPON  database

DataData Data at NUTS23 level for eight years: 2000-2007Data at NUTS23 level for eight years: 2000-2007

For each year - ‘unemployment rate’ calculatedFor each year - ‘unemployment rate’ calculated [Unemployment population)/(Active population)][Unemployment population)/(Active population)]

8 variables at each of 790 regions = 6320 obs.8 variables at each of 790 regions = 6320 obs.

Data checked for input errors - i.e. stage 1 doneData checked for input errors - i.e. stage 1 done

4. Case study 24. Case study 2(detecting statistical outliers)(detecting statistical outliers)

Page 16: The identification of exceptional values in the ESPON  database
Page 17: The identification of exceptional values in the ESPON  database

Presentation of results…Presentation of results…

For brevity…For brevity…

Lets say - we only need at least one of 8 Lets say - we only need at least one of 8 time-specific unemployment values in a time-specific unemployment values in a region to be outlying…region to be outlying…

(But we can identify outliers by year too)(But we can identify outliers by year too)

Page 18: The identification of exceptional values in the ESPON  database

Results: 1 boxplot statisticsResults: 1 boxplot statistics(aspatial & univariate)(aspatial & univariate)

Page 19: The identification of exceptional values in the ESPON  database

Results: 2 Hawkins’ testResults: 2 Hawkins’ test(spatial & univariate)(spatial & univariate)

Page 20: The identification of exceptional values in the ESPON  database

Results: 3 time series statisticsResults: 3 time series statistics(temporal & univariate)(temporal & univariate)

Page 21: The identification of exceptional values in the ESPON  database

Results: 4 MLR residualsResults: 4 MLR residuals(aspatial linear relationships)(aspatial linear relationships)

Page 22: The identification of exceptional values in the ESPON  database

Results: 5 LWR residualsResults: 5 LWR residuals(aspatial nonlinear relationships)(aspatial nonlinear relationships)

Page 23: The identification of exceptional values in the ESPON  database

Results: 6 GWR residualsResults: 6 GWR residuals(spatial nonlinear relationships)(spatial nonlinear relationships)

Page 24: The identification of exceptional values in the ESPON  database

Results: 7 PCA residualsResults: 7 PCA residuals(aspatial linear relationships & model-free)(aspatial linear relationships & model-free)

Page 25: The identification of exceptional values in the ESPON  database

Results: 8 LWPCA residualsResults: 8 LWPCA residuals(aspatial nonlinear relationships & model-free)(aspatial nonlinear relationships & model-free)

Page 26: The identification of exceptional values in the ESPON  database

Results: 9 GWPCA residualsResults: 9 GWPCA residuals(spatial nonlinear relationships & model-free)(spatial nonlinear relationships & model-free)

Page 27: The identification of exceptional values in the ESPON  database

Summary of results: weight of evidenceSummary of results: weight of evidence

Page 28: The identification of exceptional values in the ESPON  database

Preliminary performance resultsPreliminary performance results

Infected ~ 5% of the data with ‘outliers’ & Infected ~ 5% of the data with ‘outliers’ & repeated the analysis on this ‘infected’ data…repeated the analysis on this ‘infected’ data…

False negatives: 10.3% False positives: 34.3% Overall misclassification rate: 26.1%

Problems: Difficult to guarantee that our infections actually

produce outliers… The data already contains outliers (as shown)

Page 29: The identification of exceptional values in the ESPON  database

1. Other ways of performance testing our approach Simulated data with known properties? Statistical theory (or properties)?

2. Refining each of our nine chosen techniques Robust extensions

5. Next things to do…5. Next things to do…

Page 30: The identification of exceptional values in the ESPON  database

Thank You!Thank You!