euroregionalmap: best practices in quality assessment for a pan-european dataset
DESCRIPTION
EuroRegionalMap: Best practices in quality assessment for a pan-European dataset. Nathalie Delattre QKEN meeting, Brussels, 5-7 may 2010. Items. ERM: presentation Best Practices in quality control Quality issues Expectation-Debate. 1. ERM: presentation. - PowerPoint PPT PresentationTRANSCRIPT
EuroRegionalMap: Best practices in quality assessment for a pan-European dataset
Nathalie DelattreQKEN meeting, Brussels, 5-7 may 2010
Items
1. ERM: presentation
2. Best Practices in quality control
3. Quality issues
4. Expectation-Debate
1. ERM: presentation
Project status: consolidation phase 2007-2010 Eurostat Contract
1. to provide a yearly update of the ERM data for a European coverage in accordance with the EC contract No. 2006/S 174-185902
2. to improve the level of harmonisation of ERM in the data content and selection criteria
3. To upgrade ERM according to EUROSTAT specifications orientated for spatial analysis purpose
Evolution of ERM towards EC requirements
Theme Mapping Background location
Help for location
Spatial analysis
Routing
BND X X X X
HYDRO X X X X
TRANS X X X X X
NAMET X
POP X X X
MISC X X X X
VEG X X
ERM level of progress
• Release 2.2 (Jan 2008)• 31 countries : EU26, 4 EFTA, Moldova• Croatia: administrative boundaries
• Release 3.0 (Jan 2009)+ Croatia : Railway network
+ Isle of Man
+ Faeroe Islands• No update or improvement from Italy
( VMap data sources)• No data from Bulgaria
• Release 3.1 (Jan 2010)Adm, transports, settlements, names
• Release 3.2 ( Dec 2010)• Hydro
+ Bulgaria
Production work flow
Deliverables: • national component of the ERM data : draft
version (GDB or shapefiles)
• Validation report
• national component of the ERM data : draft version
• Validation report
• National components of the ERM data: final version
• Metadata + lineage files• Final reception ( sending approval )
Countries
Countries
RC
Data Production Own Quality Control
Quality Control
Corrections andEdge Matching
RC Quality Control ( also on edge-matching
CountriesLast corrections
Production phase
Validation phase
Integration phase
Task:Finishing the edge-matching at cross border area by
integrating the duplicated features located on international boundaries into one single feature
Deliverables: • ERM data set in File GDB • Metadata for ERM
Task:• Adding land mask feature• Merging the ferry lines into a seamless and
consistent network usable for spatial analysis• Setting up UIC code for railways
Deliverables: • ERM data set in File GDB, fit for EC • Metadata for Eurostat
• Quality assessment report
PM
PM
Data integration into a seamless coverage
Specific processes for specific features asked by
EC
Integration phase
2. Quality control : best practices for a pan-European dataset
Quality control
1. Validation process : checking the conformity with the ERM specifications
2. Quality assessment process: reporting on data content and data harmonisation in selection criteria
Validation specifications
• Compliance with the ERM Specifications
• Data model
• Topology
• Allowed attribute values
• Selection criteria
• Geometrical resolution
• Coherence and consistency of feature and attributes
• Homogeneity of attribute values in a feature network
• Consistency between themes
• Cross-border continuity between neighbouring countries
Minimum Requirements
ERM Data production
Validation by producer Validation by RC
Report about validation results
If errors exist
• To ensure best data quality
Validation process
Validation deliverables
Documentation:
My ERM documentation
D41_ERMSpecificationDC_v43.pdf
D51_DataValidationSpecifications_V40.pdf
D52_DataValidationSpecifications_MinReq_v12.pdf
ICC_ERM_ValidationReport_template.xls
ERM_v31_Validation_Tools_v10.xls
…
Quality indicators in Metadata
• Metadata for discovery (standard ISO 19115) : • ERM_Metadata_partners_template.xls
• Lineage files ( data quality)• ERM Lineage Template.doc• ERM_Lineage.xls
Quality indicators
1. Existence (ID1) = presence/absence of feature or attribute
Def: the feature or attribute information exists in the real world context and has been captured ( presence) or not captured (absence) in the ERM data set.
Values: • Presence : indicator ID1 = 1 • Absence : indicator ID1 = 0• N_A: indicator = -1 ( the feature/attribute doesn’t exist in the
real world context)
Existence for Austria
Theme name
Feature class name
Feature Code and Attribute Name
Feature Name and Attribute description
Obligation
Existence (ID1) Comments
ID1=[0,1,-1]BND PolbndA FA001 Administrative Area M 1
EBM0 Sabe Hierarchical Number M 1EBM1 Sabe Hierarchical Number M 1EBM2 Sabe Hierarchical Number M 1EBM3 Sabe Hierarchical Number M 1EBM4 Sabe Hierarchical Number M -1EBM5 Sabe Hierarchical Number M -1TAA Type of administrative area M 1
HYDRO CoastA BA010 Foreshore M -1HYDRO CoastL BA010 Coastline Shoreline M -1
•
Hierarchical level 4 and 5 doesn’t exist in Austria ID1 = -1
Foreshore and coastline doesn’t exist in AustriaID1 = -1
Existence for Spain•
Foreshore not entering in the selection criteria ID1 = -1
Shoreline exist but have not been captured : ID1 = 0
HYDRO CoastA BA010 Foreshore M -1 not entering into selection criteriaMCC Material Composition Category M -1NAMA1 Name in first national language (ASCII)O -1NAMA2 Name in second national language (ASCII)O -1NAMN1 Name in first national language O -1NAMN2 Name in second national languageO -1NLN1 3-Char Langage Code O -1NLN2 3-Char Langage Code O -1
HYDRO CoastL BA010 Coastline Shoreline M 1HYDRO CoastL BB081 Shoreline Construction O 0
HOC Hydrographical Origin Category O 0
Quality indicators (2)
1. Completeness (ID2) group of indicators1. Selection compliancy (ID2.1) for features
2. Data Completeness (ID2.2) for attributes
Selection compliancy : features are captured for the entire territory and in accordance to the portrayal and selection criteria of the specifications
Values
ID2.1 = 1 ( fully compliant)
ID2.1 = 0 ( not fully compliant)
Quality indicators (3)
1. Completeness (ID2) group of indicators1. Selection compliancy (ID2.1) for features
2. Data Completeness (ID2.2) for attributes
Data Completeness : % of the populated attributes holding real values ( null values like UNK or N_P are not considered)
Value: %
Ex: value for RTN• Number of features with RTN <> [UNK] = 34000• Number of total features = 45000
• ID2.2 = [ROUNDUP (34000/45000) * 100] = 76%
Example: Completeness for road and island
Theme name
Feature class name
Feature Code and Attribute Name
Feature Name and Attribute description
Obligation
Existence (ID1) Comments
Completeness (ID2) Comments
Improvement in data quality from the previous release
Improvement in data qualityexpected for next release
ID1=[0,1,-1]ID2.1 = [0,1] ID2.2 =
[0-100]%
HYDRO IslandA BA030 Island M 1 0areas less than 0.6 km2 have been captured
NAMN1 Name in first national language M 1 80NAMN2 Name in second national languageM -1NAMA1 Name in first national language (ASCII)M 1 80NAMA2 Name in second national language (ASCII)M -1NLN1 3-Char Language Code M 1 100NLN2 3-Char Language Code M -1
TRANS RoadL AP030 Road M 1 1EXS Existence Category M 1 100LLE Location Level M 1 100
LTN Lane/Track Number M 1 75 local roads have LTN unknownMED Median Category M 1 100NAMN1 Name in first national language O 0NAMN2 Name in second national languageO 0 not evaluatedNAMA1 Name in first national language (ASCII)O 0NAMA2 Name in second national language (ASCII)O 0 not evaluatedNLN1 3-Char Language Code O 0NLN2 3-Char Language Code O 0 not evaluatedRST Road Surface Type M 1 100RSU Seasonal Availability O 1 100RTE Route Number (Int.) M 1 100
RTN Route Number (Nat.) M 1 75 local roads have RTN unknownRTT Route Intended Use M 1 100TOL Toll Category O -1TUC Transportation Use Category M 1 100 TUC has been newly populated
TRANS RunwayL GB055 Runway M
Metadata on not provided information
Attribute value Attribute Type
Null/No Value Unknown Unpopulated Not Applicable
Meaning in the real world context
Information cannot be applied
Information is missing
Information exists but has not been collected
Information doesn’t exist
Text N/A UNK N_P N_A Integer Coded -32768 0 997 998 Integer Actual Value -32768 -29999 -29997 -29998
Quality tools
PLTS Data Reviewer (Knowledgebase)
• Automated validation of attribute domains as well as combinations of attributes• Validation of minimum dimensions
GDB Topology • Validation of topology •Not all relationsships can be defined
ERM Scripts (python) • Validation of generalisation degree, • Attribute completeness
Visual control • Necessary as not all checks can be automated (e.g. feature density)
Python Scripts in ERM Toolbox
• Edgematching• Check Edgematching for lines
• Check Edgematching for points
• ERM QC• Check Multipart
• Feature Statistics
• Item Statistics
• Populate Symbol Number
• Summary Statistics
• Test ASCII fields
• Export• Export to Shape
Statistics tools
• Feature Statistics• the number of features / featureclasse• use:
• QA - presence of feature classes and country codes• supports to fill the metadata (lineage.doc)
Statistics tools
• AllStatistics• ID1= the existence of the feature and attribute {0,1}• ID2 = the completeness of the feature and attribute {0,..,100}• use: supports to fill the metadata (lineage.xls)
Statistics tools
• GeomStat• the number of the features per unit Area (10km2, 100km2, etc.)• use: QA – density of features -> base for harmonization of selection criteria between
countries
CZSK
RO
MDHU
WatrcrsL
(Natural)315
12 32
10 km
10 k
m
Statistics tools
• GeomStat
Geometry tools
• MinVertexDistance• check the minimum allowed distance between vertices (50 m)• use: QA - data quality requirements
46 m
Correction needed !
WatrcrsL
Quality issues
Quality requirements
1. Compliancy with a standard (ERM specifications)
2. Topological errors
usable topological network
3 Completeness in attributes Ex : Name completions
4 Data harmonisation between countries • in selection criteria • in classification• in geometrical accuracy ( vertices density)
Quality issues : Transport
• Heterogeneity in national classification of the roads ( primary secondary, etc..)
Quality issues: Hydro
•Heterogeneity in selection criteria
Quality issues: Hydro
•Name completion (selected in blue the non-named rivers)
Quality issues:Hydro
•River hierachical level : must be consistent at European level ( in blue rivers with national hirerachical level)
Expectations
Expectations
1. Need of a quality control manager1. Assess quality of the data
2. Suggest new methodology and improvement in Quality control tools
3. Provide a quality assessment report of each release
2. ESDIN framework (the near future for ERM): 1. what kind of quality data model for the pan-European products
2. What kind of validation tools and quality control ?
3. Commitment of the Quality KEN ? Support welcome, which kind?
Debate : quality data model? For which kind of data?
• Quality control applicable to base level datasets• Related to real world phenomena
• Quality control applicable to generalised and derived datasets ( at medium scale level)?• Added factor of selection criteria
• Quality control applicable to pan-European datasets?• Added factor of harmonisation between countries.