representing contextual aspects of data
DESCRIPTION
Representing Contextual Aspects of Data. Andreas Harth Joint work with Juan Salas. PlanetData 1st EC Review, 7-8 December 2011, Luxembourg. Outline. Motivation Source Datasets NeoGeo Vocabulary Integration Algorithm Integrated Datasets and Services Community Activities Demo - PowerPoint PPT PresentationTRANSCRIPT
Slide 1 of x
Representing Contextual Aspects of Data
Andreas HarthJoint work with Juan Salas
PlanetData 1st EC Review, 7-8 December 2011, Luxembourg
Slide 2 of x
Outline
• Motivation• Source Datasets • NeoGeo Vocabulary• Integration Algorithm• Integrated Datasets and Services• Community Activities• Demo• Outlook• Conclusion
Slide 3 of x
Motivation
Geodata is becoming increasingly relevantLocation-based servicesMobile applicationsEvery increasing amount of sensor data (phones,
satelites)
Data is published in many formatGML, KML, WKT, RDF?…
Applications require integrated access to geodataSpatial queryingSpatial reasoning
Slide 4 of x
GeoData
Geospatial data is ubiquitous in information management, whether it is aimed to scientific, industrial or just everyday activities. For this reason, a shared representation of GeoData is of vital importance in the future of the Semantic Web.
Example application fields include:• Transport• Demography• Mobile Applications• Remote Sensing• Commerce(and many more…)
Slide 5 of x
Requirements
Integrated data format (syntax) and access (data transfer protocol)Linked Data (RDF, HTTP)
Mapping to a common vocabularyFocus on representing geographic regions
Mappings between instancesAlgorithms and systems for integrated queryingAlgorithms and systems for integrated reasoning
(integrate that slide with next one)
Slide 6 of x
Integration Challenges
Vocabularies – http://geovocab.org/doc/survey.htmlSurvey of several well-known Linked Data datasets
(Ordnance Survey, GeoLinkedData.es, LinkedGeoData.org, GeoNames, DBpedia).
Identified properties and classes mapped to the NeoGeo vocabularies published at GeoVocab.org
InstancesFinding equivalences between regions across multiple
datasets at the geometry level.
Slide 7 of x
Geodata Integration System Architecture
!?
Source 1
Source 2
Source n
Wrapper 1
Mapping 2
Mapping n
Integration
Mapping 1
Slide 8 of x
Integration Vocabulary
GeoVocab.org is an initiative to study methods and tools for the integration of geospatial data on the Semantic Web
Geometry Vocabulary – http://geovocab.org/geometryRepresentation of georeferenced geometric shapes.
Spatial Ontology – http://geovocab.org/spatialRepresentation and reasoning on topological relations
based on the Region Connection Calculus.
spatial:Feature ngeo:Geometryngeo:geometry
spatial:*
Slide 9 of x
Spatial Ontologyhttp://geovocab.org/spatial
Uses RCC vocabulary for the representation of topological relations between regions.
Supports RCC5 and RCC8 relations.
Inference available for most RCC relations. However some rules require „Negation as Failure“, which is not supported in OWL.
Slide 10 of xPlanetData 1st EC Review, 7-8 December 2011, Luxembourg, Luxembourg
Spatial Properties (RCC-8)
Slide 11 of x
Geometry Ontologyhttp://geovocab.org/geometry
Premises:● Open RDF Format● Fully based on Linked Data principles
Based on:● ISO 19109 - OGC General Feature Model● ISO 19137 - Core profile of the spatial schema
Slide 12 of x
Geometry Ontologyhttp://geovocab.org/geometry
Since the Geometry ontology is based on the General Feature Model, it makes a distinction between the feature (resource to which the geometry belongs), and the actual geometry. This approach results in:• Semantics of the feature are more important than the
representation of the geometry.• Instances of the feature are related to the type of the feature.• A feature can be related to multiple geometries, not as
MultiLineString, MultiPolygon or MultiPoints, but as multiple distinct geometries. This allows to model different geometric properties for one single feature (e.g. different scales).
Being it also based on ISO 19137, basically determines the geometries that can be represented: Point, LineString, Polygon, MultiPoint, MultiLineString and MultiPolygon, which should suffice most use cases, without adding extra complexity.
Slide 13 of x
Geometry Ontologyhttp://geovocab.org/geometry
Unlike GML/WKT representations embedded into RDF, the Geometry Ontology is fully based on RDF.
Advantages:● It is possible to agregate or geometries.For example: A MultiPolygon can be composed of several Polygon resources, each with its own URI and Metadata.
● Allows to add Metadata to individidual parts of the geometries.For example: Label disputed borders as such or compose a polygon with GPS obtained measurements, each having versioning and date of measure.
Disadvantages:● The geometry must be reasambled in WKT or GML in order to use current libraries for querying or spatial indexing.
Slide 14 of x
Different ApproachesList of W3C Geo Coordinates
A geometric shape's coordinates is coded using a list of W3C Geo Point resources. It is based on current implementations of some current RDF spatial datasets such as GeoLinkedData.es and LinkedGeoData.org.
Advantages:● Allows to add metadata to nodes.● Allows to link geometries at node level.
Disadvantages:● Restricted to WGS 84.● Generates a large number of triples, which must be joined when using current libraries for querying.
Slide 15 of x
ExampleList of W3C Geo Coordinates
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix nuts: <http://nuts.geovocab.org/id/> .@prefix ngeo: <http://geovocab.org/geometry#> .
nuts:DE123_geometry rdf:type ngeo:Polygon .nuts:DE123_geometry ngeo:exterior _:d1e59878 ._:d1e59878 rdf:type ngeo:LinearRing ._:d1e59878 ngeo:posList (
[ geo:long "8.33996995"; geo:lat "49.08015" ] [ geo:long "8.41577995"; geo:lat "49.2510995" ] [ geo:long "8.46698545"; geo:lat "49.2829755" ] [ geo:long "8.48726795"; geo:lat "49.2900265" ] [ geo:long "8.81823295"; geo:lat "49.194497" ] [ geo:long "8.87779445"; geo:lat "49.0584785" ] [ geo:long "8.57685695"; geo:lat "48.9896935" ] [ geo:long "8.49357245"; geo:lat "48.820182" ] [ geo:long "8.41662495"; geo:lat "48.835368" ] [ geo:long "8.30566745"; geo:lat "48.862568" ] [ geo:long "8.35457445"; geo:lat "48.934889" ] [ geo:long "8.26128395"; geo:lat "48.980917" ] [ geo:long "8.27714095"; geo:lat "48.99016" ] [ geo:long "8.53982195"; geo:lat "48.953889" ] [ geo:long "8.43560245"; geo:lat "49.091529" ] [ geo:long "8.33996995"; geo:lat "49.08015" ]
) .
Slide 16 of x
Different ApproachesSingle Literal
All coordinates are concatenated into a single literal value.
Advantages:● Reduces the number of triples.● Allows the use of other coordinate systems than WGS 84.
Disadvantages:● Does not enable the addition of metadata to single parts of the geometry (at the level of the coordinates).● Does not allow to reference shared segments.
Slide 17 of x
ExampleSingle Literal
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix nuts: <http://nuts.geovocab.org/id/> .@prefix ngeo: <http://geovocab.org/geometry#> .
nuts:DE123_geometry rdf:type ngeo:Polygon .nuts:DE123_geometry ngeo:exterior _:d1e59878 ._:d1e59878 rdf:type ngeo:LinearRing ._:d1e59878 ngeo:posList "8.33996995 49.08015,8.41577995 49.2510995,8.46698545
49.2829755,8.48726795 49.2900265,8.81823295 49.194497,8.87779445
49.0584785,8.57685695 48.9896935,8.49357245 48.820182,8.41662495
48.835368,8.30566745 48.862568,8.35457445 48.934889,8.26128395
48.980917,8.27714095 48.99016,8.539821950 48.953889,8.43560245
49.091529,8.33996995 49.08015" .
Slide 18 of x
Different ApproachesList of coordinate literals
Mixes both previous approaches, coding the coordinates as a list of literales, each of which encodes a segment of coordinates.
Advantages:● Allows the user to choose the level of granularity desired.● Enables to group contiguous parts of a geometry which have the same metadata.● Permits to reuse shared borders easily.● Allows to use other coordinate systems than WGS 84.
Disadvantages:● Segments must be joined for querying with current libraries.
Slide 19 of x
ExampleList of coordinate literals
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix nuts: <http://nuts.geovocab.org/id/> .@prefix ngeo: <http://geovocab.org/geometry#> .
nuts:DE123_geometry rdf:type ngeo:Polygon .nuts:DE123_geometry ngeo:exterior _:d1e59878 ._:d1e59878 rdf:type ngeo:LinearRing ._:d1e59878 ngeo:posList (
"8.33996995 49.08015,8.41577995 49.2510995,8.46698545 49.2829755,8.4872679549.2900265,8.81823295 49.194497"
"8.87779445 49.0584785,8.57685695 48.9896935,8.49357245 48.820182,8.4166249548.835368,8.30566745 48.862568"
"8.35457445 48.934889,8.26128395 48.980917,8.27714095 48.99016,8.53982195048.953889,8.43560245 49.091529,8.33996995 49.08015"
) .
Slide 20 of x
Georeferenced geometric shapes
Dataset Point Bounding Box Points in Lists
Single predicate
Literal
UN FAO Own
Ordnance Survey W3C Geo / GeoRSS
Own / GML
GeoLinkedData.es W3C Geo Own Own / GML
LinkedGeoData.org
W3C Geo Own
GeoNames.org W3C Geo
Uberblic.org Own
RAMON NUTS
Dbpedia.org
NeoGeo W3C Geo
Slide 21 of x
Spatial RelationsDataset Disjoin
tTouches Overlap
sWithin Contains Equal
sNearby
UN FAO hasBorderWith
isInGroup
Ordnance Survey disjoint touches partiallyOverlaps
within contains Equals
GeoLinkedData.es
formaParteDe
formadoPor
LinkedGeoData.org
GeoNames.org neighbour / neighbouringFeatures
parentFeature
childrenFeatures
nearby / nearbyFeatures
Uberblic.org adjoining_location
containing_location
RAMON NUTS partOf
Dbpedia.org locatedInArea
NeoGeo DC EC PO PP PPi EQ
Slide 22 of x
Geospatial Datasets
GADM-RDF – http://gadm.geovocab.orgRDF representation of the administrative regions of
the GADM project: http://gadm.org
NUTS-RDF – http://nuts.geovocab.orgRDF representation of Eurostat's NUTS
nomenclature.
They serve as:New geospatial information on the Semantic Web.Bridges between already published spatial datasets.Proof-of-concept platforms.
Slide 23 of xPlanetData 1st EC Review, 7-8 December 2011, Luxembourg, Luxembourg
Vocabulary Mappings
TOFROM
NeoGeo DBpedia Linked-GeoData
geo.linkeddata.es
Geonames
NeoGeo - SC, SP SC, SP SC, SP SC, SP
DBpedia tbd -
Linkged-GeoData
tdb -
geo.linkeddata.es
tdb -
Geonames tdb -
SC: rdfs:subClassOf, SP: rdfs:subPropertyOf, SA: owl:sameAs
Slide 24 of xPlanetData 1st EC Review, 7-8 December 2011, Luxembourg, Luxembourg
Instance Mappings
TOFROM
NeoGeoNUTS
NeoGeo GADM
DBpedia Linked-GeoData
geo.linkeddata.es
Geonames
NeoGeo NUTS
- EQ PPi PPi PPi
NeoGeo GADM
EQ - PPi PPi PPi
DBpedia PP PP -
LinkgedGeoData
PP PP -
geo.linkeddata.es
PP PP -
Geonames -
Slide 25 of x
Geometric Equivalences
NUTS-RDF and GADM-RDF have different:Sampling valuesScalesStarting pointsRounding effects
Geometric shapes will not be vertex by vertex equivalent.
A sensible criterion for finding geometric equivalences is needed.
Slide 26 of x
Algorithm Overview
WGS-84, Plate Carrée projection
Hausdorff distance
spatial:EQ
1
1
*
Slide 27 of x
1. Retrieve sample data
The algorithm requires:WGS-84 coordinate reference system.Plate Carrée projection:
X = longitudeY = latitude
Coordinates are treated as Cartesian. Distorts all parameters (area, shape, distance,
direction). Geometric shapes are equally distorted on both
datasets.Local reprojections are avoided (e.g. UTM).Units will be presented in centesimal degrees.
Slide 28 of x
2. Similarity threshold function
The Hausdorff Distance provides a measure of similarity between geometric shapes.
Can be intuitively defined asthe largest distance between the closest points of two geometric shapes.
Slide 29 of x
2. Similarity threshold function
Smaller regions need a lower Hausdorff Distance threshold than larger regions.
Slide 30 of x
2. Similarity threshold function
NUTS Name NUTS Area GADM Name Hausdorff Distance
Midpoint Value
ESPAÑA 53.47 España 1.63 10.39
Tamanghasset 19.15
ΕΛΛΑΔΑ / ELLADA 13.16 Ellas or Ellada 1.05 3.7
Bulgaria 6.34
ÖSTERREICH 10.07 Österreich 0.18 2.06
Ceská republika 3.93
Hedmark 4.61 Hedmark 0.48 2.93
Oppland 2.45
Somme 0.78 Somme 0.32 0.5
Oise 0.67
We calculate the midpoint value between the Hausdorff Distances for a correct guess and the lowest wrong guess.
Slide 31 of x
We perform regression on the midpoint values to obtain the Hausdorff Distance threshold function.
2. Similarity threshold function
Slide 32 of x
3. Finding spatial equivalences
NUTS Name NUTS Area
GADM Name Hausdorff Distance
ThresholdFunction
spatial:EQ
HRVATSKA 6.21 Hrvatska 1.14 3.49 Yes
NEDERLAND 4.83 Nederland 0.39 2.96 Yes
LIETUVA 9.15 Lietuva 0.47 4.31 Yes
ΕΛΛΑΔΑ / ELLADA 13.17 Bulgaria 6.35 5.08 No
UNITED KINGDOM 33.03 France 12.6 7.02 No
Córdoba 1.41 Sevilla 1.41 0.37 No
Slide 33 of x
Poor Geospatial Information
Sometimes location is approximated as a single point.Can lead to false assertions while calculating containment
relations.
<http://dbpedia.org/resource/Germany> geo:lat 52.516666; geo:long 13.383333 .
<http://nuts.geovocab.org/id/DE30_geometry> rdf:type ngeo:Polygon .
Germany is not contained in Berlin.
Other properties must be considered to calculate containment relations (e.g. rdf:type).
Other spatial relations (e.g. spatial:EQ) cannot be calculated.
Slide 34 of x
Optimizations
The cost of calculating the Hausdorff distance depends on the amount of vertices.
The Ramer-Douglas-Peucker algorithm allows to simplify geometric shapes, using an arbitrary maximum separation.
Slide 35 of x
Optimizations
Region Name NUTS Points
GADM Points
Hausdorff Distance (Original)
Time [ms] (Original)
Hausdorff Distance
(0.2 Simplif.)
Time [ms] (0.2 Simplif.)
Finland 389 107783 1.3996 30353 1.3483 2504
Croatia 175 193180 1.1374 7830 1.1366 1108
Schleswig-Holstein 118 28001 0.7281 1870 0.7257 296
Iceland 320 7610 0.4163 567 0.4613 66
Karlsruhe 47 1021 0.1062 35 0.1906 13
Seine-Saint-Denis 6 30 0.0812 1 0.0716 2
Slide 36 of x
Spatial Databases
The algorithm works also well with spatial databases (e.g. PostgreSQL / PostGIS):
SELECT g.gadm_id, n.nuts_id FROM nuts n INNER JOIN gadm g ON (n.geometry && g.geometry) WHERE n.shape_area BETWEEN (g.shape_area * 0.9) AND (g.shape_area * 1.1) AND ST_HausdorffDistance( ST_SimplifyPreserveTopology(n.geometry, 0.5), ST_SimplifyPreserveTopology(g.geometry, 0.5) ) < g.max_hausdorff_dist;
Slide 37 of x
Evaluation
Not every NUTS region matches a GADM region.Many NUTS regions represent parts or aggregations
of GADM administrative boundaries.
1,671 NUTS regions => 965 matches & 13 false positives.
NUTS UKF2Leicestershire, Rutland and Northamptonshire
GADM 2_13988Leicestershire
Slide 38 of x
EvaluationNUTS Region NUTS Area Incorrect GADM guess Hausdorff Distance
UKM34 0.0214 East Renfrewshire 0.1862
FR106 0.0334 Val-De-Marne 0.1644
BE321 0.0654 Soignies 0.3521
BE353 0.1188 Thuin 0.2834
CH061 0.1672 Aargau 0.3653
LT 9.5204 Latvija 2.5098
LI 0.0205 Appenzell Innerrhoden 0.2783
UKM28 0.0689 North Lanarkshire 0.3478
BE331 0.1013 Lige 0.335
BE353 0.1188 Thuin 0.2834
CH061 0.1672 Aargau 0.3653
SE3 60.585 Norge 7.8658
BE321 0.0654 Soignies 0.3521
Slide 39 of x
Currently available resources
NeoGeo vocabulary and best practices for publishing geodata as Linked Data
NUTS and GADM dataset onlineIntegration vocabulary online, including mappingsGADM mappings to DbpediaLinked Data Services for accessing/querying spatial
indices (withinRegion, boundingBox)Work on similarity metrics (with optimisations and
evaluation) for geospatial regions
Slide 40 of x
Future Work
Finalisation of NeoGeo vocabularyImprovement of precision of spatial similarity;
publish service onlineMore earth and space science dataTools to support the mapping processMore instance mappings to GADMPossibly map to sensor descriptionsMore experiments: querying of integrated dataInclude reasoning
Temporal context
Slide 41 of x
Conclusion
GeoVocab.org published vocabulary and vocabulary mappings
NUTS and GADM use vocabulary and instance-map to several well-known other datasets
Several services onlineUsing an optimised algorithm for the detection of
spatially co-located features across multiple RDF datasets
More work to be done, including coordination with other efforts