representing contextual aspects of data

41
Slide 1 of x Representing Contextual Aspects of Data Andreas Harth Joint work with Juan Salas PlanetData 1st EC Review, 7-8 December 2011, Luxembourg

Upload: hugh

Post on 15-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Representing Contextual Aspects of Data. Andreas Harth Joint work with Juan Salas. PlanetData 1st EC Review, 7-8 December 2011, Luxembourg. Outline. Motivation Source Datasets NeoGeo Vocabulary Integration Algorithm Integrated Datasets and Services Community Activities Demo - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Representing  Contextual Aspects  of  Data

Slide 1 of x

Representing Contextual Aspects of Data

Andreas HarthJoint work with Juan Salas

PlanetData 1st EC Review, 7-8 December 2011, Luxembourg

Page 2: Representing  Contextual Aspects  of  Data

Slide 2 of x

Outline

• Motivation• Source Datasets • NeoGeo Vocabulary• Integration Algorithm• Integrated Datasets and Services• Community Activities• Demo• Outlook• Conclusion

Page 3: Representing  Contextual Aspects  of  Data

Slide 3 of x

Motivation

Geodata is becoming increasingly relevantLocation-based servicesMobile applicationsEvery increasing amount of sensor data (phones,

satelites)

Data is published in many formatGML, KML, WKT, RDF?…

Applications require integrated access to geodataSpatial queryingSpatial reasoning

Page 4: Representing  Contextual Aspects  of  Data

Slide 4 of x

GeoData

Geospatial data is ubiquitous in information management, whether it is aimed to scientific, industrial or just everyday activities. For this reason, a shared representation of GeoData is of vital importance in the future of the Semantic Web.

Example application fields include:• Transport• Demography• Mobile Applications• Remote Sensing• Commerce(and many more…)

Page 5: Representing  Contextual Aspects  of  Data

Slide 5 of x

Requirements

Integrated data format (syntax) and access (data transfer protocol)Linked Data (RDF, HTTP)

Mapping to a common vocabularyFocus on representing geographic regions

Mappings between instancesAlgorithms and systems for integrated queryingAlgorithms and systems for integrated reasoning

(integrate that slide with next one)

Page 6: Representing  Contextual Aspects  of  Data

Slide 6 of x

Integration Challenges

Vocabularies – http://geovocab.org/doc/survey.htmlSurvey of several well-known Linked Data datasets

(Ordnance Survey, GeoLinkedData.es, LinkedGeoData.org, GeoNames, DBpedia).

Identified properties and classes mapped to the NeoGeo vocabularies published at GeoVocab.org

InstancesFinding equivalences between regions across multiple

datasets at the geometry level.

Page 7: Representing  Contextual Aspects  of  Data

Slide 7 of x

Geodata Integration System Architecture

!?

Source 1

Source 2

Source n

Wrapper 1

Mapping 2

Mapping n

Integration

Mapping 1

Page 8: Representing  Contextual Aspects  of  Data

Slide 8 of x

Integration Vocabulary

GeoVocab.org is an initiative to study methods and tools for the integration of geospatial data on the Semantic Web

Geometry Vocabulary – http://geovocab.org/geometryRepresentation of georeferenced geometric shapes.

Spatial Ontology – http://geovocab.org/spatialRepresentation and reasoning on topological relations

based on the Region Connection Calculus.

spatial:Feature ngeo:Geometryngeo:geometry

spatial:*

Page 9: Representing  Contextual Aspects  of  Data

Slide 9 of x

Spatial Ontologyhttp://geovocab.org/spatial

Uses RCC vocabulary for the representation of topological relations between regions.

Supports RCC5 and RCC8 relations.

Inference available for most RCC relations. However some rules require „Negation as Failure“, which is not supported in OWL.

Page 10: Representing  Contextual Aspects  of  Data

Slide 10 of xPlanetData 1st EC Review, 7-8 December 2011, Luxembourg, Luxembourg

Spatial Properties (RCC-8)

Page 11: Representing  Contextual Aspects  of  Data

Slide 11 of x

Geometry Ontologyhttp://geovocab.org/geometry

Premises:● Open RDF Format● Fully based on Linked Data principles

Based on:● ISO 19109 - OGC General Feature Model● ISO 19137 - Core profile of the spatial schema

Page 12: Representing  Contextual Aspects  of  Data

Slide 12 of x

Geometry Ontologyhttp://geovocab.org/geometry

Since the Geometry ontology is based on the General Feature Model, it makes a distinction between the feature (resource to which the geometry belongs), and the actual geometry. This approach results in:• Semantics of the feature are more important than the

representation of the geometry.• Instances of the feature are related to the type of the feature.• A feature can be related to multiple geometries, not as

MultiLineString, MultiPolygon or MultiPoints, but as multiple distinct geometries. This allows to model different geometric properties for one single feature (e.g. different scales).

Being it also based on ISO 19137, basically determines the geometries that can be represented: Point, LineString, Polygon, MultiPoint, MultiLineString and MultiPolygon, which should suffice most use cases, without adding extra complexity.

Page 13: Representing  Contextual Aspects  of  Data

Slide 13 of x

Geometry Ontologyhttp://geovocab.org/geometry

Unlike GML/WKT representations embedded into RDF, the Geometry Ontology is fully based on RDF.

Advantages:● It is possible to agregate or geometries.For example: A MultiPolygon can be composed of several Polygon resources, each with its own URI and Metadata.

● Allows to add Metadata to individidual parts of the geometries.For example: Label disputed borders as such or compose a polygon with GPS obtained measurements, each having versioning and date of measure.

Disadvantages:● The geometry must be reasambled in WKT or GML in order to use current libraries for querying or spatial indexing.

Page 14: Representing  Contextual Aspects  of  Data

Slide 14 of x

Different ApproachesList of W3C Geo Coordinates

A geometric shape's coordinates is coded using a list of W3C Geo Point resources. It is based on current implementations of some current RDF spatial datasets such as GeoLinkedData.es and LinkedGeoData.org.

Advantages:● Allows to add metadata to nodes.● Allows to link geometries at node level.

Disadvantages:● Restricted to WGS 84.● Generates a large number of triples, which must be joined when using current libraries for querying.

Page 15: Representing  Contextual Aspects  of  Data

Slide 15 of x

ExampleList of W3C Geo Coordinates

@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix nuts: <http://nuts.geovocab.org/id/> .@prefix ngeo: <http://geovocab.org/geometry#> .

nuts:DE123_geometry rdf:type ngeo:Polygon .nuts:DE123_geometry ngeo:exterior _:d1e59878 ._:d1e59878 rdf:type ngeo:LinearRing ._:d1e59878 ngeo:posList (

[ geo:long "8.33996995"; geo:lat "49.08015" ] [ geo:long "8.41577995"; geo:lat "49.2510995" ] [ geo:long "8.46698545"; geo:lat "49.2829755" ] [ geo:long "8.48726795"; geo:lat "49.2900265" ] [ geo:long "8.81823295"; geo:lat "49.194497" ] [ geo:long "8.87779445"; geo:lat "49.0584785" ] [ geo:long "8.57685695"; geo:lat "48.9896935" ] [ geo:long "8.49357245"; geo:lat "48.820182" ] [ geo:long "8.41662495"; geo:lat "48.835368" ] [ geo:long "8.30566745"; geo:lat "48.862568" ] [ geo:long "8.35457445"; geo:lat "48.934889" ] [ geo:long "8.26128395"; geo:lat "48.980917" ] [ geo:long "8.27714095"; geo:lat "48.99016" ] [ geo:long "8.53982195"; geo:lat "48.953889" ] [ geo:long "8.43560245"; geo:lat "49.091529" ] [ geo:long "8.33996995"; geo:lat "49.08015" ]

) .

Page 16: Representing  Contextual Aspects  of  Data

Slide 16 of x

Different ApproachesSingle Literal

All coordinates are concatenated into a single literal value.

Advantages:● Reduces the number of triples.● Allows the use of other coordinate systems than WGS 84.

Disadvantages:● Does not enable the addition of metadata to single parts of the geometry (at the level of the coordinates).● Does not allow to reference shared segments.

Page 17: Representing  Contextual Aspects  of  Data

Slide 17 of x

ExampleSingle Literal

@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix nuts: <http://nuts.geovocab.org/id/> .@prefix ngeo: <http://geovocab.org/geometry#> .

nuts:DE123_geometry rdf:type ngeo:Polygon .nuts:DE123_geometry ngeo:exterior _:d1e59878 ._:d1e59878 rdf:type ngeo:LinearRing ._:d1e59878 ngeo:posList "8.33996995 49.08015,8.41577995 49.2510995,8.46698545

49.2829755,8.48726795 49.2900265,8.81823295 49.194497,8.87779445

49.0584785,8.57685695 48.9896935,8.49357245 48.820182,8.41662495

48.835368,8.30566745 48.862568,8.35457445 48.934889,8.26128395

48.980917,8.27714095 48.99016,8.539821950 48.953889,8.43560245

49.091529,8.33996995 49.08015" .

Page 18: Representing  Contextual Aspects  of  Data

Slide 18 of x

Different ApproachesList of coordinate literals

Mixes both previous approaches, coding the coordinates as a list of literales, each of which encodes a segment of coordinates.

Advantages:● Allows the user to choose the level of granularity desired.● Enables to group contiguous parts of a geometry which have the same metadata.● Permits to reuse shared borders easily.● Allows to use other coordinate systems than WGS 84.

Disadvantages:● Segments must be joined for querying with current libraries.

Page 19: Representing  Contextual Aspects  of  Data

Slide 19 of x

ExampleList of coordinate literals

@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix nuts: <http://nuts.geovocab.org/id/> .@prefix ngeo: <http://geovocab.org/geometry#> .

nuts:DE123_geometry rdf:type ngeo:Polygon .nuts:DE123_geometry ngeo:exterior _:d1e59878 ._:d1e59878 rdf:type ngeo:LinearRing ._:d1e59878 ngeo:posList (

"8.33996995 49.08015,8.41577995 49.2510995,8.46698545 49.2829755,8.4872679549.2900265,8.81823295 49.194497"

"8.87779445 49.0584785,8.57685695 48.9896935,8.49357245 48.820182,8.4166249548.835368,8.30566745 48.862568"

"8.35457445 48.934889,8.26128395 48.980917,8.27714095 48.99016,8.53982195048.953889,8.43560245 49.091529,8.33996995 49.08015"

) .

Page 20: Representing  Contextual Aspects  of  Data

Slide 20 of x

Georeferenced geometric shapes

Dataset Point Bounding Box Points in Lists

Single predicate

Literal

UN FAO Own

Ordnance Survey W3C Geo / GeoRSS

Own / GML

GeoLinkedData.es W3C Geo Own Own / GML

LinkedGeoData.org

W3C Geo Own

GeoNames.org W3C Geo

Uberblic.org Own

RAMON NUTS

Dbpedia.org

NeoGeo W3C Geo

Page 21: Representing  Contextual Aspects  of  Data

Slide 21 of x

Spatial RelationsDataset Disjoin

tTouches Overlap

sWithin Contains Equal

sNearby

UN FAO hasBorderWith

isInGroup

Ordnance Survey disjoint touches partiallyOverlaps

within contains Equals

GeoLinkedData.es

formaParteDe

formadoPor

LinkedGeoData.org

GeoNames.org neighbour / neighbouringFeatures

parentFeature

childrenFeatures

nearby / nearbyFeatures

Uberblic.org adjoining_location

containing_location

RAMON NUTS partOf

Dbpedia.org locatedInArea

NeoGeo DC EC PO PP PPi EQ

Page 22: Representing  Contextual Aspects  of  Data

Slide 22 of x

Geospatial Datasets

GADM-RDF – http://gadm.geovocab.orgRDF representation of the administrative regions of

the GADM project: http://gadm.org

NUTS-RDF – http://nuts.geovocab.orgRDF representation of Eurostat's NUTS

nomenclature.

They serve as:New geospatial information on the Semantic Web.Bridges between already published spatial datasets.Proof-of-concept platforms.

Page 23: Representing  Contextual Aspects  of  Data

Slide 23 of xPlanetData 1st EC Review, 7-8 December 2011, Luxembourg, Luxembourg

Vocabulary Mappings

TOFROM

NeoGeo DBpedia Linked-GeoData

geo.linkeddata.es

Geonames

NeoGeo - SC, SP SC, SP SC, SP SC, SP

DBpedia tbd -

Linkged-GeoData

tdb -

geo.linkeddata.es

tdb -

Geonames tdb -

SC: rdfs:subClassOf, SP: rdfs:subPropertyOf, SA: owl:sameAs

Page 24: Representing  Contextual Aspects  of  Data

Slide 24 of xPlanetData 1st EC Review, 7-8 December 2011, Luxembourg, Luxembourg

Instance Mappings

TOFROM

NeoGeoNUTS

NeoGeo GADM

DBpedia Linked-GeoData

geo.linkeddata.es

Geonames

NeoGeo NUTS

- EQ PPi PPi PPi

NeoGeo GADM

EQ - PPi PPi PPi

DBpedia PP PP -

LinkgedGeoData

PP PP -

geo.linkeddata.es

PP PP -

Geonames -

Page 25: Representing  Contextual Aspects  of  Data

Slide 25 of x

Geometric Equivalences

NUTS-RDF and GADM-RDF have different:Sampling valuesScalesStarting pointsRounding effects

Geometric shapes will not be vertex by vertex equivalent.

A sensible criterion for finding geometric equivalences is needed.

Page 26: Representing  Contextual Aspects  of  Data

Slide 26 of x

Algorithm Overview

WGS-84, Plate Carrée projection

Hausdorff distance

spatial:EQ

1

1

*

Page 27: Representing  Contextual Aspects  of  Data

Slide 27 of x

1. Retrieve sample data

The algorithm requires:WGS-84 coordinate reference system.Plate Carrée projection:

X = longitudeY = latitude

Coordinates are treated as Cartesian. Distorts all parameters (area, shape, distance,

direction). Geometric shapes are equally distorted on both

datasets.Local reprojections are avoided (e.g. UTM).Units will be presented in centesimal degrees.

Page 28: Representing  Contextual Aspects  of  Data

Slide 28 of x

2. Similarity threshold function

The Hausdorff Distance provides a measure of similarity between geometric shapes.

Can be intuitively defined asthe largest distance between the closest points of two geometric shapes.

Page 29: Representing  Contextual Aspects  of  Data

Slide 29 of x

2. Similarity threshold function

Smaller regions need a lower Hausdorff Distance threshold than larger regions.

Page 30: Representing  Contextual Aspects  of  Data

Slide 30 of x

2. Similarity threshold function

NUTS Name NUTS Area GADM Name Hausdorff Distance

Midpoint Value

ESPAÑA 53.47 España 1.63 10.39

Tamanghasset 19.15

ΕΛΛΑΔΑ / ELLADA 13.16 Ellas or Ellada 1.05 3.7

Bulgaria 6.34

ÖSTERREICH 10.07 Österreich 0.18 2.06

Ceská republika 3.93

Hedmark 4.61 Hedmark 0.48 2.93

Oppland 2.45

Somme 0.78 Somme 0.32 0.5

Oise 0.67

We calculate the midpoint value between the Hausdorff Distances for a correct guess and the lowest wrong guess.

Page 31: Representing  Contextual Aspects  of  Data

Slide 31 of x

We perform regression on the midpoint values to obtain the Hausdorff Distance threshold function.

2. Similarity threshold function

Page 32: Representing  Contextual Aspects  of  Data

Slide 32 of x

3. Finding spatial equivalences

NUTS Name NUTS Area

GADM Name Hausdorff Distance

ThresholdFunction

spatial:EQ

HRVATSKA 6.21 Hrvatska 1.14 3.49 Yes

NEDERLAND 4.83 Nederland 0.39 2.96 Yes

LIETUVA 9.15 Lietuva 0.47 4.31 Yes

ΕΛΛΑΔΑ / ELLADA 13.17 Bulgaria 6.35 5.08 No

UNITED KINGDOM 33.03 France 12.6 7.02 No

Córdoba 1.41 Sevilla 1.41 0.37 No

Page 33: Representing  Contextual Aspects  of  Data

Slide 33 of x

Poor Geospatial Information

Sometimes location is approximated as a single point.Can lead to false assertions while calculating containment

relations.

<http://dbpedia.org/resource/Germany> geo:lat 52.516666; geo:long 13.383333 .

<http://nuts.geovocab.org/id/DE30_geometry> rdf:type ngeo:Polygon .

Germany is not contained in Berlin.

Other properties must be considered to calculate containment relations (e.g. rdf:type).

Other spatial relations (e.g. spatial:EQ) cannot be calculated.

Page 34: Representing  Contextual Aspects  of  Data

Slide 34 of x

Optimizations

The cost of calculating the Hausdorff distance depends on the amount of vertices.

The Ramer-Douglas-Peucker algorithm allows to simplify geometric shapes, using an arbitrary maximum separation.

Page 35: Representing  Contextual Aspects  of  Data

Slide 35 of x

Optimizations

Region Name NUTS Points

GADM Points

Hausdorff Distance (Original)

Time [ms] (Original)

Hausdorff Distance

(0.2 Simplif.)

Time [ms] (0.2 Simplif.)

Finland 389 107783 1.3996 30353 1.3483 2504

Croatia 175 193180 1.1374 7830 1.1366 1108

Schleswig-Holstein 118 28001 0.7281 1870 0.7257 296

Iceland 320 7610 0.4163 567 0.4613 66

Karlsruhe 47 1021 0.1062 35 0.1906 13

Seine-Saint-Denis 6 30 0.0812 1 0.0716 2

Page 36: Representing  Contextual Aspects  of  Data

Slide 36 of x

Spatial Databases

The algorithm works also well with spatial databases (e.g. PostgreSQL / PostGIS):

SELECT g.gadm_id, n.nuts_id FROM nuts n INNER JOIN gadm g ON (n.geometry && g.geometry) WHERE n.shape_area BETWEEN (g.shape_area * 0.9) AND (g.shape_area * 1.1) AND ST_HausdorffDistance( ST_SimplifyPreserveTopology(n.geometry, 0.5), ST_SimplifyPreserveTopology(g.geometry, 0.5) ) < g.max_hausdorff_dist;

Page 37: Representing  Contextual Aspects  of  Data

Slide 37 of x

Evaluation

Not every NUTS region matches a GADM region.Many NUTS regions represent parts or aggregations

of GADM administrative boundaries.

1,671 NUTS regions => 965 matches & 13 false positives.

NUTS UKF2Leicestershire, Rutland and Northamptonshire

GADM 2_13988Leicestershire

Page 38: Representing  Contextual Aspects  of  Data

Slide 38 of x

EvaluationNUTS Region NUTS Area Incorrect GADM guess Hausdorff Distance

UKM34 0.0214 East Renfrewshire 0.1862

FR106 0.0334 Val-De-Marne 0.1644

BE321 0.0654 Soignies 0.3521

BE353 0.1188 Thuin 0.2834

CH061 0.1672 Aargau 0.3653

LT 9.5204 Latvija 2.5098

LI 0.0205 Appenzell Innerrhoden 0.2783

UKM28 0.0689 North Lanarkshire 0.3478

BE331 0.1013 Lige 0.335

BE353 0.1188 Thuin 0.2834

CH061 0.1672 Aargau 0.3653

SE3 60.585 Norge 7.8658

BE321 0.0654 Soignies 0.3521

Page 39: Representing  Contextual Aspects  of  Data

Slide 39 of x

Currently available resources

NeoGeo vocabulary and best practices for publishing geodata as Linked Data

NUTS and GADM dataset onlineIntegration vocabulary online, including mappingsGADM mappings to DbpediaLinked Data Services for accessing/querying spatial

indices (withinRegion, boundingBox)Work on similarity metrics (with optimisations and

evaluation) for geospatial regions

Page 40: Representing  Contextual Aspects  of  Data

Slide 40 of x

Future Work

Finalisation of NeoGeo vocabularyImprovement of precision of spatial similarity;

publish service onlineMore earth and space science dataTools to support the mapping processMore instance mappings to GADMPossibly map to sensor descriptionsMore experiments: querying of integrated dataInclude reasoning

Temporal context

Page 41: Representing  Contextual Aspects  of  Data

Slide 41 of x

Conclusion

GeoVocab.org published vocabulary and vocabulary mappings

NUTS and GADM use vocabulary and instance-map to several well-known other datasets

Several services onlineUsing an optimised algorithm for the detection of

spatially co-located features across multiple RDF datasets

More work to be done, including coordination with other efforts