fusing openstreetmap with wikipedia

Tags:

Post on 29-Aug-2014

55 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Ulmon's recipe for a travel guide is to fuse multiple open sources of data that you may otherwise use individually to plan your vacation, and present them as a coherent package. We are trying to fuse this data in such a way that the resulting whole is more valuable than the sum of its parts. Our main sources of map data and knowledge about places are OpenStreetMap and Wikipedia respectively. This talk is about the challenges posed by connecting these two, and our strategies of coping with them.

TRANSCRIPT

Fusing OpenStreetMap with WikipediaUlmon GmbH

08/05/2014 Linuxwochen Wien

Hello from

08/05/2014 Linuxwochen Wien

Ulmon’s recipe for a travel guideFuse sources of data to create a whole more valuable than its parts

08/05/2014 Linuxwochen Wien

Wikipedia and OSM in CityMaps2Go

08/05/2014 Linuxwochen Wien

What about unmatchable WIKI?

08/05/2014 Linuxwochen Wien

Wikipedia tag in OpenStreetMap

08/05/2014 Linuxwochen Wien

http://taginfo.openstreetmap.org

Wikipedia tag statistics

Tag name Number of valueswikipedia 339,148

wikipedia:ru 30,457 wikipedia:en 16,432 wikipedia:de 13,923 wikipedia:es 4,706

404,666

Total Wikipedia entries with location:1,621,704 in 15 languages

798,965 English

08/05/2014 Linuxwochen Wien

The Confusion of Tongues

08/05/2014 Linuxwochen Wien

Multiple OSM candidates for one Wiki

08/05/2014 Linuxwochen Wien

Multiple fitting Wiki entries

08/05/2014 Linuxwochen Wien

Wiki articles with no OSM object

08/05/2014 Linuxwochen Wien

What data to include?

… for an offline guide

178MB!

08/05/2014 Linuxwochen Wien

08/05/2014 Linuxwochen Wien

Ulmon’s matching algorithm…StephansdomStröckStephansplatzStephansplatz (U3 station)Stock-im-Eisen-PlatzCafé WeinwurmDO&CO am StephansplatzHaas-HausAida…

Distance: 0.9

Name: 1.0

Type: 0.0

?

?? ?

?

?

Comparing Names

• Edit distance (Levenshtein distance)• Soundex• Dice coefficient

08/05/2014 Linuxwochen Wien

Type score

• Compare OSM tags with Dbpedia types– Manual rules– Word similarity– Future: Synonymic analysis based on

Wordnet

08/05/2014 Linuxwochen Wien

Decision tree

• Generated using the J48 algorithm of the Weka toolkit

• How to get learning data?– Manual creation– Parsing wikipedia tags from OSM

08/05/2014 Linuxwochen Wien

Ulmon’s matching performance

• Current– Total wiki entries: 810K (674K English)– Matched entries: 429K

• Future– Total wiki entries: 1.6M– Matched entries (extrapolation): 850K

08/05/2014 Linuxwochen Wien

Multiple OSM candidates for one Wiki

08/05/2014 Linuxwochen Wien

Multiple fitting Wiki entries

08/05/2014 Linuxwochen Wien

Open questions

• Reduce false positives– Current: 10%, desired < 3%

• Get more matching!• Reduce the amount of data

08/05/2014 Linuxwochen Wien

Thank you for your attention!Come visit us at www.ulmon.com

08/05/2014 Linuxwochen Wien

top related