fusing openstreetmap with wikipedia
DESCRIPTION
Ulmon's recipe for a travel guide is to fuse multiple open sources of data that you may otherwise use individually to plan your vacation, and present them as a coherent package. We are trying to fuse this data in such a way that the resulting whole is more valuable than the sum of its parts. Our main sources of map data and knowledge about places are OpenStreetMap and Wikipedia respectively. This talk is about the challenges posed by connecting these two, and our strategies of coping with them.TRANSCRIPT
Fusing OpenStreetMap with WikipediaUlmon GmbH
08/05/2014 Linuxwochen Wien
Hello from
08/05/2014 Linuxwochen Wien
Ulmon’s recipe for a travel guideFuse sources of data to create a whole more valuable than its parts
08/05/2014 Linuxwochen Wien
Wikipedia and OSM in CityMaps2Go
08/05/2014 Linuxwochen Wien
What about unmatchable WIKI?
08/05/2014 Linuxwochen Wien
Wikipedia tag in OpenStreetMap
08/05/2014 Linuxwochen Wien
http://taginfo.openstreetmap.org
Wikipedia tag statistics
Tag name Number of valueswikipedia 339,148
wikipedia:ru 30,457 wikipedia:en 16,432 wikipedia:de 13,923 wikipedia:es 4,706
404,666
Total Wikipedia entries with location:1,621,704 in 15 languages
798,965 English
08/05/2014 Linuxwochen Wien
The Confusion of Tongues
08/05/2014 Linuxwochen Wien
Multiple OSM candidates for one Wiki
08/05/2014 Linuxwochen Wien
Multiple fitting Wiki entries
08/05/2014 Linuxwochen Wien
Wiki articles with no OSM object
08/05/2014 Linuxwochen Wien
What data to include?
… for an offline guide
178MB!
08/05/2014 Linuxwochen Wien
08/05/2014 Linuxwochen Wien
Ulmon’s matching algorithm…StephansdomStröckStephansplatzStephansplatz (U3 station)Stock-im-Eisen-PlatzCafé WeinwurmDO&CO am StephansplatzHaas-HausAida…
Distance: 0.9
Name: 1.0
Type: 0.0
?
?? ?
?
?
Comparing Names
• Edit distance (Levenshtein distance)• Soundex• Dice coefficient
08/05/2014 Linuxwochen Wien
Type score
• Compare OSM tags with Dbpedia types– Manual rules– Word similarity– Future: Synonymic analysis based on
Wordnet
08/05/2014 Linuxwochen Wien
Decision tree
• Generated using the J48 algorithm of the Weka toolkit
• How to get learning data?– Manual creation– Parsing wikipedia tags from OSM
08/05/2014 Linuxwochen Wien
Ulmon’s matching performance
• Current– Total wiki entries: 810K (674K English)– Matched entries: 429K
• Future– Total wiki entries: 1.6M– Matched entries (extrapolation): 850K
08/05/2014 Linuxwochen Wien
Multiple OSM candidates for one Wiki
08/05/2014 Linuxwochen Wien
Multiple fitting Wiki entries
08/05/2014 Linuxwochen Wien
Open questions
• Reduce false positives– Current: 10%, desired < 3%
• Get more matching!• Reduce the amount of data
08/05/2014 Linuxwochen Wien
Thank you for your attention!Come visit us at www.ulmon.com
08/05/2014 Linuxwochen Wien