openstreetmap geocoder based on solr

Post on 11-May-2015

2.514 Views

Category:

Technology

13 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented by Ishan Chattopadhyaya, LucidWorks This talk is on the technical aspects of a new OpenStreetMap geocoder based on Apache Solr & Lucene. Recent changes to Apache Lucene and Apache Solr (4.0 and onwards) have seen a marked improvement in the spatial search capabilities. Also, its improved support for distributed storage and search, via the SolrCloud mode, makes applications using Solr scale easily. OpenStreetMap's current geocoder, Nomainatim, is based on Postgresql/PostGis. Some benefits of using Solr (as compared to a database system like Postgres) for building a geocoder, is robust partial text search, analysis in various languages (stemming, tokenization, stop words etc.), spell check, faceting, highlighting etc. Through this presentation, the author intends to bring out an appreciation for a Solr based geocoder.

TRANSCRIPT

Ishan ChattopadhyayaLucidWorksOpenStreetMap FoundationTwitter: @ichattopadhyaya, OSM: chatman

● Wikipedia of GeoData

● OpenStreetMap is a project aimed squarely

at creating and providing free geographic

data such as street maps to anyone who

wants them.

What is OpenStreetMap?

State of OSM

● Commercial competitors

– Google Maps

– Bing Maps

● http://tools.geofabrik.de/mc/

The OpenStreetMap Software Stack

What is a Geocoder?

● Input: raw query

● Output: geocoordinates

Nominatim

● http://nominatim.openstreetmap.org/

Goals for the new Geocoder● Search for:

– Cities and towns

– Streets

– Address points

– Places of Interest, Businesses, Amenities, Attractions etc.

● Reverse geocoding

● Support for fuzzy queries

Good changes in Lucene/Solr 4.x● Support for indexing polygons

– RecursivePrefixTree indexing

● Special spatial search predicates

– Contains

– IsWithin

– Intersects

– Etc.

● Reference: David Smiley's LuceneRevolution presentation

● SolrCloud mode for distributed indexing/searching

Architecture

Indexer

Solr

www.Geocoder.

in

API Layer

Planet dumps

Indexing: OSM Data format

● Node

– “A node defines a single geospatial point using a latitude and longitude.”

● Way

– “A way is an ordered list of between 2 and 2,000 nodes. Ways are used to represent linear features (vectors), such as rivers or roads.”

● Relation

– “A Relation is an all-purpose data structure that documents a relationship between two or more other objects.”

Indexing: Facts and figures

● Number of OSM Nodes in the database = 2071039612

● Number of OSM Ways in the database = 202570637

● Number of OSM Relations in the database = 2217240

Indexing: Schema

admin2 admin3

admin4

admin5 admin6 admin7 street st_type

Ireland Dublin County

Dublin Ballsbridge Lansdowne

Street

name level geo popularity

Landsdowne Street s <shape>

Indexing: Schema

admin2 admin3

admin4

admin5 admin6 admin7 street st_type

Ireland Dublin County

Dublin

name level geo popularity

Dublin 6 <shape> 1

Indexing: Schema (POIs)

admin2 admin3

admin4

admin5 admin6 admin7 street st_type

Ireland Dublin County

Dublin Ballsbridge

name category geo

Ballsbridge Hotel hotel <shape>

Searching

Classifier Validator

Geocoder (lookup)

Raw query Classifications

Valid classifications

Structured location + geocodes

Searching: Classification

Tokenizer Bloom FiltersQuery Shingles Classifications

Searching: Classification

● Query= “hotels near lansdowne rd dublin”

● Shingles: hotels, near, lansdowne, rd, dublin, hotels near, near lansdowne, lansdowne rd, rd dublin, .., hotels near lansdowne rd dublin

Tokenizer Bloom FiltersQuery Shingles Classifications

Searching: Classification

● hotels, near, lansdowne, rd, dublin, hotels near, near lansdowne, ..

Tokenizer Bloom FiltersQuery Shingles Classifications

Cat A2 A4 A5 Streets

hotels

Match

Searching: Classification

● hotels, near, lansdowne, rd, dublin, hotels near, near lansdowne, ..

Tokenizer Bloom FiltersQuery Shingles Classifications

Cat A2 A4 A5 Streets

dublin

MatchMatch

Searching: Classification

● hotels, near, lansdowne, rd, dublin, hotels near, near lansdowne, ..

Tokenizer Bloom FiltersQuery Shingles Classifications

Cat A2 A4 A5 Streets

lansdowne

MatchMatch

Searching: Classifications

● Query = “hotels near lansdowne rd dublin”

● Classifications:hotels = categorylansdowne = admin5lansdowne = streetdublin = admin5dublin = street

Searching: Classifications

● Query = “hotels near lansdowne rd dublin”

● Classifications:hotels = categorylansdowne = admin5lansdowne = streetdublin = admin5dublin = street

● Possible permutations:C.5.5C.S.5C.5.SC...5C.5..etc.

Searching: Solr Query

● Query = “hotels near lansdowne rd dublin”

● Possible permutations:C.5.5: +level:5 +admin5:lansdowne +admin5:dublinC.S.5: +level:s +street:lansdowne +admin5:dublinC.5.S: +level:s +street:dublin +admin5:lansdowneC...5: +level:5 +admin5:dublinC.5..: +level:5 +admin5:lansdowneetc.

Searching: Solr Query

● Query = “hotels near lansdowne rd dublin”

● Possible permutations:C.5.5: +level:5 +admin5:lansdowne +admin5:dublinC.S.5: +level:s +street:lansdowne +admin5:dublinC.5.S: +level:s +street:dublin +admin5:lansdowneC...5: +level:5 +admin5:dublinC.5..: +level:5 +admin5:lansdowneetc.

Searching: Solr Query

● Query = “hotels near lansdowne rd dublin”

● Possible permutations:C.5.5: +level:5 +admin5:lansdowne +admin5:dublinC.S.5: +level:s +street:lansdowne +admin5:dublinC.5.S: +level:s +street:dublin +admin5:lansdowneC...5: +level:5 +admin5:dublinC.5..: +level:5 +admin5:lansdowneetc.

"POINT (-6.232063,53.333833)"

Searching: Searching for POIs

● Query = “hotels near lansdowne rd dublin”

● Query = “hotels near” near "POINT (-6.232063,53.333833)"

● Solr query: fl=*,scoresort=score ascq={!geofilt score=distance filter=false sfield=geo pt= 53.333833,-6.232063 d=10}fq=+category:hotel

Searching: Searching for POIs

Challenges: Indexing

● Street Associativity

● Incomplete polygons

Challenges

● Handling Updates

● Data validation

Distributed Search

● Need for distributed search?

● Geographical partitioning

Conclusion

● http://www.geocoder.in/

● Twitter: @ichattopadhyaya

top related