retrieving location-based data on the web andrei tabarcea, 14.02.2011
TRANSCRIPT
Retrieving Location-based Data on the Web
Andrei Tabarcea, 14.02.2011
Introduction
The goal is to find services and points of interest close to the user’s location
We call this “location-based search” We try to find location information in web-
pages
MOPSI Search
MOPSI Search Results
Locally Managed Database
Users’ Collection
Open Web Searches
Combinationof
search results
Location Information in Webpages
- Site hosting information (owner address, server address etc.)
- HTML tags (geo-tags, address-tags, vcards for Google Maps etc.)
- Addresses, postal codes, phone numbers
- Well-known places
Main Challenges
Find location information in webpages Find relevant information related to the
found location information
Ad-Hoc Georeferencing
• The problem is how to extract and validate location data from semi-structured text
• Postal address is the most common location data found• Our goal is to give geographical coordinates to services
mentioned in web-pages• We call this method ad-hoc georeferencing
<HTML><HEAD profile"="http://geotags.com/geo> <META name="geo.position" content="62.35;29.44"> <META name="geo.region" content="FI"><META name="geo.placename" content="Joensuu"> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="http://www.joensuu.fi/tkt/sivutyyli.css" type="text/css"><TITLE>Pages of Pasi Fränti</TITLE></HEAD>
VS.
Extracting the Information
For each link:
- Extract plain text from html-file- Detect street names by using
gazetteer- Extract additional service
information- Gather results as list
For result list:
- Evaluate relevance- Arrange by distance- Purge overlapping results- Show results- (Optionally) Save results
Problems- How to evaluate relevance?
- Mixed keyword meanings
- No relation between keywords and addresses
Mobile Search Engine
Geocoded street-name
database
Core server software
Mobileapplication
Web userinterface
Coordinates
AddressKeywordCoordinates
Searchresults
KeywordCoordinates
Searchresults
Search Engine consists of:• User interface• Core server software• Geocoded street-name
database
Core Server software
Georeferencing module
Geocoded
database
Address and
description detector
Address validator
Word list
Results list
Sorted results list
KeywordMunicipaliti
es
<keyword, municipality>
query
Result
links
Coordinat
es
Municipalities li
st
Addresses
Coordinates
Relevant municipalities detector
Keyword, Address,
Coordinates
Page parser
Street-address Detection
• We use a rule-based pattern matching algorithm• The detection of street-names is the starting point of the algorithm• An address-block candidate is constructed by detecting typical
address elements (street names, numbers, postal codes, telephone numbers and municipal names)
• Address block candidates are validated using the gazetteer
Title Detection
- Title detection (or company detection) is a Named Entity Recognition problem
- We designed a 2-step system to detect titles associated to addresses:
- Step 1: Fast dictionary match- Step 2: Use a classifier to detect the title
Title Extractor
Usually, the text before the address holds relevant information
Joen Pizza Special Y-tunnus: 2129577-6 Käyntiosoite: Koskikatu 17 80100 JOENSUU Postiosoite Koskikatu 17 80100 JOENSUU Puhelin: 013-220246 Virallinen toimiala: Kahvila-ravintolat
address
words before the address
The Problem- Results for keyword “kahvila”, address: ”Freesenkatu 1, Helsinki”
No title
System Architecture
Tagged and hand-checked data
Classifier
Training data
HTML pages
Evaluator
Evaluation data
HTML parser
Dictionary matching
Match
Title extractor
Title candidateParsed HTML
StatisticsTITLE
Dataset Collection
No match
Parsing HTML pages
-Current solution extracts text from HTML pages-We don’t exploit the advantage that we extract data from web pages-Proposed future solution:
- Visual segmentation of web pages- Detection of the address block- Nearest-neighbor search considering text and visual characteristics
Joen Pizza Special Y-tunnus 2129577-6 Käyntiosoite Koskikatu 17 80100 JOENSUU Postiosoite Koskikatu 17 80100 JOENSUU Puhelin: 013-220246 Virallinen toimiala Kahvila-ravintolat
Questions
Thank you