part iii searching city data -...
TRANSCRIPT
IBM - Dublin Research Lab
Part III
Searching City Data
Veli Bicer
IBM Research
IBM - Dublin Research Lab
Search
Indexes
Search
IndexesSearch
Indexes
Search
IndexesSearch
Indexes
Search
Indexes
Semantic Virtual
Views on Data
Semantic Virtual
Views on Data
Stream
Data
Stream
Data
Search LogsSearch Logs
Retrieval and RankingRetrieval and Ranking
Structure
d Data
Structure
d Data
Textual
Data
Textual
Data
Multiple Indexes needed to
support different lookup
patterns, e.g. text,
structured, real-time and
spatial indexes
Queries, Clickthrough
Data
Social, Web,
Businesses,
Events, (
Search Interface
(Traditional)
Search Interface
(Traditional)Search Interface
(Map)
Search Interface
(Map)
Understanding the query
characteristics via
classification, rewriting,
expansion and semantic
translation
Contextual
Access for Users and
Applications
Contextual
Access for Users and
Applications
Model LearningModel Learning
IndexingIndexing
Part IIIPart III
Part IIPart II
Legend
OtherOtherGeospati
al Data
Geospati
al Data
Sensors,
Transportation,
Weather,
GPS,(
Municipality,
Public Services,
(
Road Network,
POI,
(
Data Access and Management LayerData Access and Management Layer
Query ProcessingQuery Processing
Search Interface
(Mobile)
Search Interface
(Mobile)
Overview of Components in City Search
Retrieval and RankingRetrieval and Ranking
Multiple aspects of
relevance for various ranking
signals and aggregation
[Russell-Rose, SIGIR 2013]
[Liu, SIGIR 2008]
[Zhang & Jin, SIGIR 2008]
[Mika & Tran, SemTech 2013]
Geographic
Crawling
Geographic
Crawling
IBM - Dublin Research Lab
Outline
• Indexing City Data– Real-time Indexing
– Spatial Indexing
– Structure Indexing
• Query Processing– Query Classification/Rewriting
– Query Context
– Query Sematics
• Retrieval Models– Multiple Aspects
– Content/Structure based Models
– Distance
– Local Popularity
– Context
– Aggregation
• Future Directions
IBM - Dublin Research Lab
Indexing City Data
IBM - Dublin Research Lab
Indexing City Data
• City data comes in different forms and shapes
– Standard indexing techniques (e.g. Lucene) is useful to handle textual content
– Multiple Indexing for supporting different lookup patterns
• Challenges (not fitting to hypertext indexing):
– Streaming data has a short temporal span
– Spatial Indexing
– Structured Data
IBM - Dublin Research Lab
Real-time Indexing
• Partial Indexing
– Indexing only the records with high chance of being queried
• One example: Tweet Index
– Index tweets based on their content and their rankings w.r.t. existing queries
• Indexing process
– determines whether to index or not
– classifies the tweet as distinguished or noisy via ranking
– keeps some statistics in memory
• Ranking function
– based on user’s PageRank
– popularity of topics
[Chen et al, SIGMOD’11]
Overall Architecture
Data Flow
IBM - Dublin Research Lab
Real-time Indexing
• Index maintains records of the same keyword as a
list and latest record inserted to head
– Sorted by their timestamps
– Stores TID, U-PageRank, TF, Tree ID
• A separate Tweet Table
– Stores a tree encoding to determine re-tweets
– Pointer to log file if tweet is not indexed
ID of the replied tweet# of tweets that
reply to this tweetOffset in the log file
(for unindexed tweets)
Tweet TableIndex Structure
[Chen et al, SIGMOD’11]
IBM - Dublin Research Lab
Spatial Indexing
• Needed to handle geographic queries– Q=“Chinese restaurant” with a
location
• A hybrid index (IR-Tree) combining R-tree for spatial DBs with inverted index for text– Text relevancy: LM
– R-Tree: Group nearby objects with their minimum bounding rectangle (MBR) in the next higher level of the tree
[Cong et al, VLDB’09]
Term-Doc Matrix
Bounding Boxes
IBM - Dublin Research Lab
Spatial Indexing[Cong et al, VLDB’09]
Leaf nodes contain
entries (object,
rectangle) and
pointer to II for
those entries
Non-leaf nodes contain
minimum bounding box of
child nodes and a pseudo
document representing all
documents in the sub-
nodes
IBM - Dublin Research Lab
RDF data
10
• Consists of triples <s,p,o>
• Triples form a graph, where vertices denote resources and their values, connected
by directed labelled edges representing properties (i.e.,relations and attributes)
• URIs are used as labels of edges and vertices representing resources
• Schema information incomplete, especially Web data, RDFa data
• RDF data might be schema-less, semi-structured data
0
6 7
8 9
432
1
AuthorOf A
uthorOfAu
thorOf AuthorO
f
AuthorOf
AuthorOf
Supervises Supervises Supervises
WorksA
t
WorksAt
WorksAt
WorksAt
WorksAt
KIT MITName Name
5
Supervises
WorksA
t
[Tran et al, TKDE’12]
adopted from http://es.slideshare.net/thanhtran81/query-processing-using-structure-index-for-rdf-data-on-the-web
IBM - Dublin Research Lab
Structure Index: State-of-the-art• Data Partitioning
– Vertical partitioning (SW-Store)
• Indexing
– Sextuple indexing (Hexastore)
– Materialization and indexing of entire join paths (GRIN)
• Index Implementation
– B+ tree
– Inverted index (Semplore)
– Index compression (RDF-3X)
• Query processing
– Sorted merge join based on vertical partitioning and indexing (SW-Store)
– Join order optimization based on dynamic programming (RDF-3X)
� A combination of different concepts makes up the state-of-the-art!
11 adopted from http://es.slideshare.net/thanhtran81/query-processing-using-structure-index-for-rdf-data-on-the-web
IBM - Dublin Research Lab
Structure Index for RDF data
12
• Structure index is a graph
– Is a structural description more fine-granular then a schema
– Consists of classes (extensions) and relations between them
– Resources in an extension exhibit the same structure, i.e., cannot be
distinguished by outgoing (forward bisimilarity) and incoming (backward
bisimilarity) “edge trees”
– Parameterize bisimulation by two sets of edge labels
0
6 7
8 9
432
1
AuthorOf A
uthorOfAu
thorOf AuthorO
f
AuthorOf
AuthorOf
Supervises Supervises Supervises
WorksA
t
WorksAt
WorksAt
WorksAt
WorksAt
KIT MITName Name
5
Supervises
WorksA
t
B1: 3,7
B4: 2,4,6
B3: 8,9
B2: 0,1
AuthorOf
AuthorOfSupervises
WorksAt
WorksAt
Name
B6: 5
WorksAt
Supervises
B5:KIT,MIT
[Tran et al, TKDE’12]
adopted from http://es.slideshare.net/thanhtran81/query-processing-using-structure-index-for-rdf-data-on-the-web
IBM - Dublin Research Lab
Structure-based Partitioning
13
• Whether a graph vertex instantiates a variable of a query depends on its structure �
vertices physically grouped based on structural similarity
• Apply grouping captured by the structure index to the physical organization
– Creating a physical group for every vertex
– Triples are in the same group when their subjects belong to the same extension
• Triples of a SP table satisfy not only the property of a triple pattern but also, provide some
structural guarantee, e.g., match the entire query structure
B1: 3,7
B4: 2,4,6
B3: 8,9
B2: 0,1
AuthorOf
AuthorOfSupervises
WorksAt
WorksAt
Name
B6: 5
WorksAt
Supervises
B5:KIT,MIT
Sub Property Obj
2 AuthorOf 0
4 AuthorOf 0
6 AuthorOf 1
2 WorksAt 8
4 WorksAt 8
6 WorksAt 9
Sub Obj
2 0
4 0
6 1
3 0
7 1
VP AuthorOf table
SP B4 table
[Tran et al, TKDE’12]
adopted from http://es.slideshare.net/thanhtran81/query-processing-using-structure-index-for-rdf-data-on-the-web
IBM - Dublin Research Lab
Query Processing
IBM - Dublin Research Lab
Query Processing
• According to [Zhang et al, GIR’06], most users search information at city level– 83.77% Yahoo! geo-queries has a city name
• User tend to modify their query with geo terms (geomodification):– e.g.“dry cleaner” � “dry cleaner dublin”
• Queries can:– be global – e.g. “javascript”
– include location – e.g. “dun laoghaire”
– request information – e.g. “dublin castles”
– or implicitly local – e.g. “house for sale”
• How to classify local vs global queries?
• How to automatically rewrite/expand the query?
• How to understand query context/semantics?
IBM - Dublin Research Lab
Query Classification/Rewriting
• [Gravano et al, CIKM’03] classified ~2.5 M Excite queries to determine global and local queries using pseudo-relevance feedback
– 20 features generated based on:
• Frequency of location words, aggregate # of locations, unique # of locations, fraction of pages with location etc.
– Trains different classifiers (RIPPER, C4.5, log-linear, SVM)
• Log-linear and SVM performs best
• [Jones et al, WWW’06] exploits users’ query sessions in logs for query rewriting
– Generate candidates from different
substitutions in the logs
– Identify the important candidates
– Rank the candidates
– Assign confidence scores
[Zhang et al, GIR’06]
IBM - Dublin Research Lab
Query Context
• Understanding query context for different domains
– Create a model for each domain using EM:
– Select a domain for query with min. KL-Divergence
[Bai et al, SIGIR’07]
“Environment” term probabilities Distribution of TREC Queries
IBM - Dublin Research Lab
Query Semantics
• Semantic entity index– Inverted index for entities / triples
– Return entities / entities’relationships as results to keys
• Semantic entity ranking– Structured language model: one
language model for every attribute
– Returns entities’ LMs that most likely generate the keywords, i.e. the entity descriptions that best match the keyword
Entity
“address company dublin”
[Tran et al, ICDE’09]
adopted from: [Mika and Tran,SemTech’13]
IBM - Dublin Research Lab
Query Semantics
• Offline component: query-independent schema graph
• Reuse schema– Pseudo-schema construction:
all possible connections between classes of entities, e.g. friendships between users
• Online component: query-specific keyword matching elements – Connect keyword matching
elements / entities to the classes they belong to
Relationships / Structure
Entity
“address company dublin”
[Tran et al, ICDE’09]
adopted from: [Mika and Tran,SemTech’13]
IBM - Dublin Research Lab
Query Semantics
• Top-k graph exploration– Shortest-path based
algorithm that finds top-k graphs connecting keyword matching elements
• Top-k graph ranking– Language model based
– Aggregated model that combines the LMs of entities matching the keywords
Relationships / Structure
Entity
“address company dublin”
[Tran et al, ICDE’09]
adopted from: [Mika and Tran,SemTech’13]
IBM - Dublin Research Lab
Query Semantics
• Graph to query mapping
– Translation rules that map top ranked graphs to structured queries (SQL, SPARQL)
– Translation rules that map structured queries to natural language questions
• Graph matching
– Triple index: cover index supporting different triple patterns
– Various join implementationsTripleRelationships /
StructureEntity
Address of companies located in Dublin?
“address company dublin”
[Tran et al, ICDE’09]
IBM - Dublin Research Lab
Retrieval Models
IBM - Dublin Research Lab
Multiple Aspects of Relevance
• Overall relevance between a query and a city-specific information is a tradeoff between multiple relevance aspects
– E.g. reputation aspect can be more important for “coffee shops” but the distance aspect can be more important for “bank office".
• Important aspects for city search– Content
– Structure
– Local popularity
– Distance
– Temporality
– Reputation
– Ratings
– Ease-of-access
– Activity
– Topic
– User preference
– … and even more
• Aggregation of multiple aspects
IBM - Dublin Research Lab
Content-based model construction
• Document statistics, e.g.
– Term frequency
– Document length
• Collection statistics, e.g.
– Inverse document frequency
– Background language models
)|()1(||
)|( CtPd
tftP d λλθ −+=
idfd
tfw dt ∗=
||,
• An object is more likely
about “Dublin”?
• When it contains a
relatively high number
of mentions of the
term “Dublin”
• When number of
mentions of term in
the overall collection is
relatively low
• An object is more likely
about “Dublin”?
• When it contains a
relatively high number
of mentions of the
term “Dublin”
• When number of
mentions of term in
the overall collection is
relatively low
adopted from: [Tran, ESSIR’11]
IBM - Dublin Research Lab
Structure-based model construction• PageRank
– Link analysis algorithm
– Measuring relative importance of nodes
– Link counts as a vote of support
– The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links)
• ObjectRank
– Types and semantics of links vary in structured data
– Authority transfer schema graph specifies connection strengths
– Recursively compute authority transfer data graph
• An object (about “Dublin”) is more important?
• When a relatively large number of objects are linked to it
• An object (about “Dublin”) is more important?
• When a relatively large number of objects are linked to it
[Hristidis et al, TDS08]
adopted from: [Tran, ESSIR’11]
IBM - Dublin Research Lab
• EASE, XRANK, BLINKS, etc.
• EASE
– Proximity between a pair of keywords
– Overall score of a JRT is aggregation on the score of keyword pairs
• XRANK
– Ranking of XML documents / elements
– Proximity of n is defined based on w, the smallest text window in n that
contains all search keywords
Structure-based model construction – proximity
• A structured result (e.g. Steiner tree) is more relevant?
• When it is more compact s.t. elements are closely related
• A structured result (e.g. Steiner tree) is more relevant?
• When it is more compact s.t. elements are closely related
[Li et al, SIGMOD08]
[Guo et al, SIGMOD03]
adopted from: [Chen et al, SIGMOD09]
IBM - Dublin Research Lab
Structured-content-based model construction
• Consider structure of objects during content-
based modeling, i.e., to obtain structured
content-based model
–Content-based model for structured objects,
structured documents, database tuples…
)|()|( f
Ff
fd
d
tPtP θαθ ∑∈
=
• An object is more likely about “Dublin”?
• When its (important) fields / attributes contain a relatively
high number of mentions of the term “Dublin”
• An object is more likely about “Dublin”?
• When its (important) fields / attributes contain a relatively
high number of mentions of the term “Dublin”
adopted from: [Tran, ESSIR’11]
IBM - Dublin Research Lab
P(w|Q) w
.077 malahide
.055 dublin
.034 town
.033 tourist
.027 castle
.011 beach
.010 marina
.010 north
.010 coastal
…
sample probabilities
∑ ∏∈ =
=UMM
k
i
ik MqPMwPMPqqwP1
1 )|()|()()...,(
castle
Malahide
visit
???
q1q2q3
w
M
M
M
)...(
)...,()...|()|(
1
11
k
kk
qqP
qqwPqqwPRwP =≈
Structured-content-based model constructionRelevance model[Lavrenko et al, SIGIR01]
IBM - Dublin Research Lab
Structured-content-based model constructionEdge-specific relevance model
• Given a query Q={q1,…,qn}, a set of resources (FR) are retrieved
– E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}
• Based on FR results, an edge specific RMFR is constructed for each unique
edge e:
[Bicer et al, CIKM11]
IBM - Dublin Research Lab
• Edge-specific resource model:
– Smoothing with model for the entire resource
• Use RM for query expansion: the score of a
resource calculated based on cross-entropy of
edge-specific RMFR and edge-specific RMr:
• Alpha allows to control the importance of edges
Structured-content-based model constructionEdge-specific resource model [Bicer et al, CIKM11]
IBM - Dublin Research Lab
• Ranking aggregated JRTs:
– The cross entropy between the edge-specific RMFR (query model) and
geometric mean of combined edge-specific RMJRT:
• The proposed ranking function is monotonic with respect to
the individual resource scores (a necessary property for
using top-k keyword search algorithms)
[Bicer et al, CIKM11]
Structured-content-based model constructionScoring
IBM - Dublin Research Lab
Distance
• Idea of distance being relative to the type of
business in local search
– E.g. furniture vs. coffee shop
• Using backoff methods to enrich a result
businesses with aggregate features from
selected objects in a search log
• Different distance functions:
– Geographic distance, categorical distance, user
distance
[Berberich, SIGIR’11]
IBM - Dublin Research Lab
Distance[Berberich, SIGIR’11]
IBM - Dublin Research Lab
Local Popularity
• Mining traffic patterns of venues from 22 million Foursquare checkins– in 12 million venues
• 56% of which is assigned to categories
• 7.8% of which have at least one tag
• Traffic pattern of a venue is modeled as checkinfrequency for a series of time units (24 units for daily, 70 units for weekly TPs)
• Temporal correlation measure to detect similarity based on traffic patterns among the venues
• Clustering the venues– Using K-Means or EM
[Cheng et al, CIKM’11]
IBM - Dublin Research Lab
Local Popularity
• Daily and Weekly Traffic Pattern for Walmart
• Top pairs of venues correlated based on traffic
[Cheng et al, CIKM’11]
IBM - Dublin Research Lab
Local Popularity[Cheng et al, CIKM’11]
Book shops and pharmacies Steakhouses
Sub shopsCoffee shops
IBM - Dublin Research Lab
Context
• Hapori: Uses context to provide more relevant results
– Temporal, Weather, Spatial, Personal
• Analyzed 80,000 local search queries submitted to Mobile Bing Local from 11,000 users
• Data from search logs contained
– Query terms, POI identifier of click, user GPS location, timestamp, user id
– Timestamp used to extract weather data
• Goal: Improve the POI relevance based on a preference model of context
[Lane et al, Ubicomp’10]
IBM - Dublin Research Lab
Context
• POI Decision: Clicks
• Contextual Features:
– Rely on phone sensors
• Differences among people
via a CSM
• Use features and CSM to
train a classifier for each
POI category
– KNN classifier
[Lane et al, Ubicomp’10]
IBM - Dublin Research Lab
Context[Lane et al, Ubicomp’10]
IBM - Dublin Research Lab
Aggregation
• How to aggregate multiple aspects of relevance
• Aspect ranking function (ARF):
– maps feature vector to an aspect relevance score
[Kang et al, WSDM’12]
• Model Aggregation Function:
– Aggregates estimated aspect
relevances to a final relevance
score
• Apply Learning-to-rank to
learn ARFs
IBM - Dublin Research Lab
Future Directions
• More and more city data becomes available helping us to understand city dynamics
• IR community can benefit from this to better answer the queries of people in everyday city life
• Collaboration
– “Build smarter cities, together”
• A toolbox of initial approaches is already developed in different areas:
– Core IR, DM, Semantic Search, GIR etc.
• More focused research within city context
IBM - Dublin Research Lab
References
• Indexing– Chun Chen, Feng Li, Beng C. Ooi, Sai Wu, Ti: an efficient indexing mechanism for real-
time search on tweets, SIGMOD 2011
– Thanh Tran, Günter Ladwig, Sebastian Rudolph, RDF Data Partitioning and Query Processing Using Structure Indexes, Transactions on Knowledge and Data Engineering, 2012.
– Cong, Gao, Christian S. Jensen, and Dingming Wu. Efficient retrieval of the top-k most relevant spatial web objects. Proceedings of the VLDB Endowment 2.1 (2009): 337-348.
• Query Processing– Zhang, Wei Vivian, Benjamin Rey, Eugene Stipp, and Rosie Jones. Geomodification in
Query Rewriting. In GIR. 2006.
– Gravano, Luis, Vasileios Hatzivassiloglou, and Richard Lichtenstein. Categorizing web queries according to geographical locality. Proceedings of the twelfth international conference on Information and knowledge management. ACM, 2003.
– Jones, Rosie, Benjamin Rey, Omid Madani, and Wiley Greiner. Generating query substitutions. In Proceedings of the 15th international conference on World Wide Web, pp. 387-396. ACM, 2006.
– Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano, Top-k exploration of query candidates for efficient keyword search on graph-shaped (rdf) data, ICDE, 2009.
– Bai, Jing, Jian-Yun Nie, Guihong Cao, and Hugues Bouchard. Using query contexts in information retrieval. In SIGIR, 2007.
IBM - Dublin Research Lab
References
• Retrieval– Changsung Kang, Xuanhui Wang, Yi Chang, Belle Tseng, Learning to rank with
multi-aspect relevance for vertical search, WSDM 2012
– Nicholas D Lane, Dimitrios Lymberopoulos, Feng Zhao, Andrew T. Campbell, Hapori: context-based local search for mobile phones using community behavioral modeling and similarity, Ubicomp,2010.
– Klaus Berberich, Arnd C. Konig, Dimitrios Lymberopoulos, Peixiang Zhao, Improving local search ranking through external logs, SIGIR 2011.
– Cheng, Zhiyuan, et al. Toward traffic-driven location-based Web search.CIKM, 2011.
– Hristidis, Vagelis, Heasoo Hwang, and Yannis Papakonstantinou. Authority-based keyword search in databases. ACM Transactions on Database Systems (TODS) 33, no. 1 (2008): 1.
– Li, Guoliang, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, and Lizhu Zhou. EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD, 2008.
– Guo, Lin, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. XRANK: ranked keyword search over XML documents. In SIGMOD, 2003.
– Bicer, Veli, Thanh Tran, and Radoslav Nedkov. Ranking support for keyword search on structured data using relevance models. In CIKM, 2011.
IBM - Dublin Research Lab
References
• Other Tutorials– Tony Russell-Rose, Designing Search Usability, Tutorial at
SIGIR 2013.
– Tie-Yan Liu, Learning to Rank for Information Retrieval, Tutorial at SIGIR 2008
– Yi Zhang and Rong Jin, Supervised and Semi-Supervised Learning for IR, Tutorial at SIGIR 2008
– Peter Mika, Thanh Tran, Semantic Search on the Rise, Tutorial at SemTech 2013
– Thanh Tran, Semantic Search - Focus: IR on Structured Data, ESSIR, 2011.
– Yi Chen, Keyword Search on Structured and Semi-structured Data. Tutorial in SIGMOD, 2009
IBM - Dublin Research Lab
Acknowledgements
Spyros KotoulasThanh Tran
Freddy LecueGiusy Di Lorenzo
Marco Luca Sbodio
Martin Stephenson
Pierpaolo Tommasi
Simone Tallevi-Diotallevi
Pol Mac Aonghusa
IBM - Dublin Research Lab