part iii searching city data -...

IBM - Dublin Research Lab

Part III

Searching City Data

Veli Bicer

IBM Research


Search

Indexes

Search

IndexesSearch

Indexes

Search

IndexesSearch

Indexes

Search

Indexes

Semantic Virtual

Views on Data

Semantic Virtual

Views on Data

Stream

Data

Stream

Data

Search LogsSearch Logs

Retrieval and RankingRetrieval and Ranking

Structure

d Data

Structure

d Data

Textual

Data

Textual

Data

Multiple Indexes needed to

support different lookup

patterns, e.g. text,

structured, real-time and

spatial indexes

Queries, Clickthrough

Data

Social, Web,

Businesses,

Events, (

Search Interface

(Traditional)

Search Interface

(Traditional)Search Interface

(Map)

Search Interface

(Map)

Understanding the query

characteristics via

classification, rewriting,

expansion and semantic

translation

Contextual

Access for Users and

Applications

Contextual

Access for Users and

Applications

Model LearningModel Learning

IndexingIndexing

Part IIIPart III

Part IIPart II

Legend

OtherOtherGeospati

al Data

Geospati

al Data

Sensors,

Transportation,

Weather,

GPS,(

Municipality,

Public Services,

(

Road Network,

POI,

(

Data Access and Management LayerData Access and Management Layer

Query ProcessingQuery Processing

Search Interface

(Mobile)

Search Interface

(Mobile)

Overview of Components in City Search

Retrieval and RankingRetrieval and Ranking

Multiple aspects of

relevance for various ranking

signals and aggregation

[Russell-Rose, SIGIR 2013]

[Liu, SIGIR 2008]

[Zhang & Jin, SIGIR 2008]

[Mika & Tran, SemTech 2013]

Geographic

Crawling

Geographic

Crawling


Outline

• Indexing City Data– Real-time Indexing

– Spatial Indexing

– Structure Indexing

• Query Processing– Query Classification/Rewriting

– Query Context

– Query Sematics

• Retrieval Models– Multiple Aspects

– Content/Structure based Models

– Distance

– Local Popularity

– Context

– Aggregation

• Future Directions


Indexing City Data


Indexing City Data

• City data comes in different forms and shapes

– Standard indexing techniques (e.g. Lucene) is useful to handle textual content

– Multiple Indexing for supporting different lookup patterns

• Challenges (not fitting to hypertext indexing):

– Streaming data has a short temporal span

– Spatial Indexing

– Structured Data


Real-time Indexing

• Partial Indexing

– Indexing only the records with high chance of being queried

• One example: Tweet Index

– Index tweets based on their content and their rankings w.r.t. existing queries

• Indexing process

– determines whether to index or not

– classifies the tweet as distinguished or noisy via ranking

– keeps some statistics in memory

• Ranking function

– based on user’s PageRank

– popularity of topics

[Chen et al, SIGMOD’11]

Overall Architecture

Data Flow


Real-time Indexing

• Index maintains records of the same keyword as a

list and latest record inserted to head

– Sorted by their timestamps

– Stores TID, U-PageRank, TF, Tree ID

• A separate Tweet Table

– Stores a tree encoding to determine re-tweets

– Pointer to log file if tweet is not indexed

ID of the replied tweet# of tweets that

reply to this tweetOffset in the log file

(for unindexed tweets)

Tweet TableIndex Structure

[Chen et al, SIGMOD’11]


Spatial Indexing

• Needed to handle geographic queries– Q=“Chinese restaurant” with a

location

• A hybrid index (IR-Tree) combining R-tree for spatial DBs with inverted index for text– Text relevancy: LM

– R-Tree: Group nearby objects with their minimum bounding rectangle (MBR) in the next higher level of the tree

[Cong et al, VLDB’09]

Term-Doc Matrix

Bounding Boxes


Spatial Indexing[Cong et al, VLDB’09]

Leaf nodes contain

entries (object,

rectangle) and

pointer to II for

those entries

Non-leaf nodes contain

minimum bounding box of

child nodes and a pseudo

document representing all

documents in the sub-

nodes


RDF data

10

• Consists of triples <s,p,o>

• Triples form a graph, where vertices denote resources and their values, connected

by directed labelled edges representing properties (i.e.,relations and attributes)

• URIs are used as labels of edges and vertices representing resources

• Schema information incomplete, especially Web data, RDFa data

• RDF data might be schema-less, semi-structured data

0

6 7

8 9

432

1

AuthorOf A

uthorOfAu

thorOf AuthorO

f

AuthorOf

AuthorOf

Supervises Supervises Supervises

WorksA

t

WorksAt

WorksAt

WorksAt

WorksAt

KIT MITName Name

5

Supervises

WorksA

t

[Tran et al, TKDE’12]

adopted from http://es.slideshare.net/thanhtran81/query-processing-using-structure-index-for-rdf-data-on-the-web


Structure Index: State-of-the-art• Data Partitioning

– Vertical partitioning (SW-Store)

• Indexing

– Sextuple indexing (Hexastore)

– Materialization and indexing of entire join paths (GRIN)

• Index Implementation

– B+ tree

– Inverted index (Semplore)

– Index compression (RDF-3X)

• Query processing

– Sorted merge join based on vertical partitioning and indexing (SW-Store)

– Join order optimization based on dynamic programming (RDF-3X)

� A combination of different concepts makes up the state-of-the-art!

11 adopted from http://es.slideshare.net/thanhtran81/query-processing-using-structure-index-for-rdf-data-on-the-web


Structure Index for RDF data

12

• Structure index is a graph

– Is a structural description more fine-granular then a schema

– Consists of classes (extensions) and relations between them

– Resources in an extension exhibit the same structure, i.e., cannot be

distinguished by outgoing (forward bisimilarity) and incoming (backward

bisimilarity) “edge trees”

– Parameterize bisimulation by two sets of edge labels

0

6 7

8 9

432

1

AuthorOf A

uthorOfAu

thorOf AuthorO

f

AuthorOf

AuthorOf

Supervises Supervises Supervises

WorksA

t

WorksAt

WorksAt

WorksAt

WorksAt

KIT MITName Name

5

Supervises

WorksA

t

B1: 3,7

B4: 2,4,6

B3: 8,9

B2: 0,1

AuthorOf

AuthorOfSupervises

WorksAt

WorksAt

Name

B6: 5

WorksAt

Supervises

B5:KIT,MIT




Structure-based Partitioning

13

• Whether a graph vertex instantiates a variable of a query depends on its structure �

vertices physically grouped based on structural similarity

• Apply grouping captured by the structure index to the physical organization

– Creating a physical group for every vertex

– Triples are in the same group when their subjects belong to the same extension

• Triples of a SP table satisfy not only the property of a triple pattern but also, provide some

structural guarantee, e.g., match the entire query structure

B1: 3,7

B4: 2,4,6

B3: 8,9

B2: 0,1

AuthorOf

AuthorOfSupervises

WorksAt

WorksAt

Name

B6: 5

WorksAt

Supervises

B5:KIT,MIT

Sub Property Obj

2 AuthorOf 0

4 AuthorOf 0

6 AuthorOf 1

2 WorksAt 8

4 WorksAt 8

6 WorksAt 9

Sub Obj

2 0

4 0

6 1

3 0

7 1

VP AuthorOf table

SP B4 table




Query Processing


Query Processing

• According to [Zhang et al, GIR’06], most users search information at city level– 83.77% Yahoo! geo-queries has a city name

• User tend to modify their query with geo terms (geomodification):– e.g.“dry cleaner” � “dry cleaner dublin”

• Queries can:– be global – e.g. “javascript”

– include location – e.g. “dun laoghaire”

– request information – e.g. “dublin castles”

– or implicitly local – e.g. “house for sale”

• How to classify local vs global queries?

• How to automatically rewrite/expand the query?

• How to understand query context/semantics?


Query Classification/Rewriting

• [Gravano et al, CIKM’03] classified ~2.5 M Excite queries to determine global and local queries using pseudo-relevance feedback

– 20 features generated based on:

• Frequency of location words, aggregate # of locations, unique # of locations, fraction of pages with location etc.

– Trains different classifiers (RIPPER, C4.5, log-linear, SVM)

• Log-linear and SVM performs best

• [Jones et al, WWW’06] exploits users’ query sessions in logs for query rewriting

– Generate candidates from different

substitutions in the logs

– Identify the important candidates

– Rank the candidates

– Assign confidence scores

[Zhang et al, GIR’06]


Query Context

• Understanding query context for different domains

– Create a model for each domain using EM:

– Select a domain for query with min. KL-Divergence

[Bai et al, SIGIR’07]

“Environment” term probabilities Distribution of TREC Queries


Query Semantics

• Semantic entity index– Inverted index for entities / triples

– Return entities / entities’relationships as results to keys

• Semantic entity ranking– Structured language model: one

language model for every attribute

– Returns entities’ LMs that most likely generate the keywords, i.e. the entity descriptions that best match the keyword

Entity

“address company dublin”

[Tran et al, ICDE’09]

adopted from: [Mika and Tran,SemTech’13]


Query Semantics

• Offline component: query-independent schema graph

• Reuse schema– Pseudo-schema construction:

all possible connections between classes of entities, e.g. friendships between users

• Online component: query-specific keyword matching elements – Connect keyword matching

elements / entities to the classes they belong to

Relationships / Structure

Entity





Query Semantics

• Top-k graph exploration– Shortest-path based

algorithm that finds top-k graphs connecting keyword matching elements

• Top-k graph ranking– Language model based

– Aggregated model that combines the LMs of entities matching the keywords

Relationships / Structure

Entity





Query Semantics

• Graph to query mapping

– Translation rules that map top ranked graphs to structured queries (SQL, SPARQL)

– Translation rules that map structured queries to natural language questions

• Graph matching

– Triple index: cover index supporting different triple patterns

– Various join implementationsTripleRelationships /

StructureEntity

Address of companies located in Dublin?




Retrieval Models


Multiple Aspects of Relevance

• Overall relevance between a query and a city-specific information is a tradeoff between multiple relevance aspects

– E.g. reputation aspect can be more important for “coffee shops” but the distance aspect can be more important for “bank office".

• Important aspects for city search– Content

– Structure

– Local popularity

– Distance

– Temporality

– Reputation

– Ratings

– Ease-of-access

– Activity

– Topic

– User preference

– … and even more

• Aggregation of multiple aspects


Content-based model construction

• Document statistics, e.g.

– Term frequency

– Document length

• Collection statistics, e.g.

– Inverse document frequency

– Background language models

)|()1(||

)|( CtPd

tftP d λλθ −+=

idfd

tfw dt ∗=

||,

• An object is more likely

about “Dublin”?

• When it contains a

relatively high number

of mentions of the

term “Dublin”

• When number of

mentions of term in

the overall collection is

relatively low

• An object is more likely

about “Dublin”?

• When it contains a

relatively high number

of mentions of the

term “Dublin”

• When number of

mentions of term in

the overall collection is

relatively low

adopted from: [Tran, ESSIR’11]


Structure-based model construction• PageRank

– Link analysis algorithm

– Measuring relative importance of nodes

– Link counts as a vote of support

– The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links)

• ObjectRank

– Types and semantics of links vary in structured data

– Authority transfer schema graph specifies connection strengths

– Recursively compute authority transfer data graph

• An object (about “Dublin”) is more important?

• When a relatively large number of objects are linked to it

• An object (about “Dublin”) is more important?

• When a relatively large number of objects are linked to it

[Hristidis et al, TDS08]



• EASE, XRANK, BLINKS, etc.

• EASE

– Proximity between a pair of keywords

– Overall score of a JRT is aggregation on the score of keyword pairs

• XRANK

– Ranking of XML documents / elements

– Proximity of n is defined based on w, the smallest text window in n that

contains all search keywords

Structure-based model construction – proximity

• A structured result (e.g. Steiner tree) is more relevant?

• When it is more compact s.t. elements are closely related

• A structured result (e.g. Steiner tree) is more relevant?

• When it is more compact s.t. elements are closely related

[Li et al, SIGMOD08]

[Guo et al, SIGMOD03]

adopted from: [Chen et al, SIGMOD09]


Structured-content-based model construction

• Consider structure of objects during content-

based modeling, i.e., to obtain structured

content-based model

–Content-based model for structured objects,

structured documents, database tuples…

)|()|( f

Ff

fd

d

tPtP θαθ ∑∈

=

• An object is more likely about “Dublin”?

• When its (important) fields / attributes contain a relatively

high number of mentions of the term “Dublin”

• An object is more likely about “Dublin”?

• When its (important) fields / attributes contain a relatively

high number of mentions of the term “Dublin”



P(w|Q) w

.077 malahide

.055 dublin

.034 town

.033 tourist

.027 castle

.011 beach

.010 marina

.010 north

.010 coastal

…

sample probabilities

∑ ∏∈ =

=UMM

k

i

ik MqPMwPMPqqwP1

1 )|()|()()...,(

castle

Malahide

visit

???

q1q2q3

w

M

M

M

)...(

)...,()...|()|(

1

11

k

kk

qqP

qqwPqqwPRwP =≈

Structured-content-based model constructionRelevance model[Lavrenko et al, SIGIR01]


Structured-content-based model constructionEdge-specific relevance model

• Given a query Q={q1,…,qn}, a set of resources (FR) are retrieved

– E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}

• Based on FR results, an edge specific RMFR is constructed for each unique

edge e:

[Bicer et al, CIKM11]


• Edge-specific resource model:

– Smoothing with model for the entire resource

• Use RM for query expansion: the score of a

resource calculated based on cross-entropy of

edge-specific RMFR and edge-specific RMr:

• Alpha allows to control the importance of edges

Structured-content-based model constructionEdge-specific resource model [Bicer et al, CIKM11]


• Ranking aggregated JRTs:

– The cross entropy between the edge-specific RMFR (query model) and

geometric mean of combined edge-specific RMJRT:

• The proposed ranking function is monotonic with respect to

the individual resource scores (a necessary property for

using top-k keyword search algorithms)

[Bicer et al, CIKM11]

Structured-content-based model constructionScoring


Distance

• Idea of distance being relative to the type of

business in local search

– E.g. furniture vs. coffee shop

• Using backoff methods to enrich a result

businesses with aggregate features from

selected objects in a search log

• Different distance functions:

– Geographic distance, categorical distance, user

distance

[Berberich, SIGIR’11]


Distance[Berberich, SIGIR’11]


Local Popularity

• Mining traffic patterns of venues from 22 million Foursquare checkins– in 12 million venues

• 56% of which is assigned to categories

• 7.8% of which have at least one tag

• Traffic pattern of a venue is modeled as checkinfrequency for a series of time units (24 units for daily, 70 units for weekly TPs)

• Temporal correlation measure to detect similarity based on traffic patterns among the venues

• Clustering the venues– Using K-Means or EM

[Cheng et al, CIKM’11]


Local Popularity

• Daily and Weekly Traffic Pattern for Walmart

• Top pairs of venues correlated based on traffic

[Cheng et al, CIKM’11]


Local Popularity[Cheng et al, CIKM’11]

Book shops and pharmacies Steakhouses

Sub shopsCoffee shops


Context

• Hapori: Uses context to provide more relevant results

– Temporal, Weather, Spatial, Personal

• Analyzed 80,000 local search queries submitted to Mobile Bing Local from 11,000 users

• Data from search logs contained

– Query terms, POI identifier of click, user GPS location, timestamp, user id

– Timestamp used to extract weather data

• Goal: Improve the POI relevance based on a preference model of context

[Lane et al, Ubicomp’10]


Context

• POI Decision: Clicks

• Contextual Features:

– Rely on phone sensors

• Differences among people

via a CSM

• Use features and CSM to

train a classifier for each

POI category

– KNN classifier

[Lane et al, Ubicomp’10]


Context[Lane et al, Ubicomp’10]


Aggregation

• How to aggregate multiple aspects of relevance

• Aspect ranking function (ARF):

– maps feature vector to an aspect relevance score

[Kang et al, WSDM’12]

• Model Aggregation Function:

– Aggregates estimated aspect

relevances to a final relevance

score

• Apply Learning-to-rank to

learn ARFs


Future Directions

• More and more city data becomes available helping us to understand city dynamics

• IR community can benefit from this to better answer the queries of people in everyday city life

• Collaboration

– “Build smarter cities, together”

• A toolbox of initial approaches is already developed in different areas:

– Core IR, DM, Semantic Search, GIR etc.

• More focused research within city context


References

• Indexing– Chun Chen, Feng Li, Beng C. Ooi, Sai Wu, Ti: an efficient indexing mechanism for real-

time search on tweets, SIGMOD 2011

– Thanh Tran, Günter Ladwig, Sebastian Rudolph, RDF Data Partitioning and Query Processing Using Structure Indexes, Transactions on Knowledge and Data Engineering, 2012.

– Cong, Gao, Christian S. Jensen, and Dingming Wu. Efficient retrieval of the top-k most relevant spatial web objects. Proceedings of the VLDB Endowment 2.1 (2009): 337-348.

• Query Processing– Zhang, Wei Vivian, Benjamin Rey, Eugene Stipp, and Rosie Jones. Geomodification in

Query Rewriting. In GIR. 2006.

– Gravano, Luis, Vasileios Hatzivassiloglou, and Richard Lichtenstein. Categorizing web queries according to geographical locality. Proceedings of the twelfth international conference on Information and knowledge management. ACM, 2003.

– Jones, Rosie, Benjamin Rey, Omid Madani, and Wiley Greiner. Generating query substitutions. In Proceedings of the 15th international conference on World Wide Web, pp. 387-396. ACM, 2006.

– Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano, Top-k exploration of query candidates for efficient keyword search on graph-shaped (rdf) data, ICDE, 2009.

– Bai, Jing, Jian-Yun Nie, Guihong Cao, and Hugues Bouchard. Using query contexts in information retrieval. In SIGIR, 2007.


References

• Retrieval– Changsung Kang, Xuanhui Wang, Yi Chang, Belle Tseng, Learning to rank with

multi-aspect relevance for vertical search, WSDM 2012

– Nicholas D Lane, Dimitrios Lymberopoulos, Feng Zhao, Andrew T. Campbell, Hapori: context-based local search for mobile phones using community behavioral modeling and similarity, Ubicomp,2010.

– Klaus Berberich, Arnd C. Konig, Dimitrios Lymberopoulos, Peixiang Zhao, Improving local search ranking through external logs, SIGIR 2011.

– Cheng, Zhiyuan, et al. Toward traffic-driven location-based Web search.CIKM, 2011.

– Hristidis, Vagelis, Heasoo Hwang, and Yannis Papakonstantinou. Authority-based keyword search in databases. ACM Transactions on Database Systems (TODS) 33, no. 1 (2008): 1.

– Li, Guoliang, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, and Lizhu Zhou. EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD, 2008.

– Guo, Lin, Feng Shao, Chavdar Botev, and Jayavel Shanmugasundaram. XRANK: ranked keyword search over XML documents. In SIGMOD, 2003.

– Bicer, Veli, Thanh Tran, and Radoslav Nedkov. Ranking support for keyword search on structured data using relevance models. In CIKM, 2011.


References

• Other Tutorials– Tony Russell-Rose, Designing Search Usability, Tutorial at

SIGIR 2013.

– Tie-Yan Liu, Learning to Rank for Information Retrieval, Tutorial at SIGIR 2008

– Yi Zhang and Rong Jin, Supervised and Semi-Supervised Learning for IR, Tutorial at SIGIR 2008

– Peter Mika, Thanh Tran, Semantic Search on the Rise, Tutorial at SemTech 2013

– Thanh Tran, Semantic Search - Focus: IR on Structured Data, ESSIR, 2011.

– Yi Chen, Keyword Search on Structured and Semi-structured Data. Tutorial in SIGMOD, 2009


Acknowledgements

Spyros KotoulasThanh Tran

Freddy LecueGiusy Di Lorenzo

Marco Luca Sbodio

Martin Stephenson

Pierpaolo Tommasi

Simone Tallevi-Diotallevi

Pol Mac Aonghusa

part iii searching city data -...

Documents