DivQ: Diversification for Keyword Search over Structured Databases
Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang NejflL3S Research Center, Hannover, Germany
Fraunhofer IPSE, Darmstadt GermanyCSIRO ICT Centre, Australia
SIGIR 2010
2010. 12. 17.Jaehui Park
Copyright 2010 by CEBT
INTRODUCTION Keyword search over structured data
No single interpretation of a keyword query can satisfy all users Multiple interpretation may yield overlapping results.
Diversification Minimizing the risk of user's dissatisfaction by balancing relevance and
novelty of search results
An example Query: "London"
– location: the capital of UK– name: a book written by Jack London
The occurrences can be viewed as a keyword interpretation with differ-ent semantics offering complementary results.
2
Copyright 2010 by CEBT
INTRODUCTION Motivation
Taking advantage of the structure of the databases – Query interpretation in terms of the underlying database– To deliver more diverse and orthogonal representations of query results
ex) attribute
Contributions DivQ
– A probabilistic query disambiguation model– A diversification scheme for generating top-k query interpretations
Evaluation metrics for structured data– α-nDCG-W– WS-recall
3
Copyright 2010 by CEBT
The Diversification Scheme Query interpretations
a keyword query -> a set of structured queries
Ranking the query interpretations Providing a quick overview over the available classes of re-
sults Faceted search: navigate and choose
4
Q: CONSIDERATION CHRISTOPHER GUESTRele-vance
Top-3 interpretations rank-ing
Rele-vance
Top-3 interpretations diversifica-tion
0.9 A directorCHRISTOPHER GUEST of a movie CONSIDERATION
0.9 A directorCHRISTOPHER GUEST of a movie CONSIDERATION
0.5 A director CHRISTOPHER GUEST
0.4 An actor CHRISTOPHER GUEST
0.8 An actor CHRISTOPHER GUEST in a movie CON-SIDERATION
0.2 A plot containing CHRISTOPHER GUEST of a movie
increasing novelty
Copyright 2010 by CEBT
The Diversification Scheme Bringing Keywords into Structure
Keyword Interpretations Ai:ki
– Mapping each keyword ki to an element Ai of an algebraic ex-pression
– (Predefined) query template T joining the keyword interpretations a structural patterns that is frequently used to query the databases
– An example Keyword query (K): CONSIDERATION CHRISTOPHER GUEST
director:CHRISTOPHER director:GUEST movie:CONSIDERATION
T: A director X of a movie Y
5
Copyright 2010 by CEBT
The Diversification Scheme Estimating Query Relevance
Relevance of a query interpretation Q to informational needs K– P(Q|K) = P(I,T|K)
T: query template, I: a set of keyword interpretations– Assumptions
Each keyword has one particular interpretation. The probability of a keyword interpretation is independent from the part of the
query interpretation the keyword is not interpreted to.
– Attribute specific term frequency (ex. the avg number of co-occurrences) ex) rank higher: a first name and a last name of a person to attribute "name"
6
the probability that, given that Aj is a part of a query in-terpretation, keyword interpretation Aj are also a part of the query interpretation.
smoothing factor
Copyright 2010 by CEBT
The Diversification Scheme Estimating Query Similarity
The Jaccard coefficient between the sets of keyword interpretations I contained by Q1 and Q2
Combining Relevance and Similarity 1. Select the most relevance interpretation as the first interpretation presented to
the user 2. Each of the following interpretations is selected based on both its relevance and
novelty
7
selected query interpretation set
Copyright 2010 by CEBT
The Diversification Scheme The Diversification algo-
rithm materializing top-k rele-
vance query interpreta-tions
the worst case O(l*r)
– l: the number of query interpretations in L
– r: the number of query interpretations in the result list R
8
Copyright 2010 by CEBT
EVALUATION METRICS α-nDCG-W
CGn (Cumulative Gain)– ex) 3+2+3+0+1+2 = 11
DCGi (Discounted Cumulative Gain)– ex) DCG1 = 3, DCG2 = 3 + 2/log22 = 5, DCG3 = 3 + (2/log22 + 3/
log23) = 6.887 nDCGi = DCGi / ideal DCGi α-nDCG
– Views a document as the set of information nuggets n Counting how many documents containing n were seen before and dis-
count the gain of this document accordingly– if α = 0, it is a standard nDCG– with increasing α, novelty is rewarded with more credit
9
D1 D2 D3 D4 D5 D63 2 3 0 1 2
Copyright 2010 by CEBT
EVALUATION METRICS α-nDCG-W
In databases– an information nugget n corresponds to a primary key pki
The gain
The overlap– For each primary key pki in the result of Qk
Count how many query interpretations with pki were seen before, and ag-gregate the counts
10
overlap factor
Copyright 2010 by CEBT
EVALUATION METRICS Weighted S-Recall
S-recall– Instance recall at rank k
when search results are related to several subtopics The number of unique subtopics covered by the first k re-
sults, divided by the total number of subtopics– a primary key corresponds to a subtopic in S-recall
11
Copyright 2010 by CEBT
EXPERIMENTS IMDB
10,000,000 records Lyrics
400,000 records Query logs
MSN, AOL 200 most frequent queries (single query) 100 queries (complex queries)
12
Copyright 2010 by CEBT
EXPERIMENTS User Study
16 participants were asked to indicate on a two-point Likert scale
to assess the relevance– top-25 interpretations
13
Copyright 2010 by CEBT
EXPERIMENTS α-nDCG-W
α = 0, 0.5, and 0.99
14
Copyright 2010 by CEBT
EXPERIMENTS WS-recall
Balancing Relevance and Novelty
15
Copyright 2010 by CEBT
CONCLUSION We present an approach to search results diversification
over structured data. a probabilistic query disambiguation model query similarity measure a greedy algorithm
An adaptation of the established evaluation metrics are proposed.
– α-nDCG-W and WS-recall Evaluation results demonstrate the quality of the pro-
posed model and show that using our algorithms the novelty of keyword search results over structured data can be substantially improved.
16