topical query decomposition
DESCRIPTION
Topical Query Decomposition. Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis Yahoo! Research Barcelona, Spain KDD 08. Abstract. Given a query and a document retrieval system - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/1.jpg)
Topical Query Decomposition
Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis
Yahoo! ResearchBarcelona, Spain
KDD 08
![Page 2: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/2.jpg)
2
Abstract
Given a query and a document retrieval system To produce a small set of queries whose union
of resulting documents corresponds approximately to that of the original query.
Set cover problem Greedy algorithm
Clustering problem Two-phase algorithm based on hierarchical
agglomerative clustering (dynamic programming)
![Page 3: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/3.jpg)
3
Introduction
A query log L A list of pairs < q, D(q) >
q: query, D(q): its result a set of documents that answer
query q
Q(q) the maximal set of queries pi, where for each pi, the set D(pi) has at least one document in common with the documents returned by q
![Page 4: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/4.jpg)
4
![Page 5: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/5.jpg)
5
The goal is to compute a cover. Selecting a subcollection C Q(q7) such that it
covers almost all of D(q7)
![Page 6: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/6.jpg)
6
Problem Statement – 1/3
Red-Blue set cover problem U={b1,…bn, r1,…rm} ( for a query q ) B={b1,…bn} (i.e., document set) R={r1,…rm} (i.e., query set) S={S1,…,Sk} is provided from L (query log L)
Si U Si
B : blue points in Si (SiB= Si B)
SiR : red points in Si (Si
R= Si B) Goal: To find a subcollection C ⊆ S that
covers many blue points of U without covering too many red points.
![Page 7: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/7.jpg)
7
Problem Statement – 2/3
For each query q, the candidate queries Q(q)
For each set Si with blue and red points, its weight is
scatter sc(Si) (coherence: opposite of scatter)
ii SvSu
i v,udSsc min 2)()(
1))(1()(
)(
2
}{
b,qclickslogbw
bw|S| BiSbwi
![Page 8: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/8.jpg)
8
Problem Statement – 3/3
Our goal is to find a subcollection C ⊆ S that covers almost all the blue points of U and has large coherence.
More precisely, we want that C satisfies the following properties: Cover-blue Not-cover-red Small-overlap Coherence
![Page 9: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/9.jpg)
9
Greedy Algorithm – 1/2
At i-th iteration , minimizes s(S,VB,VR)
C, R, O are parameters that weight the relative importance of the three terms.
VB : blue balls were already selected at before iterations
VR : red balls were already selectedat before iterations
D. Peleg. Approximation algorithm for the label-covermax and red-blue set cover problem. Journal of Discrete Algorithms, 2007
![Page 10: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/10.jpg)
10
Greedy Algorithm – 2/2
![Page 11: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/11.jpg)
11
Integer Programming
Si+S2+….Sl <=10
Si <= 1
![Page 12: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/12.jpg)
12
Clustering-Based Method
Two-phase approach First phase: all points in set B are clustered
using a hierarchical agglomerative clustering algorithm. (CLUTO toolkit)
Second phases: to match the clusters of the hierarchy produced by the agglomerative algorithm with the sets of S.
The main idea is to match sets of S into clusters of Every node T ∈ corresponds to a cluster
T(B) be the set of points in B
![Page 13: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/13.jpg)
13
Clustering-Based Method
Dendrogram
![Page 14: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/14.jpg)
14
Clustering-Based Method -Dynamic Programming - 1/2
Complete Coverage: for each set S S v.s. for each node T ∈ , Matching score m(T, S)
m*(T) the score of the best matching set in S.
Optimal cost of covering the points of TB with sets in S.
![Page 15: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/15.jpg)
15
Clustering-Based Method -Dynamic Programming - 2/2 Partial Coverage:
U weights the relative importance between the two terms, the scatter cost of the sets S and the number of uncovered points.
![Page 16: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/16.jpg)
16
Application
Query log L : 2.9 million distinct queries A majority of users only looks at the first page
of results, while few users request more result pages.
D(q): any user asking for q in the query log navigated, and consider the set of result documents for the query
24 million distinct documents seen by the users
![Page 17: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/17.jpg)
17
Application - Candidate queries for the cover For each query q, the candidate queries Qk(q)
![Page 18: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/18.jpg)
18
Application - Results A set of 100 queries were randomly picked
from top 10,000 queries submitted by users.
Cost of k queries The number of documents
included outside the set D(q) Average numbre of queries
covering each element Coverage after the top k
candidates have been picked
![Page 19: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/19.jpg)
19
![Page 20: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/20.jpg)
20
![Page 21: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/21.jpg)
21
![Page 22: Topical Query Decomposition](https://reader035.vdocuments.us/reader035/viewer/2022062421/56812b25550346895d8f2761/html5/thumbnails/22.jpg)
22
Conclusions
A novel problem : Topical query decomposition
Elegant solutions red-blue metric set cover clustering with predefined clusters.
( hierarchical agglomerative clustering ) The set-cover formulation provides solutions
of better quality Code and data for reproducing the results
shown in Table 3 is available at http://www.yr-bcn.es/querydecomp/ .