web search result diversification

70
 Web Search Result Diversification Farhan Ahmad Apr 2011

Upload: farhanhubble

Post on 08-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 1/70

 

Web Search Result

Diversification

Farhan Ahmad

Apr 2011

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 2/70

 

Original Paper 

Diversifying Search Results

Rakesh AgrawalSearch Labs

Microsoft Research

Sreenivas GollapudiSearch Labs

Microsoft Research

Alan HalversonSearch Labs

Microsoft Research

Samuel IeongSearch Labs

Microsoft Research

Second ACM International Conference on Web Search and Data MiningWSDM 2009

Barcelona, Spain - February 9-12, 2009

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 3/70

 

Abstract

Query terms given by users are oftenambiguous.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 4/70

 

Abstract

Query terms given by users are oftenambiguous.

Search engines should diversify thesearch results to minimize the risk of dissatisfaction of average users.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 5/70

 

Abstract

Query terms given by users are oftenambiguous.

Search engines should diversify thesearch results to minimize the risk of dissatisfaction of average users.

The authors have presented a systematicapproach for measuring diversity of search

results.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 6/70

 

Abstract

Query terms given by users are oftenambiguous.

Search engines should diversify thesearch results to minimize the risk of dissatisfaction of average users.

The authors have presented a systematicapproach for measuring diversity of search

results. They have presented an algorithm to

maximize the diversity of a subset of thesearch results.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 7/70

 

Introduction: Ambiguousqueries

Consider the search term 'FLASH'

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 8/70

 

Introduction: Ambiguousqueries

Consider the search term 'FLASH'

It can have several interpretations-

− Flash player − Flash floods

− Flash Gordon (an adventure hero)

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 9/70

 

Introduction: Ambiguousqueries

Suppose Flash player is searched mostoften.

Most of the top results returned for thequery 'FLASH' will belong to this category.

This is because search engines rank onthe basis of similarity, and make no explicit

attempt to diversify the documents.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 10/70

 

Introduction:Relevance of search results

The basic premise is “ The relevance of aset of documents depends not only on theindividual relevance of its members, but

also on how they relate to each other.”Jaime G. Carbonell and Jade Goldstein. The use of MMR, diversity-based re-ranking for 

reordering documents and producing summaries.

Ideally the result set should properlyaccount for the interests of the overall user population.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 11/70

 

Introduction: BasicAssumption #1

A taxonomy of information exists at thetopical level.

A document can belong to one or more

categories, and so can a query.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 12/70

 

Introduction: BasicAssumption #2

Usage statistics are available for user intents.

Example: When searching for 'FLASH',

65% users intended to find 'Flash Player',15% were looking for 'Recent flash floods'and 5% were looking for 'Flash Gordon'.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 13/70

 

Introduction: Defining theobjective

To maximize the relevance of a resultdocument set based on individualrelevance of the members and their 

diversity.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 14/70

 

Formalization of Notation

The set of categories to which a documentd belongs is denoted by C(d).

The set of categories to which a querybelongs is denoted by C(q).

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 15/70

 

Formalization of Notation

Example:

− C(q='FLASH') = { 'flash floods','Flashplayer','Flash Gordon') }

− C(d='FLASH') = { 'flash floods','Flashplayer', 'Flash Gordon', 'Flash village' }

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 16/70

 

Formalization of Notation

C(d) ∩C(q) may be empty.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 17/70

 

Formalization of Notation

The probability of a given query qbelonging to a category c is denoted byP(c|q).

It is called user intent for query q andcategory c.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 18/70

 

Formalization of Notation

Assumption: Our knowledge is complete.cƐC(q), Ʃ P(c|q) = 1

Informally this means that given a queryq, we have an exhaustive list of all thecategories to which the query couldbelong.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 19/70

 

Formalization of Notation

V(d|q,c) is defined as the relevance valueof document d for query q, when theintended category of q is c.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 20/70

 

Formalization of Notation

V(d|q,c) is defined as the relevance valueof document d for query q, when theintended category is c.

If we constrain V to [0,1], it represents the

probability of document d satisfying user query q that has intended category c.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 21/70

 

Formalization of Notation

V can be obtained by multiplying query-document similarity by the probability thatthe document d belongs to category c.

− V(d|q,c) = Similarity(d,q) * P(c|d)

− Where P(c|d) can be computed by theclassifier algorithm . E.g. a confidencevalue.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 22/70

 

Formalization of Notation

Assumption: Given a query q and acategory of intent c, the relevance of twodocuments is independent

− V(d1|q,c) , V(d

2|q,c) are independent.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 23/70

 

Formalizing the objective fn

Suppose users only consider the top kresults of a search engine.

 –  We can rephrase the objective :

 – 

As:“To maximize the relevance of a result document

set based on individual relevance of themembers and their diversity”.

The objective is to maximize theprobability that an average user finds at-least one relevant resultamong the top k.

 

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 24/70

 

Formalizing the objective fn

Formally, given− A query q,

− A set of documents D

A distribution of category of intent P(c|q),− The relevance of each document d ε D

V(d|q,c),

− We want to find the set S of top k results

(|S| = k), S⊆ D ,such that

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 25/70

 

Formalizing the objective fn

P(S|q)= ∑

cP(c|q) . (1- π

d S∈(1-V(d|q,c) ) )

is maximized

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 26/70

 

Origin of the objective function

d S∈ V(d|q,c) is the probability that adocument d from our result subset Ssatisfies a user query q having intendedcategory c.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 27/70

 

Origin of the objective function

d S∈ 1-V(d|q,c) is the probability that adocument d, from our result subset S,does not satisfy a user query q, havingintended category c.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 28/70

 

Origin of the objective function

πd S∈ (1-V(d|q,c) ) is the probability that nodocument d, from our result subset S,satisfies user query q, having intendedcategory c.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 29/70

 

Origin of the objective function

1- πd S∈ (1-V(d|q,c) ) is the probabilitythat at least one document d, from our result subset S, satisfies user query q,having intended category c.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 30/70

 

Origin of the objective function

Therefore,P(c|q) . (1- π

d S∈(1-V(d|q,c) ) )

gives the probability that query q has

intended category c, and it is satisfied byat least one document from our resultsubset S.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 31/70

 

Origin of the objective function

If we sum upP(c|q) . (1- π

d S∈(1-V(d|q,c) ) )

for different categories {c1,c

2,...,c

r },

we find the probability that a user querybelonging to any of these categories issatisfied by at least one document from

our result subset S.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 32/70

 

Origin of the objective function

Therefore by definingP(S|q) = ∑

cP(c|q) . (1- π

d S∈(1-V(d|q,c) ) ),

and trying to maximize P(S|q), we are trying

to maximize the chances that an averageuser is satisfied.

That is P(S|q) is measuring the diversity of result subset S for a query q.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 33/70

 

Formalizing the problem

Given a result set D, find a set S D, |S|⊆=k, whose diversity

P(S|q) = ∑c

P(c|q) . (1- πd S∈

(1-V(d|q,c) ) )

is maximum of all such possible S

Formally written as Diversify(k)

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 34/70

 

Caveat

Diversify(k) does not try to cover all thecategories

− While trying to maximize

∑c P(c|q) . (1- πd S∈ (1-V(d|q,c) ) ),− And maintain |S| = k

We might need to exclude all documents

from some category cr .

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 35/70

 

Caveats

It might be that by taking all k documentsfrom only a single category c3,

we are able

to maximize P(S|q) !!

All other categories are left out in such acase.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 36/70

 

Problems with Diversify(k)

Diversify(k) does not say anything aboutthe ordering of the result subset S.

Diversify(k) is NP hard (it reduces to theproblem of finding the max coverage of 

result set D).

A greedy algorithm for

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 37/70

 

A greedy algorithm for Diversify(k)

The authors have proposed a greedyalgorithm for Diversify(k) that uses theconcept of marginal utility to diversify aswell as re-rank the search results.

This algorithm maximizes P(S|q) whenevery document can belong to just onecategory, and otherwise it optimizes P(S|q)with bounded error.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 38/70

 

Notation

U(c|q,S) is the probability that a query qbelongs to the category c, given that alldocuments in the set S fail to satisfy theuser.

− Initially S= ∅

− And we define U(c|q, ) = P(c|q)∅

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 39/70

 

Notation

We define the marginal utility of adocument d as the product of its relevancevalue V with the conditional distribution of categories U

g(d|q,c,S) = ∑c C(d)∈ U(c|q,S) . V(d|q,c)

− It is the probability that document dsatisfies the user when all documentsthat come before it fail to do so.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 40/70

 

Greedy algo IA-Select Inputs: k,q,C(q),D,C(d),P(c|q),V(d|c,q)

Output S D, |S|=k⊆

1: S=∅

2: c, U(c|q,S) = P(c|q)∀

3: WHILE |S| < k

4: FOR d D do∈

5: g(d|q,c,S) = ∑c C(d)∈

U(c|q,S) . V(d|q,c)

6: ENDFOR7: d* = argmax( g(d|q,c,S) )

8: c C(d*), U(c|q,S) = (1 – V(d*|q,c) ) . U(c|q,S)∀ ∈

9: S = S d*∪

G S

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 41/70

 

Greedy algo IA-Select10: D = D – {d*}

11:ENDWHILE

12:RETURN S

P f

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 42/70

 

Proofs

Earlier we claimed that –  IA-select maximizes P(S|q), the diversity if 

every document belongs to exactly onecategory.

 –  IA-select optimizes P(S|q), with boundederror, otherwise.

B i f f

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 43/70

 

Basis for proofs

To prove our claims, we need tounderstand the concept of submodularity.

• We first define submodularity.

• Then prove that P(S|q) is submodular 

• Then prove our claims.

S b d l it

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 44/70

 

Submodularity

It is known as the principle of diminishingmarginal utilities in economics.

• In our context, the marginal benefit of adding a document to a larger 

collection is less than that of adding thesame document to a smaller collection.

• Formal definition follows.

S b d l it

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 45/70

 

Submodularity

If N is a set and f is a set functionf: 2N ==> R

then f is submodular if and only if 

for S T N⊆ ⊆and d N and d S , d T∊ ∉ ∉

f(S U {d}) – f(S) f(T U {d}) - f(T)≧

Submodularity

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 46/70

 

Submodularity

We have chosen two subsets of thedomain N, S and T. S is smaller than Twhich in turn is smaller than N.

• Then for a new element d in N

which has not yet been added to either S or T.

Submodularity

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 47/70

 

Submodularity

We evaluate the change in values of f due toaddition of d, for S and T both.

• f(S U {d}) – f(S) is the marginal utilitygained from adding d to the smaller set

S.• f(T U {d}) – f(T) is the marginal utility

gained from adding d to the larger set T.

Submodularity

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 48/70

 

Submodularity

If the inequalityf(S U {d}) – f(S) f(T U {d}) – f(T) holds,≧

f is said to be submodular.

P(S|q) is submodular

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 49/70

 

P(S|q) is submodular 

Let S,TS T D⊆ ⊆

be two sets of documents.

• And e D be a document such that∊e T∉

• Let S' = S {e} and T' = T {e}⋃ ⋃

P(S|q) is submodular

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 50/70

 

P(S|q) is submodular 

• P(S'|q) – P(S|q)

= P(c|q) . [ (1- πd S'∈

(1-V(d|q,c) ) ) - (1-

πd S∈

(1-V(d|q,c) ) ) ]

= P(c|q) . [ πd S∈

(1-V(d|q,c) - πd S'∈

(1-V(d|

q,c) )]

P(S|q) is submodular

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 51/70

 

P(S|q) is submodular 

= P(c|q) . [ πd S∈  (1-V(d|q,c) ) - πd S∈  (1-

V(d|q,c) ) . (1-V(e|q,c)) ]

=P(c|q) . (πd S∈

 (1-V(d|q,c) ) ) . V(e|q,c) 

P(S|q) is submodular

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 52/70

 

P(S|q) is submodular 

• Similarly

• P(T'|q) – P(T|q)

P(c|q) . (πd T∈

 (1-V(d|q,c) ) ) . V(e|q,c) 

P(S|q) is submodular

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 53/70

 

P(S|q) is submodular 

• Because |T| |S|≧

(πd S∈

 (1-V(d|q,c) ) ) (≧ πd T∈

 (1-V(d|q,c) ) )

since both the sides are products of 

fractions.And RH product contains all the fractions in

LH product in addition to its own factors.

P(S|q) is submodular

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 54/70

 

P(S|q) is submodular 

• Therefore

P(S'|q) – P(S|q) P(T'|q) – P(T|q)≧

• Hence P(S|q) is submodular.

Proof #1

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 55/70

 

Proof #1

• Formally, we have to prove that

- IA-select maximizes P(S|q) if 

∀d D, |C(d)|=1∈

Proof #1

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 56/70

 

Proof #1

• In this case step 5 in our algo becomes

 –  g(d|q,c,S) = U(c|q,S) . V(d|q,c)

instead of 

g(d|q,c,S) = ∑c C(d)∈

U(c|q,S) . V(d|q,c) 

Proof #1

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 57/70

 

Proof #1

• Step 8 in our algo becomes

 –  c=C(d), U(c|q,S) = (1 – V(d*|q,c) ) . U(c|q,S)

instead of 

  ∀c C(d), U(c|q,S) = (1 – V(d*|q,c) ) . U(c|q,S)∈

Proof #1

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 58/70

 

Proof #1

• Because U(c|q,S) = P(c|q) at the outset,

as documents { d1, d

2,..} get added to S,

in step 8 U is updated as

 –   c=C(d1),

U(c|q,S) = (1 – V(d1|q,c) ) . U(c|q, )∅

= (1 – V(d1|q,c) ) . P(c|q)

 –  c=C(d2),

U(c|q,S) = (1 – V(d2|q,c) ) . (1 – V(d

1|q,c) ) . U(c|q, )∅

= (1 – V(d2|q,c) ) .(1 – V(d

1|q,c) ) . P(c|q)

Proof #1

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 59/70

 

Proof #1

• Therefore in the kth iteration of IA-select, insteps 4 and 5, g is updated as

 

FOR d D do∈  

g(d|q,c,S) = (πd' S∊

(1 – V(d'|q,c) ) . P(c|q)) . V(d|q,c)

= P(c|q) . (πd' S∊(1 – V(d'|q,c) ) . V(d|q,c)

Proof #1

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 60/70

 

Proof #1

• From the result for submodularity of P(S|q) we know that

for every document d

∑c

g(d|q,c,S) = P( S {d} | q) – P(S|q)∪

 –  Because g(d|q,c,S) is non-zero for exactly one category c, the sigma

can be removed

g(d|q,c,S) = P( S {d} | q) – P(S|q)∪

Proof #1

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 61/70

 

Proof #1

• At the start S = . When we have added∅

required k documents to S

g(d1|q,c, )∅ = P( {d

1} | q) – P( |q)∅

g(d2|q,c,{d

1}) = P( {d

2,d

1} | q ) - P({d

1} | q)

. . .

. . .

. . .

g(dk|q,c,{d1,d2...dk-1} ) = P(S | q) – P({d1,d2...dk-1} | q)

Proof #1

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 62/70

 

• The sum is

g(d1|q,c, ) + g(d∅

2|q,c,{d

1}) + ......... + g(d

k|q,c,{d

1,d

2...d

k-1} )

= P(S|q) – P( |q)∅

=P(S|q)

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 63/70

Proof #2

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 64/70

 

• Formally we have to prove that if documents can belong to manycategories, their selection based on g(d|q,c,S) still optimizes P(S|q), with an error that is bounded.

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 65/70

Proof #2

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 66/70

 

• Since our objective P(S|q) is submodular,and our algorithm IA-Select() is greedy inthe sense mentioned before,

 –  IA-Select is a (1-1/e) approximation to

Diversify(k). –  That is, IA-select optimizes P(S|q) and the

optimum value is not less than (1-1/e) themaximum value obtainable fromDiversify(K).

Evaluating IA-Select

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 67/70

 

g

• To measure the result set diversity of search engines and compare it with thediversity of results obtained using IA-Select, the authors first defined intent-aware counterparts of traditional IRmetrics.

Evaluating IA select

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 68/70

 

• One such metric is reciprocal rank(RR) :It is the inverse of the first position atwhich a relevant document is found in alist.

• If there is a rank-threshold T, RR is zeroif no relevant document is found amongthe first T documents.

• The mean reciprocal rank (MRR) of aquery set is the average RR of thequeries in the set.

Evaluating IA-Select

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 69/70

 

• Because IA-Select diversifies the top kresults, the authors defined an IA MRR

• For a result set D, rank threshold k

and query q

 –  MRR-IA(D,k) = ∑c

P(c|q) . MRR(D,k|c)

MRR(D,k|c) gives the average RR for aquery set belonging to category c.

Evaluating IA-Select

8/6/2019 Web Search Result Diversification

http://slidepdf.com/reader/full/web-search-result-diversification 70/70

 

• Shown below are the results obtainedusing three commercials search engineand IA-Select.