diversified ranking on large graphs: an optimization viewpoint

© 2010 IBM Corporation

Diversified Ranking on Large Graphs: An Optimization Viewpoint

Hanghang Tong, Jingrui He, Zhen Wen, Ching-Yung Lin, Ravi Konuru

KDD 2011, August 21-24, San Diego, CA

2

Background: Why Diversity?

A1: Uncertainty & Ambiguity in an Information Need

Case 1: Uncertainty from the query

Case 2: Uncertainty from the user

3

Background: Why Diversity? (cont.)

A2: Uncertainty & ambiguity of an information need –C1: Product search want different reviews–C2: Political issue debate desire different opinions–C3: Legal search get an overview of a topic–C4: Team assembling find a set of relevant & diversified experts

A3: Become a better and safer employee–Better: A 1% increase in diversity an additional $886 of monthly

revenue–Safer: A 1% increase in diversity an increase of 11.8% in job

retention

4

Problem Definitions & Challenges

Problem 1 (Evaluate/measure a given top-k ranking list)– Given: A large graph A, the query vector p, the damping factor c, and a

subset of k nodes S; – Measure: the goodness of the subset of nodes S by a single number in

terms of (a) the relevance of each node in S wrt the query vector p, and (b) the diversity among all the nodes in the subset S.

Problem 2 (Find a near optimal top-k ranking list)– Given: A large graph A, the query vector p, the damping factor c, and

the budget k;– Find: A subset of k nodes S that maximizes the goodness measure f(S).

Challenges– (for Prob. 1) No existing measure encoding both relevance and diversity– (for Prob. 2) Sub-set level optimization

4

5

Our Solutions (10 seconds introduction!)

Problem 1 (Evaluate/measure a given top-k ranking list) A1: A weighted sum between relevance and similarity

Problem 2 (Find a near optimal top-k ranking list) A2: A greedy algorithm (near-optimal, linear scalability)

5

weight diversityrelevance

6

Measure Relevance (r) by RWR (a.k.a. Personalized PageRank)

Details

1

43

2

5 6

7

9 10

811

12

0.13 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0

0.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0

0.13

0.22

0.13

0.050.9

0.05

0.08

0.04

0.03

0.04

0.02

0

1/3 1/3 0 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 1/4 0 0 0 0 0 0 0

0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 0

0 0 0 0 1/4 0 1/2 0 0 0 0 0

0 0 0 0 1/4 1/2 0 0 0 0 0 0

0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 0

0 0 0 0 0 0 0 1/4 0 1/3 0 0

0 0 0 0 0 0 0 0 1/2 0 1/3 1/2

0 0 0 0 0 0 0 1/4 0 1/3 0 1/2

0 0 0 0 0 0 0 0 0 1/3 1/3 0

0.13 0

0.10 0

0.13 0

0.22

0.13 0

0.05 00.1

0.05 0

0.08 0

0.04 0

0.03 0

0.04 0

2 0

1

0.0

n x n n x 1n x 1

Ranking vector Starting vectorAdjacency matrix

1

Restart p

r = c A r + (1-c) e

7

r = c A r + (1-c) e = [c A + (1-c) e 1’ ] r = B r

Diversity ~ reverse of weighted similarity on the personalized graph

Details

B: Personalized Graph(a.k.a ‘Google-Matrix’)

1

43

2

5 6

7

9 10

811

12

B(i,j): How node i and node j are connected in the personalized graph

1

43

2

5 6

7

9 10

811

12

g(S) = w∑r(i) - ∑B(i,j)r(j)i in S i,j in S

8

Properties of g(S): Why is it a Good Measure?

P1: g(S)=0 for an empty set SP2: g(S) is sub-modular for any w>0P3: g(S) is monotonically non-decreasing for any w>=2

A greedy algorithm (Dragon) leads to near-opt. solution– Quality: g(S) >= (1−1/e)g(S*), where S* is the optimal subset maximizing g(S)

– Complexity: O(m) for both time and space

For any w>=2

Details

Footnote: Dragon stands for Diversified Ranking on Graph: An Optimization Viewpoint

9

Experimental Results

9

Quality-Time Balance Scalability

An Illustrative Example Compare w/ alternative choices

Quality

Budget

Budget

TimeTime

Opt. Quality

10

Conclusion

Problem 1 (Evaluate/measure a given top-k ranking list) A1: A weighted sum between relevance and similarity

Problem 2 (Find a near optimal top-k ranking list) A2: A greedy algorithm (near-optimal, linear scalability)

Contact: Hanghang Tong ([email protected])

11

Academic Literature: More Detailed Comparison

[6]

[7]

This Disclosure Proposes (1) The first measure that combines both relevance & diversity (2) The first method that (a) leads to near-optimal solution with (b) linear complexity

For Problem 1 For Problem 2

diversified ranking on large graphs: an optimization viewpoint

Documents

optimal subset

relevance diversity

subset of nodes s

nodes s measure

subset s

nearoptimal solution

diversified ranking

c ediversity