1 clustering applications at yahoo! deepayan chakrabarti ([email protected])

32
1 Clustering Applications at Yahoo! Deepayan Chakrabarti ([email protected])

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

1

Clustering Applications at Yahoo!

Deepayan Chakrabarti ([email protected])

Page 2: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

2

Outline

Clustering, in itself, is often not the primary problem Exploratory analysis is rarely needed

Methods are often tied to the “final” goal

Data

Problem

Method

Issues

Data

Problem

Method

Issues

Data

Problem

Method

Issues

Page 3: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

3

Outline

Advertiser-keyword graph(+ social networks) Graph Partitioning

CTR predictions for ads on webpages Co-clustering

Query refinement and suggestions Local search methods

Conclusions

Page 4: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

4

Outline

Advertiser-keyword graph(+ social networks) Graph Partitioning

CTR predictions for ads on webpages Co-clustering

Query refinement and suggestions Local search methods

Conclusions

Page 5: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

5

Graph Partitioning (Applications)

Find clusters of advertisers and keywords Keyword suggestions Running experiments on

some “natural” clusters

Similar: Y! Answers Flickr

Advertiser Bidded Keyword

bids

~10M nodes

Page 6: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

6

Graph Partitioning (Applications)

Find clusters of IM users Targeted advertising Exploratory analysis

Clusters of the Web Graph Distributed pagerank

computation

~100M nodes

Who-messages-whom IM graph

Page 7: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

7

Graph Partitioning (Methods)

Basic “Global” spectral partitioning [Ng+/01] Find 2nd eigenvector of the graph Laplacian This embeds all nodes on the real line Split the line in two, to get two clusters

Can approximate the optimal conductance cut For more clusters:

Use k eigenvectors (for known k), or Split in two, and recurse on each cluster

Page 8: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

8

Graph Partitioning (Methods)

However, this has problems [Lang/05, Leskovec+/08]:

Min. conductance or quotient cuts lead to “small chunks”

Better balance worse cuts

Large-sized low-quotient cuts are actually just unions of whiskers

“Whiskers”

Page 9: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

9

Graph Partitioning (Methods)

However, this has problems: Min. conductance or quotient cuts lead to “small

chunks” Balance is not very strongly encouraged

Recursive partitioning takes too long Each eigenvector computation yields only a small cluster

being broken off

Two alternatives: More balanced cuts recursion is faster Unbalanced cuts, but much faster computation

Page 10: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

10

Graph Partitioning (Methods)

Achieving better balance Combine algorithms (e.g., [Anderson+/08])

METIS (more balanced cuts), followed by Flow-based improvement (conductance)

Stronger balance constraints

Many nodes One nodeSpectral embedding

Perfect balance on real line: NP-Hard

Perfect balance on hypersphere: SDP formulation [Lang/05]

Page 11: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

11

Graph Partitioning (Methods)

Faster Computation via “local” graph partitioning [Spielman+/04, Anderson+/06] Pick seeds randomly Build local clusters around seed Bite off cluster, and repeat

Time for each iteration is proportional to the size of the local cluster (scalability)

Better for large graphs, or when not all clusters are needed

Page 12: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

12

Graph Partitioning (Issues)

Many complex networks are very different from planar/mesh-like networks

Good small cuts

Good large cuts are hard to find, and may not even existHard to have a good hierarchical partitioning

Is conductance really the best objective? METIS+flow finds lower conductance cuts, but Local spectral methods finds more “tightly-knit” cuts [Leskovec+/08]

Is there a good compromise?

Page 13: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

13

Graph Partitioning (Issues)

Speed and Scalability Most implementations have trouble with large

graphs, or graphs with some extremely high-degree nodes

Map-reduce style partitioning algorithms?

Page 14: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

14

Outline

Advertiser-keyword graph(+ social networks) Graph Partitioning

CTR predictions for ads on webpages Co-clustering

Query refinement and suggestions Local search methods

Conclusions

Page 15: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

15

Co-clustering (Applications)

The Content Match problem Predict click-thru rate (CTR) Extreme sparsity

Few views Even fewer clicks

Ads

Web

page

s

clicksviews

= CTR

Page 16: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

16

Co-clustering (Methods)

Approximate cell CTR using block CTR [Dhillon+/03]

Minimize divergence between the original matrix and its reconstruction

Combats sparsity

Ad clusters

Web

page

cl

uste

rs

= row effect + column effect + block effectCell CTR

Page 17: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

17

Co-clustering (Issues)

Picking the right number of clusters Use MDL [Chakrabarti+/2004]

Handling new ads/pages Combine co-clustering with feature-based

prediction models [Agarwal+/2007] Handling extra information

E.g., if each page and ad can be categorized into a taxonomy [Chakrabarti+/2007]

Can the taxonomy be automatically modified?

Page 18: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

18

Co-clustering (Issues)

Iterative process (hard for map-reduce) Factor-3 approximation by just clustering

webpages and ads separately [Dasgupta+/2008]

Page 19: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

19

Outline

Advertiser-keyword graph(+ social networks) Graph Partitioning

CTR predictions for ads on webpages Co-clustering

Query refinement and suggestions Local search methods

Conclusions

Page 20: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

20

Query Refinement

User inputs ambiguous query (“madonna”) Search engine asks:

“Did you mean: songs, videos, pictures?”

lyrics “song title” album …

Refinement =

cluster of terms

Page 21: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

21

pictures

photos

songs

lyrics

albums

Query Refinement

Suppose we could relate queries with keywords

Honda

Ford

QueriesRelated

Keywords

Madonna

Beatles

Page 22: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

22

Query Refinement (Problem)

Different from plain bipartite graph partitioning Don’t confuse users!

only 3 or fewer clusters for each query only a few easily-distinguishable clusters overall

Clustering quality now also depends on the query workload the algo that picks the “top-3” clusters for any query

Page 23: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

23

Query Refinement (Method)

Can optimally pick top-k clusters for any query [Wang+/2009] for a wide range of matching functions only if clusters are disjoint

Iteratively improve clustering via local search Move a keyword to a new cluster Update top-k clusters for all queries in workload Repeat

Page 24: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

24

Query Refinement (Issues)

Iterative slow Each iteration has to go over the entire historical

query logs Optimality guarantees? Modeling issues:

How do we present a cluster to the user? Cluster naming? How does a user interact with a cluster?

Page 25: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

25

Outline

Advertiser-keyword graph(+ social networks) Graph Partitioning

CTR predictions for ads on webpages Co-clustering

Query refinement and suggestions Local search methods

Conclusions

Page 26: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

26

Conclusions

Standalone clustering applications are rare! Constraints on clustering

What use will it serve (as in query refinement)? What is the desired “natural” cluster size?

Clustering for predictions (e.g., CTR) Missing values Extreme sparsity

Combining clustering with explore-exploit

Page 27: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

27

Conclusions

Even for standalone clustering: Scalability concerns

10s to 100s of millions of nodes Skewed degree distributions Algorithms amenable to map-reduce

The right “balance” relaxation Tradeoffs between cluster “compactness”, conductance,

and partition sizes Is it even reasonable to require balance?

Page 28: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

28

References

On Spectral Clustering: Analysis and an Algorithm, by Ng, Jordan, & Weiss, in NIPS 2002 Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems, by

Spielman & Teng, STOC 2004 Fixing two weaknesses of the Spectral Method, by Lang, NIPS 2005 Local Graph Partitioning using PageRank Vectors, by Anderson, Chung, & Lang, SODA 2008 An algorithm for improving graph partitions, by Anderson & Lang, SODA 2008 Finding dense and isolated submarkets in a sponsored search spending graph, by Lang & Anderson,

CIKM 2007 Clustering of bipartite advertiser-keyword graph, by Carrasco, Fain, Lang, & Zhukov, IEEE Computer

Society, 2003 Statistical Properties of Community Structure in Large Social and Information Networks, by Leskovec,

Lang, Dasgupta, & Mahoney, WWW 2008 Approximate Algorithms for Co-Clustering, by Anagnostopoulos, Dasgupta, & Kumar, PODS 2008 Information-theoretic co-clustering, by Dhillon, Mallela, & Modha, in KDD 2003 Fully Automatic Cross-Associations, by Chakrabarti, Papadimitriou, & Faloutsos, in KDD 2004 Predictive discrete latent factor models for large scale dyadic data, by Merugu, & Agarwal, in KDD 2007 Estimating Rates of Rare Events at Multiple Resolutions, by Agarwal, Broder, Chakrabarti, Diklic,

Josifovski, & Sayyadian, in KDD 2007 Mining Broad Latent Query Aspects from Search Sessions, by Wang, Chakrabarti, & Punera, in KDD

2009

Page 29: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

29

Graph Partitioning (Applications)

Find clusters of users posting in similar groups Enhance group activity Are their special groups?

Can we build special tools for them?

Similar: Y! Answers Flickr

Users Yahoo! Groups

~100M users, ~10M groups

Page 30: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

30

Graph Partitioning (Issues)

Many complex networks are very different from planar/mesh-like networks

Good small cutsGood large cuts are hard to find, and may not even exist

Hard to have a good hierarchical partitioning

Good large cuts If they exist, most algorithms find them If not: METIS+flow finds lower conductance cuts, but

local spectral methods finds more “coherent” cuts Is there a good compromise?

Page 31: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

31

Query Refinement

Given an ambiguous query, present user with up to 5 refinements madonna songs, videos, pictures

Each refinement represents a cluster of terms songs = (lyrics, “song title”, album, …)

Page 32: 1 Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

32

lyrics

albums

pictures

photos

songs

Query Refinement

How can we relate “madonna” and “songs”?

Search “madonna”

Search“madonna songs”

clickno click

Beatles

Honda

Ford

Queries Reformulations

Madonna