fast katz and commuters: efficient estimation of social relatedness in large networks

FAST KATZ AND COMMUTERS Efficient Estimation of Social Relatedness

David F. Gleich

Sandia National Labs

WAW2010

16 December 2010

Palo Alto, CA

With Pooya Esfandiar, Francesco Bonchi, Chen Grief,

Laks V. S. Lakshmanan, and Byung-Won On

David F. Gleich (Sandia) ICME la/opt seminar

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin

Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

1 / 28

MAIN RESULTS

A – adjacency matrix

L – Laplacian matrix

Katz score :

Commute time:

For Katz Compute one fast Compute top fast For Commute Compute one fast

2 of 28

OUTLINE

Katz Rank and Commute Time, then Why

Matrices, moments, and quadrature rules for pairwise scores

Sparse linear systems solves for top-k

Some results

David F. Gleich (Sandia) ICME la/opt seminar 3 of 28

NOTE – EVERYTHING IS SIMPLE

All graphs are undirected

All graphs are connected

All graphs are unweighted

KATZ SCORES

The Katz score is

Carl Neumann

COMMUTE TIME

Consider a uniform random walk on a graph

Fouss et al. TKDE 2007

Also called the hitting

time from node i to j, or

the first transition time

: graph Laplacian

is the only null-vector

6 of 28

WHAT DO OTHER PEOPLE DO?

1) Just work with the linear algebra formulations

2) For Katz, truncate the Neumann series as a few (3-5) terms

3) Use low-rank approximations from EVD(A) or EVD(L)

4) For commute, use Johnson-Lindenstrauss inspired random sampling

5) Approximately decompose into smaller problems

Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,

Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007, Wang et al. ICDM2007

7 of 28

THE PROBLEM

All of these techniques are

preprocessing based because

most people’s goal is to compute

all the scores.

We want to avoid

preprocessing the graph.

There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse

8 of 28

WHY? LINK PREDICTION

Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient

Neighborhood based

Path based

9 of 28

WHY NO PREPROCESSING?

The graph is constantly changing

as I rate new movies.

WHY NO PREPROCESSING?

Top-k predicted “links”

are movies to watch!

Pairwise scores give

user similarity

11 of 28

PAIRWISE ALGORITHMS

Commute

Golub and Meurant

to the rescue!

12 of 28

MMQ - THE BIG IDEA

Quadratic form

Weighted sum

Stieltjes integral

Quadrature approximation

Matrix equation David F. Gleich (Sandia) ICME la/opt seminar

A is s.p.d. use EVD

“A tautology”

Lanczos

13 of 28

MMQ PROCEDURE SKETCH

1. Run k-steps of Lanczos on starting with

2. Compute , with an additional eigenvalue at ,

3. Compute , with an additional eigenvalue at , set

4. Output as lower and upper bounds on b

Correspond to a Gauss-Radau rule, with

u as a prescribed node

Correspond to a Gauss-Radau rule, with

l as a prescribed node

Bad bounds give worse

performance!

Larger k gives better results;

k steps is k matvecs

Easy to “update” this inverse as k increases.

14 of 28

PRACTICAL MMQ

ONE LAST STEP FOR KATZ

TOP-K ALGORITHM FOR KATZ

Approximate

where is sparse

Keep sparse too

Ideally, don’t “touch” all of

For PageRank, we can do this!

THE ALGORITHM - MCSHERRY

Start with the Richardson iteration

Rewrite

Note is sparse

If , then is sparse.

Idea Only add one component of to

McSherry WWW2005

Richardson converges if

18 of 28

THE ALGORITHM FOR KATZ

Pick as max David F. Gleich (Sandia) ICME la/opt seminar

Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more

19 of 28

CONVERGENCE?

The “standard” proof works for 1/max-degree.

If you pick as the maximum element, we can show this is convergent if Richardson converges. This proof requires to be symmetric positive definite.

RESULTS – DATA, PARAMETERS

All unweighted, connected graphs

Easy :

Hard :

KATZ BOUND CONVERGENCE

COMMUTE BOUND CONVERG.

KATZ SET CONVERGENCE

For arXiv graph.

24 of 28

TIMING

CONCLUSIONS

These algorithms are faster than many alternatives in these special cases.

For pairwise commute, stopping criteria are simpler with bounds.

For top-k problems, we often need less than 1 matvec for good enough results

WARTS AND TODOS

Stopping criteria on our top-k algorithm can be a bit hairy… we should refine it.

The top-k approach doesn’t work right for commute time… we have an alternative.

Take advantage of new research to “seed” commute time better!

Evaluate more datasets, like Netflix.

von Luxburg et al. NIPS2010

27 of 28

By AngryDogDesign on DeviantArt

More details in the paper

Slides should be online soon

Code is online already

stanford.edu/~dgleich/

publications/2010/codes/fast-katz

David F. Gleich (Sandia) ICME la/opt seminar 28

LEO KATZ

NOT QUITE, WIKIPEDIA

: adjacency, : random walk

PageRank

These are equivalent if has constant degree

WHAT KATZ ACTUALLY SAID

Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43

“we assume that each link independently has the

same probability of being effective” …

“we conceive a constant , depending

on the group and the context of the particular

investigation, which has the force of a probability

of effectiveness of a single link. A k-step chain

then, has probability of being effective.”

“We wish to find the column sums of the matrix”

RETURNING TO THE MATRIX

Carl Neumann

PROPERTIES OF KATZ’S MATRIX

is symmetric

exists when

is sym. pos. def. when

Note that 1/max-degree suffices

SKIPPING DETAILS

: graph Laplacian

is the only null-vector

Carl Neumann

I’ve heard the Neumann series called the “von Neumann”

series more than I’d like! In fact, the von Neumann kernel

of a graph should be named the “Neumann” kernel!

Wikipedia page

35 / 28

LANCZOS

, k-steps of the Lanczos method produce

36 of 28

PRACTICAL LANCZOS

Only need to store the last 2 vectors in

Updating requires O(matvec) work

is not orthogonal

PRACTICAL MMQ

Increase k to become more accurate

Bad eigenvalue bounds yield worse results

and are easy to compute

not required, we can iteratively

update it’s LU factorization

THE ALGORITHM

How to pick ?

MAIN RESULTS – SLIDE ONE

A – adjacency matrix

L – Laplacian matrix

Katz score :

Commute time:

TOP-K INSPIRATION: PAGERANK

Approximate

where is sparse

Keep sparse too? YES!

Ideally, don’t “touch” all of ? YES!

McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.

41 of 28

THE ALGORITHM

Note is sparse.

If , then is sparse.

only add one component of to

MAIN RESULTS – SLIDE TWO

For Katz Compute one fast

Compute top fast

For Commute

Compute one fast

For almost commute

Compute top fast

MAIN RESULTS – SLIDE THREE

F-MEASURE

PAIRWISE RESULTS

Katz upper and lower bounds

Katz error convergence

Commute-time upper and lower bounds

Commute-time error convergence

For the arXiv graph here

fast katz and commuters: efficient estimation of social relatedness in large networks

Technology

explanatory semantic relatedness and explicit...

alcoholics anonymous: personal stories, relatedness

designing smart specialization policy: relatedness

katz certificate declaration...

computing semantic relatedness using wikipedia features

relatedness = recency of common ancestry

form-independent meaning representation for...

modes of relatedness in psychotherapy

virginia halifax county - microsoft...commuting patterns...

pedigree relatedness and pseudo-phenotypes as a first ......

controlling for phylogenetic relatedness and evolutionary

institutional relatedness behind product...

roots and event structure i - linguistic society of...

motivation and relatedness

hernia and work relatedness

relatedness analysis tool for comparing drafted

habitat fragmentation in forests affects relatedness -

semantic similarity and relatedness between clinical terms...

relatedness-based multi-entity summarization

fast katz and commuters