fast matrix computations for pair-wise and column-wise katz scores and commute times
DESCRIPTION
A seminar I gave at the University of Chicago about the ideas to compute the Katz matrices and commute times quickly.TRANSCRIPT
FAST MATRIX COMPUTATIONS FOR
PAIRWISE AND COLUMN-WISE KATZ
SCORES AND COMMUTE TIMES
David F. Gleich
Purdue University
University of Chicago
Statistical and Scientific Computing Seminar
October 6th, 2011
With Pooya Esfandiar, Francesco Bonchi, Chen Grief,
Laks V. S. Lakshmanan, and Byung-Won On
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 1 / 47
MAIN RESULTS
A – adjacency matrix
L – Laplacian matrix
Katz score :
Commute time:
For Commute
Compute one fast
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
For Katz Compute one fast Compute top fast
2 of 47
OUTLINE
Why study these measures?
Katz Rank and Commute Time
How else do people compute them?
Quadrature rules for pairwise scores
Sparse linear systems solves for top-k
As many results as we have time for…
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 3 of 47
WHY? LINK PREDICTION
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Liben-Nowell and Kleinberg 2003, 2006 found that path based link prediction was more efficient
Neighborhood based
Path based
4 of 47
NOTE
All graphs are undirected
All graphs are connected
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 5 of 47
LEO KATZ
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 6 of 47
NOT QUITE, WIKIPEDIA
: adjacency, : random walk
PageRank
Katz
These are equivalent if has constant degree
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 7 of 47
WHAT KATZ ACTUALLY SAID
Leo Katz 1953, A New Status Index Derived from Sociometric Analysis, Psychometria 18(1):39-43
“we assume that each link independently has the
same probability of being effective” …
“we conceive a constant , depending
on the group and the context of the particular
investigation, which has the force of a probability
of effectiveness of a single link. A k-step chain
then, has probability of being effective.”
“We wish to find the column sums of the matrix”
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 8 of 47
A MODERN TAKE
The Katz score (node-based) is
The Katz score (edge-based) is
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 9 of 47
RETURNING TO THE MATRIX
Carl Neumann
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 10 of 47
Carl Neumann
I’ve heard the Neumann series called the “von Neumann”
series more than I’d like! In fact, the von Neumann kernel
of a graph should be named the “Neumann” kernel!
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Wikipedia page
11 / 47
PROPERTIES OF KATZ’S MATRIX
is symmetric
exists when
is spd when
Note that 1/max-degree suffices
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 12 of 47
COMMUTE TIME
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Picture taken from Google images, seems to be
Bay Bridge Traffic by Jim M. Goldstein.
13 / 47
COMMUTE TIME
Consider a uniform random walk on a graph
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Also called the hitting
time from node i to j, or
the first transition time
14 of 47
SKIPPING DETAILS
: graph Laplacian
is the only null-vector
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 15 of 47
WHAT DO OTHER PEOPLE DO?
1) Just work with the linear algebra formulations
2) For Katz, Truncate the Neumann series as a few (3-5) terms
3) Use low-rank approximations from EVD(A) or EVD(L)
4) For commute, use Johnson-Lindenstrauss inspired random sampling
5) Approximately decompose into smaller problems
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Liben-Nowell and Kleinberg CIKM2003, Acar et al. ICDM2009,
Spielman and Srivastava STOC2008, Sarkar and Moore UAI2007, Wang et al. ICDM2007 16 of 47
THE PROBLEM
All of these techniques are
preprocessing based because
most people’s goal is to compute
all the scores.
We want to avoid
preprocessing the graph.
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
There are a few caveats here! i.e. one could solve the system instead of looking for the matrix inverse
17 of 47
WHY NO PREPROCESSING?
The graph is constantly changing
as I rate new movies.
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 18 of 47
WHY NO PREPROCESSING?
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Top-k predicted “links”
are movies to watch!
Pairwise scores give
user similarity
19 of 47
PAIR-WISE ALGORITHMS
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 20 / 47
PAIRWISE ALGORITHMS
Katz
Commute
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Golub and Meurant
to the rescue!
21 of 47
MMQ - THE BIG IDEA
Quadratic form
Weighted sum
Stieltjes integral
Quadrature approximation
Matrix equation David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Think
A is s.p.d. use EVD
“A tautology”
Lanczos
22 of 47
LANCZOS
, k-steps of the Lanczos method produce
and
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
=
23 of 47
PRACTICAL LANCZOS
Only need to store the last 2 vectors in
Updating requires O(matvec) work
is not orthogonal
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 24 of 47
MMQ PROCEDURE
Goal
Given
1. Run k-steps of Lanczos on starting with
2. Compute , with an additional eigenvalue at ,
set 3. Compute , with an additional eigenvalue at , set
4. Output as lower and upper bounds on
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Correspond to a Gauss-Radau rule, with
u as a prescribed node
Correspond to a Gauss-Radau rule, with
l as a prescribed node
25 of 47
PRACTICAL MMQ
Increase k to become more accurate
Bad eigenvalue bounds yield worse results
and are easy to compute
not required, we can iteratively
update it’s LU factorization
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 26 of 47
PRACTICAL MMQ
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 27 of 47
ONE LAST STEP FOR KATZ
Katz
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 28 of 47
COLUMN-WISE
ALGORITHMS
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 29 / 47
COLUMN-WISE COMMUTE-TIME
requires entire diagonal of
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Paige and Saunders, 1975
=
Each vector is computed by the a Lanczos based CG algorithm.
30 of 47
COLUMN-WISE COMMUTE TIME
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
following Hein et al. 2010 is a MUCH better and faster approximation.
31
KATZ SCORES ARE LOCALIZED
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Up to 50 neighbors is 99.65% of the total mass
32 of 47
PARTICIPATION RATIOS
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Participation
Ratios
“effective non-zeros” in a vector
33 of 47
TOP-K ALGORITHM FOR KATZ
Approximate
where is sparse
Keep sparse too
Ideally, don’t “touch” all of
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 34 of 47
INSPIRATION - PAGERANK
Approximate
where is sparse
Keep sparse too? YES!
Ideally, don’t “touch” all of ? YES!
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
McSherry WWW2005, Berkhin 2007, Anderson et al. FOCS2008 – Thanks to Reid Anderson for telling me McSherry did this too.
35 of 47
THE ALGORITHM - MCSHERRY
For
Start with the Richardson iteration
Rewrite
Richardson converges if
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 36 of 47
THE ALGORITHM
Note is sparse.
If , then is sparse.
Idea
only add one component of to
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 37 of 47
THE ALGORITHM
For
Init:
How to pick ?
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 38 of 47
THE ALGORITHM FOR KATZ
For
Init:
Pick as max David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
Storing the non-zeros of the residual in a heap makes picking the max log(n) time. See Anderson et al. FOCS2008 for more
39 of 47
CONVERGENCE?
If 1/max-degree then is sub-stochastic and the PageRank based proof applies because the matrix is diagonally dominant
For , then for symmetric , this algorithm is the Gauss-Southwell procedure and it still converges.
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
40 of 47
RESULTS – DATA, PARAMETERS
All unweighted, connected graphs
Easy
Hard
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 41 of 47
KATZ BOUND CONVERGENCE
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 42 of 47
COMMUTE BOUND CONVERG.
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 43 of 47
KATZ SET CONVERGENCE
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar
For arXiv graph.
44 of 47
TIMING
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 45 of 47
CONCLUSIONS
These algorithms are faster than many alternatives.
For pairwise commute, stopping criteria are simpler with bounds
For top-k problems, we often need less than 1 matvec for good enough results
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 46 of 47
By AngryDogDesign on DeviantArt
Paper at WAW2010, J. Internet Mathematics
Slides should be online soon
Code is online already
www.cs.purdue.edu/homes/dgleich/
/codes/fast-katz-2011
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 47