anti-differentiating approximation algorithms: pagerank and mincut
DESCRIPTION
We study how Google's PageRank method relates to mincut and a particular type of electrical flow in a network. We also explain the details of how the "push method" for computing PageRank helps to accelerate it. This has implications for semi-supervised learning and machine learning, as well as social network analysis.TRANSCRIPT
Anti-differentiating approximation algorithms !
& new relationships between !Page Rank, spectral, and localized flow
David F. Gleich!Purdue University!
Joint work with Michael Mahoney. Supported by "NSF CAREER 1149756-CCF, "Simons Inst.
ICERM David Gleich · Purdue 1
Anti-differentiating approximation algorithms !
& new relationships between !Page Rank, spectral, and localized flow
A new derivation of the PageRank vector for an undirected graph based on Laplacians, cuts, or flows. A new understanding of the “push” methods to compute Personalized PageRank An empirical improvement to methods for semi-supervised learning.
1st
2nd
ICERM David Gleich · Purdue 2
The PageRank problem ! The PageRank random surfer
1. With probability beta, follow a random-walk step
2. With probability (1-beta), jump randomly ~ dist. v.
Goal find the stationary dist. x!!Alg Solve the linear system
�
�
��
�
�
�
�
�
Symmetric adjacency matrix Diagonal degree matrix
Solution Jump-vector
(I � �AD
�1)x = (1 � �)v
x = �AD
�1x + (1 � �)v
ICERM David Gleich · Purdue 3
The PageRank problem & !the Laplacian
1. (I � �AD
�1)x = (1 � �)v;
2. (I � �A)y = (1 � �)D�1/2v,
where A = D
�1/2AD
�1/2 and x = D
1/2y; and
3. [↵D + L]z = ↵v where � = 1/(1 + ↵) and x = Dz.
Combinatorial Laplacian
ICERM David Gleich · Purdue 4
The Push Algorithm for PageRank Proposed (in closest form) in Andersen, Chung, Lang "(also by McSherry, Jeh & Widom) for personalized PageRank
Strongly related to Gauss-Seidel (see my talk at Simons for this) Derived to show improved runtime for balanced solvers
1. x
(1)
= 0, r
(1)
= (1� �)e
i
, k = 1
2. while any r
j
> ⌧d
j
(d
j
is the degree of node j)
3. x
(k+1)
= x
(k )
+ (r
j
� ⌧d
j
⇢)e
j
4. r
(k+1)
i
=
8><
>:
⌧d
j
⇢ i = j
r
(k )
i
+ �(r
j
� ⌧d
j
⇢)/d
j
i ⇠ j
r
(k )
i
otherwise
5. k k + 1
The Push
Method!⌧ , ⇢
ICERM David Gleich · Purdue 5
… demo of push …
ICERM David Gleich · Purdue 6
Why do we care about push? 1. Used for empirical
studies of “communities”
2. Used for “fast PageRank” approximation
It produces sparse approximations to PageRank!
Newman’s netscience!379 vertices, 1828 nnz “zero” on most of the nodes
v has a single "one here
7
Our question!Why does the “push method” have such incredible empirical utility?
8
The O(correct) answer 1. PageRank related to Laplacian 2. Laplacian related to cuts 3. Andersen, Chung, Lang provides the "
“right” bounds and “localization” This talk the θ(correct) answer?"A deeper insight into the relationship
ICERM David Gleich · Purdue 9
Intellectually indebted to …
Chin, Mądry, Miller & Peng [2013] Orecchia & Zhu [2014]
10
minimize kBxkC,1
=
Pij2E
C
i ,j
|xi
� x
j
|subject to x
s
= 1, x
t
= 0, x � 0.
The s-t min-cut problem
Unweighted incidence matrix Diagonal capacity matrix
11
The localized cut graph Related to a construction used in “FlowImprove” "Andersen & Lang (2007); and Orecchia & Zhu (2014)
AS =
2
40 ↵dT
S 0↵dS A ↵dS̄
0 ↵dTS̄ 0
3
5
Connect s to vertices
in S with weight ↵ · degree
Connect t to vertices
in
¯S with weight ↵ · degree
ICERM David Gleich · Purdue 12
The localized cut graph Connect s to vertices
in S with weight ↵ · degree
Connect t to vertices
in
¯S with weight ↵ · degree
BS =
2
4e �IS 00 B 00 �IS̄ e
3
5
minimize kB
S
xkC(↵),1
subject to x
s
= 1, x
t
= 0
x � 0.
Solve the s-t min-cut
ICERM David Gleich · Purdue 13
The localized cut graph Connect s to vertices
in S with weight ↵ · degree
Connect t to vertices
in
¯S with weight ↵ · degree
BS =
2
4e �IS 00 B 00 �IS̄ e
3
5
Solve the “electrical flow” s-t min-cut minimize kB
S
xkC(↵),2
subject to x
s
= 1, x
t
= 0
ICERM David Gleich · Purdue 14
s-t min-cut à PageRank The PageRank vector z that solves
(↵D + L)z = ↵v
with v = d
S
/vol(S) is a renormalized
solution of the electrical cut computation:
minimize kB
S
xkC(↵),2
subject to x
s
= 1, x
t
= 0.
Specifically, if x is the solution, then
x =
2
41
vol(S)z
0
3
5
Proof
Square and expand
the objective into
a Laplacian, then
apply constraints.
ICERM David Gleich · Purdue 15
PageRank à s-t min-cut That equivalence works if v is degree-weighted. What if v is the uniform vector?
�
�
�����
�
��
��
��
�A(s) =2
40 ↵sT 0↵s A ↵(d � s)0 ↵(d � s)T 0
3
5 .
ICERM David Gleich · Purdue 16
And beyond …
Easy to cook up interesting diffusion-like problems and adapt them to this framework. In particular, Zhou et al. (2004) gave a semi-supervised learning diffusion we study soon.
2
40 eT
S 0eS ✓A eS̄0 eS̄ 0
3
5 . (I + ✓L)x = eS
ICERM David Gleich · Purdue 17
Back to the push method Let x be the output from the push method
with 0 < � < 1, v = dS/vol(S),
⇢ = 1, and ⌧ > 0.
Set ↵ =
1��� , = ⌧vol(S)/�, and let zG solve:
minimize
1
2
kBSzk2
C(↵),2
+ kDzk1
subject to zs = 1, zt = 0, z � 0
,
where z =
h1
zG0
i.
Then x = DzG/vol(S).
Proof Write out KKT conditions
Show that the push method
solves them. Slackness was “tricky”
Regularization for sparsity
ICERM David Gleich · Purdue 18
Need for normalization
… demo of equivalence …
19
This is a case of Algorithmic Anti-differentiation!
20
The ideal world
Given Problem P Derive solution characterization C Show algorithm A "finds a solution where C holds Profit?!
Given “min-cut” Derive “max-flow is equivalent to min-cut” Show push-relabel solves max-flow " Profit!!
ICERM David Gleich · Purdue 21
(The ideal world)’
Given Problem P Derive solution approx. characterization C’ Show algorithm A’ quickly finds a solution where C’ holds Profit?!
Given “sparest-cut” Derive Rayleigh-quotient approximation Show power-method finds a good Rayleigh-quotient Profit?!
ICERM David Gleich · Purdue 22
The real world?
Given Task P Hack around until you find something useful Write paper presenting “novel heuristic” H for P and … Profit!!
Given “find-communities” Hack around "??? (hidden) ??? Write paper presenting “three matvecs finds real-world communities” Profit!!
ICERM David Gleich · Purdue 23
Understand why H works!Show heuristic H solves P’ Guess and check!until you find something H solves Derive characterization of heuristic H
The real world
Given “find-communities” Hack around " Write paper presenting “three matvecs finds real-world communities” Profit!!
Algorithmic Anti-differentiation!Given heuristic H, is there a problem P’ such that H is an algorithm for P’ ?
ICERM David Gleich · Purdue 24
e.g. Mahoney & Orecchia
If your algorithm is related to optimization, this is:
Given a procedure X, "what objective does it optimize?
The real world Algorithmic Anti-differentiation!Given heuristic H, is there a problem P’ such that H is an algorithm for P’ ?
In an unconstrained case, this is just “anti-differentiation!”
ICERM David Gleich · Purdue 25
Algorithmic Anti-differentiation in the literature
Dhillon et al. (2007) "Spectral clustering, trace minimization & kernel k-means Saunders (1995) LSQR & Craig iterative methods
ICERM David Gleich · Purdue 26
Why does it matter?!These details matter in "many empirical studies, and can dramatically impact performance (speed or quality)
ICERM David Gleich · Purdue 27
Semi-supervised Learning on Graphs
Ai ,j = exp
✓�kdi � djk2
2
2�2
◆
di
dj � = 2.5
� = 1.25
Zhou et al. NIPS (2003)
28
Semi-supervised Learning on Graphs
� = 2.5
� = 1.25
Experiment predict unlabeled images from the labeled ones 29
Semi-supervised Learning on Graphs
K2 = (D � �A)�1
K1 = (I � �A)�1
K3 = (Diag(Ae) � �A)�1
Y = KiL
Our new “kernel” Indicators on the
revealed labels Predictions
Experiment vary number of labeled images and track perf.
y = argmaxj Y
30
Semi-supervised Learning on Graphs
K2 = (D � �A)�1
K1 = (I � �A)�1
K3 = (Diag(Ae) � �A)�1
Y = KiL
Experiment vary number of labeled images and track perf.
y = argmaxj Y
0 20 400
0.2
0.4
0.6
0.8
1
Num. labels
Erro
r rat
e
K1K2K3RK3� = 1.25
Regularized K3
Zhou et al. NIPS (2004)
31
Semi-supervised Learning on Graphs
K2 = (D � �A)�1
K1 = (I � �A)�1
K3 = (Diag(Ae) � �A)�1
Y = KiL
Experiment vary number of labeled images and track perf.
y = argmaxj Y
Regularized K3
� �� ���
���
���
���
���
�
�� �� ���
���������
���������
� = 2.5Our new value
Random guessing
32
Semi-supervised Learning on Graphs
K2 = (D � �A)�1
K1 = (I � �A)�1
K3 = (Diag(Ae) � �A)�1
Y = KiL
Experiment vary number of labeled images and track perf.
y = argmaxj Y
Regularized K3
0 20 400
0.2
0.4
0.6
0.8
1
Num. labels
Erro
r rat
e
K1K2K3RK3
� = 2.5Our new value
Random guessing
33
What’s happening?
0 0.5 10
0.2
0.4
0.6
0.8
12 vs. 1,2,3,4, σ=2.50
false pos.
true p
os.
K1K2K3RK3
0 0.5 10
0.2
0.4
0.6
0.8
12 vs. 1,2,3,4, σ=1.25
false pos.
tru
e p
os.
K1K2K3RK3
Much better performance!
ICERM David Gleich · Purdue 34
The results of our !regularized estimate
500 1000 1500 2000 2500 3000 3500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
ICERM David Gleich · Purdue 35
Why does it matter?!Theory has the answer!
We “sweep” over cuts from approximate eigenvectors!
It’s the order not the values.
ICERM David Gleich · Purdue 36
0 20 400
0.1
0.2
0.3
0.4
Num. labels
Erro
r rat
e
K1K2K3RK3
Improved performance
Y = KiL
Regularized K3
y = argminj
SortedRank (Y)
We have spent no time tuning the reg. parameter.
ICERM David Gleich · Purdue 37
K2 = (D � �A)�1K1 = (I � �A)�1
K3 = (Diag(Ae) � �A)�1
� = 2.5Our new value
110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219
Anti-di↵erentiating Approximation Algorithms
16 nonzeros 15 nonzeros 284 nonzeros 24 nonzeros
Figure 1. Examples of the di↵erent cut vectors on a portion of the net-science graph. At left we show the set S highlightedwith its vertices enlarged. In the other figures, we show the solution vectors from the various cut problems (from left toright, Probs. (2), (4), and (6), or mincut, PageRank, and ACL) for this set S. Each vector determines the color and size ofa vertex where high values are large and dark. White vertices with outlines are numerically non-zero (which is why mostof the vertices in the fourth figure are outlined in contrast to the third figure).
1.3. Application to semi-supervised learning
We recreate the USPS digit recognition problem fromZhou et al. (2003). Consider the digits 1, 2, 3, 4. Forall nodes with these true labels, construct a weightedgraph using a radial basis function:
Ai,j = exp��kdi � djk22/(2�2)
�
where � 2 {1.25, 2.5}, and where di is the vector repre-sentation of the ith image of a digit. This constructs theweighted graph that is the input to the semi-supervisedlearning method. We randomly pick nodes to be la-beled in the input and ensure that at least one nodefrom each class is chosen; and we predict labels usingthe following two kernels as in Zhou et al. (2003):
K1 = (I� �A)�1, and
K2 = (D� �A)�1,
where we used � = 0.99, as in the prior work.
When we wanted to use our new regularized cutproblems, we discovered that the weighted graphsA constructed through the radial basis function ker-nel method had many rows that were e↵ectively zero.These are images that are not similar to anything else inthe data. This fact posed a problem for our formulationbecause we need the seed vector s � 0 and the comple-ment seed vector d� s � 0. Also, the cases where wewould divide by the degree would be grossly inflatedby these small degrees. Consequently, we investigateda third kernel:
K3 = (Diag(Ae)� �A)�1.
The intuition for this kernel is that we first normal-ize the degrees via A, and then use a combinatorial
Laplacian kernel to propagate the labels. Because thedegrees are much more uniform, our new cut-basedformulations are significantly easier to apply and hadmuch better results in practice. Finally, we also con-sider using our 1-norm based regularization applied toK3 with = 10�5, which for the sake of consistency,we label rK3 although it is not a matrix-type kernel.
Detail on the error rates (Figure 2) For themulti-class prediction task explored by Zhou et al.,we show the mean error rates (percentage of mislabeledpoints) over 100 trials for four methods (K1,K2,K3
and rK3) as we vary the number of labels on fourproblems. The four problems use the digit sets 1, 2, 3, 4and 5, 6, 7, 8 with � = 1.25 and � = 2.5 in the weightfunction. And as in Zhou et al. (2003), we use themaximum entry in each row of KiS where S representsthe matrix of labeled nodes Si,j = 1 if node i is chosenas a labeled node for digit j. We note:
· First, setting � = 1.25 results in very low error rates,and all of the kernels have approximately the sameperformance once the number of labels is above 20.For fewer labels, the regularized kernel has worse
performance. With � = 1.25, the graph has a goodpartition based on the four classes. The regularizationdoes not help because it limits the propagation oflabels to all of the nodes; however, once there aresu�cient labels this e↵ect weakens.
· Second, when � = 2.5, the results are much worse.Here, K2 has the best performance and the regular-ized kernel rK3 has the worst performance.
Details on the ROC curves (Figure 3) We lookat all ROC curves for single-class prediction tasks with20 labeled nodes as we sweep along the solutions fora single label from each of the kernels. (Formally,
Recap & Conclusions
ICERM David Gleich · Purdue 38
Open issues!Better treatment of directed graphs? Algorithm for rho < 1?!rho set to ½ in most “uses” Need new analysis
New relationships between localized cuts & PageRank New understanding of PPR"push procedure Improvements to semi-supervised learning on graphs!