anti-differentiating approximation algorithms: pagerank and mincut

Anti-differentiating approximation algorithms !

& new relationships between !Page Rank, spectral, and localized flow

David F. Gleich!Purdue University!

Joint work with Michael Mahoney. Supported by "NSF CAREER 1149756-CCF, "Simons Inst.

ICERM David Gleich · Purdue 1

Anti-differentiating approximation algorithms !

& new relationships between !Page Rank, spectral, and localized flow

A new derivation of the PageRank vector for an undirected graph based on Laplacians, cuts, or flows. A new understanding of the “push” methods to compute Personalized PageRank An empirical improvement to methods for semi-supervised learning.

1st

2nd


The PageRank problem ! The PageRank random surfer

1.  With probability beta, follow a random-walk step

2.  With probability (1-beta), jump randomly ~ dist. v.

Goal find the stationary dist. x!!Alg Solve the linear system

�

�

��

�

�

�

�

�

Symmetric adjacency matrix Diagonal degree matrix

Solution Jump-vector

(I � �AD

�1)x = (1 � �)v

x = �AD

�1x + (1 � �)v


The PageRank problem & !the Laplacian

1. (I � �AD

�1)x = (1 � �)v;

2. (I � �A)y = (1 � �)D�1/2v,

where A = D

�1/2AD

�1/2 and x = D

1/2y; and

3. [↵D + L]z = ↵v where � = 1/(1 + ↵) and x = Dz.

Combinatorial Laplacian


The Push Algorithm for PageRank Proposed (in closest form) in Andersen, Chung, Lang "(also by McSherry, Jeh & Widom) for personalized PageRank

Strongly related to Gauss-Seidel (see my talk at Simons for this) Derived to show improved runtime for balanced solvers

1. x

(1)

= 0, r

(1)

= (1� �)e

i

, k = 1

2. while any r

j

> ⌧d

j

(d

j

is the degree of node j)

3. x

(k+1)

= x

(k )

+ (r

j

� ⌧d

j

⇢)e

j

4. r

(k+1)

i

=

8><

>:

⌧d

j

⇢ i = j

r

(k )

i

+ �(r

j

� ⌧d

j

⇢)/d

j

i ⇠ j

r

(k )

i

otherwise

5. k k + 1

The Push

Method!⌧ , ⇢


… demo of push …


Why do we care about push? 1.  Used for empirical

studies of “communities”

2.  Used for “fast PageRank” approximation

It produces sparse approximations to PageRank!

Newman’s netscience!379 vertices, 1828 nnz “zero” on most of the nodes

v has a single "one here

7

Our question!Why does the “push method” have such incredible empirical utility?

8

The O(correct) answer 1.  PageRank related to Laplacian 2.  Laplacian related to cuts 3.  Andersen, Chung, Lang provides the "

“right” bounds and “localization” This talk the θ(correct) answer?"A deeper insight into the relationship


Intellectually indebted to …

Chin, Mądry, Miller & Peng [2013] Orecchia & Zhu [2014]

10

minimize kBxkC,1

=

Pij2E

C

i ,j

|xi

� x

j

|subject to x

s

= 1, x

t

= 0, x � 0.

The s-t min-cut problem

Unweighted incidence matrix Diagonal capacity matrix

11

The localized cut graph Related to a construction used in “FlowImprove” "Andersen & Lang (2007); and Orecchia & Zhu (2014)

AS =

2

40 ↵dT

S 0↵dS A ↵dS̄

0 ↵dTS̄ 0

3

5

Connect s to vertices

in S with weight ↵ · degree

Connect t to vertices

in

¯S with weight ↵ · degree


The localized cut graph Connect s to vertices



in


BS =

2

4e �IS 00 B 00 �IS̄ e

3

5

minimize kB

S

xkC(↵),1

subject to x

s

= 1, x

t

= 0

x � 0.

Solve the s-t min-cut


The localized cut graph Connect s to vertices



in


BS =

2

4e �IS 00 B 00 �IS̄ e

3

5

Solve the “electrical flow” s-t min-cut minimize kB

S

xkC(↵),2

subject to x

s

= 1, x

t

= 0


s-t min-cut à PageRank The PageRank vector z that solves

(↵D + L)z = ↵v

with v = d

S

/vol(S) is a renormalized

solution of the electrical cut computation:

minimize kB

S

xkC(↵),2

subject to x

s

= 1, x

t

= 0.

Specifically, if x is the solution, then

x =

2

41

vol(S)z

0

3

5

Proof

Square and expand

the objective into

a Laplacian, then

apply constraints.


PageRank à s-t min-cut That equivalence works if v is degree-weighted. What if v is the uniform vector?

�

�

��

�

��

��

��

�A(s) =2

40 ↵sT 0↵s A ↵(d � s)0 ↵(d � s)T 0

3

5 .


And beyond …

Easy to cook up interesting diffusion-like problems and adapt them to this framework. In particular, Zhou et al. (2004) gave a semi-supervised learning diffusion we study soon.

2

40 eT

S 0eS ✓A eS̄0 eS̄ 0

3

5 . (I + ✓L)x = eS


Back to the push method Let x be the output from the push method

with 0 < � < 1, v = dS/vol(S),

⇢ = 1, and ⌧ > 0.

Set ↵ =

1�� , = ⌧vol(S)/�, and let zG solve:

minimize

1

2

kBSzk2

C(↵),2

+ kDzk1

subject to zs = 1, zt = 0, z � 0

,

where z =

h1

zG0

i.

Then x = DzG/vol(S).

Proof Write out KKT conditions

Show that the push method

solves them. Slackness was “tricky”

Regularization for sparsity


Need for normalization

… demo of equivalence …

19

This is a case of Algorithmic Anti-differentiation!

20

The ideal world

Given Problem P Derive solution characterization C Show algorithm A "finds a solution where C holds Profit?!

Given “min-cut” Derive “max-flow is equivalent to min-cut” Show push-relabel solves max-flow " Profit!!


(The ideal world)’

Given Problem P Derive solution approx. characterization C’ Show algorithm A’ quickly finds a solution where C’ holds Profit?!

Given “sparest-cut” Derive Rayleigh-quotient approximation Show power-method finds a good Rayleigh-quotient Profit?!


The real world?

Given Task P Hack around until you find something useful Write paper presenting “novel heuristic” H for P and … Profit!!

Given “find-communities” Hack around "??? (hidden) ??? Write paper presenting “three matvecs finds real-world communities” Profit!!


Understand why H works!Show heuristic H solves P’ Guess and check!until you find something H solves Derive characterization of heuristic H

The real world

Given “find-communities” Hack around " Write paper presenting “three matvecs finds real-world communities” Profit!!

Algorithmic Anti-differentiation!Given heuristic H, is there a problem P’ such that H is an algorithm for P’ ?


e.g. Mahoney & Orecchia

If your algorithm is related to optimization, this is:

Given a procedure X, "what objective does it optimize?

The real world Algorithmic Anti-differentiation!Given heuristic H, is there a problem P’ such that H is an algorithm for P’ ?

In an unconstrained case, this is just “anti-differentiation!”


Algorithmic Anti-differentiation in the literature

Dhillon et al. (2007) "Spectral clustering, trace minimization & kernel k-means Saunders (1995) LSQR & Craig iterative methods


Why does it matter?!These details matter in "many empirical studies, and can dramatically impact performance (speed or quality)


Semi-supervised Learning on Graphs

Ai ,j = exp

✓�kdi � djk2

2

2�2

◆

di

dj � = 2.5

� = 1.25

Zhou et al. NIPS (2003)

28


� = 2.5

� = 1.25

Experiment predict unlabeled images from the labeled ones 29


K2 = (D � �A)�1

K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1

Y = KiL

Our new “kernel” Indicators on the

revealed labels Predictions

Experiment vary number of labeled images and track perf.

y = argmaxj Y

30


K2 = (D � �A)�1

K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1

Y = KiL


y = argmaxj Y

0 20 400

0.2

0.4

0.6

0.8

1

Num. labels

Erro

r rat

e

K1K2K3RK3� = 1.25

Regularized K3

Zhou et al. NIPS (2004)

31


K2 = (D � �A)�1

K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1

Y = KiL


y = argmaxj Y

Regularized K3

� ��

��

��

��

��

�

��

��

��

� = 2.5Our new value

Random guessing

32


K2 = (D � �A)�1

K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1

Y = KiL


y = argmaxj Y

Regularized K3

0 20 400

0.2

0.4

0.6

0.8

1

Num. labels

Erro

r rat

e

K1K2K3RK3


Random guessing

33

What’s happening?

0 0.5 10

0.2

0.4

0.6

0.8

12 vs. 1,2,3,4, σ=2.50

false pos.

true p

os.

K1K2K3RK3

0 0.5 10

0.2

0.4

0.6

0.8

12 vs. 1,2,3,4, σ=1.25

false pos.

tru

e p

os.

K1K2K3RK3

Much better performance!


The results of our !regularized estimate

500 1000 1500 2000 2500 3000 3500

0.05

0.1

0.15

0.2

0.25

0.3

0.35


Why does it matter?!Theory has the answer!

We “sweep” over cuts from approximate eigenvectors!

It’s the order not the values.


0 20 400

0.1

0.2

0.3

0.4

Num. labels

Erro

r rat

e

K1K2K3RK3

Improved performance

Y = KiL

Regularized K3

y = argminj

SortedRank (Y)

We have spent no time tuning the reg. parameter.


K2 = (D � �A)�1K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1


110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Anti-di↵erentiating Approximation Algorithms

16 nonzeros 15 nonzeros 284 nonzeros 24 nonzeros

Figure 1. Examples of the di↵erent cut vectors on a portion of the net-science graph. At left we show the set S highlightedwith its vertices enlarged. In the other figures, we show the solution vectors from the various cut problems (from left toright, Probs. (2), (4), and (6), or mincut, PageRank, and ACL) for this set S. Each vector determines the color and size ofa vertex where high values are large and dark. White vertices with outlines are numerically non-zero (which is why mostof the vertices in the fourth figure are outlined in contrast to the third figure).

1.3. Application to semi-supervised learning

We recreate the USPS digit recognition problem fromZhou et al. (2003). Consider the digits 1, 2, 3, 4. Forall nodes with these true labels, construct a weightedgraph using a radial basis function:

Ai,j = exp��kdi � djk22/(2�2)

�

where � 2 {1.25, 2.5}, and where di is the vector repre-sentation of the ith image of a digit. This constructs theweighted graph that is the input to the semi-supervisedlearning method. We randomly pick nodes to be la-beled in the input and ensure that at least one nodefrom each class is chosen; and we predict labels usingthe following two kernels as in Zhou et al. (2003):

K1 = (I� �A)�1, and

K2 = (D� �A)�1,

where we used � = 0.99, as in the prior work.

When we wanted to use our new regularized cutproblems, we discovered that the weighted graphsA constructed through the radial basis function ker-nel method had many rows that were e↵ectively zero.These are images that are not similar to anything else inthe data. This fact posed a problem for our formulationbecause we need the seed vector s � 0 and the comple-ment seed vector d� s � 0. Also, the cases where wewould divide by the degree would be grossly inflatedby these small degrees. Consequently, we investigateda third kernel:

K3 = (Diag(Ae)� �A)�1.

The intuition for this kernel is that we first normal-ize the degrees via A, and then use a combinatorial

Laplacian kernel to propagate the labels. Because thedegrees are much more uniform, our new cut-basedformulations are significantly easier to apply and hadmuch better results in practice. Finally, we also con-sider using our 1-norm based regularization applied toK3 with = 10�5, which for the sake of consistency,we label rK3 although it is not a matrix-type kernel.

Detail on the error rates (Figure 2) For themulti-class prediction task explored by Zhou et al.,we show the mean error rates (percentage of mislabeledpoints) over 100 trials for four methods (K1,K2,K3

and rK3) as we vary the number of labels on fourproblems. The four problems use the digit sets 1, 2, 3, 4and 5, 6, 7, 8 with � = 1.25 and � = 2.5 in the weightfunction. And as in Zhou et al. (2003), we use themaximum entry in each row of KiS where S representsthe matrix of labeled nodes Si,j = 1 if node i is chosenas a labeled node for digit j. We note:

· First, setting � = 1.25 results in very low error rates,and all of the kernels have approximately the sameperformance once the number of labels is above 20.For fewer labels, the regularized kernel has worse

performance. With � = 1.25, the graph has a goodpartition based on the four classes. The regularizationdoes not help because it limits the propagation oflabels to all of the nodes; however, once there aresu�cient labels this e↵ect weakens.

· Second, when � = 2.5, the results are much worse.Here, K2 has the best performance and the regular-ized kernel rK3 has the worst performance.

Details on the ROC curves (Figure 3) We lookat all ROC curves for single-class prediction tasks with20 labeled nodes as we sweep along the solutions fora single label from each of the kernels. (Formally,

Recap & Conclusions


Open issues!Better treatment of directed graphs? Algorithm for rho < 1?!rho set to ½ in most “uses” Need new analysis

New relationships between localized cuts & PageRank New understanding of PPR"push procedure Improvements to semi-supervised learning on graphs!

anti-differentiating approximation algorithms: pagerank and mincut

Technology

pagerank st

icerm david gleich purdue14

icerm david gleich purdue3

icerm david gleich purdue5

icerm david gleich purdue15

icerm david gleich purdue16

icerm david gleich purdue21

icerm david gleich purdue22