anti-differentiating approximation algorithms: pagerank and mincut

38
Anti-differentiating approximation algorithms & new relationships between Page Rank, spectral, and localized flow David F. Gleich Purdue University Joint work with Michael Mahoney. Supported by NSF CAREER 1149756-CCF, Simons Inst. ICERM David Gleich · Purdue 1

Upload: david-gleich

Post on 15-Jan-2015

190 views

Category:

Technology


0 download

DESCRIPTION

We study how Google's PageRank method relates to mincut and a particular type of electrical flow in a network. We also explain the details of how the "push method" for computing PageRank helps to accelerate it. This has implications for semi-supervised learning and machine learning, as well as social network analysis.

TRANSCRIPT

Page 1: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Anti-differentiating approximation algorithms !

& new relationships between !Page Rank, spectral, and localized flow

David F. Gleich!Purdue University!

Joint work with Michael Mahoney. Supported by "NSF CAREER 1149756-CCF, "Simons Inst.

ICERM David Gleich · Purdue 1

Page 2: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Anti-differentiating approximation algorithms !

& new relationships between !Page Rank, spectral, and localized flow

A new derivation of the PageRank vector for an undirected graph based on Laplacians, cuts, or flows. A new understanding of the “push” methods to compute Personalized PageRank An empirical improvement to methods for semi-supervised learning.

1st

2nd

ICERM David Gleich · Purdue 2

Page 3: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The PageRank problem ! The PageRank random surfer

1.  With probability beta, follow a random-walk step

2.  With probability (1-beta), jump randomly ~ dist. v.

Goal find the stationary dist. x!!Alg Solve the linear system

��

Symmetric adjacency matrix Diagonal degree matrix

Solution Jump-vector

(I � �AD

�1)x = (1 � �)v

x = �AD

�1x + (1 � �)v

ICERM David Gleich · Purdue 3

Page 4: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The PageRank problem & !the Laplacian

1. (I � �AD

�1)x = (1 � �)v;

2. (I � �A)y = (1 � �)D�1/2v,

where A = D

�1/2AD

�1/2 and x = D

1/2y; and

3. [↵D + L]z = ↵v where � = 1/(1 + ↵) and x = Dz.

Combinatorial Laplacian

ICERM David Gleich · Purdue 4

Page 5: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The Push Algorithm for PageRank Proposed (in closest form) in Andersen, Chung, Lang "(also by McSherry, Jeh & Widom) for personalized PageRank

Strongly related to Gauss-Seidel (see my talk at Simons for this) Derived to show improved runtime for balanced solvers

1. x

(1)

= 0, r

(1)

= (1� �)e

i

, k = 1

2. while any r

j

> ⌧d

j

(d

j

is the degree of node j)

3. x

(k+1)

= x

(k )

+ (r

j

� ⌧d

j

⇢)e

j

4. r

(k+1)

i

=

8><

>:

⌧d

j

⇢ i = j

r

(k )

i

+ �(r

j

� ⌧d

j

⇢)/d

j

i ⇠ j

r

(k )

i

otherwise

5. k k + 1

The Push

Method!⌧ , ⇢

ICERM David Gleich · Purdue 5

Page 6: Anti-differentiating Approximation Algorithms: PageRank and MinCut

… demo of push …

ICERM David Gleich · Purdue 6

Page 7: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Why do we care about push? 1.  Used for empirical

studies of “communities”

2.  Used for “fast PageRank” approximation

It produces sparse approximations to PageRank!

Newman’s netscience!379 vertices, 1828 nnz “zero” on most of the nodes

v has a single "one here

7

Page 8: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Our question!Why does the “push method” have such incredible empirical utility?

8

Page 9: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The O(correct) answer 1.  PageRank related to Laplacian 2.  Laplacian related to cuts 3.  Andersen, Chung, Lang provides the "

“right” bounds and “localization” This talk the θ(correct) answer?"A deeper insight into the relationship

ICERM David Gleich · Purdue 9

Page 10: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Intellectually indebted to …

Chin, Mądry, Miller & Peng [2013] Orecchia & Zhu [2014]

10

Page 11: Anti-differentiating Approximation Algorithms: PageRank and MinCut

minimize kBxkC,1

=

Pij2E

C

i ,j

|xi

� x

j

|subject to x

s

= 1, x

t

= 0, x � 0.

The s-t min-cut problem

Unweighted incidence matrix Diagonal capacity matrix

11

Page 12: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The localized cut graph Related to a construction used in “FlowImprove” "Andersen & Lang (2007); and Orecchia & Zhu (2014)

AS =

2

40 ↵dT

S 0↵dS A ↵dS̄

0 ↵dTS̄ 0

3

5

Connect s to vertices

in S with weight ↵ · degree

Connect t to vertices

in

¯S with weight ↵ · degree

ICERM David Gleich · Purdue 12

Page 13: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The localized cut graph Connect s to vertices

in S with weight ↵ · degree

Connect t to vertices

in

¯S with weight ↵ · degree

BS =

2

4e �IS 00 B 00 �IS̄ e

3

5

minimize kB

S

xkC(↵),1

subject to x

s

= 1, x

t

= 0

x � 0.

Solve the s-t min-cut

ICERM David Gleich · Purdue 13

Page 14: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The localized cut graph Connect s to vertices

in S with weight ↵ · degree

Connect t to vertices

in

¯S with weight ↵ · degree

BS =

2

4e �IS 00 B 00 �IS̄ e

3

5

Solve the “electrical flow” s-t min-cut minimize kB

S

xkC(↵),2

subject to x

s

= 1, x

t

= 0

ICERM David Gleich · Purdue 14

Page 15: Anti-differentiating Approximation Algorithms: PageRank and MinCut

s-t min-cut à PageRank The PageRank vector z that solves

(↵D + L)z = ↵v

with v = d

S

/vol(S) is a renormalized

solution of the electrical cut computation:

minimize kB

S

xkC(↵),2

subject to x

s

= 1, x

t

= 0.

Specifically, if x is the solution, then

x =

2

41

vol(S)z

0

3

5

Proof

Square and expand

the objective into

a Laplacian, then

apply constraints.

ICERM David Gleich · Purdue 15

Page 16: Anti-differentiating Approximation Algorithms: PageRank and MinCut

PageRank à s-t min-cut That equivalence works if v is degree-weighted. What if v is the uniform vector?

�����

��

��

��

�A(s) =2

40 ↵sT 0↵s A ↵(d � s)0 ↵(d � s)T 0

3

5 .

ICERM David Gleich · Purdue 16

Page 17: Anti-differentiating Approximation Algorithms: PageRank and MinCut

And beyond …

Easy to cook up interesting diffusion-like problems and adapt them to this framework. In particular, Zhou et al. (2004) gave a semi-supervised learning diffusion we study soon.

2

40 eT

S 0eS ✓A eS̄0 eS̄ 0

3

5 . (I + ✓L)x = eS

ICERM David Gleich · Purdue 17

Page 18: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Back to the push method Let x be the output from the push method

with 0 < � < 1, v = dS/vol(S),

⇢ = 1, and ⌧ > 0.

Set ↵ =

1��� , = ⌧vol(S)/�, and let zG solve:

minimize

1

2

kBSzk2

C(↵),2

+ kDzk1

subject to zs = 1, zt = 0, z � 0

,

where z =

h1

zG0

i.

Then x = DzG/vol(S).

Proof Write out KKT conditions

Show that the push method

solves them. Slackness was “tricky”

Regularization for sparsity

ICERM David Gleich · Purdue 18

Need for normalization

Page 19: Anti-differentiating Approximation Algorithms: PageRank and MinCut

… demo of equivalence …

19

Page 20: Anti-differentiating Approximation Algorithms: PageRank and MinCut

This is a case of Algorithmic Anti-differentiation!

20

Page 21: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The ideal world

Given Problem P Derive solution characterization C Show algorithm A "finds a solution where C holds Profit?!

Given “min-cut” Derive “max-flow is equivalent to min-cut” Show push-relabel solves max-flow " Profit!!

ICERM David Gleich · Purdue 21

Page 22: Anti-differentiating Approximation Algorithms: PageRank and MinCut

(The ideal world)’

Given Problem P Derive solution approx. characterization C’ Show algorithm A’ quickly finds a solution where C’ holds Profit?!

Given “sparest-cut” Derive Rayleigh-quotient approximation Show power-method finds a good Rayleigh-quotient Profit?!

ICERM David Gleich · Purdue 22

Page 23: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The real world?

Given Task P Hack around until you find something useful Write paper presenting “novel heuristic” H for P and … Profit!!

Given “find-communities” Hack around "??? (hidden) ??? Write paper presenting “three matvecs finds real-world communities” Profit!!

ICERM David Gleich · Purdue 23

Page 24: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Understand why H works!Show heuristic H solves P’ Guess and check!until you find something H solves Derive characterization of heuristic H

The real world

Given “find-communities” Hack around " Write paper presenting “three matvecs finds real-world communities” Profit!!

Algorithmic Anti-differentiation!Given heuristic H, is there a problem P’ such that H is an algorithm for P’ ?

ICERM David Gleich · Purdue 24

e.g. Mahoney & Orecchia

Page 25: Anti-differentiating Approximation Algorithms: PageRank and MinCut

If your algorithm is related to optimization, this is:

Given a procedure X, "what objective does it optimize?

The real world Algorithmic Anti-differentiation!Given heuristic H, is there a problem P’ such that H is an algorithm for P’ ?

In an unconstrained case, this is just “anti-differentiation!”

ICERM David Gleich · Purdue 25

Page 26: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Algorithmic Anti-differentiation in the literature

Dhillon et al. (2007) "Spectral clustering, trace minimization & kernel k-means Saunders (1995) LSQR & Craig iterative methods

ICERM David Gleich · Purdue 26

Page 27: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Why does it matter?!These details matter in "many empirical studies, and can dramatically impact performance (speed or quality)

ICERM David Gleich · Purdue 27

Page 28: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Semi-supervised Learning on Graphs

Ai ,j = exp

✓�kdi � djk2

2

2�2

di

dj � = 2.5

� = 1.25

Zhou et al. NIPS (2003)

28

Page 29: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Semi-supervised Learning on Graphs

� = 2.5

� = 1.25

Experiment predict unlabeled images from the labeled ones 29

Page 30: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Semi-supervised Learning on Graphs

K2 = (D � �A)�1

K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1

Y = KiL

Our new “kernel” Indicators on the

revealed labels Predictions

Experiment vary number of labeled images and track perf.

y = argmaxj Y

30

Page 31: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Semi-supervised Learning on Graphs

K2 = (D � �A)�1

K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1

Y = KiL

Experiment vary number of labeled images and track perf.

y = argmaxj Y

0 20 400

0.2

0.4

0.6

0.8

1

Num. labels

Erro

r rat

e

K1K2K3RK3� = 1.25

Regularized K3

Zhou et al. NIPS (2004)

31

Page 32: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Semi-supervised Learning on Graphs

K2 = (D � �A)�1

K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1

Y = KiL

Experiment vary number of labeled images and track perf.

y = argmaxj Y

Regularized K3

� �� ���

���

���

���

���

�� �� ���

���������

���������

� = 2.5Our new value

Random guessing

32

Page 33: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Semi-supervised Learning on Graphs

K2 = (D � �A)�1

K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1

Y = KiL

Experiment vary number of labeled images and track perf.

y = argmaxj Y

Regularized K3

0 20 400

0.2

0.4

0.6

0.8

1

Num. labels

Erro

r rat

e

K1K2K3RK3

� = 2.5Our new value

Random guessing

33

Page 34: Anti-differentiating Approximation Algorithms: PageRank and MinCut

What’s happening?

0 0.5 10

0.2

0.4

0.6

0.8

12 vs. 1,2,3,4, σ=2.50

false pos.

true p

os.

K1K2K3RK3

0 0.5 10

0.2

0.4

0.6

0.8

12 vs. 1,2,3,4, σ=1.25

false pos.

tru

e p

os.

K1K2K3RK3

Much better performance!

ICERM David Gleich · Purdue 34

Page 35: Anti-differentiating Approximation Algorithms: PageRank and MinCut

The results of our !regularized estimate

500 1000 1500 2000 2500 3000 3500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

ICERM David Gleich · Purdue 35

Page 36: Anti-differentiating Approximation Algorithms: PageRank and MinCut

Why does it matter?!Theory has the answer!

We “sweep” over cuts from approximate eigenvectors!

It’s the order not the values.

ICERM David Gleich · Purdue 36

Page 37: Anti-differentiating Approximation Algorithms: PageRank and MinCut

0 20 400

0.1

0.2

0.3

0.4

Num. labels

Erro

r rat

e

K1K2K3RK3

Improved performance

Y = KiL

Regularized K3

y = argminj

SortedRank (Y)

We have spent no time tuning the reg. parameter.

ICERM David Gleich · Purdue 37

K2 = (D � �A)�1K1 = (I � �A)�1

K3 = (Diag(Ae) � �A)�1

� = 2.5Our new value

Page 38: Anti-differentiating Approximation Algorithms: PageRank and MinCut

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Anti-di↵erentiating Approximation Algorithms

16 nonzeros 15 nonzeros 284 nonzeros 24 nonzeros

Figure 1. Examples of the di↵erent cut vectors on a portion of the net-science graph. At left we show the set S highlightedwith its vertices enlarged. In the other figures, we show the solution vectors from the various cut problems (from left toright, Probs. (2), (4), and (6), or mincut, PageRank, and ACL) for this set S. Each vector determines the color and size ofa vertex where high values are large and dark. White vertices with outlines are numerically non-zero (which is why mostof the vertices in the fourth figure are outlined in contrast to the third figure).

1.3. Application to semi-supervised learning

We recreate the USPS digit recognition problem fromZhou et al. (2003). Consider the digits 1, 2, 3, 4. Forall nodes with these true labels, construct a weightedgraph using a radial basis function:

Ai,j = exp��kdi � djk22/(2�2)

where � 2 {1.25, 2.5}, and where di is the vector repre-sentation of the ith image of a digit. This constructs theweighted graph that is the input to the semi-supervisedlearning method. We randomly pick nodes to be la-beled in the input and ensure that at least one nodefrom each class is chosen; and we predict labels usingthe following two kernels as in Zhou et al. (2003):

K1 = (I� �A)�1, and

K2 = (D� �A)�1,

where we used � = 0.99, as in the prior work.

When we wanted to use our new regularized cutproblems, we discovered that the weighted graphsA constructed through the radial basis function ker-nel method had many rows that were e↵ectively zero.These are images that are not similar to anything else inthe data. This fact posed a problem for our formulationbecause we need the seed vector s � 0 and the comple-ment seed vector d� s � 0. Also, the cases where wewould divide by the degree would be grossly inflatedby these small degrees. Consequently, we investigateda third kernel:

K3 = (Diag(Ae)� �A)�1.

The intuition for this kernel is that we first normal-ize the degrees via A, and then use a combinatorial

Laplacian kernel to propagate the labels. Because thedegrees are much more uniform, our new cut-basedformulations are significantly easier to apply and hadmuch better results in practice. Finally, we also con-sider using our 1-norm based regularization applied toK3 with = 10�5, which for the sake of consistency,we label rK3 although it is not a matrix-type kernel.

Detail on the error rates (Figure 2) For themulti-class prediction task explored by Zhou et al.,we show the mean error rates (percentage of mislabeledpoints) over 100 trials for four methods (K1,K2,K3

and rK3) as we vary the number of labels on fourproblems. The four problems use the digit sets 1, 2, 3, 4and 5, 6, 7, 8 with � = 1.25 and � = 2.5 in the weightfunction. And as in Zhou et al. (2003), we use themaximum entry in each row of KiS where S representsthe matrix of labeled nodes Si,j = 1 if node i is chosenas a labeled node for digit j. We note:

· First, setting � = 1.25 results in very low error rates,and all of the kernels have approximately the sameperformance once the number of labels is above 20.For fewer labels, the regularized kernel has worse

performance. With � = 1.25, the graph has a goodpartition based on the four classes. The regularizationdoes not help because it limits the propagation oflabels to all of the nodes; however, once there aresu�cient labels this e↵ect weakens.

· Second, when � = 2.5, the results are much worse.Here, K2 has the best performance and the regular-ized kernel rK3 has the worst performance.

Details on the ROC curves (Figure 3) We lookat all ROC curves for single-class prediction tasks with20 labeled nodes as we sweep along the solutions fora single label from each of the kernels. (Formally,

Recap & Conclusions

ICERM David Gleich · Purdue 38

Open issues!Better treatment of directed graphs? Algorithm for rho < 1?!rho set to ½ in most “uses” Need new analysis

New relationships between localized cuts & PageRank New understanding of PPR"push procedure Improvements to semi-supervised learning on graphs!