anti-differentiating approximation algorithms: a case study with min-cuts, spectral, and flow

23
Algorithmic Anti-Differentiation A case study with min-cuts, spectral, and flow David F. Gleich · Purdue University Michael W. Mahoney · Berkeley ICSI Code www.cs.purdue.edu/homes/dgleich/codes/l1pagerank 1

Upload: david-gleich

Post on 15-Jan-2015

149 views

Category:

Technology


1 download

DESCRIPTION

This talk covers the idea of anti-differentiating approximation algorithms, which is an idea to explain the success of widely used heuristic procedures. Formally, this involves finding an optimization problem solved exactly by an approximation algorithm or heuristic.

TRANSCRIPT

Page 1: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

Algorithmic !Anti-Differentiation!

A case study with !min-cuts, spectral, and flow

!!

David F. Gleich · Purdue University!Michael W. Mahoney · Berkeley ICSI!

Code "www.cs.purdue.edu/homes/dgleich/codes/l1pagerank !

1

Page 2: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

Algorithmic Anti-differentiation!Understanding how and why heuristic procedures •  Early stopping •  Truncating small entries •  etc

are actually algorithms for implicit objectives.

2 ICML David Gleich · Purdue

Page 3: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

The ideal world Given Problem P Derive solution characterization C Show algorithm A "finds a solution where C holds Profit?!

Given “min-cut” Derive “max-flow is equivalent to min-cut” Show push-relabel solves max-flow " Profit!!

ICML David Gleich · Purdue 3

Page 4: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

(The ideal world)’ Given Problem P Derive solution approx. characterization C Show algorithm A’ "finds a solution where C’ holds Profit?!

Given “sparsest-cut” Derive Rayleigh-quotient approximation Show power method finds good Rayleigh quotient Profit? !

ICML David Gleich · Purdue 4

(In academia!)!

Page 5: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

The real world Given Task P Hack around until you find something useful Write paper presenting “novel heuristic” H for P and Profit!!

Given “find-communities” Hack around !… hidden ..!Write paper on “three steps of power method finds communities” Profit!!

ICML David Gleich · Purdue 5

Page 6: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

(The ideal world)’’ Understand why H works!Show heuristic H solves P’ Guess and check!until you find something H solves Derive characterization of heuristic H

Given “find-communities” Hack around !!Write paper on “three steps of power method finds communities” Profit!!

ICML David Gleich · Purdue 6

Page 7: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

If your algorithm is related to optimization, this is:

Given a procedure X, "what objective does it optimize?

The real world Algorithmic Anti-differentiation!Given heuristic H, is there a problem P’ such that H is an algorithm for P’ ?

In the smooth, unconstrained case, this is just “anti-differentiation!”

ICML David Gleich · Purdue 7

Page 8: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

Algorithmic Anti-differentiation in the literature Mahoney & Orecchia (2011) Three steps of the power method and p-norm reg. Dhillon et al. (2007) "Spectral clustering, trace minimization & kernel k-means Saunders (1995) LSQR & Craig iterative methods for Ax = b!… many more …

ICML David Gleich · Purdue 8

Page 9: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

Outline 1.  A new derivation of the PageRank vector for an

undirected graph based on Laplacians, cuts, or flows. 2.  An understanding of the implicit regularization of

PageRank “push” method. 3.  The impact of this on a few applications.

ICML David Gleich · Purdue 9

Page 10: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

The PageRank problem The PageRank random surfer 1.  With probability beta, follow a

random-walk step 2.  With probability (1-beta), jump

randomly ~ dist. v. Goal find the stationary dist. x!

!

��

Sym. adjacency matrix Diagonal degree matrix

Solution Jump-vector

(I � �AD

�1)x = (1 � �)v

ICML David Gleich · Purdue 10

[↵D + L]z = ↵v

where� = 1/(1 + ↵)and x = Dz

Equivalent to

Combinatorial "Laplacian

Page 11: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

The Push Algorithm for PageRank!

Proposed (in closest form) in Andersen, Chung, Lang "(also by McSherry, Jeh & Widom) for personalized PageRank

Strongly related to Gauss-Seidel, coordinate descent Derived to quickly approximate PageRank with sparsity

1. x

(1)

= 0, r

(1)

= (1� �)e

i

, k = 1

2. while any r

j

> ⌧d

j

(d

j

is the degree of node j)

3. x

(k+1)

= x

(k )

+ (r

j

� ⌧d

j

⇢)e

j

4. r

(k+1)

i

=

8><

>:

⌧d

j

⇢ i = j

r

(k )

i

+ �(r

j

� ⌧d

j

⇢)/d

j

i ⇠ j

r

(k )

i

otherwise

5. k k + 1

The Push

Method!⌧ , ⇢

ICML David Gleich · Purdue 11

Page 12: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

The push method stays local

ICML David Gleich · Purdue 12

Page 13: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

Why do we care about push? 1.  Used for empirical studies of

“communities” and an ingredient in an empirically successful community finder (Whang et al. CIKM 2013).

2.  Used for “fast PageRank” approximation

3.  It produces sparse approximations to PageRank!

Newman’s netscience!379 vertices, 1828 nnz “zero” on most of the nodes

v has a single "one here

13

ICML

Page 14: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

minimize kBxkC,1

=

Pij2E

C

i ,j

|xi

� x

j

|subject to x

s

= 1, x

t

= 0, x � 0.

The s-t min-cut problem Unweighted incidence matrix

Diagonal cost matrix

14

ICML David Gleich · Purdue

Page 15: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

The localized cut graph

Related to a construction used in “FlowImprove” "Andersen & Lang (2007); and Orecchia & Zhu (2014)

AS =

2

40 ↵dT

S 0↵dS A ↵dS

0 ↵dTS 0

3

5

Connect s to vertices

in S with weight ↵ · degree

Connect t to vertices

in

¯S with weight ↵ · degree

ICML David Gleich · Purdue 15

Page 16: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

The localized cut graph & PageRank

ICML David Gleich · Purdue 16

minimize kB

S

xkC(↵),1

subject to x

s

= 1, x

t

= 0

x � 0.

Solve the s-t min-cut

Page 17: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

The localized cut graph & PageRank

ICML David Gleich · Purdue 17

Solve “spectral” s-t min-cut minimize kB

S

xkC(↵),2

subject to x

s

= 1, x

t

= 0

x � 0.

The PageRank vector z that solves

(↵D + L)z = ↵v

with v = d

S

/vol(S) is a renormalized

solution of the electrical cut computation:

minimize kB

S

xkC(↵),2

subject to x

s

= 1, x

t

= 0.

Specifically, if x is the solution, then

x =

2

41

vol(S)z

0

3

5

Page 18: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

Back to the push method Let x be the output from the push method

with 0 < � < 1, v = dS/vol(S),

⇢ = 1, and ⌧ > 0.

Set ↵ =

1��� , = ⌧vol(S)/�, and let zG solve:

minimize

1

2

kBSzk2

C(↵),2

+ kDzk1

subject to zs = 1, zt = 0, z � 0

,

where z =

h1

zG0

i.

Then x = DzG/vol(S).

Proof Write out KKT conditions

Show that the push method

solves them. Slackness was “tricky”

Regularization for sparsity

ICML David Gleich · Purdue 18

Need for normalization

Page 19: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

A simple example

Anti-di↵erentiating Approximation Algorithms

4. Illustrative empirical results

In this section, we present several empirical results thatillustrate our theory from Section 3. We begin with a fewpedagogical examples in order to state solutions of theseproblems that are correctly normalized. These precise valuesare sensitive to small di↵erences and imprecisions in solvingthe equations. We state them here so that others can verifythe correctness of their future implementations—althoughthe codes underlying our computations are also available fordownload.4 We then illustrate how the 2-norm PageRankvector (Prob. (4)) and 1-norm regularized vector (Prob. (6))approximate the 1-norm min-cut problem (Prob. (2)) for asmall but widely-studied social graph.

4.1. Pedagogical examples

We start by solving numerically the various formulationsand variants (min-cut, PageRank, and ACL) of our basicproblem for the example graph illustrated in Figure 1. Asummary of these results may be found in Table 1. To orientthe reader about the normalizations (which can be tricky),we present the three PageRank vectors that satisfy the re-lationships xpr = Dz and vol(S )z = x(↵, S ), as predictedby the theory from Section 2 and Theorem 1. Note thatz, x(↵, S ) and z

G

round to the discrete solution producedby the exact cut if we simply threshold at about half thelargest value. Thus, and not surprisingly, all these formula-tions reveal—to a first approximation—similar propertiesof the graph. Importantly, though, note also that the vectorz

G

has the same sparsity as the true cut-vector. This is adirect consequence of the implicit `1 regularization that wecharacterize in Theorem 3. Understanding these “details”matters critically as we proceed to apply these methods tolarger graphs where there may be many small entries in thePageRank vector z.5

4.2. Newman’s netscience graph

Consider, for instance, Newman’s netscience graph (New-man, 2006), which has 379 nodes and 914 undirected edges.We considered a set S of 16 vertices, illustrated in Figure 2.In this case, then the solution of the 1-norm min-cut problem(Prob. (2)) has 15 non-zeros; the solution of the PageRankproblem (which we have shown in Prob. (4) implicitly solvesan `2 regression problem) has 284 non-zeros;6 and the so-lution from the ACL procedure (which we have shown in

4http://www.cs.purdue.edu/homes/dgleich/codes/

l1pagerank

5The reason is both statistical, in that we encourage sparsity tofind sparse partitions when the original set S is small, as well asalgorithmic, in that the algorithm does not in general touch most ofthe nodes of a large graph in order to find small clusters (Mahoney,2012). These properties were critically exploited in (Leskovecet al., 2009).

6The PageRank solution actually has 379 non-zero entries.However, only 284 of them are non-zero above a 10�5 threshold.

Table 1. Numerical results of solving each problem (Prob. (2),Prob. (4), and Prob. (6)) for the graph G and set S from Figure 1.Here, we set ↵ = 1/2, ⌧ = 10�2, ⇢ = 1, � = 2/3, and = 0.315.The vector xpr, z, and x(↵, S ) are the PageRank vectors from Theo-rem 1, where x(↵, S ) solves Prob. (4) and the others are from theproblems at the end of Section 2. The vector xcut solves the cutProb. (2), and z

G

solves Prob. (6).

Deg. xpr z x(↵, S ) xcut z

G

2 0.0788 0.0394 0.8276 1 0.27584 0.1475 0.0369 0.7742 1 0.24377 0.2362 0.0337 0.7086 1 0.21384 0.1435 0.0359 0.7533 1 0.23254 0.1297 0.0324 0.6812 1 0.19777 0.1186 0.0169 0.3557 0 03 0.0385 0.0128 0.2693 0 02 0.0167 0.0083 0.1749 0 04 0.0487 0.0122 0.2554 0 03 0.0419 0.0140 0.2933 0 0

Prob. (6) solves an `1-regularized `2 regression problem)has 24 non-zeros. The true “min-cut” set is large in boththe 2-norm PageRank problem and the regularized problem.Thus, we identify the underlying graph feature correctly;but the implicitly regularized ACL procedure does so withmany fewer non-zeros than the vanilla PageRank procedure.

5. Discussion and Conclusion

We have shown that the PageRank linear system correspondsto a 2-norm variation on a 1-norm formulation of a min-cutproblem, and we have also shown that the ACL procedurefor computing an approximate personalized PageRank vec-tor exactly solves a 1-norm regularized version of the Page-Rank 2-norm objective. Both of these results are examplesof algorithmic anti-di↵erentiation, which involves extract-ing a precise optimality characterization for the output of anapproximate or heuristic algorithmic procedure.

While a great deal of work has focused on deriving newtheoretical bounds on the objective function quality or run-time of an approximation algorithm, our focus is somewhatdi↵erent: we have been interested in understanding whatare the precise problems that approximation algorithms andheuristics solve exactly. Prior work has exploited these ideasless precisely in a much larger-scale application to draw verystrong downstream conclusions (Leskovec et al., 2009), andour empirical results have illustrated this more precisely inmuch smaller-scale and more-controlled applications.

Our hope is that one can use these insights in order to com-bine simple heuristic/algorithmic primitives in ways thatwill give rise to principled scalable machine learning algo-rithms more generally. We have early results along theselines that involve semi-supervised learning.

ICML David Gleich · Purdue 19

Page 20: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

David Gleich · Purdue 20

Anti-di↵erentiating Approximation Algorithms

16 nonzeros 15 nonzeros 284 nonzeros 24 nonzeros

Figure 2. Examples of the di↵erent cut vectors on a portion of the netscience graph. In the left subfigure, we show the set S highlightedwith its vertices enlarged. In the other subfigures, we show the solution vectors from the various cut problems (from left to right, Probs. (2),(4), and (6), solved with min-cut, PageRank, and ACL) for this set S . Each vector determines the color and size of a vertex, where highvalues are large and dark. White vertices with outlines are numerically non-zero (which is why most of the vertices in the fourth figure areoutlined, in contrast to the third figure). The true min-cut set is large in all vectors, but the implicitly regularized problem achieves thiswith many fewer non-zeros than the vanilla PageRank problem.

References

Andersen, Reid and Lang, Kevin. An algorithm for improvinggraph partitions. In Proceedings of the 19th annual ACM-SIAM

Symposium on Discrete Algorithms, pp. 651–660, 2008.

Andersen, Reid, Chung, Fan, and Lang, Kevin. Local graph par-titioning using PageRank vectors. In Proceedings of the 47th

Annual IEEE Symposium on Foundations of Computer Science,pp. 475–486, 2006.

Arasu, Arvind, Novak, Jasmine, Tomkins, Andrew, and Tomlin,John. PageRank computation and the structure of the web:Experiments and algorithms. In Proceedings of the 11th inter-

national conference on the World Wide Web, 2002.

Chin, Hui Han, Madry, Aleksander, Miller, Gary L., and Peng,Richard. Runtime guarantees for regression problems. In Pro-

ceedings of the 4th Conference on Innovations in Theoretical

Computer Science, pp. 269–282, 2013.

Dhillon, Inderjit S., Guan, Yuqiang, and Kulis, Brian. Weightedgraph cuts without eigenvectors: A multilevel approach. IEEE

Trans. Pattern Anal. Mach. Intell., 29(11):1944–1957, 2007.

Goldberg, Andrew V. and Tarjan, Robert E. A new approach to themaximum-flow problem. Journal of the ACM, 35(4):921–940,1988.

Koutra, Danai, Ke, Tai-You, Kang, U., Chau, Duen Horng, Pao,Hsing-Kuo Kenneth, and Faloutsos, Christos. Unifying guilt-by-association approaches: Theorems and fast algorithms. In Pro-

ceedings of the 2011 European conference on Machine learning

and knowledge discovery in databases, pp. 245–260, 2011.

Kulis, Brian and Jordan, Michael I. Revisiting k-means: Newalgorithms via Bayesian nonparametrics. In Proceedings of the

29th International Conference on Machine Learning, 2012.

Langville, Amy N. and Meyer, Carl D. Google’s PageRank and

Beyond: The Science of Search Engine Rankings. PrincetonUniversity Press, 2006.

Leskovec, Jure, Lang, Kevin J., Dasgupta, Anirban, and Mahoney,Michael W. Community structure in large networks: Naturalcluster sizes and the absence of large well-defined clusters.Internet Mathematics, 6(1):29–123, 2009.

Mahoney, Michael W. Approximate computation and implicitregularization for very large-scale data analysis. In Proceedings

of the 31st Symposium on Principles of Database Systems, pp.143–154, 2012.

Mahoney, Michael W., Orecchia, Lorenzo, and Vishnoi,Nisheeth K. A local spectral method for graphs: With applica-tions to improving graph partitions and exploring data graphslocally. Journal of Machine Learning Research, 13:2339–2365,2012.

Newman, M. E. J. Finding community structure in networks usingthe eigenvectors of matrices. Phys. Rev. E, 74:036104, 2006.

Orecchia, Lorenzo and Mahoney, Michael W. Implementing regu-larization implicitly via approximate eigenvector computation.In Proceedings of the 28th International Conference on Machine

Learning, pp. 121–128, 2011.

Orecchia, Lorenzo and Zhu, Zeyuan Allen. Flow-based algorithmsfor local graph clustering. In Proceedings of the 25th ACM-

SIAM Symposium on Discrete Algorithms, pp. 1267–1286, 2014.

Page, Lawrence, Brin, Sergey, Motwani, Rajeev, and Winograd,Terry. The PageRank citation ranking: Bringing order to theweb. Technical Report 1999-66, Stanford University, November1999.

Saunders, Michael A. Solution of sparse rectangular systemsusing LSQR and CRAIG. BIT Numerical Mathematics, 35(4):588–604, 1995.

Tibshirani, Robert. Regression shrinkage and selection via thelasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.

Zhou, Dengyong, Bousquet, Olivier, Lal, Thomas Navin, Weston,Jason, and Scholkopf, Bernhard. Learning with local and globalconsistency. In Advances in Neural Information Processing

Systems 16, pp. 321–328, 2004.

Anti-di↵erentiating Approximation Algorithms

16 nonzeros 15 nonzeros 284 nonzeros 24 nonzeros

Figure 2. Examples of the di↵erent cut vectors on a portion of the netscience graph. In the left subfigure, we show the set S highlightedwith its vertices enlarged. In the other subfigures, we show the solution vectors from the various cut problems (from left to right, Probs. (2),(4), and (6), solved with min-cut, PageRank, and ACL) for this set S . Each vector determines the color and size of a vertex, where highvalues are large and dark. White vertices with outlines are numerically non-zero (which is why most of the vertices in the fourth figure areoutlined, in contrast to the third figure). The true min-cut set is large in all vectors, but the implicitly regularized problem achieves thiswith many fewer non-zeros than the vanilla PageRank problem.

References

Andersen, Reid and Lang, Kevin. An algorithm for improvinggraph partitions. In Proceedings of the 19th annual ACM-SIAM

Symposium on Discrete Algorithms, pp. 651–660, 2008.

Andersen, Reid, Chung, Fan, and Lang, Kevin. Local graph par-titioning using PageRank vectors. In Proceedings of the 47th

Annual IEEE Symposium on Foundations of Computer Science,pp. 475–486, 2006.

Arasu, Arvind, Novak, Jasmine, Tomkins, Andrew, and Tomlin,John. PageRank computation and the structure of the web:Experiments and algorithms. In Proceedings of the 11th inter-

national conference on the World Wide Web, 2002.

Chin, Hui Han, Madry, Aleksander, Miller, Gary L., and Peng,Richard. Runtime guarantees for regression problems. In Pro-

ceedings of the 4th Conference on Innovations in Theoretical

Computer Science, pp. 269–282, 2013.

Dhillon, Inderjit S., Guan, Yuqiang, and Kulis, Brian. Weightedgraph cuts without eigenvectors: A multilevel approach. IEEE

Trans. Pattern Anal. Mach. Intell., 29(11):1944–1957, 2007.

Goldberg, Andrew V. and Tarjan, Robert E. A new approach to themaximum-flow problem. Journal of the ACM, 35(4):921–940,1988.

Koutra, Danai, Ke, Tai-You, Kang, U., Chau, Duen Horng, Pao,Hsing-Kuo Kenneth, and Faloutsos, Christos. Unifying guilt-by-association approaches: Theorems and fast algorithms. In Pro-

ceedings of the 2011 European conference on Machine learning

and knowledge discovery in databases, pp. 245–260, 2011.

Kulis, Brian and Jordan, Michael I. Revisiting k-means: Newalgorithms via Bayesian nonparametrics. In Proceedings of the

29th International Conference on Machine Learning, 2012.

Langville, Amy N. and Meyer, Carl D. Google’s PageRank and

Beyond: The Science of Search Engine Rankings. PrincetonUniversity Press, 2006.

Leskovec, Jure, Lang, Kevin J., Dasgupta, Anirban, and Mahoney,Michael W. Community structure in large networks: Naturalcluster sizes and the absence of large well-defined clusters.Internet Mathematics, 6(1):29–123, 2009.

Mahoney, Michael W. Approximate computation and implicitregularization for very large-scale data analysis. In Proceedings

of the 31st Symposium on Principles of Database Systems, pp.143–154, 2012.

Mahoney, Michael W., Orecchia, Lorenzo, and Vishnoi,Nisheeth K. A local spectral method for graphs: With applica-tions to improving graph partitions and exploring data graphslocally. Journal of Machine Learning Research, 13:2339–2365,2012.

Newman, M. E. J. Finding community structure in networks usingthe eigenvectors of matrices. Phys. Rev. E, 74:036104, 2006.

Orecchia, Lorenzo and Mahoney, Michael W. Implementing regu-larization implicitly via approximate eigenvector computation.In Proceedings of the 28th International Conference on Machine

Learning, pp. 121–128, 2011.

Orecchia, Lorenzo and Zhu, Zeyuan Allen. Flow-based algorithmsfor local graph clustering. In Proceedings of the 25th ACM-

SIAM Symposium on Discrete Algorithms, pp. 1267–1286, 2014.

Page, Lawrence, Brin, Sergey, Motwani, Rajeev, and Winograd,Terry. The PageRank citation ranking: Bringing order to theweb. Technical Report 1999-66, Stanford University, November1999.

Saunders, Michael A. Solution of sparse rectangular systemsusing LSQR and CRAIG. BIT Numerical Mathematics, 35(4):588–604, 1995.

Tibshirani, Robert. Regression shrinkage and selection via thelasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.

Zhou, Dengyong, Bousquet, Olivier, Lal, Thomas Navin, Weston,Jason, and Scholkopf, Bernhard. Learning with local and globalconsistency. In Advances in Neural Information Processing

Systems 16, pp. 321–328, 2004.

Push’s sparsity helps it identify the “right” graph feature with fewer non-zeros

The set S The mincut solution

The push solution The PageRank solution

ICML

Page 21: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

It’s easy to make this apply broadly

Easy to cook up interesting diffusion-like problems and adapt them to this framework. In particular, Zhou et al. (2004) gave a semi-supervised learning diffusion we are currently studying …

2

40 eT

S 0eS ✓A eS0 eS 0

3

5 .

ICML David Gleich · Purdue 21

minimize

1

2

kB

S

xk2

2

+ kxk1

subject to x

s

= 1, x

t

= 0, x � 0

minimize

1

2

x

T(I + ✓L)x � �x

TeS + kxk

1

subject to x � 0

Page 22: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Anti-di↵erentiating Approximation Algorithms

16 nonzeros 15 nonzeros 284 nonzeros 24 nonzeros

Figure 1. Examples of the di↵erent cut vectors on a portion of the net-science graph. At left we show the set S highlightedwith its vertices enlarged. In the other figures, we show the solution vectors from the various cut problems (from left toright, Probs. (2), (4), and (6), or mincut, PageRank, and ACL) for this set S. Each vector determines the color and size ofa vertex where high values are large and dark. White vertices with outlines are numerically non-zero (which is why mostof the vertices in the fourth figure are outlined in contrast to the third figure).

1.3. Application to semi-supervised learning

We recreate the USPS digit recognition problem fromZhou et al. (2003). Consider the digits 1, 2, 3, 4. Forall nodes with these true labels, construct a weightedgraph using a radial basis function:

Ai,j = exp��kdi � djk22/(2�2)

where � 2 {1.25, 2.5}, and where di is the vector repre-sentation of the ith image of a digit. This constructs theweighted graph that is the input to the semi-supervisedlearning method. We randomly pick nodes to be la-beled in the input and ensure that at least one nodefrom each class is chosen; and we predict labels usingthe following two kernels as in Zhou et al. (2003):

K1 = (I� �A)�1, and

K2 = (D� �A)�1,

where we used � = 0.99, as in the prior work.

When we wanted to use our new regularized cutproblems, we discovered that the weighted graphsA constructed through the radial basis function ker-nel method had many rows that were e↵ectively zero.These are images that are not similar to anything else inthe data. This fact posed a problem for our formulationbecause we need the seed vector s � 0 and the comple-ment seed vector d� s � 0. Also, the cases where wewould divide by the degree would be grossly inflatedby these small degrees. Consequently, we investigateda third kernel:

K3 = (Diag(Ae)� �A)�1.

The intuition for this kernel is that we first normal-ize the degrees via A, and then use a combinatorial

Laplacian kernel to propagate the labels. Because thedegrees are much more uniform, our new cut-basedformulations are significantly easier to apply and hadmuch better results in practice. Finally, we also con-sider using our 1-norm based regularization applied toK3 with = 10�5, which for the sake of consistency,we label rK3 although it is not a matrix-type kernel.

Detail on the error rates (Figure 2) For themulti-class prediction task explored by Zhou et al.,we show the mean error rates (percentage of mislabeledpoints) over 100 trials for four methods (K1,K2,K3

and rK3) as we vary the number of labels on fourproblems. The four problems use the digit sets 1, 2, 3, 4and 5, 6, 7, 8 with � = 1.25 and � = 2.5 in the weightfunction. And as in Zhou et al. (2003), we use themaximum entry in each row of KiS where S representsthe matrix of labeled nodes Si,j = 1 if node i is chosenas a labeled node for digit j. We note:

· First, setting � = 1.25 results in very low error rates,and all of the kernels have approximately the sameperformance once the number of labels is above 20.For fewer labels, the regularized kernel has worse

performance. With � = 1.25, the graph has a goodpartition based on the four classes. The regularizationdoes not help because it limits the propagation oflabels to all of the nodes; however, once there aresu�cient labels this e↵ect weakens.

· Second, when � = 2.5, the results are much worse.Here, K2 has the best performance and the regular-ized kernel rK3 has the worst performance.

Details on the ROC curves (Figure 3) We lookat all ROC curves for single-class prediction tasks with20 labeled nodes as we sweep along the solutions fora single label from each of the kernels. (Formally,

Recap & Conclusions

ICML David Gleich · Purdue 22

Open issues!Better treatment of directed graphs? Algorithm for rho < 1?!rho set to ½ in most “uses” Need new analysis (Coming soon)"Improvements to semi-supervised learning on graphs!

Key point We don’t solve the 1-norm regularized problem with a 1-norm solver, but with the efficient push method. Run push, and you get a 1-norm reg. with early stopping

David Gleich · Purdue Supported by NSF CAREER 1149756-CCF www.cs.purdue.edu/homes/dgleich

1.  “Defined” alg. anti-diff to understand why heuristics work.

2.  Found equiv. w/ PageRank and cut / flow.

3.  Push & 1-norm regularization.

Page 23: Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow

PageRank à s-t min-cut That equivalence works if s is degree-weighted. What if s is the uniform vector?

�����

��

��

��

A(s) =2

40 ↵sT 0↵s A ↵(d � s)0 ↵(d � s)T 0

3

5 .

David Gleich · Purdue 23

MMDS 2014