co-blockmodel and spectral algorithm di-sim · 2015. 1. 9. · co-clustering for directed graphs:...

Co-clustering for directed graphs: the Stochasticco-Blockmodel and spectral algorithm Di-Sim

Karl Rohe ∗1, Tai Qin1, and Bin Yu2

1Department of Statistics, University of Wisconsin Madison2Department of Statistics, University of California, Berkeley

January 9, 2015

Abstract

Directed graphs have asymmetric connections, yet the current graph clustering methodolo-gies cannot identify the potentially global structure of these asymmetries. We give a spectralalgorithm called di-sim that builds on a dual measure of similarity that correspond to howa node (i) sends and (ii) receives edges. Using di-sim, we analyze the global asymmetries inthe networks of Enron emails, political blogs, and the c elegans neural connectome. In eachexample, a small subset of nodes have persistent asymmetries; these nodes send edges with onecluster, but receive edges with another cluster. Previous approaches would have assigned theseasymmetric nodes to only one cluster, failing to identify their sending/receiving asymmetries.

Regularization and “projection" are two steps of di-sim that are essential for spectralclustering algorithms to work in practice. The theoretical results show that these steps makethe algorithm weakly consistent under the degree corrected Stochastic co-Blockmodel, a modelthat generalizes the Stochastic Blockmodel to allow for both (i) degree heterogeneity and (ii)the global asymmetries that we intend to detect. The theoretical results make no assumptionson the smallest degree nodes. Instead, the theorem requires that the average degree growssufficiently fast and that the weak consistency only applies to the subset of the nodes withsufficiently large leverage scores. The results results also apply to bipartite graphs.

IntroductionThe network analysis literature has primarily studied networks with symmetric relationships (i.e.undirected edges). However, many networks contain asymmetric relationships (i.e. directed edges).For example, in a communication network, one person calls the other person. Citation networks, webgraphs, and internet networks are also characterized by asymmetric relationships. Even networksthat are often represented as undirected networks of symmetric edges (e.g. Facebook friendships androad networks) are simplifications of an underlying directed network; in Facebook, each friendshipis proposed by one of the friends and received by the other friend. This induces an asymmetry. Inroad networks, one is often interested in the flow of traffic; anyone who has a reverse commute can

∗karlrohe at stat dot wisc dot edu

1

arX

iv:1

204.

2296

v2 [

stat

.ML

] 8

Jan

201

5

confirm that traffic flows asymmetrically. In biochemical cellular networks, a relationship representsthe flow of information and/or energy in the cell. These are causal graphs, and causality requiresdirection. In these examples and in a wide range of other applications, directed networks moreaccurately represent the underlying data generating mechanism.

To model the clustering structure of an undirected network, the Stochastic Blockmodel assignseach actor to one of k blocks and actors in the same block are exchangeable or “stochasticallyequivalent" (White et al. (1976); Holland et al. (1983)). Specifically, i and j are stochasticallyequivalent if

P (i connects to `) = P (j connects to `) for every actor ` in the network.

This paper extends the notion of stochastic equivalence to directed networks in a way that al-lows for two separate notions of equivalence, “stochastically equivalent senders" and “stochasticallyequivalent receivers".

To estimate these dual notions of equivalence, we propose co-clustering in the context of net-works. In a more general setting, Hartigan (1972) first proposed co-clustering to simultaneouslycluster the rows and columns of a data matrix. In the network setting, the data matrix could bethe adjacency matrix or some form of the graph Laplacian; row i gives the sending pattern foractor i and column j gives the receiving pattern for actor j. Co-clustering for network data isparticularly interesting because the rows and columns index the same set of nodes. The Stochasticco-Blockmodel (proposed in Section 3) clarifies the relationship between co-clustering and the dualnotions of stochastic equivalence. In this model, there are two separate partitions of the actors; onepartition for “stochastically equivalent senders" and the other partition for “stochastically equivalentreceivers".

In addition to providing a novel framework to understanding the asymmetries in a directedgraph, the contributions of this paper are threefold. The first contribution is algorithmic; Section 1presents di-sim, a novel and computationally tractable spectral algorithm for co-clustering sparseand heterogeneous data matrices. While spectral algorithms have become popular in the method-ological and theoretical literature, their empirical performance is often less than satisfactory. Thesecond contribution is methodological and empirical; Section 2 demonstrates that di-sim performswell on three empirical networks. In each of the three networks, di-sim finds a small subset of nodeswith persistent asymmetries, sending edges to one cluster and receive edges from another cluster.The final contribution is theoretical. To illustrate the types of asymmetries that di-sim estimates,Section 3 proposes the degree corrected Stochastic co-Blockmodel. Using this model, Theorem 3.1gives conditions under which di-sim misclusters a vanishing fraction of the nodes.

The theoretical results are novel in several ways. First, Theorem 3.1 gives the first statisticalestimation results results for directed graphs or bipartite graphs with general degree distributions.Second, because di-sim uses the leading singular vectors of a sparse and asymmetric matrix, theproof required novel extensions of previous proof techniques. These techniques extend the spectralresults to bipartite graphs; previous results for bipartite graphs have only studied computationallyintractable techniques, e.g. Flynn and Perry (2012); Wolfe and Choi (2014). Third, the maintheorem does not presume that the number of sending clusters is equal to the number of receivingclusters and the theoretical results highlight the difficulties presented when they are not equal.Fourth, because we study a sparse degree corrected model, the theoretical results highlight theimportance of the regularization and projection steps in di-sim. Finally, the results do not dependon the minimum node degree. Instead, the weakly connected nodes affect the conclusions throughtheir statistical leverage scores in the observed graph Laplacian. From the perspective of numerical

2

linear algebra, the leverage scores are essential to controlling the algorithmic difficulty of computingthe singular vectors (Mahoney, 2011).

1 Co-clusteringCo-clustering (a.k.a. bi-clustering) was first proposed in Hartigan (1972) for data arranged in amatrix M ∈ Rn×d. In addition to clustering the rows of M into kr clusters, co-clustering simulta-neously clusters the columns of M into kc clusters. In the past decade, co-clustering has becomean important data analytic technique in biological applications (e.g. Madeira and Oliveira (2004),Tanay et al. (2004), Tanay et al. (2005), Madeira et al. (2010)), text processing (e.g. Dhillon(2001), Bisson and Hussain (2008)), and natural language processing (e.g. Freitag (2004), Rohwerand Freitag (2004)). In these settings, Banerjee et al. (2004) describes how co-clustering dramati-cally reduces the number of parameters that one needs to estimate. This leads to three advantagesover traditional clustering: (1) more interpretable results, (2) faster computation, and (3) implicitstatistical regularization.

Previous applications of co-clustering have involved matrices where the rows and columns indexdifferent sets of objects. For example, in text processing, the rows correspond to documents, and thecolumns correspond to words. Element i, j of this matrix denotes how many times word j appears indocument i. The row clusters correspond to clusters of similar documents and the column clusterscorrespond to clusters of similar words. In contrast, this paper applies co-clustering to a matrixwhere the rows and columns index the same set of nodes. The ith row of the matrix identifiesthe outgoing edges for node i; two nodes are in the same row cluster if they send edges to severalof the same nodes. The ith column of this matrix identifies the incoming edges for node i; twonodes are in the same row co-cluster if they send edges to several of the same nodes. As such, eachnode i is in two types of clusters (one for the ith column and one for the ith row). Comparingthese two distinct partitions of the nodes can lead to novel insights when compared to the standardco-clustering applications where the rows and columns index different sets. The three examples inSection 2 will illustrate how this duality can lead to novel interpretations.

This paper proposes and studies a spectral co-clustering algorithm called di-sim. Building onprevious spectral co-clustering algorithms (e.g. Dhillon (2001)), di-sim incorporates regularizationand projection steps. These two steps are essential when there is a large amounts of degree hetero-geneity and several weakly connected nodes. The name di-sim has three meanings. First, becausedi-sim co-clusters the nodes, it uses two distinct (but related) similarity measures between nodes:“the number of common parents" and “the number of common offspring" to create two differentpartitions of the nodes. In this sense, di-sim means two similarities and two partitions. Second,di- denotes that this algorithm is specifically for directed graphs. Finally, di-sim, pronounced “dice‘em", dices data into clusters.

1.1 DI-SIM; a co-clustering algorithm for directed graphsSpectral graph algorithms have a rich history in mathematics, computer science, and statistics (e.gFiedler (1973); Chung (1997); Koltchinskii and Giné (2000)) and this line of literature motivatesthe di-sim algorithm. For a detailed account of spectral clustering for undirected graphs, see vonLuxburg (2007).

The essential algorithmic difference between standard spectral clustering and di-sim is that di-sim uses the singular value decomposition (SVD) instead of the eigendecomposition. Previously,

3

Dhillon (2001) proposed using the SVD to co-cluster bipartite networks. More generally, in theage of big data, SVD has become a canonical algorithm for low rank approximations because thereare computationally fast implementations and it generalizes the eigendecomposition to generalrectangular matrices. It is defined as follows.

Definition 1. The singular value decomposition (SVD) factorizes a matrix M ∈ Rn×d (n ≥ d)into the product of orthonormal matrices U ∈ Rn×d, V ∈ Rd×d and a diagonal matrix Σ ∈ Rd×dwith nonnegative entries,

M = UΣV T .

The columns of U contain the left singular vectors. The columns of V contain the rightsingular vectors. The diagonal of Σ contains the singular values. If the matrix M is square andsymmetric, then the SVD is equivalent to the eigendecomposition and U = V . In this way, SVD isa generalization of the eigendecomposition. Section 4.1 (at the end of this paper) briefly highlightsthe previous algorithms that employ SVD to explore the structure of graphs.

1.2 The di-sim algorithmLet G = (V,E) denote a graph, where V is a vertex set and E is an edge set. The vertex setV = {1, . . . , n} contains vertices or nodes. These are the actors in the graph. This paper considersunweighted, directed edges. So, the edge set E contains a pair (i, j) if there is an edge, or rela-tionship, from node i to node j: i → j. The graph can be represented as an adjacency matrixA ∈ {0, 1}n×n:

Aij =

{1 if (i, j) is in the edge set0 otherwise.

If the adjacency matrix is symmetric, then the graph is undirected. We are interested in exploringthe asymmetries in A.

The graph Laplacian is a function of the adjacency matrix. It is fundamental to spectralgraph theory and the spectral clustering algorithm (Chung (1997); von Luxburg (2007)). Severalprevious papers have proposed and or studied various ways of regularizing the graph Laplacian;these regularization steps improve the statistical performance of various spectral algorithms (Pageet al. (1999); Andersen et al. (2006); Chaudhuri et al. (2012); Amini et al. (2013); Qin and Rohe(2013); Joseph and Yu (2014)). This paper generalizes the regularization proposed in Chaudhuriet al. (2012) to directed graphs. Define the regularized graph Laplacian L ∈ Rn×n for directedgraphs with the diagonal matrices P ∈ Rn×n and O ∈ Rn×n, regularization parameter τ ≥ 0, andidentity matrix I ∈ Rn×n,

Pjj =∑k Akj =

∑k 1{k → j} and P τ = P + τI;

Oii =∑k Aik =

∑k 1{i→ k} and Oτ = O + τI; and

Lij =Aij√OτiiP

τjj

= 1{i→j}√OτiiP

τjj

= [(Oτ )−1/2A(P τ )−1/2]ij .

(1)

Pjj is the number of nodes that send an edge to node j, or the number of parents to node j.Similarly, Oii is the number of nodes to which i sends an edge, or the number of offspring to nodei. A more standard definition of the graph Laplacian is I −O−1/2AO−1/2. Our definition also usesP in the normalization and it does not contain I−. These changes are essential to our theoreticalresults and many of the interpretations of di-sim would not hold otherwise. The regularized degree

4

matrices, P τ and Oτ , artificially inflate every degree by a constant τ . In the setting of undirectedgraphs, Qin and Rohe (2013) showed that in order to make the asymptotic bounds informative, τshould grow proportionally to the average node degree,

∑iOii/n. Note that

∑iOii/n =

∑j Pjj/n

since the out degree equals to the in degree. We use the average node degree as the default valuefor τ .

To apply di-sim to a bipartite graph on disjoint sets of vertices U and V (e.g. U contains wordsand V contains documents), let U index the rows of A and V index the columns of A. As such, Ais rectangular and Aij = 1 if and only if i ∈ U shares an edge with j ∈ V (e.g. word i is containedin document j). While the dimensions of O,P, and L must change to reflect that A is rectangular,the definitions in Equations (1) remain the same.

Throughout, for x ∈ Rd, ‖x‖2 =√∑d

i=1 x2i , for M ∈ Rd×p, ‖M‖ denotes the spectral norm and

‖M‖F denotes the Frobenius norm. With the above notation, di-sim is defined as follows.

di-sim

Input: Adjacency matrix A ∈ {0, 1}n×n, regularizer τ ≥ 0 (Default: τ = average node degree),number of row-clusters ky, number of column-clusters kz.

(1) Compute the regularized graph Laplacian L = (Oτ )−1/2A(P τ )−1/2.

(2) Compute the top K left and right singular vectors XL ∈ Rn×K , XR ∈ Rn×K , whereK = min{ky, kz}.

(3) Normalize each row of XL and XR to have unit length. That is, define X∗L ∈ Rn×K , X∗R ∈Rn×K , such that

[X∗L]i =[XL]i‖[XL]i‖2

, [X∗R]j =[XR]j‖[XR]j‖2

,

where [XL]i is the ith row of XL and similarly for [X∗L]i, [XR]j , [X∗R]j .

(4) Cluster the rows of X∗L into kr clusters with (1 + α)-approximate k-means (Kumar et al.(2004)). Because each row of X∗L corresponds to a node’s sending pattern in the graph,the results cluster the nodes’ sending patterns.

(5) Cluster the receiving patterns by performing step (4) on the matrix X∗R with kz clusters.

Output: The clusters from step (4) and (5).

When A is undirected, then the left and right singular vectors of L are equal to each other andequal to the eigenvectors of L. In this special case, di-sim is equivalent to previous versions ofundirected spectral clustering (e.g. see von Luxburg (2007), Qin and Rohe (2013)).

1.3 Interpreting the singular vectorsThis subsection examines how the singular vectors of L correspond to the following dual measuresof similarity: “number of common parents" and “number of common offspring". Recall that theSVD expresses a matrix M ∈ Rn×d as the product of three matrices, M = UΣV T . Lemma 1.1shows how to compute the matrices U, V , and Σ, giving insight into the similarity measures used

5

by di-sim. The lemma follows from Lemma 7.3.1 in Horn and Johnson (2005).

Lemma 1.1. With SVD, M = UΣV T for M ∈ Rn×d. The matrices U and V contain the eigenvec-tors to the symmetric, positive semi-definite matrices MMT and MTM respectively. Both MMTand MTM have the same eigenvalues and these values are contained in the diagonal of Σ2. If Mis symmetric, U contains the eigenvectors of M and U = V .

This implies that di-sim uses the eigenvectors of two matrices, LTL and LLT . To understandthese matrices, first look at ATA and AAT .

(ATA)ab =∑x

1{x→ a and x→ b} : The number of common “parents".

(AAT )ab =∑x

1{a→ x and b→ x} : The number of common “offspring".

These two similarity matrices are symmetric and easily interpretable. LLT and LTL perform a sim-ilar task while down-weighting the contribution of high degree nodes and utilizing the regularizationparameter τ .

2 Applications where asymmetric relationships allow for novelinsights

The next three subsections use di-sim to examine the asymmetries in (i) the email communicationnetwork at Enron; (ii) the network of hyperlinks among a set of political blogs; and (iii) the neuralconnectome of a primitive worm, c elegans. These examples demonstrate how one can leverage thegraph asymmetries to make novel insights into the graph structure. The examples also demonstratesimple modifications of di-sim that are appropriate in various settings.

2.1 Detecting malfeasance at EnronThe defunct corporation Enron went bankrupt on December 2, 2001 because “its reported financialcondition was sustained substantially by an institutionalized, systematic, and creatively plannedaccounting fraud" (Wikipedia (2013)). This section examines a communication network formed witha portion of the corporations’ emails that were made publicly available as a result of the federalinvestigation into corporate misconduct. We use di-sim to search for “bottleneck" communicators,or people that relayed information from one part of the organization to another.

The emails used in the following analysis form a communication network for 154 employees ofEnron between 1998 and 2002 (Cohen (2009)). In our analysis, we set Aij as the number of emailsthat i sends to j over the entire time period. This is a weighted network. While the data set alsoprovides the text of the emails, we only use the “metadata", i.e. the network A.

2.1.1 Data Analysis

This section does not use the full di-sim algorithm; there are two simplifications.

1. The rows of the singular vector matrices XL, XR ∈ R1222×K are not projected onto the unitsphere. That is, step (c) in di-sim is skipped.

6

●

●

●

●●

●●●

●●●●●●

●●●●●●●●●●●

5 10 15 20 25

0.2

0.3

0.4

0.5

0.6

0.7

Eigengap after thesecond and fifth values.

Index

sing

ular

val

ues

●

●

●

●●

●

●

Histogram of movement scoresusing K=2

movement score

0.0 0.1 0.2 0.3 0.4 0.5

0

20

40

60

80

100

120

Histogram of movement scoresusing K=5

movement score

0.0 0.2 0.4 0.6 0.8

0

20

40

60

80

Figure 1: The left panel displays the top 25 singular values of L. There are two eigengaps. The firsteigengap, suggests K = 2 using the singular values in solid black. The second eigengap, suggestsK = 5 by adding the singular values in solid grey. Using K = 2, the center panel gives a histogramof the movement scores m(K)i as defined in Equation 2. The right panel gives a histogram of themovement scores using K = 5. Each histogram has an outlier. For K = 2, the outlier is Enron’sDirector for Regulatory and Government Affairs Jeff Dasovich. For K = 5, the outlier is BillWilliams who is discussed in the text below.

2. Instead of running k-means, the asymmetry in the graph is investigated by directly comparingthe rows of XL and XR with the movement score, defined as

m(K)i =

(K∑`=1

([XL]i` − [XR]i`)2)1/2

. (2)

If one were to ignore edge direction by symmetrizing the network, then m(K)i would be zerofor all i. As such, it measures the asymmetry in a node’s connections.

The left panel in Figure 1 displays the top 25 singular values of L. The center panel gives thehistogram of the movement scores m(2)i . The right panel gives the histogram of m

(5)i . The outlier

for K = 2 is Enron’s Director for Regulatory and Government Affairs Jeff Dasovich. Using K = 5,the outlier is an energy trader at Enron named Bill Williams.

The large movement scores for Dasovich and Williams could be due to three possibilities. First,they could receive information from one part of the network and transmits it to another part ofthe network (i.e. act as a bottleneck communicator); second, they could take information and notrelay that information; or third, they could receive little information through this communicationnetwork, but transmit lots of information. In fact, in the weighted network (Aij is number ofemails from i to j), Dasovich has the largest out-degree and Williams’ has the 10th largest out-degree. Dasovich has the ninth largest in-degree and Williams has the 45th highest in-degree (outof n = 154). This rules out the second and third possibilities, suggesting that both Dasovich andWilliams are bottleneck communicators.

Although such network patterns do not necessarily imply criminal activity, the analysis identifiesEnron employee Bill Williams as a clear outlier. Using qualitative evidence not associated with the

7

methods presented here, Williams was convicted of creating artificial energy shortages by orderingpower plants to temporarily shut down. The New York Times reported on the incident and quotedfrom audio recordings of Bill Williams telling a power plant to shut down. The day after that audiorecording, roughly half a million Californians suffered from rolling blackouts (Egan (2005)).

Although Williams’ communications with the power plant make him a bottleneck communicator,it is worth noting that our vertex set in this data does not contain people outside of Enron. Assuch, Williams was identified for playing the bottleneck communicator for other activities withinEnron. Importantly, this analysis would have been infeasible if we had ignored edge direction. Thedata in this section have been extensively preprocessed by Zhou et al. (2007) and Perry and Wolfe(2013).

2.2 Blog network during the 2004 US presidential electionIn the 2004 US presidential election, political blogs contributed to the election media landscape forthe first time. In order to better understand the role of these blogs, Adamic and Glance (2005)recorded the hyperlink connections among these blogs and found that the connections betweenblogs were highly related to the blog’s political persuasion. In subsequent research, Karrer andNewman (2011); Chen et al. (2012), and Zhao et al. (2012) estimated the political partition from thenetwork alone. In contrast to the work presented here, each of these previous analyses symmetrizedthe edge directions. As such, they found a single partition of the blogs into conservative andliberal blogs. However, co-clustering with di-sim finds two partitions, one based on how the blogssend hyperlinks and another based on how the blogs receive hyperlinks. These two partitions areroughly similar.1 This suggests that most blogs with similar sending patterns have similar receivingpatterns. However, some blogs send hyperlinks to conservative blogs and receive hyperlinks fromliberal blogs, and vice versa. For these blogs, the direction of the edges is particularly salient.Using di-sim, we seek to identify and characterize these blogs. Figure 2 illustrates how a nodecould belong to opposite sending and receiving clusters.

Figure 2: In this diagram, there are two clusters and a bottleneck node between the two clusters.In the sending cluster, this node joins the nodes on the left. In the receiving cluster, this node joinsthe nodes on the right.

1Both partitions roughly align with the political divide of liberal vs. conservative blogs.

8

2.2.1 Data description

To create the network, Adamic and Glance (2005) curated a list of the top 1,494 political blogs and,in February of 2005, (a) recorded the front page of each blog and (b) identified the hyperlinks thatpoint to other blogs on the list. From these links, Adamic and Glance (2005) created a directednetwork.1 Each blog was identified as liberal or conservative. Some of these labels were manuallyidentified and some of the labels are self-reported to one of several blog directories. While theselabels may be subject to various types of errors, they are generally consistent with the networkconnectivity and the names of the blogs (e.g. xtremerightwing.net vs. loveamericahatebush.com).We will thus refer to these labels as the true labels. To refer to the blogs on either side of the politicalpartition, we will use the terms {Kerry, dem, liberal} interchangeably and the terms {Bush, gop,conservative} interchangeably.

We restrict our analysis to the 1,222 blogs in the largest connected component; this removes266 nodes with no edges and two blogs linked together without any connections to the largestcomponent. Our analysis concerns the 586 liberal blogs and 636 conservative blogs that remain.While this network is sparse (the average degree is 16) clustering is feasible because there areroughly 10 times as many edges between blogs of the same party than between blogs of differentparty affiliations (Adamic and Glance, 2005).

Because there are two political parties, we set k = 2 for both the sending and receiving clusters.This section makes one modification to di-sim. Instead of running k-means twice (once for XL ∈R1222×2 and once for XR ∈ R1222×2), the analysis runs k-means on XL and XR simultaneously.That is, XL and XR are stacked into a single tall matrix in R2444×2 and k-means is run on the rowsof this tall matrix. This makes the labels of the left and right clusters comparable. After runningk-means on the 1,222 blogs in the largest connected component, subsequent analysis is restrictedto the blogs that have at least three incoming edges and at least three outgoing edges. There are549 such blogs. Of these blogs, 543 are clustered into sending and receiving clusters that are nearlyidentical; this partition broadly agrees with the true labels of “Kerry" and “Bush" blogs.

While 543 of the 549 blogs are clustered into identical sending and receiving clusters, the re-maining six are clustered into different sending and receiving clusters (See Table 1). Five of theseblogs are “dem2gop" blogs that appear to take links from Kerry (i.e. dem) blogs and send links toBush (i.e. gop) blogs. The final blog in the table (quando.net) is the only “gop2dem" blog, takingmore edges from Bush blogs and sending links to Kerry blogs.

All of the six blogs are labeled as Kerry blogs. However, we visited the blog urls and performedrelated web searches (Table 2). Many of the sites are now defunct. Interestingly, the only gop2demblog in the entire analysis, quando.net, appears mislabeled in the original data set. Adamic andGlance (2005) label it as a Kerry blog. Upon closer inspection, this blog hosts a collection of con-servative/libertarian bloggers (e.g. “Face it - the only thing Bush can brag about is his comparativeconservative advantage over Kerry. And that’s akin to saying a tornado is–comparatively–better athome improvement projects than a hurricane.").

This analysis finds six blogs with asymmetric community memberships. Each of these six blogsappear to be doing “opposition research," where they link to blogs that hold different politicalviews. As such, asymmetric blogs link to content that they dislike. We found no evidence of anyasymmetric blogs receiving links from the opposite party. This suggests that the incoming edgesappear to be more informative for detecting the community membership of a political blog. Thisanalysis is only feasible because di-sim respects the asymmetry between incoming and outgoing

1See Adamic and Glance (2005) for a more complete description of how the list of 1,494 blogs was curated.

9

Table 1: Of the 549 blogs that have at least three incoming edges and at least three outgoing edges,these are the only six blogs whose receiving cluster (from.cluster) is different from their sendingcluster (to.cluster). The numbers to the right of the line are generated using the labels provided inthe data set. These numbers reveal that di-sim identifies the nodes with asymmetric relationshipsbetween the true blocks.blog url from.clust 2 to.clust from.dem from.gop to.dem to.gopchepooka.com dem2gop 13 2 1 2clarified.blogspot.com dem2gop 4 2 0 6politics.feedster.com dem2gop 3 0 13 18polstate.com dem2gop 31 7 3 2shininglight.us dem2gop 2 2 4 7qando.net gop2dem 5 57 14 10

Table 2: After accounting for the fact that the data set appears to mislabel quando.net as a liberalblog, all asymmetric blogs link to blogs of the opposite political leaning.blog url label in data set upon visitchepooka.com liberal unclear, possibly defunctclarified.blogspot.com liberal Kerry supporterpolitics.feedster.com liberal defunct, evidence for Kerry supporterpolstate.com liberal defunct, old twitter feed self-identifies as “pan-partisian"shininglight.us liberal defunctqando.net liberal collection of conservative bloggers,

see http://www.qando.net/archives/2004_09.htm

edges.

2.3 The neural connectome of c elegansThis section examines the neural connectome of the male Caenorhabditis elegans (c elegans), a 1mmlong worm. The chemical connections between the neurons of c elegant create a directed networkand a directed analysis highlights vast dissimilarities between the sending and receiving patterns.Said another way, XL represents a different structure than XR. For some neurons, this reflectspreviously understood behavior that is relevant to the understanding of the connectome (e.g. seeFigure 6 in Jarrell et al. (2012) for a discussion of the role of PVV neurons in feedforward loops).For other neurons, our analysis suggests areas for future inquiry.

2.3.1 Data description

c elegans is well suited to laboratory research–it is easy to store and reproduce because it is 1mmin length, it is easy to witness an organism’s state because its exterior is transparent, and theanatomy is easy to identify and catalogue because the adult male is composed of exactly 1031 cells,of which exactly 383 are neurons. As a result, c elegans has become a model organism for severalareas of biological research, including neurology. For example, it was the first organism with a fullysequenced genome and also the first with a complete wiring diagram of the neurological connections(White et al. (1986)).

10

Our analysis concerns the chemical connections in the connectome of the male c elegans. Jarrellet al. (2012) mapped the posterior neural connectome of the male c elegans by slicing the posteriorof the 1mm long worm into a series of 5,000 serial slices, 70 nm to 90 nm thick. Each slice was imagedwith an electron microscope, and the neurons from each slice were mapped to the neurons in theadjacent slices. Piecing these mappings together created a three dimensional image of the organismthat reveals the synaptic connections between the neurons. The construction of the connectomeused both computational tools for automatic information extraction and a substantial amount ofhuman judgment.

This section investigates the directed graph that encompasses the chemical connections amongthe neurons, muscles, and gonad. In the posterior chemical connectome, there are

• 126 nodes that send at least one edge and receive at least one edge,

• one node that sends at least one edge and receives no edges, and

• 73 nodes that send no edges and receive at least one edge.

Of the nodes that send at least one edge, the average out degree is 18. Of the nodes that receiveedges, the average in degree is 11.5. Both of these degree calculations are on the unweighted graph.In fact, each edge has an edge weight that corresponds to the size of the synaptic connection. Thelarger connections produce a more robust connection between neurons. More details can be found inJarrell et al. (2012). The distribution of these edge weights has a long tail. Based on a preliminaryanalysis, the edge weights were log-transformed.

The analysis uses di-sim with the default value of τ . Because the original paper Jarrell et al.(2012) estimated seven communities, we also estimate seven communities. After normalizing therows of XL ∈ R127×7 and XR ∈ R199×7, these matrices are combined into a single, taller matrix inR326×7 and k-means is applied to this matrix with K = 7. By running k-means once (instead oftwice), the sending and receiving clusters are more easily comparable because they have the samecluster center.

2.3.2 Results

Using di-sim to estimate the sending and receiving clusters, Figure 3 shows how the co-clustersconnect to each other. It displays the matrix B̂ which is an estimate of the matrix B in Definition2. B̂u,v is a proportion. The denominator is the number of node pairs (i, j) with i in sendingcluster u and j in receiving cluster v. The numerator is the number of such pairs that connect.This matrix has a strong diagonal which suggests that if an edge comes from sending block u, thenit probably points to a node in receiving block u. While di-sim was run on the weighted graph,Figure 3 computes B̂ with the unweighted graph. When B̂ is computed on the weighted graph (i.e.B̂u,v is average weight of the edges from block u to block v), the results are largely unchanged.

Figure 4 presents the left and right partitions of the c elegans connectome as estimated bydi-sim. The figure compares the two di-sim partitions with the single partition estimated in theoriginal paper (Jarrell et al. (2012)) in which they used the spectral technique of Leicht and Newman(2008).

Figure 4 presents three partitions of the nodes. The first two partitions correspond to thesending clusters (on left) and receiving clusters (on right) in di-sim.1 Because the k-means step

1Some nodes are listed off to the right side of the receiving cluster. These are nodes that do not send any edges,thus they do not have a sending cluster. These nodes are largely motor neurons that control muscles.

11

B

Receiving clusters

Sen

ding

clu

ster

s

1 2 3 4 5 6 7

7

6

5

4

3

2

1

0.0

0.2

0.4

0.6

0.8

Figure 3: Element u, v is darker when there are more edge from block u to block v. A strongdiagonal in this matrix suggests that if an edge comes from a node in sending block u, then itprobably goes to a node in receiving block u.

was run only once, the left and right clusters are comparable. So, the vertical orientation of theclusters is informative because the ith sending cluster from the top sends several edges to the ithreceiving cluster from the top.

Each neuron has exactly one line that connects the node’s sending cluster to the node’s receivingcluster.2 Darker lines indicate that the neuron moves further between its left and right represen-tations in the singular vectors (as measured by the movement score in Equation 2). If there wereno co-clustering structure, then all of the lines would be horizontal. However, several lines traversediagonally, connecting different clusters, and thus indicating non-trivial co-clustering structure. Toidentify which neurons move a further distance, they are written in a slightly larger font in thesending and receiving clusters.

The final partition represented in Figure 4 is represented by the color of the text; this partitioncorresponds to the communities or modules estimated in Jarrell et al. (2012). The sensory inputto the Response Module (orange) comes from the ventral side of the worm’s fan; this module playsan important role in helping the worm physically align with another worm for reproduction. Theresponse module feeds into the Locomotion Module (pink). The locomotion module contains the

2The lines in Figure 4 do not represent the edges in the graph.

12

●

●

●●● ● ●● ●● ● ●● ● ● ●●●●● ● ● ●● ●●● ●● ●● ● ●● ●●● ● ●● ●● ●● ● ●● ● ●●● ●●● ● ●●● ●●●● ● ●● ● ● ●●● ● ●● ● ●

● ●● ●● ●● ●● ● ● ●●● ● ●●● ● ●●● ● ●● ● ●●●●● ●● ●●● ● ● ●●●● ●●● ●●● ● ●●● ● ● ●● ●● ●●● ●● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ●●● ●●● ● ●● ● ●●●●● ● ●● ●● ● ● ● ●● ●● ●●● ● ●●● ● ● ●●●● ●● ●● ● ●● ●

● ●● ●●● ● ●●● ● ●● ● ●●●● ●●● ● ● ●● ● ●●●● ●●● ● ● ● ● ●●●●● ●●● ●● ● ●●●● ●● ● ● ●● ●● ● ●●● ●● ●● ●● ●● ●●●● ● ● ●●● ●● ● ●●●●● ●● ● ● ●● ● ●● ● ●● ●● ●● ●● ● ● ●● ●● ●●● ● ●●● ● ● ● ● ●● ●● ●● ●●● ● ● ●● ●●●● ●●● ● ● ●● ●●●●● ●● ● ● ● ●●● ● ● ●● ● ●●● ●● ●●●●●● ● ● ●● ●●●● ●● ●● ●● ●●● ● ● ● ● ●

CA03,CP04, AVL,CA08,

DB05,DD03,VA09,VA10,VB06, VB07,

VB08,VD09,

VD10,

CA02, CA08,DB05,DD03,VA10,

VB07,VB08,

VB09,VD09,

DD04,DD05,

vBWML17,vBW

MR18,

vBWML19,vBW

MR19,vBW

MR20,

SPDL,SPDR,SPVL, SPVR,PVV,PDA,PDC, PVZ,CP01, CP02,

CP03,DVC,

EF1,CA09,DA08, DD06,

AS11,VA11,VA12,

VB09,VB10,VB11, VD11,

VD12,VD13,

SPDL,SPDR,CA09, DD06,VA12,

VB10,VD10,

VD11,VD12,

VD13,

PVU,dBW

ML21,dBW

MR21,dBW

ML22,

dBWMR22,dBW

ML24,vBW

ML20,vBW

ML21,

vBWMR21,vBW

ML22,vBW

ML23,vBW

MR23,

dglR3,dglL2,

dglR2,dglL1,

dglR1,ailL,

ailR,intL,

PCBL,PCBR,PCCL,PCCR,SPCL,SPCR,CA02, CA05,

CA06,CP05,

CP06,DVB,

DVE,DVF,

HOB,

PCBL,PCCL,PCCR,SPCL,SPCR,SPVL, SPVR,CA03,

CA05,CA06,CP01,CP04, CP05,

CP06,DVB,

DVC,DVE,

DVF,AVL,

PVT,vBW

MR24,

aobL,aobR,

pobL,pobR,

adp,dspL,

dspR,vspL,

vspR,intR,

sph,gonad,

R1AL,R1AR,R1BL,R2AL,R2AR,R3AL,

R3AR,R4AL,R4AR,R5AL,R5AR,R5BL,R6AL,R6AR,R7AL,R7AR,R7BL,

CP08,CA04,

R1AL,R1AR,R1BL,R2AL,R2AR,

R3AL,R3AR,R4AL,R4AR,R5AL,R5AR,R5BL,

R6AL,R6AR,R7AL,R7AR,CA04,PDC,

dBWML23,vBW

MR22,

cdlL,cdlR,

polL,polR,

dglR8,dglL7,

dglR7,dglR6,

dglL5,dglR5,

dglL4,hyp,

PVY,PVX,AVAL,

AVAR,AVBL,AVBR,AVDL,AVDR,PVCL,PVCR, DA07,

DA09,

AVAL,AVAR,AVBL,AVBR,AVDL,

AVDR,PVCL,PVCR, DA07,DA08,DA09,

AS11,VA09,VA11,VB06,VB11,

DA04,DA05,

DA06,DB04,

DB06,DB07,

AS08,AS09,

AS10,

PHAL,PHBL,PHBR,R2BL,R2BR,R4BL,R4BR,R8AL,R8AR,R8BL,R8BR,R9AL,R9AR,

HOA,

HOB,

PCAL,PCAR, LUAL,LUAR,

DX1,DX2,

DX3,CA07,

PHBL,R2BL,R2BR,R4BL,R4BR,R8AL,R8AR,R8BL,R8BR,R9AL,HO

A,PCAL,PCAR,PCBR, LUAL,

LUAR,CP07,CP08,PVY,PVX,

PDA,PVZ,CP02,

CP03,DX1,

DX3,AVG

,

gecL,gecR,

grtL,grtR,

PHAR,R1BR,R3BL,R3BR,R5BR,R6BL,R6BR,R7BR,R9BL,R9BR,CP07,

CP09,PDB,

EF2,AVG

,AN3a,

AN3b,PVNL,PVNR,

PGA,

PHAL,PHAR,PHBR,R1BR,R3BL,R3BR,R5BR,R6BL,R6BR,R7BL,R7BR,R9AR,R9BL,R9BR,CP09,PVV,

PDB,DX2,

EF1,EF2,

AN3a,AN3b,PVNL,PVNR,CA07,

PGA,

EF3,PVS,

dBWMR23,dBW

MR24,

dglL6,

These lines show how

nodes m

ove between

send and receive clusters.R

eceiving Clusters

Sending Clusters

Motor N

eurons

C-R

esponseC

-Locomotion

D-R

(1-5)A

E-PVVF-Insem

ination

Figure 4: Cluster 1 in Figure 3 corresponds to the top cluster in this figure, cluster 2 correspondsto the second cluster from the top, and so on.

13

body-wall motor neurons, helping the worm to move. The R(1-5)A module (green) contains sensoryneurons that “promote ventral curling of the tail during mating". The PVV Module (blue) is likely“involved in aspects of male posture during mating". The Insemination Module (yellow) containsneurons that “will take over the male’s behavior once the vulva is sensed". The interpretation ofthe clusters in Jarrell et al. (2012) comes from that paper.

Figure 4 suggests that there is co-clustering structure beyond the standard one-way clustering.In particular, PVV, PVX, PVY, and PVZ neurons are all in large bold font because they havelarge movement scores. These findings are consistent with the discussion of feedforward circuits inJarrell et al. (2012). In particular, Figure 6 in Jarrell et al. (2012) illustrates how these neurons(PVV, PVX, PVY, and PVZ) send most of their edges to neurons in separate clusters.

While Figure 3 shows that most edges stay within the same cluster, Figure 4 shows that manyof the nodes do not stay within the same cluster. In particular ten of the 25 nodes in sending cluster2 are not in receiving cluster 2; they move. Instead of using the symmetric notion of clustering,this analysis co-clusters the sending and receiving patterns with di-sim, revealing several persistentasymmetries across the network.

3 Stochastic co-BlockmodelThis section proposes a statistical model for a directed graph with dual notions of stochastic equiv-alence. Despite the fact that di-sim is not a model based algorithm, when the graph is sampledfrom this model, di-sim will estimate these dual partitions.

3.1 Stochastic equivalence, a model based similarityStochastic equivalence is a fundamental concept in classical social network analysis. In the Stochas-tic Blockmodel, two nodes are in the same block if and only if they are stochastically equivalent(Holland et al. (1983)). In a directed network, two nodes a and b are stochastically equivalent ifand only if both of the following hold:

P (a→ x) = P (b→ x) ∀x and (3)P (x→ a) = P (x→ b) ∀x (4)

where a→ x denotes the event that a sends an edge to x. Separating these two notions allows for co-clustering structure. Two nodes a and b are stochastically equivalent senders if and only if Equation3 holds. Two nodes a and b are stochastically equivalent receivers if and only if Equation 4 holds.These two concepts correspond to a model based notion of co-clusters and they are simultaneouslyrepresented in the new Stochastic co-Blockmodel.

3.2 A statistical model of co-clustering in directed graphsThe Stochastic Blockmodel provides a model for a random network with K well defined blocks, orcommunities (Holland et al. (1983)). The Stochastic co-Blockmodel is an extension of the StochasticBlockmodel.

This model naturally generalizes to bi-partite graphs, where the rows and the columns of Aindex different sets of actors (e.g. words and documents). As such, the rest of the paper allows fora different number of rows (Nr) and columns (Nc) in the adjacency matrix A. Using the notationfrom the previous sections, a directed graph would satisfy Nr = Nc = n.

14

Definition 2. Define three nonrandom matrices, Y ∈ {0, 1}Nr×ky , Z ∈ {0, 1}Nc×kz and B ∈[0, 1]ky×kz . Each row of Y and each row of Z has exactly one 1 and each column has at least one 1.Under the Stochastic co-Blockmodel (ScBM), the adjacency matrix A ∈ {0, 1}Nr×Nc is randomsuch that E(A) = Y BZT . Further, each edge is independent, so the probability distribution factors

P (A) =∏i,j

P (Aij).

Without loss of generality, we will always presume that ky ≤ kz.In the Stochastic Blockmodel, E(A) = ZBZT . In the ScBM, E(A) = Y BZT . In this definition,

Y and Z record two types of block membership which correspond to the two types of stochasticequivalence (Equations 3 and 4). Denote yi as the ith row of Y and zi to be the ith row of Z.

Proposition 3.1. Under the ScBM for a directed graph, if yi = yj, then nodes i and j are stochas-tically equivalent senders, Equation 3. Similarly, if zi = zj, then nodes i and j are stochasticallyequivalent receivers, Equation 4.

Wang and Wong (1987) previously proposed and studied a directed Stochastic Blockmodel.However, our aims are different. Where Wang and Wong (1987) sought to understand the depen-dence between Aij and Aji, the current paper seeks to understand the co-clustering structure ofthe blocks. Importantly, where we use two types of stochastic equivalence (sending and receiving),Wang and Wong (1987) uses only one type of stochastic equivalence which implies that if two nodesare stochastically equivalent senders, then the nodes are also stochastically equivalent receivers andvice versa. By encoding co-clustering structure, the ScBM more closely aligns with the concept ofseparately exchangeable arrays (e.g. see Diaconis and Janson (2007) and Wolfe and Choi (2014)).

3.2.1 Degree correction

The degree-corrected Stochastic Blockmodel generalizes the Stochastic Blockmodel to allow fornodes in the same block to have highly heterogeneous degrees (Karrer and Newman (2011)). The-orem 3.1 below studies a similar generalization of the ScBM. The Degree-Corrected Stochastic co-Blockmodel (DC-ScBM) adds two sets of parameters (θyi > 0, i = 1, ..., Nr and θ

zj > 0, j = 1, ..., Nc)

that control the in- and out-degrees for each node. Let B be a ky × kz matrix where Bab ≥ 0 forall a, b. Then, under the DC-ScBM

P (Aij = 1) = θyi θzjByizj

where θyi θzjByizj ∈ [0, 1]. Note that parameters θ

yi and θ

zj are arbitrary to within a multiplicative

constant that is absorbed into B. To make it identifiable, we impose the constraint that withineach row block, the summation of θyi s is 1. That is, for each row-block s,∑

i

θyi 1(Yis = 1) = 1.

Similarly, for any column-block t, we impose∑j

θzj1(Zjt = 1) = 1.

15

Under this constraint, B has explicit meaning: Bst represents the expected number of links fromrow-block s to column-block t. Under the DC-ScBM, define A , EA. This matrix can be expressedas a product of the matrices,

A = ΘyYBZTΘz,

where Θy is a diagonal matrix whose ii’th element is θyi and Θz is defined similarly with θ

zj .

3.3 Estimating the Stochastic co-Blockmodel with di-simTheorem 3.1 bounds the number of nodes that di-sim “misclusters". This demonstrates that theco-clusters from di-sim estimate both the row- and column-block memberships, one in matrix Yand the other in matrix Z, corresponding to the two types of stochastic equivalence. This impliesthat the two notions of stochastic equivalence relate to the two sets of singular vectors of L.

In a diverse set of large empirical networks, the optimal clusters, as judged by a wide variety ofgraph cut objective functions, are not very large (Leskovec et al. (2008)). To account for this, theresults below limit the growth of community sizes by allowing the number of communities to growwith the number of nodes. Previously, Rohe et al. (2011); Choi et al. (2012); Rohe et al. (2012), andBhattacharyya and Bickel (2014) have also studied this high dimensional setting for the undirectedStochastic Blockmodel.

Several previous papers have explored the use of spectral tools to aid the estimation of theStochastic Blockmodel, including McSherry (2001); Dasgupta et al. (2004); Coja-Oghlan and Lanka(2009); Ames and Vavasis (2010); Rohe et al. (2011); Sussman et al. (2012); Chaudhuri et al. (2012);Joseph and Yu (2014); Qin and Rohe (2013); Sarkar and Bickel (2013); Krzakala et al. (2013); Jin(2015); and Lei and Rinaldo (2015). The results below build on this previous literature in severalways. Theorem 3.1 gives the first statistical estimation results for directed graphs or bipartite graphswith general degree distributions. Because we study a graph that is directed, di-sim uses the leadingsingular vectors of a sparse and asymmetric matrix. As such, the proof required novel extensionsof previous proof techniques. These techniques allow the results to also hold for bipartite graphs;previous results for bipartite graphs have only studied computationally intractable techniques, e.g.Flynn and Perry (2012); Wolfe and Choi (2014). For directed graphs and particularly for bipartitegraphs, it is not necessarily true that the number of sending clusters should equal the number ofreceiving clusters. Theorem 3.1 below does not presume that the number of sending clusters equalsthe number of receiving clusters; the theoretical results highlight the statistical price that is paidwhen they are not equal. Finally, we study a sparse degree corrected model and the theoreticalresults highlight the importance of the regularization and projection steps in di-sim.

Previous theoretical papers that use the non-regularized graph Laplacian all require that theminimum degree grows with the number of nodes (e.g. Rohe et al. (2011); Sarkar and Bickel (2013);Lei and Rinaldo (2015)). However, in many empirical networks, most nodes have 1, 2, or 3 edges.In these settings, the non-regularized graph Laplacian often has highly localized eigenvectors thatare uninformative for estimating large partitions in the graph. Because di-sim uses a regularizedgraph Laplacian, the concentration of the singular vectors does not require a growing minimumnode degree. Several previous papers have realized the benefits of regularizing the graph Laplacian(e.g. Page et al. (1999); Andersen et al. (2006); Amini et al. (2013); Chaudhuri et al. (2012);Qin and Rohe (2013); Joseph and Yu (2014)). While the regularized singular vectors concentratewithout a growing minimum degree, the weakly connected nodes effect the conclusions through theirstatistical leverage scores. From the perspective of numerical linear algebra, the leverage scores and

16

the localization of the singular vectors are essential to controlling the algorithmic difficulty ofcomputing the singular vectors (Mahoney, 2011).

3.3.1 Population notation

Recall that A = E(A) is the population version of the adjacency matrix A. Under the Degree-Corrected Stochastic co-Blockmodel,

A = ΘyYBZTΘz,

Similar to Equation (1), define regularized population versions of O, P , and L as

Ojj =∑k Akj

Pii =∑k Aik

Oτ = O + τI, Pτ = P + τI

L = O− 12τ A P

− 12τ

(5)

where O and P are diagonal matrices. The population graph Laplacian L has an alternativeexpression in terms of Y and Z.

Lemma 3.1. (Explicit form for Lτ ) Under the DC-ScBM with parameters {B, Y, Z,ΘY ,ΘZ},define ΘY,τ ∈ RNr×Nr (ΘZ,τ ∈ RNc×Nc) to be diagonal matrix where

[ΘY,τ ]ii = θYi

OiiOii + τ

[ΘZ,τ ]jj = θZj

PjjPjj + τ

.

Then L has the following form,

L = O− 12τ A P

− 12τ = Θ

12

Y,τY BLZTΘ

12

Z,τ ,

for some matrix BL ∈ Rky×kz that is defined in the proof.

The proof of Lemma 3.1 is in Section D.1, in the supplementary materials.

3.3.2 Definition of misclustered

Rigorous discussions of clustering require careful attention to identifiability. In the ScBM, the orderof the columns of Y and Z are unidentifiable. This leads to difficulty in defining “misclustered".Theorem 3.1 uses the following definition of misclustered that is extended from Rohe et al. (2011).

By the singular value decomposition, there exist orthonormal matrices XL ∈ RNr×ky and XR ∈RNc×ky and diagonal matrix Λ ∈ Rky×ky such that

L = XLΛXTR .

Define X ∗L and X∗R as the row normalized population singular vectors,

[X ∗L ]i =[XL]i||[XL]i||2

, [X ∗R ]j =[XR]j||[XR]j ||2

.

17

Unless stated otherwise, we will presume without loss of generality that ky ≤ kz. If rank(B) = ky,then there exist matrices µy ∈ Rky×ky and µz ∈ Rkz×ky such that Y µy = X ∗L and Zµz = X ∗R(implied by Lemma D.1 in the supplementary materials). Moreover, the rows of µy are distinct;with a slightly stronger assumption, the rows of µz are also distinct. As such, k-means applied tothe rows of X ∗L will reveal the partition in Y . Similarly for µ

z, X ∗R , and Z. As such, di-sim appliedto the population Laplacian, L , can discover the block structure in the matrices Y and Z.

Let XL ∈ RNr×ky be a matrix whose orthonormal columns are the right singular vectors cor-responding to the largest ky singular values of L. di-sim applies k-means (with ky clusters) to therows of X∗L, denoted as u1, . . . , uNr . Each row is assigned to one cluster and each cluster has acentroid.

Definition 3. For i = 1, . . . , Nr, define cLi ∈ Rky to be the centroid corresponding to ui afterrunning (1 + α)-approximate k-means on u1, . . . , uNr with ky clusters.

If cLi is closer to some population centroid other than its own, i.e. yjµy for some yj 6= yi, thenwe call node i Y -misclustered. This definition must be slightly complicated by the fact that thecoordinates in XL must first align with the coordinates in XL. So, the definitions below include anadditional rotation matrix RL.

Definition 4. The set of nodes Y -misclustered is

My ={i : ‖cLi − yiµyRL‖2 > ‖cLi − yjµyRL‖2 for any yj 6= yi

}, (6)

where RL is the orthonormal matrix that solves Wahba’s problem min ‖XL −XLRL‖F , i.e. it isthe procrustean transformation.

Defining Z-misclustered, requires defining cRi and µz analogous to the previous definitions.

Definition 5. The set of nodes Z-misclustered is

Mz ={i : ‖cRi − ziµzRR‖2 > ‖cRi − zjµzRR‖2 for any zj 6= zi

}, (7)

where RR is the orthonormal matrix that solves Wahba’s problem min ‖XR −XRRR‖F , i.e. it isthe procrustean transformation.

3.3.3 Asymptotic performance

DefineH = (Y TΘY,τY )

1/2BL(ZTΘZ,τZ)

1/2.

H ∈ Rky×kz shares same top K singular values with the population graph Laplacian L . DefineH·j as the jth column of H, and define

γz = mini 6=j‖H·i −H·j‖2. (8)

When kz > ky, γz controls the additional difficulty in estimating Z.Define my as the minimum row length of XL. Similarly define mz as the minimum row length

of XR. That is,my = min

i=1,..,Nr||[XL]i||2, mz = min

j=1,..,Nc||[XR]j ||2. (9)

These are the minimum leverage scores for the matrices L L T and L TL .The next theorem bounds the sizes of the sets of misclustered nodes, |My| and |Mz|.

18

Theorem 3.1. Suppose A ∈ RNr×Nc is an adjacency matrix sampled from the Degree-CorrectedStochastic co-Blockmodel with ky left blocks and kx right blocks. Let K = min{ky, kz} = ky. DefineL as in Equation 5. Define λ1 ≥ λ2 ≥ · · · ≥ λK > 0 as the K nonzero singular values of L . LetMy and Mz be the sets of Y - and Z-misclustered nodes (Equations 6 and 7) by DI-SIM. Let δ bethe minimum expected row and column degree of A, that is δ = min(mini Oii,minj Pjj). Define γz,my and mz as in Equations 8 and 9. For any � > 0, if δ+ τ > 3 ln(Nr +Nc) + 3 ln(4/�), then withprobability at least 1− �,

MyNr≤ c0(α)

K ln(4(Nr +Nc)/�)

Nrλ2Km2y(δ + τ)

, (10)

MzNc≤ c1(α)

K ln(4(Nr +Nc)/�)

Ncλ2Km2zγ

2z (δ + τ)

. (11)

A proof of Theorem 3.1 is contained in the appendix.Because ‖XL‖2F = K, the average leverage score ||[XL]i||2 is

√K/Nr. If the my is of the same

order, with λK and K fixed, thenMyNr

goes to zero when δ + τ grows faster than ln(Nr + Nc). Insparse graphs, δ is fixed and so τ must grow with n. To ensure that λK remains fixed while τ isgrowing, it is necessary for the average degree to also grow.

In many empirical networks, the vast majority of nodes have very small degrees; this is a regimein which δ is not growing. In such networks, the bounds in Equations (10) and (11) are vacuousunless τ > 0. While these equations are upper bounds, the simulations in the appendix show thatfor sparse networks (i.e. δ small), these bounds align with the performance of di-sim. Moreover,the performance of di-sim is drastically improves with statistical regularization.

These results highlight the sensitivity to the smallest leverage scores my and mz. When thereare excessively small leverage scores, then the bound above can become meaningless. However, aslight modification of di-sim that excludes the low leveraged points from the k-means step and theclustering results, obtains a vastly improved bound. If one computes the leading singular vectorsand only runs k-means on the with the observations i that satisfy ||[XL]i||2 > η

√K/N , then the

theoretical results are much improved. Denote the nodes misclustered by this procedure as M ∗y .Let there be N∗ nodes with ||[XL]i||2 > η

√K/N . If N/N∗ = O(1) and the population eigengap

λK is not asymptotically diminishing, then

M ∗yN∗≤ c2(α)

ln((Nr +Nc)/�)

η2(δ + τ).

The proof mimics the proof of Theorem 3.1.In Theorem 3.1, the bound for Mz exceeds the bound for My because the bound for Mz contains

an additional term γz. This asymmetry stems from allowing kz ≥ ky. In fact, if ky = kz, then γzcan be removed, making the bounds identical. However, if kz > ky, then Rank(L ) is at most ky.So, the singular value decomposition represents the data in ky dimensions and the k-means stepsfor both the left and the right clusters are done in ky dimensions. In estimating Y , there is onedimension in the singular vector representation for each of the ky blocks. At the same time, thesingular value representation shoehorns the kz blocks in Z into less than kz dimensions. So, thereis less space to separate each of the kz clusters, obscuring the estimation of Z.

To further understand the bound in Theorem 3.1, define the following toy model.

Definition 6. The four parameter ScBM is an ScBM parameterized by K ∈ N, s ∈ N, r ∈ (0, 1),and p ∈ (0, 1) such that p + r ≤ 1. The matrices Y, Z ∈ {0, 1}n×K each contain s ones in eachcolumn and B = pIK + r1K1TK .

19

In the four parameter ScBM, there are K left- and right-blocks each with s nodes and the nodepartitions in Y and Z are not necessarily related. If yi = zj , then P (i → j) = p + r. Otherwise,P (i→ j) = r.

Corollary 3.1. Assume the four parameter ScBM, with same number of rows and columns, andr, p fixed and K growing with N = Ks. Since δ is growing with n, set τ = 0. Then,

λK =1

K(r/p) + 1,

where λK is the Kth largest singular value of L . Moreover,

N−1(|My|+ |Mz|) = Op(K2 logN

N

).

The proportion of nodes that are misclustered converges to zero, as long as number of clustersK = o(

√N/ logN).

The proof of Corollary 3.1 is contained in the supplementary materials, Section D.

4 Discussion

4.1 Related SVD methodsSeveral other researchers have used SVD to explore and understand different network features.

Kleinberg (1999) proposed the concept of “hubs and authorities" for hyperlink-induced topicsearch (HITS). This algorithm that was a precursor to Google’s PageRank algorithm (Page et al.(1999)). The SVD plays a key role in this algorithm. The SVD also played a key role in Hoff (2009),where the left and right singular vectors estimate “sender-specific and receiver-specific latent nodalattributes". Like di-sim, the algorithms in Kleinberg (1999) and Hoff (2009) use the SVD toinvestigate asymmetric features of directed graphs.

Dhillon (2001) suggested an algorithm similar to di-sim that was to be applied to bipartitegraphs in which the rows and columns of L correspond to different entities (e.g. documents andwords). There are three key differences between di-sim and the algorithm in Dhillon (2001). First,Dhillon (2001) does not use regularization. So, the definition of L remains the same, but τ = 0. Theregularization step helps di-sim when L has highly localized singular vectors; this often happenswhen several nodes have very small degrees. Second, Dhillon (2001) does not project the rows ofthe singular vectors onto the sphere. The project step helps di-sim when the node degrees arehighly heterogeneous. Finally, to estimate K clusters, Dhillon (2001) only uses dlog2Ke singularvectors (dxe is the smallest integer greater than x). While it is much faster to only compute log2Ksingular vectors, there is additional information contained in the remaining top K singular vectors.For example, under the four parameter ScBM, λ2 = · · · = λK . As such, there is not an eigengapafter the dlog2Keth singular value.

SVD has been used in other forms of discrete data, most notably in correspondence analysis(CA). In fact, di-sim normalizes the rows and columns in an identical fashion to CA. CA hassimilarities to principal components analysis, but it is applicable to categorical data in contingencytables and is built on a beautiful set of algebraic ideas (Holmes (2006)). The methodology was firstpublished in Hirschfeld (1935) and (like spectral clustering) it has been rediscovered and reapplied

20

several times over (Guttman (1959)). While there exists a deep algorithmic, algebraic, and heuristicunderstanding of CA, it is rarely conceived through a statistical model; Goodman (1986) is oneexception. Wasserman et al. (1990) study how one could use CA to study relational data, but wasparticularly interested in two-way or bipartite networks. Anderson et al. (1992) mentions CA andvisual inspection as one possible way to construct blocks in a Stochastic Blockmodel. The previousCA literature has not explored the parameter estimation performance of CA under any of thesemodels, nor has the literature explored the dual partitions under a directed graph. Algorithmically,the CA literature does not employ the regularization step (using τ) for sparse data. Nor does itemploy the projection step, where the rows of the singular vector matrices are normalized to haveunit length. This is a potentially fruitful area for further research in CA.

In research that was contemporaneous to this paper’s tech report (Rohe and Yu (2012)), bothWolfe and Choi (2014) and Flynn and Perry (2012) studied likelihood formulations of co-clusteringin the network setting. Wolfe and Choi (2014) studied a “non-parametric" model that assumesthe nodes are separately exchangeable. This is a generalization of the Stochastic co-Blockmodel.Flynn and Perry (2012) uses a profile likelihood formulation to develop a consistent estimator ofthe Stochastic co-Blockmodel.

4.2 ConclusionBy extending both spectral clustering and the Stochastic Blockmodel to a co-clustering framework,this paper aims to better conceptualize clustering in directed graphs; co-clustering is a meaningfulprocedure for directed networks and helps to guide the development of reasonable questions fornetwork researchers.

Given that empirical graphs can be sparse, with highly heterogeneous node degrees, we proposea novel spectral algorithm di-sim that incorporates both the regularization and projection steps.Section 2 demonstrates how di-sim’s asymmetric analysis finds novel structure in three empiricalnetworks. In the Enron email network, it identifies Bill Williams, who was part of the conspiracyto manufacture energy shortages in Southern California. In the political blog network, it identifiessix asymmetric blogs. Finally, in the c elegans network di-sim identifies several neurons that formfeedforward circuits. In each of these examples, the conclusions are only feasible because the dataanalysis leverages the edge asymmetries.

Investigating the statistical properties of di-sim required several theoretical novelties that buildon the extensive literature for spectral algorithms. The results highlight the importance of regular-ization and the statistical leverage scores. Importantly, because of the regularization, the conver-gence of the singular vectors does not require a growing minimum degree. Moreover, because thetheory accommodates a “degree corrected" model, it was necessary to project the rows of XL andXR onto the sphere. Finally, these results extend to bipartite graphs, where the rows and columnsof the adjacency matrix index different sets of objects.

Acknowledgements: Thank you David Gleich for your thoughtful questions and helpful ref-erences. Thank you Sara Fernandes-Taylor and Zoe Russek for your helpful comments. Thank youSusan Holmes for the helpful references. While Karl Rohe was a graduate student, he was partiallysupported by an NSF VIGRE Graduate Fellowship at UC Berkeley and ARO grant W911NF-11-1-0114. More recently, NSF DMS-1309998 has supported this research. Tai Qin is supported byDMS-1308877. Bin Yu is partially supported by NSF grants SES-0835531 (CDI), DMS-1107000,0939370 CCF, and ARO grant W911NF-11-1-0114.

21

ReferencesAdamic, L. A. and Glance, N. (2005). The political blogosphere and the 2004 us election: divided

they blog. In Proceedings of the 3rd international workshop on Link discovery , pages 36–43.ACM.

Ames, B. P. and Vavasis, S. A. (2010). Convex optimization for the planted k-disjoint-cliqueproblem. arXiv preprint arXiv:1008.2814 .

Amini, A. A., Chen, A., Bickel, P. J., Levina, E., et al. (2013). Pseudo-likelihood methods forcommunity detection in large sparse networks. The Annals of Statistics, 41(4), 2097–2122.

Andersen, R., Chung, F., and Lang, K. (2006). Local graph partitioning using pagerank vectors.In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages475–486. IEEE.

Anderson, C. J., Wasserman, S., and Faust, K. (1992). Building stochastic blockmodels. Socialnetworks, 14(1), 137–161.

Banerjee, A., Dhillon, I., Ghosh, J., Merugu, S., and Modha, D. (2004). A generalized maximumentropy approach to bregman co-clustering and matrix approximation. In Proceedings of thetenth ACM SIGKDD international conference on Knowledge discovery and data mining , pages509–514. ACM.

Bhattacharyya, S. and Bickel, P. J. (2014). Community detection in networks using graph distance.arXiv preprint arXiv:1401.3915 .

Bisson, G. and Hussain, F. (2008). Chi-sim: A new similarity measure for the co-clustering task.In Machine Learning and Applications, 2008. ICMLA’08. Seventh International Conference on,pages 211–217. IEEE.

Borchers, H. W. (2012). [r] k-means++. https://stat.ethz.ch/pipermail/r-help/2012-January/300051.html.

Chaudhuri, K., Graham, F. C., and Tsiatas, A. (2012). Spectral clustering of graphs with general de-grees in the extended planted partition model. Journal of Machine Learning Research-ProceedingsTrack , 23, 35–1.

Chen, A., Amini, A. A., Bickel, P. J., and Levina, E. (2012). Fitting community models to largesparse networks. arXiv preprint arXiv:1207.2340 .

Choi, D., Wolfe, P., and Airoldi, E. (2012). Stochastic blockmodels with growing number of classes.Biometrica (in press).

Chung, F. (1997). Spectral graph theory . Amer Mathematical Society.

Chung, F. and Radcliffe, M. (2011). On the spectra of general random graphs. the electronic journalof combinatorics, 18(P215), 1.

Chung, F. R. K. and Lu, L. (2006). Complex graphs and networks. Number 107. American Math-ematical Soc.

Cohen, W. W. (2009). Enron email dataset.

22

Coja-Oghlan, A. and Lanka, A. (2009). Finding planted partitions in random graphs with generaldegree distributions. SIAM Journal on Discrete Mathematics, 23(4), 1682–1714.

Dasgupta, A., Hopcroft, J. E., and McSherry, F. (2004). Spectral analysis of random graphs withskewed degree distributions. In Foundations of Computer Science, 2004. Proceedings. 45th AnnualIEEE Symposium on, pages 602–610. IEEE.

Dhillon, I. (2001). Co-clustering documents and words using bipartite spectral graph partitioning.In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discoveryand data mining , pages 269–274. ACM.

Diaconis, P. and Janson, S. (2007). Graph limits and exchangeable random graphs. arXiv preprintarXiv:0712.2749 .

Egan, T. (2005). Tapes reveal enron took a role in crisis. [Online; accessed 16-July-2013].

Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Mathematical Journal , 23(2),298–305.

Flynn, C. J. and Perry, P. O. (2012). Consistent biclustering. arXiv preprint arXiv:1206.6927 .

Freitag, D. (2004). Trained named entity recognition using distributional clusters. In Proceedingsof EMNLP , volume 4, pages 262–269.

Goodman, L. A. (1986). Some useful extensions of the usual correspondence analysis approachand the usual log-linear models approach in the analysis of contingency tables. InternationalStatistical Review / Revue Internationale de Statistique, 54(3), pp. 243–270.

Guttman, L. (1959). Metricizing rank-ordered or unordered data for a linear factor analysis.Sankhyā: The Indian Journal of Statistics (1933-1960), 21(3/4), 257–268.

Hartigan, J. (1972). Direct clustering of a data matrix. Journal of the American Statistical Asso-ciation, pages 123–129.

Hirschfeld, H. (1935). A connection between correlation and contingency. Mathematical Proceedingsof the Cambridge Philosophical Society , 31(04), 520–524.

Hoff, P. (2009). Multiplicative latent factor models for description and prediction of social networks.Computational & Mathematical Organization Theory , 15(4), 261–272.

Hoff, P., Raftery, A., and Handcock, M. (2002). Latent space approaches to social network analysis.Journal of the American Statistical Association, 97(460), 1090–1098.

Holland, P., Laskey, K., and Leinhardt, S. (1983). Stochastic blockmodels: Some first steps. SocialNetworks, 5, 109–137.

Holmes, S. (2006). Multivariate analysis: The french way. Festschrift for David Freedman.

Horn, R. and Johnson, C. (2005). Matrix analysis. Cambridge university press.

Jarrell, T. A., Wang, Y., Bloniarz, A. E., Brittin, C. A., Xu, M., Thomson, J. N., Albertson, D. G.,Hall, D. H., and Emmons, S. W. (2012). The connectome of a decision-making neural network.Science, 337(6093), 437–444.

23

Jin, J. (2015). Fast community detection by score. The Annals of Statistics, 43(1), 57–89.

Joseph, A. and Yu, B. (2014). Impact of regularization on spectral clustering. arXiv preprintarXiv:1312.1733 .

Karrer, B. and Newman, M. E. (2011). Stochastic blockmodels and community structure in net-works. Physical Review E , 83(1), 016107.

Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM(JACM), 46(5), 604–632.

Koltchinskii, V. and Giné, E. (2000). Random matrix approximation of spectra of integral operators.Bernoulli , 6(1), 113–167.

Krzakala, F., Moore, C., Mossel, E., Neeman, J., Sly, A., Zdeborová, L., and Zhang, P. (2013). Spec-tral redemption in clustering sparse networks. Proceedings of the National Academy of Sciences,110(52), 20935–20940.

Kumar, A., Sabharwal, Y., and Sen, S. (2004). A simple linear time (1+ ε)-approximation algo-rithm for geometric k-means clustering in any dimensions. In Proceedings-Annual Symposium onFoundations of Computer Science, pages 454–462. IEEE.

Lei, J. and Rinaldo, A. (2013). Consistency of spectral clustering in sparse stochastic block models.arXiv preprint arXiv:1312.2050 .

Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. TheAnnals of Statistics, 43(1), 215–237.

Leicht, E. and Newman, M. (2008). Community structure in directed networks. Physical ReviewLetters, 100(11), 118703.

Leskovec, J., Lang, K., Dasgupta, A., and Mahoney, M. (2008). Statistical properties of commu-nity structure in large social and information networks. In Proceeding of the 17th internationalconference on World Wide Web, pages 695–704. ACM.

Madeira, S. and Oliveira, A. (2004). Biclustering algorithms for biological data analysis: a survey.Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1), 24–45.

Madeira, S., Teixeira, M., Sa-Correia, I., and Oliveira, A. (2010). Identification of regulatory mod-ules in time series gene expression data using a linear time biclustering algorithm. ComputationalBiology and Bioinformatics, IEEE/ACM Transactions on, 7(1), 153–165.

Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends R©in Machine Learning , 3(2), 123–224.

McSherry, F. (2001). Spectral partitioning of random graphs. In Foundations of Computer Science,2001. Proceedings. 42nd IEEE Symposium on, pages 529–537. IEEE.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The pagerank citation ranking: Bringingorder to the web.

24

Perry, P. O. and Wolfe, P. J. (2013). Point process modelling for directed interaction networks.Journal of the Royal Statistical Society: Series B (Statistical Methodology).

Qin, T. and Rohe, K. (2013). Regularized spectral clustering under the degree-corrected stochasticblockmodel. Advances in Neural Information Processing Systems.

Rohe, K. and Yu, B. (2012). Co-clustering for directed graphs; the stochastic co-blockmodel and aspectral algorithm. arXiv preprint arXiv:1204.2296 .

Rohe, K., Chatterjee, S., and Yu, B. (2011). Spectral clustering and the high-dimensional stochasticblockmodel. The Annals of Statistics, 39(4), 1878–1915.

Rohe, K., Qin, T., and Fan, H. (2012). The highest dimensional stochastic blockmodel with aregularized estimator. arXiv preprint arXiv:1206.2380 .

Rohwer, R. and Freitag, D. (2004). Towards full automation of lexicon construction. In Proceedingsof the HLT-NAACL Workshop on Computational Lexical Semantics, pages 9–16. Association forComputational Linguistics.

Sarkar, P. and Bickel, P. J. (2013). Role of normalization in spectral clustering for stochasticblockmodels. arXiv preprint arXiv:1310.1495 .

Steinhaus, H. (1956). Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci , 1,801–804.

Sussman, D. L., Tang, M., Fishkind, D. E., and Priebe, C. E. (2012). A consistent adjacency spectralembedding for stochastic blockmodel graphs. Journal of the American Statistical Association,107(499), 1119–1128.

Tanay, A., Sharan, R., Kupiec, M., and Shamir, R. (2004). Revealing modularity and organizationin the yeast molecular network by integrated analysis of highly heterogeneous genomewide data.Proceedings of the National Academy of Sciences of the United States of America, 101(9), 2981.

Tanay, A., Sharan, R., and Shamir, R. (2005). Biclustering algorithms: A survey. Handbook ofcomputational molecular biology , 9, 26–1.

von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing , 17(4), 395–416.

Wang, Y. andWong, G. (1987). Stochastic blockmodels for directed graphs. Journal of the AmericanStatistical Association, 82(397), 8–19.

Wasserman, S., Faust, K., and Galaskiewicz, J. (1990). Correspondence and canonical analysis ofrelational data. Journal of Mathematical Sociology , 15(1), 11–64.

White, H., Boorman, S., and Breiger, R. (1976). Social structure from multiple networks. I. Block-models of roles and positions. American Journal of Sociology , 81(4), 730–780.

White, J. G., Southgate, E., Thomson, J. N., and Brenner, S. (1986). The structure of the nervoussystem of the nematode caenorhabditis elegans. Philosophical Transactions of the Royal Societyof London. B, Biological Sciences, 314(1165), 1–340.

25

Wikipedia (2013). Enron — Wikipedia, the free encyclopedia. [Online; accessed 16-July-2013].

Wolfe, P. and Choi, D. (2014). Co-clustering separately exchangeable network data. Annals ofStatistics.

Zhao, Y., Levina, E., and Zhu, J. (2012). Consistency of community detection in networks underdegree-corrected stochastic block models. The Annals of Statistics, 40(4), 2266–2292.

Zhou, Y., Goldberg, M., Magdon-Ismail, M., and Wallace, W. (2007). Strategies for cleaningorganizational emails with an application to enron email dataset. In 5th Conf. of North AmericanAssociation for Computational Social and Organizational Science. Citeseer.

26

Supplementary materials

A SimulationThe theoretical results of Theorem 3.1 identify (1) the expected node degree and (2) the spectral gapas essential parameters that control the clustering performance of di-sim. The simulations inves-tigate di-sim’s non-asymptotic sensitivity to these quantities under the four parameter StochasticCo-Blockmodel (Definition 6). Moreover, the simulations investigate the performance under themodel without degree correction and with degree correction.

Both simulations use k = 5 blocks for both Y and Z. Each of the five blocks contains 400 nodes.So, n = 2000. When the model is degree corrected, θ1, . . . , θn are iid with θi

d=√Z + .169 where

Z ∼ exponential(1). The addition of .169 ensures that E(θi) ≈ 1 and thus the expected degrees areunchanged between the degree corrected model and the model without degree correction.

In the first simulation, the expected node degree is represented on the horizontal axis; the outof block probability r and the in block probability p+r change in a way that keeps the spectral gapof L fixed across the horizontal axis. In the second simulation, the spectral gap is represented onthe horizontal axis; the probabilities p and r change so that the expected degree pk + rn remainsfixed at twenty. In both simulations, the partition matrices Y and Z are sampled independentlyand uniformly over the set of matrices with s = 400 and k = 5.

To design the parameter settings of p and r, note that the population graph Laplacian L isa rank k matrix. So, its k + 1 eigenvalue is λk+1 = 0 and the spectral gap is λk − λk+1 = λk.Corollary 3.1 says that the kth eigenvalue of L for τ = 0 is

λk =1

k(r/p) + 1.

To keep the spectral gap λk fixed, it is equivalent to keeping r/p fixed.We use the k-means++ algorithm (Kumar et al. (2004), Borchers (2012)) with ten initial-

izations. Only the results for Y -misclustered (Definition 6) are reported. Code is provided athttp://www.stat.wisc.edu/∼karlrohe/.

A.0.1 Simulation 1

This simulation investigates the sensitivity of di-sim to a diminishing number of edges. Figure 5displays the simulation results for a sequence of nine equally spaced values of the expected degreebetween 5 and 16. To decrease the variability of the plot, each simulation was run twenty times;only the average is displayed. The solid line corresponds to setting the regularization parameterequal to zero (τ = 0). The line with longer dashes represents τ = 1. The line with small dashesrepresents the average degree, τ = 1n

∑i Pii.

Figure 5 demonstrates two things. First, the number of misclustered nodes increases as theexpected degree goes to zero. Second, regularization decreases the number of misclustered nodesfor small values of the expected degree.

A.0.2 Simulation 2

This simulation investigates the sensitivity of di-sim to a diminishing spectral gap λk. Figure 5displays the simulation results for a sequence of nine equally spaced values of the spectral gap,

27

6 8 10 12 14 16

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Stochastic Co−Blockmodel

Expected degree

Pro

port

ion

of n

odes

mis

culs

tere

d tau= 0tau = 1tau = avg deg

6 8 10 12 14 16

0.1

0.3

0.5

0.7

Degree correctedStochastic Co−Blockmodel

Expected degreeP

ropo

rtio

n of

nod

es m

iscu

lste

red

Figure 5: In the simulation on the left, the data comes from the four parameter Stochastic Co-Blockmodel. On the right, the data comes from the same model, but with degree correction. Theθi parameters have expectation one. In both models, k = 5 and s = 400. The probabilities p andr vary such that p = 5r, keeping the spectral gap fixed at λk = 1/2. This simulation shows thatfor small expected degree, regularization decreases the proportion of nodes that are misclustered.Moreover, the benefits of regularization are more pronounced under the degree corrected model.

between .3 and .6. In each simulation, the expected degree is held constant at twenty. To decreasethe variability, each simulation was run twenty times; only the average is displayed. The solid linecorresponds to setting the regularization parameter equal to zero (τ = 0). The line with longerdashes represents τ = 1. The line with small dashes represents the average degree, τ = 1n

∑i Pii.

Figure 5 demonstrates two things. First, the number of misclustered nodes increases as thespectral gap goes to zero. Second, regularization yields slight benefits when the spectral gap issmall and the model is degree corrected.

B Directed latent space modelThe following definition of the directed latent space model is motivated by the Aldous-Hooverrepresentation for infinite exchangeable arrays and the latent space model proposed by Hoff et al.(2002). It specifies the distribution of the random directed adjacency matrix A ∈ {0, 1}n×n.

Definition 7. The random adjacency matrix A is from the directed latent space model if and

28

0.30 0.40 0.50 0.60

0.00

0.05

0.10

0.15

0.20

0.25

Stochastic Co−Blockmodel

Spectral gap in population

Pro

port

ion

of n

odes

mis

culs

tere

d tau= 0tau = 1tau = avg deg

0.30 0.40 0.50 0.60

0.00

0.10

0.20

0.30

Degree correctedStochastic Co−Blockmodel

Spectral gap in populationP

ropo

rtio

n of

nod

es m

iscu

lste

red

Figure 6: In the simulation on the left, the data comes from the four parameter Stochastic Co-Blockmodel. On the right, the data comes from the same model, but with degree correction. The θiparameters have expectation one. In both models, k = 5 and s = 400. The spectral gap, displayedon the horizontal axis, changes because the probabilities p and r change. The values of p and rvary in a way that keeps the expected degree fixed at twenty for all simulations. Without degreecorrection, the three separate lines are difficult to distinguish because they are nearly identical.Under the degree corrected model, regularization improves performance when the spectral gap issmall.

only ifP(A|{zi, yi}ni=1) =

∏i

Definition 8. The Stochastic co-Blockmodel with k blocks is a directed latent space model with

A = Y BZT ,

where Y,Z ∈ {0, 1}n×k both have exactly one 1 in each row and at least one 1 in each column andB ∈ [0, 1]k×k is full rank.

C Convergence of Singular VectorsThe classical spectral clustering algorithm above can be divided into two steps: (1) find the eigen-decomposition of L and (2) run k-means. Several previous papers have studied the estimationperformance of the classical spectral clustering algorithm under a standard social network model.However, due to the asymmetry of A, previous proof techniques can not be directly applied tostudy the singular vectors for di-sim. In this analysis, we (a) symmetrize the graph Laplacian, (b)apply modern matrix concentration techniques to this symmetrized version of the graph Laplacian,and (c) apply an updated version of the Davis-Kahn theorem to bound the distance between thesingular spaces of the empirical and population Laplacian.

For simplicity, from now on let L denote the regularized graph Laplacian.Define the symmetrized version of L and L as

L̃ =

(0 LLT 0

), L̃ =

(0 L

L T 0

).

The next theorem gives a sharp bound between L̃ and L̃ .

Theorem C.1. (Concentration of L) Let G be a random graph, with independent edges andpr(vi ∼ vj) = pij. Let δ be the minimum expected row and column degree of G, that is δ =min(mini Oii,minj Pjj). For any � > 0, if δ + τ > 3 ln(Nr +Nc) + 3 ln(4/�), then with probabilityat least 1− �,

‖L̃− L̃ ‖ ≤ 4√

3 ln(4(Nr +Nc)/�)

δ + τ. (12)

Proof. Let C = P−12

τ AO− 12τ and define C̃ in the same way as L̃. Then ‖L̃−L̃ ‖ ≤ ‖C̃−L̃ ‖+‖L̃−C̃‖.

We bound the two terms separately.For the first term, we apply the following concentration inequality for matrices, see for example

Chung and Radcliffe (2011).

Lemma C.1. Let X1, X2, ..., Xm be independent random N × N Hermitian matrices. Moreover,assume that ‖Xi − E(Xi)‖ ≤ M for all i, and v2 = ‖

∑var(Xi)‖. Let X =

∑Xi. Then for any

a > 0,

pr(‖X − E(X)‖ ≥ a) ≤ 2N exp(− a

2

2v2 + 2Ma/3

).

Let Eij be the matrix with 1 in the i, j and j, i positions and 0 everywhere else. Let pij = Aij .To use this inequality, express C̃ − L̃ as the sum of the matrices Yi,m+j ,

Yi,m+j =1√

(Oii + τ)(Pjj + τ)(Aij − pij)Ei,m+j , i = 1, ...,m, j = 1, ..., n.

30

Note that

‖C̃ − L̃ ‖ = ‖m∑i=1

n∑j=1

Yi,m+j‖,

and‖Yi,m+j‖ ≤

1√(Oii + τ)(Pjj + τ)

≤ (δ + τ)−1.

Moreover,

E[Yi,m+j ] = 0 and E[Y 2i,m+j ] =1

(Oii + τ)(Pjj + τ)(pij − p2ij)(Eii + Em+j,m+j).

Then,

v2 = ‖m∑i=1

n∑j=1

E[Y 2i,m+j ]‖ = ‖m∑i=1

n∑j=1

1

(Oii + τ)(Pjj + τ)(pij − p2ij)(Eii + Em+j,m+j)‖

= ‖m∑i=1

[

n∑j=1

1

(Oii + τ)(Pjj + τ)(pij − p2ij)]Eii +

n∑j=1

[

m∑i=1

1

(Oii + τ)(Pjj + τ)(pij − p2ij)]Em+j,m+j‖

= max

{max

i=1,...,m(

n∑j=1

1

(Oii + τ)(Pjj + τ)(pij − p2ij)), max

j=1,...,n(

m∑i=1

1

(Oii + τ)(Pjj + τ)(pij − p2ij))

}

≤ max{

maxi=1,...,m

1

δ + τ

n∑j=1

pijOii + τ

, maxj=1,...,n

1

δ + τ

m∑i=1

pijPjj + τ

}= (δ + τ)−1.

Take

a =

√3 ln(4(Nr +Nc)/�)

δ + τ.

By assumption, δ + τ > 3 ln(Nr +Nc) + 3 ln(4/�). So a < 1. Applying Lemma C.1,

pr(‖C̃ − L̃ ‖ ≥ a) ≤ 2(Nr +Nc) exp(−

3 ln(4(Nr+Nc)/�)δ+τ

2/(δ + τ) + 2a/[3(δ + τ)]

)≤ 2N exp(−3 ln(4(Nr +Nc)/�)

3)

≤ �/2.

For the second term ‖L̃− C̃‖, define

Dτ =

(Oτ 00 Pτ

), Dτ =

(Oτ 00 Pτ

), D = D0, and D = D0.

Apply the two sided concentration inequality for each i, 1 ≤ i ≤ Nr + Nc, (see for exampleChung and Lu (2006, chap. 2))

pr(|Dii −Dii| ≥ λ) ≤ exp{−λ2

2Dii}+ exp{− λ

2

2Dii +23λ}.

31

Let λ = a(Dii + τ), where a is as before.

pr

(|Dii −Dii| ≥ a(Dii + τ

co-blockmodel and spectral algorithm di-sim · 2015. 1. 9. · co-clustering for directed graphs:...

Documents