supplementary material: large-scale community structurein …jure/pub/ncp/ncp-supp.pdf ·...

Supplementary Material: Large-scale community structureinsocial and information networks

Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, Michael W. Mahoney

Note: This is a draft, from March 25, 2009. Please do not distribute.

In this Supplementary Material, we describe additional material supporting our main text:

• In Section S1, we describe the network data sets we used in ouranalysis, and we provide basicstatistics for each of these networks.

• In Section S2, we discuss the network community profile (NCP)in greater detail, and we presentthe NCP for a wide range of networks, including many of the over 100 social and informationnetworks we have analyzed.

• In Section S3, we discuss in greater detail structural properties of networks related to the nestedcore-periphery structure that are responsible for the empirical observations we have made.

• In Section S4, we describe in more detail the Forest Fire Model, which provides an iterative localedge attachment mechanism that produces networks with a nested core-periphery structure and anupward-sloping NCP.

• In Section S5, we compare our results to other algorithms, and we discuss algorithmic issueswhich will provide confidence that we are observing properties of the networks we analyze andnot artifacts of the algorithms we employ.

• In Section S6, we provide a more detailed discussion of the relationship between our results andthe concept of modularity, which has been widely-used in thecommunity detection literature.

A central question for our conclusions in the main text has todo with the extent to which one can beconfident that the upward-sloping NCP, as well as other related quantities we discuss, are properties ofthe networks we are considering, rather than artifacts of the approximation algorithms we employ. Wehave gone to a great deal of effort to demonstrate this. In particular:

• We use several classes of graph partitioning algorithms to probe the networks for sets of nodes thatcould plausibly be interpreted as communities. These algorithms, including flow-based methods,spectral methods, and hierarchical methods, have complementary strengths and weaknesses thatare well understood both in theory and in practice. For example, flow-based methods are knownto have difficulties with expanders (42, 43), and flow-based post-processing of other methods areknown in practice to yield cuts with extremely good conductance values (39, 41). On the otherhand, spectral methods are known to have difficulties when they confuse long paths with deep

1

cuts (30, 64), a consequence of which is that they may be viewed as computing a “regularized”approximation to the network community profile plot.

• We compute spectral-based lower bounds and also semidefinite-programming-based lower boundsfor the conductance of our network datasets.

• We compute a wide range of other structural properties of thenetworks,e.g., sizes, degree distri-butions, maximum and average diameters of the purported communities, internal versus externalconductance values of the purported communities, etc.

• We recompute statistics on versions of the networks that have been modified in well-understoodways,e.g., by removing small barely-connected sets of nodes or by randomizing the edges.

• We compare our results across not only over100 large social and information networks, but alsonumerous commonly-studied small social networks, expanders, and low-dimensional manifold-like objects, and we compare our results on each network withwhat is known from the field fromwhich the network is drawn. To our knowledge, this makes oursthe most extensive such analysisof the community structure in large real-world social and information networks.

• We compare results with analytical and/or simulational results on a wide range of commonly andnot-so-commonly used network generation models (7,12,25,26,38,45,51,57).

Aside from Section S1, where we describe the network data sets, and Section S6, where we provide amore detailed discussion of the relationship between our results and modularity, most of this Supplemen-tary Material is devoted to describing these methodological issues in greater detail.

S1 Social and information network datasets we analyze

We have examined a large number of real-world complex networks. See Tables S1, S2, S3, and S4 fora summary. (In addition to the networks listed in these tables, we have also analyzed58 networks thatare related to the MESSENGERnetwork; these correspond to58 versions of the Instant messenger socialnetwork in58 different countries. These networks range in size from nearly 105 nodes up to nearly107

nodes and include the MESSENGER-DE network.) For convenience, we have organized the networks intothe following categories: Social networks; Information/citation networks; Collaboration networks; Webgraphs; Internet networks; Bipartite affiliation networks; Biological networks; Low-dimensional net-works; IMDB networks; and Amazon networks. We have also examined numerous small social networksthat have been used as a testbed for community detection algorithms (e.g., Zachary’s karate club (5,70),interactions between dolphins (5, 48), interactions between monks (5, 61), Newman’s network sciencenetwork (5, 53), etc.), numerous simple network models in which by design there is an underlying ge-ometry (e.g., power grid and road networks (69), simple meshes, low-dimensional manifolds includinggraphs corresponding to the well-studied “swiss roll” dataset (67), a geometric preferential attachmentmodel (25,26), etc.), several networks that are very good expanders, andmany simulated networks gen-erated by commonly-used network generation models (e.g., preferential attachment models (51), copyingmodels (38), hierarchical models (57), etc.).

Social networks: The class of social networks in Table S1 is particularly diverse and interesting.It includes several large on-line social networks: a network of professional contacts from LinkedIn

2

Network N E Nb Eb d d C D D Description

Social networks

DELICIOUS 147,567 301,921 0.40 0.65 4.09 48.44 0.30 24 6.28 del.icio.us collaborative tagging social networkEPINIONS 75,877 405,739 0.48 0.90 10.69 183.88 0.26 15 4.27 Who-trusts-whom network from epinions.com (59)FLICKR 404,733 2,110,078 0.33 0.86 10.43 442.75 0.40 18 5.42 Flickr photo sharing social network (37)L INKED IN 6,946,668 30,507,070 0.47 0.88 8.78 351.66 0.23 23 5.43 Social network of professional contactsL IVEJOURNAL01 3,766,521 30,629,297 0.78 0.97 16.26 111.24 0.36 23 5.55 Friendship network of a blogging community (11)L IVEJOURNAL11 4,145,160 34,469,135 0.77 0.97 16.63 122.44 0.36 23 5.61 Friendship network of a blogging community (11)L IVEJOURNAL12 4,843,953 42,845,684 0.76 0.97 17.69 170.66 0.35 20 5.53 Friendship network of a blogging community (11)MESSENGER 1,878,736 4,079,161 0.53 0.78 4.34 15.40 0.09 26 7.42 Instant messenger social networkEMAIL -ALL 234,352 383,111 0.18 0.50 3.27 576.87 0.50 14 4.07 Research organization email network (all addresses) (46)EMAIL -INOUT 37,803 114,199 0.47 0.82 6.04 165.73 0.58 8 3.74 (all addresses but email has to be sent both ways) (46)EMAIL -INSIDE 986 16,064 0.90 0.99 32.58 74.66 0.45 7 2.60 (only emails inside the research organization) (46)EMAIL -ENRON 33,696 180,811 0.61 0.90 10.73 142.36 0.71 13 3.99 Enron email dataset (36)ANSWERS 488,484 1,240,189 0.45 0.78 5.08 251.78 0.11 22 5.72 Yahoo Answers social networkANSWERS-1 26,971 91,812 0.56 0.87 6.81 59.17 0.08 16 4.49 Cluster 1 from Yahoo AnswersANSWERS-2 25,431 65,551 0.48 0.80 5.16 56.57 0.10 15 4.76 Cluster 2 from Yahoo AnswersANSWERS-3 45,122 165,648 0.53 0.87 7.34 417.83 0.21 15 3.94 Cluster 3 from Yahoo AnswersANSWERS-4 93,971 266,199 0.49 0.82 5.67 94.48 0.08 16 4.91 Cluster 4 from Yahoo AnswersANSWERS-5 5,313 11,528 0.41 0.73 4.34 29.55 0.12 14 4.75 Cluster 5 from Yahoo AnswersANSWERS-6 290,351 613,237 0.40 0.71 4.22 57.16 0.09 22 5.92 Cluster 6 from Yahoo Answers

Information (citation) networks

CIT-PATENTS 3,764,105 16,511,682 0.82 0.96 8.77 21.34 0.09 26 8.15 Citation network of all US patents (45)CIT-HEP-PH 34,401 420,784 0.96 1.00 24.46 63.50 0.30 14 4.33 Citations between physics (arxivhep-th) papers (28)CIT-HEP-TH 27,400 352,021 0.94 0.99 25.69 106.40 0.33 15 4.20 Citations between physics (arxivhep-ph) papers (28)BLOG-NAT05-6M 29,150 182,212 0.74 0.96 12.50 342.51 0.24 10 3.40 Blog citation network (6 months of data) (47)BLOG-NAT06ALL 32,384 315,713 0.87 0.99 19.50 153.08 0.20 18 3.94 Blog citation network (1 year of data) (47)POST-NAT05-6M 238,305 297,338 0.21 0.34 2.50 39.51 0.13 45 10.34 Blog post citation network (6 months) (47)POST-NAT06ALL 437,305 565,072 0.22 0.38 2.58 35.54 0.11 54 10.48 Blog post citation network (1 year) (47)

Collaboration networks

ATA-IMDB 883,963 27,473,042 0.87 0.99 62.16 517.40 0.79 15 3.48 IMDB actor collaboration network from Dec 2007CA-ASTRO-PH 17,903 196,972 0.89 0.98 22.00 65.70 0.67 14 4.21 Co-authorship inastro-ph of arxiv.org (45)CA-COND-MAT 21,363 91,286 0.81 0.93 8.55 22.47 0.70 15 5.36 Co-authorship incond-mat category (45)CA-GR-QC 4,158 13,422 0.64 0.78 6.46 17.98 0.66 17 6.10 Co-authorship ingr-qc category (45)CA-HEP-PH 11,204 117,619 0.81 0.97 21.00 130.88 0.69 13 4.71 Co-authorship inhep-ph category (45)CA-HEP-TH 8,638 24,806 0.68 0.85 5.74 12.99 0.58 18 5.96 Co-authorship inhep-th category (45)CA-DBLP 317,080 1,049,866 0.67 0.84 6.62 21.75 0.73 23 6.75 DBLP co-authorship network (11)

Table S1: Network datasets we analyzed. Statistics of networks we consider: number of nodesN ; number of edgesE; fraction nodesnot in whiskers (size of largest biconnected component)Nb/N ; fraction of edges in biconnected componentEb/E; average degreed = 2E/N ; second order average degreed; average clustering coefficientC; diameterD; and average path lengthD.

3


Web graphs

WEB-BERKSTAN 319,717 1,542,940 0.57 0.88 9.65 1,067.55 0.32 35 5.66 Web graph of Stanford and UC Berkeley (35)WEB-GOOGLE 855,802 4,291,352 0.75 0.92 10.03 170.35 0.62 24 6.27 Web graph Google released in 2002 (3)WEB-NOTREDAME 325,729 1,090,108 0.41 0.76 6.69 280.68 0.47 46 7.22 Web graph of University of Notre Dame (8)WEB-TREC 1,458,316 6,225,033 0.59 0.78 8.54 682.89 0.68 112 8.58 Web graph of TREC WT10G web corpus (2)

Internet networks

AS-ROUTEV IEWS 6,474 12,572 0.62 0.80 3.88 164.81 0.40 9 3.72 AS from Oregon Exchange BGP Route View (45)AS-CAIDA 26,389 52,861 0.61 0.81 4.01 281.93 0.33 17 3.86 CAIDA AS Relationships DatasetAS-SKITTER 1,719,037 12,814,089 0.99 1.00 14.91 9,934.01 0.17 5 3.44 AS from traceroutes run daily in 2005 by SkitterAS-NEWMAN 22,963 48,436 0.65 0.83 4.22 261.46 0.35 11 3.83 AS graph from Newman (5)AS-OREGON 13,579 37,448 0.72 0.90 5.52 235.97 0.46 9 3.58 Autonomous systems (1)GNUTELLA -25 22,663 54,693 0.59 0.83 4.83 10.75 0.01 11 5.57 Gnutella network on March 25 2000 (60)GNUTELLA -30 36,646 88,303 0.55 0.81 4.82 11.46 0.01 11 5.75 Gnutella P2P network on March 30 2000 (60)GNUTELLA -31 62,561 147,878 0.54 0.81 4.73 11.60 0.01 11 5.94 Gnutella network on March 31 2000 (60)EDONKEY 5,792,297 147,829,887 0.93 1.00 51.04 6,139.99 0.08 5 3.66 P2P eDonkey graph for a period of 47 hours in 2004

Bi-partite networks

IPTRAFFIC 2,250,498 21,643,497 1.00 1.00 19.23 94,889.05 0.00 5 2.53 IP traffic graph a single router for 24 hoursATP-ASTRO-PH 54,498 131,123 0.70 0.87 4.81 16.67 0.00 28 7.78 Authors-to-papers network ofastro-ph (47)ATP-COND-MAT 57,552 104,179 0.65 0.79 3.62 10.54 0.00 31 9.96 Authors-to-papers network ofcond-mat (47)ATP-GR-QC 14,832 22,266 0.47 0.60 3.00 9.72 0.00 35 11.08 Authors-to-papers network ofgr-qc (47)ATP-HEP-PH 47,832 86,434 0.60 0.76 3.61 16.80 0.00 27 8.55 Authors-to-papers network ofhep-ph (47)ATP-HEP-TH 39,986 64,154 0.53 0.68 3.21 13.07 0.00 36 10.74 Authors-to-papers network ofhep-th (47)ATP-DBLP 615,678 944,456 0.49 0.64 3.07 13.61 0.00 48 12.69 DBLP authors-to-papers bipartite networkSPENDING 1,831,540 2,918,920 0.34 0.58 3.19 1,536.35 0.00 26 5.62 Users-to-keywords they bidHW7 653,260 2,278,448 0.99 0.99 6.98 346.85 0.00 24 6.26 Downsampled advertiser-query bid graphNETFLIX 497,959 100,480,507 1.00 1.00 403.57 28,432.89 0.00 5 2.31 Users-to-movies they rated. From Netflix prize (4)QUERYTERMS 13,805,808 17,498,668 0.28 0.41 2.53 14.92 0.00 86 19.81 Users-to-queries they submit to a search engineCLICKSTREAM 199,308 951,649 0.39 0.87 9.55 430.74 0.00 7 3.83 Users-to-URLs they visited (50)

Biological networks

BIO-PROTEINS 4,626 14,801 0.72 0.91 6.40 24.25 0.12 12 4.24 Yeast protein interaction network (21)BIO-YEAST 1,458 1,948 0.37 0.51 2.67 7.13 0.14 19 6.89 Yeast protein interaction network data (32)BIO-YEASTP0.001 353 1,517 0.73 0.93 8.59 20.18 0.57 11 4.33 Yeast protein-protein interaction map (56)BIO-YEASTP0.01 1,266 8,511 0.79 0.97 13.45 47.73 0.44 12 3.87 Yeast protein-protein interaction map (56)


4


Nearly low-dimensional networks

ROAD-CA 1,957,027 2,760,388 0.80 0.85 2.82 3.17 0.06 865 310.97 California road networkROAD-USA 126,146 161,950 0.97 0.98 2.57 2.81 0.03 617 218.55 USA road network (only main roads)ROAD-PA 1,087,562 1,541,514 0.79 0.85 2.83 3.20 0.06 794 306.89 Pennsylvania road networkROAD-TX 1,351,137 1,879,201 0.78 0.84 2.78 3.15 0.06 1,064 418.73 Texas road networkPOWERGRID 4,941 6,594 0.62 0.69 2.67 3.87 0.11 46 19.07 Power grid of Western States Power Grid (69)MANI -FACES7K 696 6,979 0.98 0.99 20.05 37.99 0.56 16 5.52 Faces (64x64 grayscale images) (connect 7k closest pairs)MANI -FACES4K 663 3,465 0.90 0.97 10.45 20.20 0.56 29 8.96 Faces (connect 4k closest pairs)MANI -FACES2K 551 1,981 0.84 0.94 7.19 12.77 0.54 32 11.07 Faces (connect 2k closest pairs)MANI -FACESK10 698 6,935 1.00 1.00 19.87 25.32 0.51 6 3.25 Faces (connect every to 10 nearest neighbors)MANI -FACESK3 698 2,091 1.00 1.00 5.99 7.98 0.45 9 4.89 Faces (connect every to 5 nearest neighbors)MANI -FACESK5 698 3,480 1.00 1.00 9.97 12.91 0.48 7 4.03 Faces (connect every to 3 nearest neighbors)MANI -SWISS200K 20,000 200,000 1.00 1.00 20.00 21.08 0.59 103 37.21 Swiss-roll (connect 200k nearest pairs of nodes)MANI -SWISS100K 19,990 99,979 1.00 1.00 10.00 11.02 0.59 162 58.32 Swiss-roll (connect 100k nearest pairs of nodes)MANI -SWISS60K 19,042 57,747 0.93 0.96 6.07 7.03 0.59 243 89.15 Swiss-roll (connect 60k nearest pairs of nodes)MANI -SWISSK10 20,000 199,955 1.00 1.00 20.00 25.38 0.56 10 5.47 Swiss-roll (every node connects to 10 nearest neighbors)MANI -SWISSK5 20,000 99,990 1.00 1.00 10.00 12.89 0.54 13 8.34 Swiss-roll (every node connects to 5 nearest neighbors)MANI -SWISSK3 20,000 59,997 1.00 1.00 6.00 7.88 0.50 17 6.89 Swiss-roll (every node connects to 3 nearest neighbors)

IMDB Actor-to-Movie graphs

ATM-IMDB 2,076,978 5,847,693 0.49 0.82 5.63 65.41 0.00 32 6.82 Actors-to-movies graph from IMDB (imdb.com)IMDB -TOP30 198,430 566,756 0.99 1.00 5.71 18.19 0.00 26 8.32 Actors-to-movies graph heavily preprocessedIMDB -RAW07 601,481 1,320,616 0.54 0.79 4.39 20.94 0.00 32 8.55 Country clusters were extracted from this graphIMDB -FRANCE 35,827 74,201 0.51 0.76 4.14 14.62 0.00 20 6.57 Cluster of French moviesIMDB -GERMANY 21,258 42,197 0.56 0.78 3.97 13.69 0.00 34 7.47 German movies (to actors that played in them)IMDB -INDIA 12,999 25,836 0.57 0.78 3.98 31.55 0.00 19 6.00 Indian moviesIMDB -ITALY 19,189 37,534 0.55 0.77 3.91 11.66 0.00 30 6.91 Italian moviesIMDB -JAPAN 15,042 34,131 0.60 0.82 4.54 16.98 0.00 19 6.81 Japanese moviesIMDB -MEXICO 13,783 36,986 0.64 0.86 5.37 24.15 0.00 19 5.43 Mexican moviesIMDB -SPAIN 15,494 31,313 0.51 0.76 4.04 14.22 0.00 28 6.44 Spanish moviesIMDB -UK 42,133 82,915 0.52 0.76 3.94 15.14 0.00 23 7.04 UK moviesIMDB -USA 241,360 530,494 0.51 0.78 4.40 25.25 0.00 30 7.63 USA moviesIMDB -WGERMANY 12,120 24,117 0.56 0.78 3.98 11.73 0.00 22 6.26 West German movies

Amazon product co-purchasing networks

AMAZON0302 262,111 899,792 0.95 0.97 6.87 11.14 0.43 38 8.85 Amazon products from 2003 03 02 (20)AMAZON0312 400,727 2,349,869 0.94 0.99 11.73 30.33 0.42 20 6.46 Amazon products from 2003 03 12 (20)AMAZON0505 410,236 2,439,437 0.94 0.99 11.89 30.93 0.43 22 6.48 Amazon products from 2003 05 05 (20)AMAZON0601 403,364 2,443,311 0.96 0.99 12.11 30.55 0.43 25 6.42 Amazon products from 2003 06 01 (20)AMAZONALL 473,315 3,505,519 0.94 0.99 14.81 52.70 0.41 19 5.66 Amazon products (all 4 graphs merged) (20)AMAZONALL PROD 524,371 1,491,793 0.80 0.91 5.69 11.75 0.35 42 11.18 Products (all products, source+target) (44)AMAZONSRCPROD 334,863 925,872 0.84 0.91 5.53 11.53 0.43 47 12.11 Products (only source products) (44)


5


Additional networks

AIRPORTS 500 2,980 371 2,822 11.92 53.78 0.73 7 2.98 Commercial airports in the United States (22)ASTRO-PH 14,845 119,652 12,672 116,272 16.12 45.46 0.72 14 4.79 Collaborations in astro physics (5)CELEGANSNEURAL 297 2,148 282 2,133 14.46 26.05 0.31 5 2.41 Neural network of the nematode C. Elegans (14)CENTRALITY-LITERATURE 98 518 87 507 10.57 22.30 0.55 4 2.16 Centrality Literature (14)COMPANY 16 37 14 35 4.63 5.54 0.51 5 2.00 The Company dataset (14)COND-MAT 31,163 120,029 21,772 106,516 7.70 21.74 0.72 16 4.54 Condensed matter collaboration network (5)CSPHD 1,268 1,656 432 820 2.61 7.93 0.03 15 5.45 Ties between Ph.D. students and advisors in CS (14)DINING 26 42 26 42 3.23 3.81 0.12 6 2.83 Dining-table partners in a dormitory (14)DOLPHINS 62 159 53 150 5.13 6.81 0.30 8 3.25 Frequent associations between 62 dolphins (14) (48)DRUG-NET 193 273 108 184 2.83 4.25 0.19 18 7.12 Drug Net dataset (14)FLYING -TEAMS 48 284 48 284 11.83 13.25 0.39 3 1.84 Ties between cadet pilots (14)FOOTBALL 115 613 115 613 10.66 10.73 0.40 4 2.49 NCAA football teams (5)GLEISER-COMICS-1 39 86 12 44 4.41 8.74 0.68 6 2.56 Collaborations in Marvel Universe comic books (14)GLEISER-COMICS-2 54 127 14 37 4.70 8.26 0.67 7 3.08 Collaborations in Marvel Universe comic books (14)GLEISER-COMICS-3 86 197 14 50 4.58 8.38 0.77 9 3.40 Collaborations in Marvel Universe comic books (14)GLEISER-COMICS-4 113 249 29 89 4.41 8.54 0.73 9 4.10 Collaborations in Marvel Universe comic books (14)KARATE 34 78 28 67 4.59 7.77 0.59 5 2.46 Zachary’s Karate club dataset (70)LESMIS 77 254 54 227 6.60 12.06 0.74 5 2.52 Network of Les miserables (5)MEXICAN -POWER 31 106 31 106 6.84 8.53 0.51 4 1.93 Ties between Mexican political elite (14)MODMATH 30 61 30 61 4.07 4.80 0.23 5 2.49 Diffusion of a new mathematics method in 50s (14)MONKS-SAMNPR 18 35 15 35 3.89 5.29 0.23 3 1.03 Sampson Monastery, blame relation (61)MONKS-SAMPES 18 45 17 44 5.00 5.76 0.53 4 1.84 Sampson Monastery, disesteem relation (61)MONKS-SAMPIN 18 41 17 40 4.56 5.51 0.51 4 1.87 Sampson Monastery, positive influence relation (61)MONKS-SAMPLK2 18 42 18 42 4.67 5.24 0.39 4 1.80 Sampson Monastery, likes relation (61)MONKS-SAMPLK3 18 41 18 41 4.56 4.78 0.47 4 2.08 Sampson Monastery, likes relation (61)MONKS-SAMPPR 18 32 12 23 3.56 4.41 0.58 5 2.54 Sampson Monastery social network (61)NETSCIENCE 1,589 2,742 134 372 3.45 6.94 0.88 17 0.37 Collaborations in network science (5)POLBLOGS 1,222 16,715 1,081 16,573 27.36 81.26 0.36 8 2.72 Political blogs (6)POLBOOKS 105 441 105 441 8.40 11.93 0.49 7 3.05 Copurchasing of books on U.S. politics (14)PRISON 67 142 62 137 4.24 5.25 0.33 7 3.31 Social network between prisoners (14)50 WOMEN01 33 59 17 31 3.58 4.49 0.49 7 3.16 School friendships in the West of Scotland (14)50 WOMEN02 48 81 18 36 3.38 4.00 0.45 10 4.40 School friendships in the West of Scotland (14)SANJUANSUR 75 144 72 141 3.84 4.69 0.32 9 3.77 Visiting relations in haciendas in Costa Rica (14)SCOTLAND 142 324 126 308 4.56 10.61 0.08 8 3.14 Corporate interlocks in Scotland (14)STRIKE 24 38 10 16 3.17 3.74 0.46 6 2.85 Strike in a wood-processing facility (14)TANZANIA -4-2006 17 30 14 27 3.53 4.57 0.63 4 2.14 Tanzania Embassy Bombing - 2006 dataset (14)TARO 22 39 22 39 3.55 3.79 0.34 5 2.40 gift giving in Papuan village (14)WEST-BANK -18-36 14 13 2 1 1.86 2.69 0.00 7 3.06 six major terrorist groups (14)WIKI USERTALK 2,804,912 5,779,101 962,467 3,948,004 4.12 7449.21 0.22 11 3.75 Wikipedia talk graphWORLD-TRADE 77 848 77 848 22.03 33.99 0.76 3 1.70 world trading of metal (14)


6

(L INKED IN); a friendship network of a LiveJournal blogging community(L IVEJOURNAL01); and awho-trusts-whom network of Epinions (EPINIONS). It also includes an email network from Enron(EMAIL -ENRON) and from a large European research organization. For the latter we generated threenetworks: EMAIL -INSIDE uses only the communication inside organization; EMAIL -INOUT also addsexternal email addresses where email has been sent both way;and EMAIL -ALL adds all communica-tion inside the organization and to the outside world. Also included in the class of social networks arenetworks that are not the central focus of the websites from which they come, but which instead serveas a tool for people to share information more easily. For example, we have: the networks of a socialbookmarking site Delicious (DELICIOUS); a Flickr photo sharing website (FLICKR); and a network fromYahoo! Answers question answering website (ANSWERS). In all these networks, a node refers to an in-dividual and an edge is used to indicate that means that one person has some sort of interaction withanother person,e.g., one person subscribes to their neighbor’s bookmarks or photos, or answers theirquestions.

Information and citation networks: The class of Information/citation networks contains severaldifferent citation networks. It contains two citation networks of physics papers onarxiv.org, (CIT-HEP-TH and CIT-HEP-PH), and a network of citations of US patents (CIT-PATENTS). (These paper-to-paper citation networks are to be distinguished from scientific collaboration networks and author-to-paperbipartite networks, as described below.) It also contains two types of blog citation networks. In the so-called post networks, nodes are posts and edges represent hyperlinks between blog posts (POST-NAT05-6M and POST-NAT06ALL ). On the other hand, the so-called blog network is the blog-level-aggregationof the same data,i.e., there is a link between two blogs if there is a post in first that links the post in asecond blog (BLOG-NAT05-6M and BLOG-NAT06ALL ).

Collaboration networks: The class of collaboration networks contain academic collaboration (i.e.,co-authorship) networks between physicists from various categories inarxiv.org (CA-ASTRO-PH,etc.) and between authors in computer science (CA-DBLP). Italso contains a network of collaborationsbetween pairs of actors in IMDB (ATA-IMDB), i.e., there is an edge connecting a pair of actors if theyappeared in the same movie. (Again, this should be distinguished from actor-to-movie bipartite networks,as described below.)

Web graphs: The class of Web graph networks includes four different web-graphs in which nodesrepresent web-pages and edges represent hyperlinks between those pages. Networks were obtained fromGoogle (WEB-GOOGLE), the University of Notre Dame (WEB-NOTREDAME), TREC (WEB-TREC),and Stanford University (WEB-BERKSTAN). The class of Internet networks consists of various au-tonomous systems networks obtained at different sources, as well as a Gnutella and eDonkey peer-to-peerfile sharing networks.

Bipartite networks: The class of Bipartite networks is particularly diverse andincludes: authors-to-papers graphs from both computer science (ATP-DBLP) and physics (ATP-ASTRO-PH, etc.); a networkrepresenting users and the URLs they visited (CLICKSTREAM); a network representing users and themovies they rated (NETFLIX); and a users-to-queries network representing query termsthat users typedinto a search engine (QUERYTERMS). (We also have analyzed several bipartite actors-to-movies net-works extracted from the IMDB database, which we have listedseparately below.)

Biological networks: The class of Biological networks include protein-protein interaction networksof yeast obtained from various sources.

Low dimensional grid-like networks: The class of Low-dimensional networks consists of graphsconstructed from road (ROAD-CA, etc.) or power grid (POWERGRID) connections and as such might

7

be expected to “live” on a two-dimensional surface in a way that all of the other networks do not. Wealso added a “swiss roll” network, a2-dimensional manifold embedded in3-dimensions, and a “Faces”dataset where each point is an64 by 64 gray-scale image of a face (embedded in4, 096 dimensionalspace) and where we connected the faces that were most similar (using the Euclidean distance).

IMDB, Yahoo! Answers and Amazon networks: Finally, we have networks from IMDB, Ama-zon, and Yahoo! Answers, and for each of these we have separately analyzed subnetworks. The IMDBnetworks consist of actor-to-movie links, and we include the full network as well as subnetworks associ-ated with individual countries based on the country of production. For the Amazon networks, recall thatAmazon sells a variety of products, and for each itemA one may compile the list the up to ten other itemsmost frequently purchased by buyers ofA. This information can be presented as a directed network inwhich vertices represent items and there is a edge from itemA to another itemB if B was frequentlypurchased by buyers ofA. We consider the network as undirected. We use five networks from a study ofClausetet al. (20), and two networks from the viral marketing study from Leskovecet al. (44). Finally,for the Yahoo! Answers networks, we observe several deep cuts at large size scales, and so in additionthe full network, we analyze the top six most well-connectedsubnetworks.

In addition to providing a brief description of the network,Tables S1, S2, S3, and S4 show the numberof nodes and edges in each network, as well as other statistics: the number of nodesN ; the number ofedgesE; the fraction of nodes in the largest biconnected componentNb/N ; the fraction of edges inthe largest biconnected componentEb/E; the average degreed = 2E/N ; the empirical second-orderaverage degree (19) d; average clustering coefficient (69) C; the estimated diameterD; and the estimatedaverage path lengthD. (The diameter was estimated using the following algorithm: pick a random node,find the farthest nodeX (via shortest path); move toX and find the farthest node fromX; iterate thisprocedure until the distance to the farthest node does not increase anymore. The average path length wasestimated based on10, 000 randomly sampled nodes.)

In all cases, we consider the network as undirected, and we extract and analyze the largest connectedcomponent. The sizes of these networks range from about5, 000 nodes up to nearly14 million nodes, andfrom about6, 000 edges up to more than100 million edges. All of the networks are quite sparse—theirdensities range from an average degree of about2.5 for the blog post network, up to an average degreeof about400 in the network of movie ratings from Netflix, and most of the other networks, including thepurely social networks, have average degree around10 (median average degree of6). In many cases, weexamined several versions of a given network. For example, we considered the entire IMDB actor-to-movie network, as well as sub-pieces of it corresponding to different language and country groups. Intotal, we have examined over100 large real-world social and information networks, making this, to ourknowledge, the largest and most comprehensive study of suchnetworks.

S2 The Network Community Profile (NCP) of small and large networks

In this section, we discuss in greater detail thenetwork community profile(NCP), which measures thequality of network communities at different size scales. Westart in Section S2.1 by defining conduc-tance and describing algorithms for finding low-conductance cuts, and we follow this in Section S2.2by introducing the NCP. Then, in Section S2.3, we present theNCP for several examples of networkswhich inform peoples’ intuition and for which the NCP behaves in a characteristic manner. Then, inSection S2.4 we present the NCP for a wide range of large real world social and information networks.

8

We will see that in such networks the NCP behaves in a qualitatively different manner.

S2.1 Conductance and algorithms for finding low-conductance cuts

Let G = (V,E) denote a graph, then theconductanceφ of a set of nodesS ⊂ V , (whereS is assumedto contain no more than half of all the nodes), is defined as follows. Letv be the sum of degrees of nodesin S, and lets be the number of edges with one endpoint inS and one endpoint inS, whereS denotesthe complement ofS. Then, the conductance ofS is φ = s/v, or equivalentlyφ = s/(s + 2e), whereeis the number of edges with both endpoints isS. More formally:

Definion 1. Given a graphG with adjacency matrixA theconductance of a setof nodesS is defined as:

φ(S) =

∑

i∈S,j /∈S Aij

min{A(S), A(S)}, (1)

whereA(S) =∑

i∈S

∑

j∈V Aij, or equivalentlyA(S) =∑

i∈S d(i), whered(i) is a degree of nodei inG. Moreover, in this case, theconductance of the graphG is:

φG = minS⊂V

φ(S). (2)

Thus, the conductance of a set provides a measure for the quality of the cut (S, S), or relatedlythe goodness of a communityS. Indeed, it is often noted that communities should be thought of assets of nodes with more and/or better intra-connections than inter-connections; see Figure S1 for anillustration. When interested in detecting communities and evaluating their quality, we prefer sets withsmall conductances,i.e., sets that are densely linked inside and sparsely linked to the outside. Thereare of course many other density-based measures that have been used to partition a graph into a set ofcommunities (27,62,68). One that deserves particular mention is modularity (54,55); see Section S6 fora more detailed discussion of modularity.

In order to scale up to very lage networks, we have extensively used the Local Spectral Algorithmof Andersen, Chung, and Lang (9) to find node sets of low conductance,i.e., good communities, arounda seed node. This algorithm takes as input two parameters—the seed node and a parameterǫ that intu-itively controls the locality of the computation—and it outputs a set of nodes. Local spectral methodswere introduced by Spielman and Teng (9,65), and they have roughly the same kind of quadratic approx-imation guarantees as the global spectral method, but they have computational cost is proportional to thesize of the obtained piece (16–18).

S2.2 Definition of the network community profile

In order to more finely resolve community structure in large networks, we introduce thenetwork commu-nity profile (NCP). Intuitively, the NCP measures the quality of the bestpossible community in a largenetwork, as a function of the community size. Formally, we may define it as the conductance value ofthe best conductance set of cardinalityk in the entire network, as a function ofk.

9

(a) Three communities (b) Conductance bottleneck

Figure S1: (a) Caricature of the traditional view of communities as being sets of nodes with more and/orbetter intra-connections than inter-connections. (b) A graph with its minimum conductance bottleneckillustrated.

Definion 2. Given a graphG with adjacency matrixA, thenetwork community profile (NCP)plotsΦ(k)as a function ofk, where

Φ(k) = minS⊂V,|S|=k

φ(S), (3)

where|S| denotes the cardinality of the setS, and where the conductanceφ(S) of S is given by equa-tion (1).

Since this quantity is intractable to compute, we will employ well-studied approximation algorithmsfor the Minimum Conductance Cut Problem to approximate it. In particular, operationally we willuse several natural heuristics based on approximation algorithms to do graph partitioning in order tocompute different approximations to the NCP. We will use theLocal Spectral Algorithm of Andersen,Chung, and Lang (9) most extensively; this procedure returns sets that are somewhat more “compact”or “smoothed” or “regularized” than other methods. In Section S5, we will describe other methods,including Metis+MQI,i.e., the graph partitioning package Metis (34) followed by the flow-based post-processing procedure MQI (41).

S2.3 Community profile plots for expander, low-dimensional, and small social networks

The NCP behaves in a characteristic manner for graphs that are “well-embeddable” into an underlyinglow-dimensional geometric structure. To illustrate this,consider Figure S2. In Figure 2(a), we showthe results for a1-dimensional chain, a2-dimensional grid, and a3-dimensional cube. In each case, theNCP is steadily downward sloping as a function of the number of nodes in the smaller cluster. Moreover,the curves are straight lines with a slope equal to−1/d, whered is the dimensionality of the underlyinggrids. In particular, as the underlying dimension increases then the slope of the NCP gets less steep.This is simply a manifestation of the isoperimetric (i.e., surface area to volume) phenomenon: for agrid, the “best” cut is obtained by cutting out a set of adjacent nodes, in which case the surface area

10

(number of edges cut) increases asO(md−1), while the volume (number of vertices/edges inside thecluster) increases asO(md).

This qualitative phenomenon of a steadily downward slopingNCP is quite robust for networks that“live” in a low-dimensional structure,e.g., on a manifold or the surface of the earth. For example,Figure 2(b) shows the NCP for a power grid network of Western States Power Grid (69), and Figure 2(c)shows the NCP for a road network of California. These two networks have very different sizes—thepower grid network has4, 941 nodes and6, 594 edges, and the road network has1, 957, 027 nodes and2, 760, 388 edges—and they arise in very different application domains. In both cases, however, we seepredominantly downward sloping NCP, very much similar to the profile of a simple2-dimensional grid.Moreover, empirically we observe that minima in the NCP correspond to community-like sets, whichare occasionally nested. This may correspond to hierarchical community organization. For example, thenodes giving the dip atk = 19 are included in the nodes giving the dip atk = 883, while dips atk = 94andk = 105 are both included in the dip atk = 262.

In a similar manner, Figure 2(d) shows the profile plot for a graph generated from a “swiss roll”dataset which is commonly examined in the manifold and machine learning literature (67). In this case,we still observe a downward sloping NCP that corresponds to internal dimensionally of the manifold (2in this case). Finally, Figures 2(e) and 2(f) show NCPs for two graphs that are very good expanders.The first is aGnm graph with100, 000 nodes and a number of edges such that the average degree is4,6, and8. The second is a constant degree expander: to make one with degreed, we take the union ofd disjoint but otherwise random complete matchings, and we have plotted the results ford = 4, 6, 8. Inboth of these cases, the NCP is roughly flat, which we also observed in Figure 2(a) for a clique, whichis to be expected since the minimum conductance cut in the entire graph cannot be too small for a goodexpander (31).

Somewhat surprisingly (especially when compared with large networks in Section S2.4), a steadilydecreasing downward NCP is seen for small social networks that have been extensively studied in vali-dating community detection algorithms. Several examples are shown in Figures S3. For these networks,the interpretation is similar to that for the low-dimensional networks: the downward slope indicates thatas potential communities get larger and larger, there are relatively more intra-edges than inter-edges;and empirically we observe that local minima in the NCP correspond to sets of nodes that are plausiblecommunities. Consider,e.g., Zachary’s karate club (70) network (ZACHARYKARATE), an extensively-analyzed social network (33, 52, 54). The network has34 nodes, each of which represents a member ofa karate club, and78 edges, each of which represent a friendship tie between two members. Figure 3(a)depicts the karate club network, and Figure 3(b) shows its NCP. There are two local minima in the plot:the first dip atk = 5 corresponds to the CutA, and the second dip atk = 17 corresponds to CutB.Note that CutB, which separates the graph roughly in half, has better conductance value than CutA.This corresponds with the intuition about the NCP derived from studying low-dimensional graphs. Notealso that the karate network corresponds well with the intuitive notion of a community, where nodes ofthe community are densely linked among themselves and thereare few edges between nodes of differentcommunities.

In a similar manner: Figure 3(c) shows a social network (with62 nodes and159 edges) of inter-actions within a group of dolphins (48); Figure 3(e) shows a social network of monks (with18 nodesrepresenting individual monks and41 edges representing social ties between pairs of monks) in a clois-ter (61); and Figure 3(g) depicts Newman’s network (with914 collaborations between379 researchers)of scientists who conduct research on networks (55). For each network, the NCP exhibits a downward

11

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)

k (number of nodes in the cluster)

Clique, -1/d≈0

Cube, -1/d≈-.33

Grid, -1/d≈-.50

Chain, -1/d≈-1.0

(a) Several low-dimensional meshes.

10-3

10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)


(b) POWERGRID

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(c) ROAD-CA

10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)


(d) Manifold

10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)


Average degree 4Average degree 6Average degree 8

(e) Expander: denseGnm graph

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


Degree 4 expanderDegree 6 expanderDegree 8 expander

(f) Expander: union of matchings

Figure S2: Network community profile plots for expander-like graphs and several networks that “live”in low-dimensional spaces. (2(a)) A large clique graph, a cube (3d mesh), a grid (2d mesh) and a chain(line). Note that the slope of community profile plot directly corresponds to dimensionality of the graph.(2(b)) and (2(c)) Two networks on the Earth’s surface and thus that are reasonably well-embeddable intwo dimensions. (2(d)) A 2d “swiss roll” manifold embedded in3 dimensions, where every we connectedevery point to10 nearest neighbors. (2(e)) and (2(f)) Two networks that are very good expanders.

12

(a) Zachary’s karate club network . . .

0.1

1

1 10

Φ (

cond

ucta

nce)


cut Acut A+B

(b) . . . and it’s community profile plot

(c) Dolphins social network . . .

0.01

0.1

1

1 10 100

Φ (

cond

ucta

nce)


cut

(d) . . . and it’s community profile plot

(e) Monks social network . . .

0.1

1

1 10

Φ (

cond

ucta

nce)


cut

(f) . . . and it’s community profile plot

(g) Network science network . . .

0.001

0.01

0.1

1

1 10 100

Φ (

cond

ucta

nce)


AB

CD

C+E

(h) . . . and it’s community profile plot

Figure S3: Depiction of several small social networks that are common test sets for community detectionalgorithms and their network community profile plots. (3(a)–3(b)) Zachary’s karate club network. (3(c)–3(d)) A network of dolphins. (3(e)–3(f)) A network of monks.(3(g)–3(h)) A network of researchersresearching networks.

13

trend, and it has local minima at cluster sizes that correspond to good communities: the minimum forthe dolphins network (Figure 3(d)) corresponds to the separation of the network into two communitiesdenoted with different shape and color of the nodes (gray circles versus red squares); the minima of themonk network (Figure 3(f)) corresponds to the split of7 Turks (red squares) and the so-called loyal op-position (gray circles) (61); and empirically both local minima and the global minimum in the networkscience network (Figure 3(h)) correspond to plausible communities.

S2.4 Community profile plots for large social and information networks

We have examined NCPs for each of the networks listed in Tables S1, S2, S3, and S3. In the main text,we presented NCPs for six of these networks. (These six networks were chosen to be representative ofthe wide range of networks we have examined.) See Figures S4,S5, and S6 below for the NCPs of othernetworks listed in Tables S1, S2 and S3 and for a discussion ofthem. The red curves plot the resultsof the Local Spectral Algorithm on the specified network. In addition, the black curve plots the resultsof the Local Spectral Algorithm applied to arewired versionof the same network,i.e., to a randomgraph conditioned on the same degree distribution as the original network. (We obtained such randomgraphs by starting with the original network and then randomly selecting pairs of edges and rewiringthe endpoints. By doing the rewiring long enough, we obtain arandom graph that has the same degreesequence as the original network (49).) The most striking feature of these plots is that the NCP ofthereal networks is steadily increasing for nearly its entire range. Note that both axes of these figures arelogarithmic, and thus the upward trend of the NCP is over a wide range of size scales.

In the first two rows of Figure S4, we have several examples of purely Social networks and two emailnetworks, in the third row we have patent and blog Information/citation networks, and in the final rowwe have three examples of actor and author Collaboration networks. In Figure S5, we see three exampleseach of Web graphs, Internet networks, Bipartite affiliation networks, and Biological networks. Finally,in the first row of Figure S6, we see Low-dimensional networks, including two road and a manifoldnetwork; in the second row, we have an IMDB Actor-to-Movie graphs and two subgraphs induced byrestricting to individual countries; in the third row, we see three Amazon product co-purchasing net-works; and in the final row we see a Yahoo! Answers networks andtwo subgraphs that are large goodconductance cuts from the full network.

These network datasets are drawn from a wide range of areas, and these graphs contain a wealthof information, a full analysis of which is well beyond the scope of the paper. Note, however, that thegeneral trend of an upward-sloping NCP manifests itself in nearly every network. We have observedqualitatively similar results in nearly every large socialand information network we have examined.Qualitative observations are consistent across the range of network sizes, densities, and different domainsfrom which the networks are drawn. Intuitively, the upward trend in the NCP means that separating largeclusters from the rest of the network is especially expensive. It suggests that larger and larger clustersare “blended in” more and more with the rest of the network. The interpretation we draw, based on thesedata and data presented in subsequent sections is that, if a density-based concept such as conductancecaptures our intuitive notion of community goodness and if we model large networks with interactiongraphs, then the best possible communities get less and lesscommunity-like as they grow in size.

The IMDB-RAW07 network is interesting in that its NCP does not increase much (at least not theversion computed by the Local Spectral Algorithm) and we clearly observe large sets with good conduc-tance values. Upon examination, many of the large good conductance cuts seem to be associated with

14

Social networks

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ (

cond

ucta

nce)


10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


10-4

10-3

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


L INKED IN MESSENGER DELICIOUS

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


10-3

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


FLICKR EMAIL -INOUT EMAIL -ENRON

Information networks (citation and blog networks)

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ (

cond

ucta

nce)


10-3

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


PATENTS BLOG-NAT06ALL POST-NAT06ALL

Collaboration networks

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)


10-3

10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)


ATA-IMDB CA- ASTRO-PH CA-HEP-PH

Figure S4: [Best viewed in color.] Community profile plots ofnetworks from Table S1. Red curves plotthe results of the Local Spectral Algorithm on the specified network, and black curves plot the resultsfrom a randomly rewired version of the same network. In nearly every case, the “best” communities arequite small and the NCP steadily increases for nearly its entire range.

15

Web graphs

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


WEB-BERKSTAN WEB-NOTREDAME WEB-TREC

Internet networks

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)


AS-NEWMAN GNUTELLA -25 AS-OREGON

Bipartite affiliation networks

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


10-3

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


ATP-COND-MAT ATP-HEP-TH CLICKSTREAM

Biological networks

10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103

Φ (

cond

ucta

nce)


BIO-PROTEINS BIO-YEAST BIO-YEASTP0.01

Figure S5: [Best viewed in color.] Community profile plots ofnetworks from Table S2.

16

Low-dimensional networks

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)


10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103

Φ (

cond

ucta

nce)


ROAD-USA ROAD-PA MANI -FACESK5

IMDB Actor-to-Movie graphs

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)


10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)


IMDB-RAW07 IMDB-MEXICO IMDB-WGERMANY

Amazon product co-purchasing networks

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


AMAZONALL PROD AMAZON0302 AMAZONALL

Yahoo Answers networks

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(B)(C) (D)

Original networkRewired network

10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)



10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)



ANSWERS ANSWERS-5 ANSWERS-6

Figure S6: [Best viewed in color.] Community profile plots ofnetworks from Table S3, as well asANSWERSand two sub-pieces of ANSWERS.

17

different language groups. Two things should be noted. First, and not surprisingly, in this network andothers, we have observed that there is some sensitivity to how the data are prepared. For example, weobtain somewhat stronger communities if ambiguous nodes (and there are a lot of ambiguous nodes innetwork datasets with millions of nodes) are removed than if, e.g., they are assigned to a country basedon a voting mechanism or some other heuristic. A full analysis of these data preparation issues is be-yond the scope of this paper, but our overall conclusions seem to hold independent of the preparationdetails. Second, if we examine individual countries—two representative examples are shown—then wesee substantially less structure at large size scales.

The Yahoo! Answers social network (see ANSWERS) also has several large cuts with good conduc-tance value—actually, the best cut in the network has more105 nodes. (It is likely that exogenous factorsare responsible for these large deep cuts.) Using standard graph partitioning procedures, we obtainedfour large disjoint clusters consisting of ca.5, 300, 25, 400, 27, 000, and290, 000 nodes, respectively,corresponding to the four dips (two of which visually overlap) in the NCP. We then examined the com-munity profile plots for each of these pieces. The two representative examples of which we show clearlyindicate a NCP that is much more like other network datasets we have examined.

Finally, with respect to the NCP of the rewired networks, note that it first slightly decreases butthen increases and flattens out. Several general points should be noted. First, typically, the originalnetwork has considerably more structure,i.e., deeper/better cuts, than its rewired version, even up to thelargest size scales. That is, we observe significantly more structure than would be seen, for example,in an random graph on the same degree sequence. Second, relative to the original network, the “best”community in the rewired graph,i.e., the global minimum of the conductance curve, shifts upwardandtowards the left. This means that in rewired networks the best conductance clusters get smaller and haveworse conductance scores. Third, sets at and near the minimum are small trees that are connected to thecore of the random graph by a single edge. Fourth, after the small dip at a very small size scale (≈ 10nodes), the NCP increases to a high level rather quickly. This is due to the absence of structure in thecore. Finally, also note that the variance in the rewired version of the NCP (data not shown) is not muchlarger than the width of the curve in the figure.

S3 The core, the periphery, and the size of the best communities

In this section, we discuss in greater detail structural properties of networks related to the nested core-periphery structure that are responsible for the empiricalobservations we have made.

S3.1 The core and the periphery of a network

In nearly every network we have examined, there is a substantial fraction of nodes that are barely con-nected to the main part of the network,i.e., that are part of a small cluster of ca.10 to 100 nodes thatare attached to the remainder of the network via one or a smallnumber of edges. In particular, a largefraction of the network is made out of nodes that are not in thebiconnected core. (We are slightlyabusing standard terminology by using the term bi-connectivity to mean 2-edge-connectivity. Wearerunning the classic DFS-based bi-connectivity algorithm,which identifies both bridge edges and articu-lation nodes, but then we are only knocking out the bridge edges, not the articulation nodes, so we endup with 2-edge-connected pieces.)

18

For example, the EPINIONS network has75, 877 nodes and405, 739 edges, and the core of thenetwork has only36, 111 (47%) nodes and365, 253 (90%) edges. For DELICIOUS, the core is evensmaller: it contains only40% of the nodes, and65% of the edges. Averaging over our network datasets,we see that the largest biconnected component contains around only60% of the nodes and80% of theedges of the original network. This is somewhat akin to the so-called “Jellyfish” model (63, 66) (whichwas proposed as a model for the graph of internet topology) and also to the “Octopus” model (for randompower law graphs (19). Moreover, the global minimum of the NCP plot is nearly always one of thesepieces that is connected by only a single edge. Since these small barely-connected pieces seem to have adisproportionately large influence on the community structure of our network datasets, we examine themin greater detail in the next section.

We definewhiskers, or more precisely1-whiskers, to be maximal subgraphs that can be detachedfrom the rest of the network by removing asingleedge. (Occasionally, we use the term whiskers in-formally to refer to barely connected sets of nodes more generally.) To find1-whiskers, we employ thefollowing algorithm. Using a depth-first search algorithm,we find the largest biconnected componentB of the graphG. (A graph is biconnected if the removal of any single edge does not disconnect thegraph.) We then delete all the edges inG that have one of their end points inB. We call the connectedcomponents of this new graphG′ 1-whiskers, since they correspond to largest subgraphs thatcan bedisconnected fromG by removing just a single edge.

Empirically, if one looks at the sets of nodes achieving the minimum in the NCP plot, then beforethe global NCP minimum communities are whiskers and above that size scale they are often unionsof disjoint whiskers. To understand the extent to which these whiskers and unions of whiskers areresponsible for the “best” conductance sets of different sizes, we have developed aBag-of-WhiskersHeuristic. We artificially compose “communities” from disconnected whiskers and measure conductanceof such clusters. Clearly, interpreting and relating such communities to real-world communities makeslittle sense as these communities are in fact disconnected.In more detail, we performed the followingexperiment: suppose we have a setW = {w1, w2, . . .} of whiskers. In order to construct the optimalconductance cluster of sizek, we need to solve the following problem: find a setC of whiskers suchthat

∑

i∈C N(wi) = k and∑

i∈Cd(wi)|C| is maximized, whereN(wi) is the number of nodes inwi and

d(wi) is its total internal degree. We then use a dynamic programming to get an approximate solution tothis problem. This way, for each sizek, we find a cluster that is composed solely from (disconnected)whiskers.

There are several observations we can make. First, the largest whisker (denoted with a red square)is the lowest point in nearly all NCP plots. This means that the best conductance community is in asense trivial as it cuts just a single edge, and in addition that a very simple heuristic can find this set.Second, for community size below the critical size of≈ 100 nodes (i.e., of size smaller than the largestwhisker), the best community in the network is actually a whisker and can be cut by a single edge. Third,for community size larger than the critical size of≈ 100, the Bag-of-Whiskers communities have betterscores than the internally well-connected communities extracted by Local Spectral. The best conductancesets of a given size are often disconnected, and when they areconnected they are often only tenuouslyconnected. Thus, if one only cares about finding good cuts then the best cuts in these large sparse graphsare obtained by composing unrelated disconnected pieces. Intuitively, a compact cluster is internallywell and evenly connected. Possible measures for cluster compactness include: cluster connectedness,diameter, conductance of the cut inside the cluster, ratio of conductance of the cut outside versus the cut

19

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


Original networkNetwork core

(a) LIVEJOURNAL01

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ (

cond

ucta

nce)



(b) MESSENGER-DE

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(c) ATP-DBLP

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)



(d) CIT-HEP-TH

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(e) WEB-GOOGLE

10-3

10-2

10-1

100

100 101 102 103 104 105 106

ϕ (c

ondu

ctan

ce)

n (number of nodes in the cluster)


(f) A MAZONALL

Figure S7: NCP of networks and network cores (i.e., excluding1-whiskers, as described in the text)for large social and information networks with six different types of link semantics: (a) Friendships inL IVEJOURNAL01 blogging network; (b) Communications in German Yahoo MESSENGER-DE network;(c) Author-to-paper ATP-DBLP network among computer scientists; (d) Citation network CIT-HEP-TH

from of Arxiv high-energy physics; (e) WEB-GOOGLE graph of the Web; (f) AMAZONALL productco-purchasing network from Amazon. These six networks wereshown in the main text.

inside. We discuss this in more detail in Section S5.

S3.2 NCP of the network and its core

Given the surprisingly significant effect on the community structure of real-world networks that whiskersand unions of disjoint whiskers have, one might wonder whether we see something qualitatively differentif we consider a real-world network in which these barely-connected pieces have been removed. To studythis, we found all1-whiskers and removed them from our networks, using the procedure we describedabove,i.e., we selected the largest biconnected component for each of our network datasets. This way,we kept only the network “core,” and we then computed the NCP plots for these modified networks.Figure S7 shows the NCP plots of networks constructed when weremove whiskers (i.e., keep only thenetwork core) for the six networks we studied in detail in themain text.

Figure S7 compares the NCP plot of the full network and only the network core. Notice the NCPremains largely unchanged: the plot only shifts slightly upward, but the general trends remain the same.Upon examination, the communities that are responsible forthe downward part of the NCP of the networkcore are connected to the rest of the network by2 edges. Intuitively, the network core has a large numberof barely connected pieces—connected now by two edges rather than by one edge—and thus the “core”itself has a core-periphery structure. Since the “volume” for these pieces is similar to that for the original

20

periphery, whereas the “surface area” is a factor of two larger, the conductance value is roughly a factorof two worse. Thus, although we have been discussing1-periphery in this section, one should really viewthem as the simplest example of weakly-connected pieces that exert a significant effect on the communitystructure in large real-world networks.

There exists no well-defined algorithm for computing higherorder “cores” and peripheries (2,3,. . . ),as we have defined the core above. Thus, we have also used thek-core decomposition (24) of the networkto provide an alternate core-periphery notion. Recall thata k-core of a network is the largest connectedset of nodes in the original network that can not be disconnected by the removal ofk nodes. In this case,a k-periphery consists of the components that can be disconnected from (k − 1)-core by removingkedges. In particular, we define a2-periphery to be a set of all components that can be disconnected froma core (1-core) with the removal of justtwoedges. To obtain ak-core of a network, we repeatedly deletenodes with degree less thank until no nodes can be deleted. We then considererd the NCP of the various“cores” thereby constructed. Figure S8 presents a typical example of the results—the the NCP for suchcores has qualitatively similar properties to that of the original network. As we proceed to higher ordercores (largerk), the NCP moves up and somewhat to the left, but it still decreases, hits a minimum, andthen gradually increases.

Overall, these experiments indicate that large social and information networks have onion-like struc-ture, with a large number of relatively small components on the outside and then a series of denser anddenser layers. Each of the layers again contains many small pieces connected to the rest of the layer withrelatively few edges. It is important to note that initially1-connected components (components that canbe disconnected by a removal of a single edge) dominate the downward part of the NCP. As those areremoved,2-connected components take their role, and so on.

S3.3 Best community size versus network size

Finally, let us consider the size and community quality of the “best” communities as a function of thesize of the network. We examined this in two ways: first, we looked at these quantities for all of ourstatic networks; and second, we looked at these quantities for one of our networks as a function of time.

Figure S9 plots the size of best cluster as a function of network size. Here we took all networks weconsidered in our study and for each network we found the sizeof best community. Note in particularthat best community size increases only very slowly with thenetwork size across a wide range of realsocial and information networks. Note also that (when restricting ourselves to considering only networkswith more than200 nodes) we see that the average community size is106 nodes and that the mediancommunity size is60 nodes.

Figure S10 illustrates the same phenomena but now for a evolving social network of LINKED IN. Inthis case, we have a complete temporal information about theevolution (additions of nodes and edges)of this network. We thus created65 snapshots of the network at various points in time, and we plot thesize of best community over time and the score that community. Here, the average community size is107 nodes, with median size72 nodes (again take over the snapshots with more than200 nodes).

S4 Forest Fire model

In this section, we describe in more detail the Forest Fire Model, which provides an iterative local edgeattachment mechanism that produces networks with a nested core-periphery structure and an upward-

21

sloping NCP.

S4.1 Forest Fire attachment mechanism

To describe the Forest Fire Model (45, 46), let us fix two parameters, aforward burning probabilitypf

and abackward burning probabilitypr. (Note that for our empirical evaluations in the main text andbelow, we use a single parameterp, where we set the burning probabilitypf = pb = p, but the moregeneral version of the model is of interest since,e.g., it has been shown to produce networks that exhibitheavy tailed degree distributions, densification over time, and shrinking diameters over time (45, 46).)We start the entire process with a single node, and at each time stept > 1, we consider a new nodev thatjoins the graphGt constructed thus far. The nodev forms out-links to nodes inGt as follows:

(i) Nodev first choose a nodew, which we will refer to as a “seed” node or an “ambassador” node,uniformly at random and forms a link tow.

(ii) Node v selectsx out-links andy in-links of w that have not yet been visited. (x andy are twogeometrically distributed random numbers with meanspf/(1 − pf ) and pr/(1 − pr), respec-tively. If not enough in-links or out-links are available, thenv selects as many as possible.) Letw1, w2, . . . , wx+y denote the nodes at the other ends of these selected links.

(iii) Node v forms out-links tow1, w2, . . . , wx+y, and then applies step (ii) recursively to each of thew1, w2, . . . , wx+y, except that nodes cannot be visited a second time during theprocess.

Thus, burning of links in the Forest Fire Model begins at nodew, spreads tow1, w2, . . . , wx+y, andproceeds recursively until the process dies out. One can view such a process intuitively as correspondingto a model in which a person comes to the party and first meets anambassador who then introduceshim or her around. If the person creates a small number of friendships these will likely be from theambassadors “community,” but if the person happens to create many friendships then these will likelygo outside the ambassador’s circle of friends. This way, theambassador’s community might graduallyget intermingled with the rest of the network. Figure S11 illustrates this, and the caption of Figure S11explains the depicted process.

Two properties of this model are particularly significant. First, although many nodes might form oneor a small number of links, certain nodes can produce large conflagrations, burning many edges and thusforming a large number of out-links before the process ends.Such nodes will help generate a skewedout-degree distribution, and they will also serve as “bridges” that connect formerly disparate parts of thenetwork. This will help make the NCP plot gradually increase. Second, there is a locality structure inthat as each new nodev arrives over time, it is assigned a “center of gravity” in some part of the network,i.e., at the ambassador node, and the manner in which new links areadded depends sensitively on thelocal graph structure around node the ambassador node. Not only does the probability of linking to othernodes decrease rapidly with distance to the current ambassador, but because of the recursive processregions with a higher density of links tend to attract new links.

S4.2 Forest Fire NCP plots

The Forest Fire Model is sensitive to the choice of the burning probabilitiespf andpb. In Figure S12,we fix pf = pr = p, and we plot Forest Fire NCPs for various values of this burning probabilityp. The

22

first thing to note is that since we are varyingp, we are viewing networks with very different densities.Empirically, we found that we obtain realistic community profiles for values ofp ≈ 0.31. Plots inFigure S12 show NCPs for various settings ofp around0.31. Notice that ifp is too low or too high, thenwe obtain qualitatively different results. For example, for smaller values ofp, the community profileplot gradually decreases to much larger size scales, and forvalues ofp below those shown it decreasesfor nearly the entire plot. In this case, the Forest Fire doesnot spread well since the forward burningprobability is too small, the network is extremely sparse and is tree-like with just a few extra edges, andso we get large well separated “communities” that get betteras they get larger. On the other hand, whenburning probability is too high, then the NCP plot has a minimum and then rises extremely rapidly. Inthis case, if a node which initially attached to a whisker successfully burns into the core, then it quicklyestablishes many successful connections to other nodes in the core. Thus, the network has relatively largewhiskers that failed to establish such a connection and a very expander-like core, with no intermediateregion, and the increase in the community profile plot is quite abrupt.

Figure S13 further examines behavior of the Forest Fire Model over the entire parameter space ofp. For each value ofp we generate a Forest Fire network on 100,000 nodes and recordposition of theglobal minimum of the NCP plot, i.e., sizek∗ of best cluster and it’s conductance. Figure 13(a) showshow the size of best cluster changes with the value ofp. Notice that for small values ofp clusters tendto be very large, and for too large values ofp the resulting fires spread too widely and the network hasno community structure. Forp around0.30 the NCP plot reaches the minimum atk∗ ≈ 100 which is asobserved in our collection of networks. Similarly, Figure 13(b) plots the conductance of best cluster as afunction ofp (as in Figure S9). Notice a sharp transition and the fact thatfor p around0.30 conductancesare in range10−2 − 10−3 which is about the same as we see in real networks.

S4.3 Community size in Forest Fire model

We have also examined how size of best cluster and conductance of best cluster change in Forest Firemodel as we generate larger and larger networks. For each value of burning probabilityp and eachnetworks sizen we generated 10 different Forest Fire networks. We varyp on [0.27, 0.35] and generatenetworks with up to 2 million nodes. Figure S14 then plots thesize of best cluster as a function of ForestFire network size for various values ofp. For small values ofp best cluster size slowly increases and aswe increasep best cluster size grows slower and slower. For value ofp = 0.31 cluster size is practicallyconstant (or very slowly increases) with best cluster size just below 100 nodes. Similarly, Figure S15plots the conductance of best cluster as a function of network size. Similar to real networks conductanceof best cluster ranges in the interval10−1 − 10−4 when suitable value ofp (p ≈ 0.31) is chosen.

S4.4 Nested core-periphery structure of Forest Fire networks

Finally, we have examined how the NCP plots for the full networks generated by the Forest Fire Modelcompares with the with the NCP plots of only their cores. See Figure S16. Here we performed similarexperiment, where for a set of values of burning probabilitiesp we created NCP plots of the full ForestFire network (on 1.4 million nodes) and the NCP plots of only the network cores (after we remove thewhiskers). Figure S16 shows that core of the Forest Fire network has similar structure as in real networks.The NCP plots shift upwards and left. Moreover, we also note that network core occupies significant partof the Forest Fire networks. Table S5 shows the fraction of nodes and edges that belong to the core of

23

Burn prob. Size of the network corep nodes,nc/n edges,mc/m

0.27 0.35 0.550.28 0.38 0.590.29 0.42 0.640.30 0.45 0.700.31 0.49 0.750.32 0.53 0.810.33 0.56 0.880.34 0.67 0.990.35 0.89 1.00

Table S5: Size of the core in Forest Fire networks for different values of burning probabilityp. Fractionnc/n of the total number of nodes of the network that are in networkcore. Fractionmc/m of the totalnumber of edges of the network that are in the network core.

the Forest Fire network for various values ofp. Note that forp ≈ 0.31 we get about the right size anddensity of the network core.

S5 Computing the NCP: comparison to other algorithms

In this section, we want to demonstrate that what we are observing is a true structural property of ournetwork datasets, rather than properties of our algorithms; and we want to use the differences betweendifferent approximation algorithms to further highlight structural properties of our network datasets. Inthis section we discuss several meta-issues related to this, including whether or not our algorithms aresufficiently powerful to recover the true shape of the minimal conductance curves, and whether we shouldactually be trying to optimize a slightly different measurethat combines conductance of the separatingcut with the piece compactness.

Recall that we defined the NCP plot to be a curve showing the minimum conductanceφ as a functionof piece sizek. Finding the points on this curve is NP-hard. Any cut that we find will only provide anupper bound on the true minimum at the resulting piece’s size. Given that fact, how confident can webe that the curve of upper bounds that we have computed has thesame rising or falling shape as the truecurve?

One method for finding out whether any given algorithm is doing a good job of pushing down theupper bounding curve in a non-size-biased way is to compare its curves for numerous graphs with thoseproduced by other algorithms. In such experiments, it is good if the algorithms are very powerful andalso independent of each other. We have done extensive experiments along these lines, and our choice ofLocal Spectral and Metis+MQI as the two algorithms for the main body of this paper was based on theresults. In Section S5.1 we mention a few interesting pointsrelated to this.

A different method for reducing our uncertainty about the shape of the true curve would be to alsocompute lower bounds on the curve. Ideally, one would compute a complete curve of tight lower bounds,leaving a thin band between the upper- and lower-bounding curves, which would make the rising or

24

falling shape of the true curve obvious. In Section S5.2 we discuss some experiments with lower bounds.Although we only obtained a few lower bounds rather than a full curve, the results are consistent withour main results obtained from upper-bounding curves.

Finally, in Section S5.3 we will discuss our decision to use the Local Spectral algorithm in additionto Metis+MQI in the main body of the paper, despite the fact that Metis+MQI clearly dominates LocalSpectral at the nominal task of finding the lowest possible upper bounding curve for the minimal conduc-tance curve. The reason for this decision is that Local Spectral often returns “nicer” and more “compact”pieces because rather than minimizing conductance alone, it optimizes a slightly different measure thatproduces a compromise between the conductance of the bounding cut and the “compactness” of theresulting piece.

As a summary, we have been primarily relying on the Local Spectral graph partitioning algorithm ofAnderson, Chung, and Lang (9). A second procedure we have used (even on graphs of millionsof nodes)is “Metis+MQI,” which consists of using the popular graph partitioning package Metis (34) followed bya flow-based MQI post-processing (41). With this Metis+MQI procedure, we obtain sets of nodes thathave very good conductance scores. At very small size scales, these sets of nodes could plausibly beinterpreted as good communities, but at larger size scales,we often obtain tenuously-connected (and insome cases unions of disconnected) pieces, which perhaps donot correspond to intuitive communities.With the Local Spectral Algorithm, we obtain sets of nodes with good conductance value that that are“compact” or more “regularized” than those pieces returnedby Metis+MQI. Since spectral methods con-fuse long paths with deep cuts (30,64), empirically we obtain sets of nodes that have worse conductancescores than sets returned by Metis+MQI, but which are “tighter” and more “community-like.” (This hasprovided an important cross-check when working with very large graphs.) For example, at small sizescales the sets of nodes returned by the Local Spectral Algorithm agrees with the output of Metis+MQI,but at larger scales this algorithm returns sets of nodes with substantially smaller diameter and averagediameter, which seem plausibly more community-like.

S5.1 Cross-checking between algorithms

As just mentioned, one way to gain some confidence in the upperbounding curves produced by a givenalgorithm is to compare them with the curves produced by other algorithms that are as strong as possible,and as independent as possible. We have extensively experimented with several variants of the globalspectral method, both the usual eigenvector-based embedding on a line, and an SDP-based embeddingon a hypersphere, both with the usual hyperplane-sweep rounding method and a fancier flow-basedrounding method which includes MQI as the last step. In addition, special post-processing can be doneto obtain either connected or disconnected sets. After examining the output of those 8 comparativelyexpensive algorithms on more than100 graphs, we found that our two cheaper main algorithms didmiss an occasional cut on an occasional graph, but nothing atall serious enough to change our mainconclusions. All of those detailed results are suppressed in this paper.

We have also done experiments with a practical version of theLeighton-Rao algorithm (42,43), sim-ilar to the implementation described in (40) and (41). These results are especially interesting because theLeighton-Rao algorithm, which is based on multi-commodityflow, provides a completely independentcheck on Metis, and on Spectral Methods generally, and therefore on our two main algorithms, namelyMetis+MQI and Local Spectral. The Leighton-Rao algorithm has two phases. In the first phase, edgecongestions are produced by routing a large number of commodities through the network. We adapted

25

our program to optimize conductance (rather than ordinary ratio cut score) by letting the expected de-mand between a pair of nodes be proportional to the product oftheir degrees. In the second phase, arounding algorithm is used to convert edge congestions intoactual cuts. Our method was to sweep overnode orderings produced by running Prim’s MST algorithm on the congestion graph, starting from alarge number of different initial nodes, using a range of different scales to avoid quadratic run time. Weused two variations of this method, one that only produces connected sets, and another one that can alsoproduce disconnected sets.

In the second row of Figure S17, we show Leighton-Rao curves for three example graphs. Ourstandard Local Spectral and Metis+MQI curves are drawn in black, while the Leighton-Rao curves forconnected and possibly disconnected sets are drawn in greenand magenta respectively. We note that forsmall to medium scales, the Leighton-Rao curves for connected sets resemble the Local Spectral curves,while the Leighton-Rao curves for possibly disconnected sets resemble the Metis+MQI curves. This isbig hint about the structure of the sets produced by Local Spectral and Metis+MQI, that we will discussfurther in Section S5.3.

At large scales, the Leighton-Rao curves for these example graphs shoot up and become much worsethan our standard curves. This is not surprising because expander graphs are known to be the worstcase input for the Leighton-Rao approximation guarantee, and we believe that these graphs contain anexpander-like core that is necessarily encountered at large scales. We remark that Leighton-Rao does notwork poorly at large scales on every kind of graph. (In fact, for large low-dimensional mesh-like graphs,Leighton-Rao is a very cheap and effective method for findingcuts at all scales, while our local spectralmethod becomes impractically slow at medium to large scales.)

We have now covered the main theoretical algorithms that arepractical enough to actually run, whichare based on spectral embeddings and on multi-commodity flow. Starting with (10), there has beena recent burst of theoretical activity showing that spectral and flow-based ideas, which were alreadyknown to have complementary strengths and weaknesses, can in fact be combined to obtain the best everapproximations. At present none of the resulting algorithms are sufficiently practical at the sizes that werequire, so they were not included in this study.

Finally, we mention that in addition to the above theoretically-based practical methods for findinglow-conductance cuts, there exist a very large number of heuristic graph clustering methods. We havetried a number of them, including Graclus (23) and Newman’s modularity optimizing program (we re-fer to it as Dendrogram) (29). Graclus attempts to find a partitioning of a graph into pieces boundedby low-conductance cuts using a kernel k-means algorithm. We ran Graclus repeatedly, asking for2, 3, . . . , i, . . . , i ∗

√2, ... pieces. Then we measured the size and conductance of all of the resulting

pieces. Newman’s Dendrogram program constructs a recursive partitioning of a graph (that is, a den-drogram) from the bottom up by repeatedly deleting the surviving edge with the highest betweennesscentrality. A flat partitioning could then be obtained by cutting at the level which gives the highest mod-ularity score, but instead of doing that, we measured the size of conductance of every piece defined by asubtree in the dendrogram.

In the bottom row of Figure S17, we present these results as scatterplots. Again our two standardcurves are drawn in black. No Graclus or Dendrogram point lies below the Metis+MQI curve. Thelower-envelopes of the points are roughly similar to those produced by Local Spectral.

Our main point with these experiments is that the lowest points produced by either Graclus or Den-drogram gradually rise as one moves from small scales to larger scales, so in principle we could havemade the same observations about the structure of large social and information networks by running

26

one of those easily downloadable programs instead of the algorithms that we did run. We chose thealgorithms we did due to their speed and power, although theymay not be as familiar to many readers.

S5.2 Lower bounds on cut conductance

As mentioned above, our main arguments are all based on curves which are actually upper bounds onthe true minimum conductance curve. To get a better idea of how good those upper bounds are, we alsocompute some lower bounds. Here we will discuss the spectrallower bound (15) on the conductanceof cuts of arbitrary balance, and we will also discuss a related SDP-based lower bound (13) on theconductance of any cut that divides the graph into two piecesof equal volume.

First, we introduce the following notation:~d is a column vector of the graph’s node degrees;Dis a square matrix whose only nonzero entries are the graph’snode degrees on the diagonal;A is theadjacency matrix ofG; L = D−A is then the non-normalized Laplacian matrix ofG; 1 is vector of 1’s;andA • B = trace(AT B) is the matrix dot-product operator.

Now, consider the following optimization problem (which iswell known to be equivalent to aneigenproblem):

λG = min

{

xT Lx

xT Dx: x ⊥ ~d, x 6= 0

}

.

Let x be a vector achieving the minimum valueλG. ThenλG

2 is the spectral lower bound on the conduc-tance of any cut in the graph, regardless of balance, whilex defines a spectral embedding of the graph ona line, to which rounding algorithms can be applied to obtainactual cuts that can serve as upper boundsat various sizes.

Next, we discuss an SDP-based lower bound on cuts which partition the graph into two sets of exactlyequal volume. Consider:

CG = min

{

1

4L • Y : diag(Y ) = 1, Y • (~d ~dT ) = 0, Y � 0

}

,

and letY be a matrix achieving the minimum valueCG. ThenCG is a lower bound on the weight of anycut with perfect volume balance, and2CG/Vol(G) is a lower bound on the conductance of any cut withperfect volume balance. We briefly mention that sinceY � 0, we can viewY as a Gram matrix thatcan be factored can be factored asRRT . Then the rows ofR are the coordinates of an embedding of thegraph on a hypersphere. Again, rounding algorithms can be applied to the embedding to obtain actualcuts that can server as upper bounds.

The spectral and SDP embeddings defined here were the basis for the extensive experiments withglobal spectral partitioning methods that were alluded to in Section S5.1. However, in this section, it isthe lower bounds that concern us. In the top row of Figure S17,we present the spectral and SDP lowerbounds for three example graphs. The spectral lower bound, which applies to cuts of any balance, isdrawn as a horizontal line which appears near the bottom of each plot. The SDP lower bound, whichonly applies to cuts separating a specific volume, namelyVol(G)/2, appears as an upwards-pointingtriangle near the right side of the each plot. (Note that plotting this point required us to use volume ratherthan number of nodes for the x-axis of these three plots.)

Clearly, for these graphs, the lower bound atVol(G)/2, is higher than the spectral lower boundwhich applies at smaller scales. More importantly, the lower bound atVol(G)/2, is higher than our

27

upperbounds at many smaller scales, so the true curve must go up, atleast at the very end, as one movesfrom small to large scales.

Take, for example, the top left plot of Figure S17 where in black we plot the conductance curvesobtained by our (Local Spectral and Metis+MQI) algorithms.With a red dashed line we also plot thelower bound of best possible cut in the network, and with red triangle we plot the lower bound for thecut that separates the graph in two equal volume parts. Thus,the true conductance curve (which isintractable to compute) lies below black but above red line and red triangle. This also demonstrates thatthe conductance curve which starts at upper left corner of the NCP plot first goes down and reaches theminimum close to the horizontal dashed line (Spectral lowerbound) and then sharply rise and ends upabove the red triangle (SDP lower bound). This verifies that our conductance curves and obtained NCPplots are not the artifacts of community detection algorithms we employed.

S5.3 Local Spectral and Metis+MQI

In this section we discuss our rationale for using Local Spectral in addition to Metis+MQI as one ofour two main algorithms for finding sets bounded by low conductance cuts. This choice requires somejustification because the NCP plots are intended to show the tightest possible upper bound on the low-est conductance cut for each piece size, while the curve for Local Spectral is generally above that forMetis+MQI.

Our reason for using Local Spectral in addition to Metis+MQIis that Local Spectral returns piecesthat are internally “nicer”. For graphs with a rising NCP plot, we have found that many of the low con-ductance sets returned by Metis+MQI (or Leighton-Rao, or the Bag-of-Whiskers Heuristic) are actuallydisconnected. Since internally disconnected sets are not very satisfying “communities”, it is natural towonder about NCP plot-style curves with the additional requirement that pieces must be internally wellconnected. In Section S5.1, we generated such a curve using Leighton-Rao, and found that the curvecorresponding to connected pieces was higher than a curve allowing disconnected sets.

In the top row of Figure S18, we show scatter plots illustrating a similar comparison between theconductance of the cuts bounding connected pieces generated by Local Spectral and by Metis+MQI. Ourmethod for getting connected pieces from Metis+MQI here is simply to separately measure each of thepieces in a disconnected set. The blue points in the figures show the conductance of some cuts found byLocal Spectral. The red points show the conductance of some cuts found by Metis+MQI. Apparently,Local Spectral and Metis+MQI find similar pieces at very small scales, but at slightly larger scales a gapopens up between the red cloud and the blue cloud. In other words, at those scales Metis+MQI is findinglower conductance cuts than Local Spectral, even when the pieces must be internally connected.

However, there is still a measurable sense in which the LocalSpectral pieces are “nicer” and more“compact,” as shown in the second row of scatter plots in Figure S18. For each of the same pieces forwhich we plotted a conductance in the top row, we are now plotting the average shortest path lengthbetween random node pairs in that piece. In these plots, we see that in the same size range whereMetis+MQI is generating clearly lower conductance connected sets, we now see that Local Spectral isgenerating pieces with clearly shorter internal paths. In other words, the Local Spectral pieces are more“compact”.

Finally, in the bottom row of Figure S18 we briefly introduce the topic of internal vs. externalcuts, which is something that none of our algorithms are explicitly trying to optimize. These are againscatter plots showing the same set of Local Spectral and Metis+MQI pieces as before, but now the y-

28

axis is external conductance divided by internal conductance. External conductance is the quantity thatwe usually plot, namely the conductance of the cut which separates the piece from the graph. Internalconductance is the score of a low conductance cutinside the piece (that is, in the induced subgraph onthe piece’s nodes). Intuitively, good communities should have small ratios, ideally below 1.0, whichwould mean that they are well separated from the rest of the network, but that they are internally well-connected. However, the three bottom-row plots show that for these three sample graphs, there are mostlyno ratios well below 1.0 except at small sizes. (Of course, any given graph could happen to contain avery distinct piece of any size, and the roughly thousand-node piece in the EMAIL -ENRON network is agood example.)

This demonstrates another aspect of our findings: small communities of size below≈ 100 nodesare internally compact and well separated from the remainder of the network, whereas larger piecesare so hard to separate that separating them from the networkis more expensive than separating theminternally.

S6 Experiments using modularity

Thus far, we have focused on methodological issues related to approximating conductance and the NCPin very large graphs. In this section, we describe a different issue. Although there are many other density-based measures that have been used to partition a graph into aset of communities (27, 62, 68), one thatdeserves particular mention is modularity (54, 55). Thus, in this section, we discuss the relationship ofour results with modularity.

Modularity (54, 55) is one of the most widely used methods to evaluate the quality of a division ofa network into modules or communities. For a given partitionof a network into a set of communities,modularity measures the number of within-community edges,relative to a null model that is usuallytaken to be a random graph with the same degree distribution.(Note that modularity was originallyintroduced and it typically used to measure the strength or quality of a particular partition of a networkinto a number of communities. On the other hand, rather than seeking to partition a graph into the“best” possible partition of communities, we would like to know how good is a particular element of thatpartition, i.e., how community-like are the best possible communities thatmodularity or any other meritfunction can hope to find, in particular as a function of the size of that partition. Of course, one could askthose questions with respect to the modularity objective.)One important difference between modularityand conductance is that modularity assumes a null-model andthen good communities (good divisionsinto communities) are those that are more expresses that what is expected from the null-model. Anotherimportant difference is that modularity only directly takes into account edges internal to a community,whereas conductance directly takes into account both internal and external edges.

More formally, let a network containn nodes andm edges and let matrixA denote the network’sadjacency matrix,ki is the degree of nodei, and letδ(i, j) = 1 if nodesi andj belong to the same groupand0 otherwise. ModularityQ is then defined as:

Q =1

4m

∑

[

Aij −kikj

2m

]

δ(i, j) (4)

The value of the modularity lies in the range[0, 1]. It is positive if the number of edges within groupsexceeds the number expected on the basis of chance. The first term of the summation in (4) counts the

29

number of edges inside the group, while the second term counts the expected number of edges inside thegroup in the null-model. Thus modularity can be rewritten as

Q =1

4m

(

V − E(V ))

,

whereV denotes avolume(the number of edges inside the groups), andE(V ) denotes an expectedvolume (expected number of edges inside the groups) if the edges would be placed randomly in thenetwork (while obeying node degrees). For completeness, wealso consider related measure that we callthemodularity ratioR, which we define as:

R =V

E(V ).

While modularity is thedifferencebetween the number of edges inside the group in the real and therandom network (and so the correction factor is an additive term), modularity ratio considers the ratiobetween the number of edges inside the groups and the number edges inside the groups as expected bychance (in which case the correction factor is a multiplicative term).

We then computed plots similar to the Network Community Profile plot, except that we used modu-larity and the modularity ratio as community quality measures. We use the simulated annealing modu-larity optimization techniques (58) to find partitions of the network into groups. For each of theobtainedgroups we then measure the modularity and modularity ratio of the obtained cut. We then for each groupsizek plot the value of the best (i.e., maximum) modularity and best (i.e., minimum) modularity ratiogroup. See Figure S19 for such plots for3 synthetic and6 real networks that we considered. Noticedouble y-axis: we plot modularityQ and appropriately scaled modularity ratioR using left, and con-ductance using right y-axis. A general observation is that modularity tends to roughly monotonicallyincrease towards the bisection of the network. This should not be surprising since modularity measuresthe “volume” of communities, with a small additive correction, and the volume clearly increases withcommunity size. On the other hand, the modularity ratio tends to decrease towards the bisection of thenetwork. This too should not be surprising, since it involves dividing the volume by a relatively smallnumber.

In particular, these results demonstrate that, with resepct to the modularityQ, that “best” communityin any of these networks has about half of all nodes (similar to what we see with small social networksusing either conductance or modularity). On the other hand,with respect to the modularity ratio, the“best” community in any of these networks has two or three nodes. In particular, whereas the conductanceis discriminative, in that it reveals interesting characteristics of the community structure in differentclasses of networks (downward for grids, flat from random graphs, downward with “hops” for hierarchialnetworks, and the now-characteristic downward-and-then-upward-sloping shape for large real social andinformation networks), modularity, on the other hand, tends to follow the same general pattern, withoccasional jumps in the curve, for all of these classes of networks.

We have also examined the behavior of both termsV andE(V ) that participate in the calculation ofthe modularity (and thus of the modularity ratio). See Tables S6 and S7, where we showV , the numberof edges inside the group, andE(V ), number of edges inside the group as expected by chance forseveral of the networks. Notice that for very small group sizes the volume is also small, while expectedvolume is almost zero. Then, while the community size in the network increases volume also increases,while expected volume increases at a faster rate. If one considers modularityV − E(V ) then bigger

30

Cluster Random graph 2d grid HRGsizek V E(V ) V E(V ) V E(V )

10 18 0.00 26 0.00 90 0.0020 38 0.01 62 0.00 184 0.0140 86 0.04 134 0.01 384 0.0680 164 0.16 282 0.02 832 0.26

100 208 0.26 358 0.04 1,080 0.42200 414 1.20 742 0.15 2,168 1.71400 840 4.90 1,518 0.61 4,292 6.89800 1,756 20.09 3,082 2.47 8,846 27.40

1,000 2,280 33.15 3,866 3.87 11,154 42.882,000 4,610 129.98 7,806 15.63 22,312 171.974,000 9,344 544.61 15,720 62.95 44,690 689.798,000 22,650 2,419.08 31,622 253.27 89,514 2,756.10

10,000 30,448 3,803.44 39,572 396.17 111,964 4,306.0520,000 75,072 15,832.50 79,388 1,589.53 223,976 17,229.1040,000 187,488 61,213.30 159,120 6,371.22 448,254 68,986.7080,000 468,680 249,019.00 318,736 25,565.60 895,674 275,788.00

100,000 638,748 393,241.00 398,638 39,943.40 1,120,340 430,915.00

Table S6: Volume and expected volume of the modularity scorefor three networks with downward-sloping NCP. Notice the modularity is heavily dominated by the volume (V ) term, while the expectedvolume (E(V )) tends to be relatively small. Thus the null model used by modularity makes a good jobof assessing the overall partitioning of the network, whileit is less useful for assessing the quality ofindividual communities in the network, which is our goal here.

communities are still better, while modularity ratioV/E(V ) still prefers small communities. Considerthe following heuristic calculation. Let’s have a network on 1 million edges, and the network has a triad(a triple of connected nodes) that is attached to the rest of the network via a single edges. If this volumeis V = 6, then the expected volume isE(V ) ≈ 10−5. Thus after normalization the modularity will bevery smallQ ≈ 10−6. On the other hand modularity ratio will be hugeR ≈ 105. And then as the clustersize grows modularity considers those clusters less and less random, while modularity ratio does exactlythe opposite. Exactly this can be noted from Tables S6 and S7,form small cluster sizesR ≈ 105, whileQ ≈ 10−6. For large clusters,R ≈ 10, while Q ≈ 1.

References and Notes1. University of Oregon Route Views Project. Online data andreports.http://www.routeviews.org, 1997.

2. TREC Web Corpus: WT10g.http://ir.dcs.gla.ac.uk/test collections/wt10g.html, 2000.

3. Google Programming Contest.http://www.google.com/programming-contest/, 2002.

4. Netflix prize.http://www.netflixprize.com/, 2006.

5. Network data.http://www-personal.umich.edu/∼mejn/netdata/, July 16, 2007.

6. L.A. Adamic and N. Glance. The political blogosphere and the 2004 U.S. election: divided they blog. InLinkKDD ’05:Proceedings of the 3rd International Workshop on Link Discovery, pages 36–43, 2005.

7. R. Z. Albert and A-L. Barabasi. Emergence of scaling in random networks.Science, 286(5439):509–512, 1999.

31

Cluster LiveJournal DBLP Googlesizek V E(V ) V E(V ) V E(V )

10 90 0.00 90 0.00 90 0.0020 348 0.00 224 0.02 380 0.0140 1,414 0.02 1,096 0.58 920 0.0980 3,494 0.17 2,124 2.24 516 0.03

100 6,590 0.00 2,256 2.52 372 0.01200 14,738 3.18 2,784 3.96 652 0.04400 14,476 3.10 17,730 171.98 5,824 3.96800 88,118 113.95 20,232 230.19 14,696 25.43

1,000 27,462 11.17 15,498 138.98 9,390 10.542,000 107,974 175.40 25,080 360.06 24,572 71.364,000 586,298 5,091.09 43,358 1,075.04 50,656 306.898,000 808,894 9,790.56 68,718 2,825.45 89,724 945.75

10,000 384,920 2,284.44 78,698 3,920.99 135,856 2182.9820,000 1,142,650 21,548.50 128,708 10,186.70 181,740 39,71.8240,000 1,333,620 33,666.90 222,612 32,697.80 334,286 13,679.2080,000 1,307,610 38,008.10 491,242 159,397.00 818,760 85,059.90

100,000 1,247,110 35,817.60 627,596 247,117.00 896,802 101,645.00

Table S7: Volume and expected volume of the modularity scorefor three large social and informationnetworks. Notice the modularity is heavily dominated by thevolume (V ) term, while the expectedvolume (E(V )) tends to be relatively small. Thus the null model used by modularity makes a good jobof assessing the overall partitioning of the network, whileit is less useful for assessing the quality ofindividual communities in the network.

8. R. Z. Albert, H. Jeong, and A-L. Barabasi. The diameter ofthe world wide web.Nature, 401:130–131, 1999.

9. R. Andersen, F.R.K. Chung, and K. Lang. Local graph partitioning using PageRank vectors. InFOCS ’06: Proceedingsof the 47th Annual IEEE Symposium on Foundations of ComputerScience, pages 475–486, 2006.

10. S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning. InSTOC ’04: Proceed-ings of the 36th annual ACM Symposium on Theory of Computing, pages 222–231, 2004.

11. L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan.Group formation in large social networks: membership, growth,and evolution. InKDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, pages 44–54, 2006.

12. B. Bollobas and O. M. Riordan. Mathematical results on scale-free random graphs. In S. Bornholdt and H.G. Schuster,editors,Handbook of Graphs and Networks, pages 1–34. Wiley, 2004.

13. S. Burer and R.D.C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rankfactorization.Mathematical Programming (series B), 95(2):329–357, 2003.

14. CASOS. Center for computational analysis of social and organizational systems (cmu casos network data).http://casos.isri.cmu.edu/computational tools/data2.php, March, 2009.

15. F.R.K. Chung.Spectral graph theory, volume 92 ofCBMS Regional Conference Series in Mathematics. American Math-ematical Society, 1997.

16. F.R.K Chung. Four proofs of Cheeger inequality and graphpartition algorithms. InProceedings of ICCM, 2007.

17. F.R.K. Chung. The heat kernel as the pagerank of a graph.Proceedings of the National Academy of Sciences of the UnitedStates of America, 104(50):19735–19740, 2007.

18. F.R.K. Chung. Random walks and local cuts in graphs.Linear Algebra and its Applications, 423:22–32, 2007.

32

19. F.R.K. Chung and L. Lu.Complex Graphs and Networks, volume 107 ofCBMS Regional Conference Series in Mathemat-ics. American Mathematical Society, 2006.

20. A. Clauset, M.E.J. Newman, and C. Moore. Finding community structure in very large networks. arXiv:cond-mat/0408187,August 2004.

21. V. Colizza, A. Flammini, M.A. Serrano, and A. Vespignani. Characterization and modeling of protein protein interactionnetworks.Physica A Statistical Mechanics and its Applications, 352:1–27, 2005.

22. V. Colizza, R. Pastor-Satorras, and A. Vespignani. Reaction-diffusion processes and metapopulation models in heteroge-neous networks. arXiv:cond-mat/0703129, March 2007.

23. I.S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach.IEEE Transactionson Pattern Analysis and Machine Intelligence, 29(11):1944–1957, 2007.

24. S.N. Dorogovtsev, A.V. Goltsev, and J. F. F. Mendes. k-core organization of complex networks.Physical Review Letters,96(4):040601, 2006.

25. A.D. Flaxman, A.M. Frieze, and J. Vera. A geometric preferential attachment model of networks. InWAW ’04: Proceedingsof the 3rd Workshop On Algorithms And Models For The Web-Graph, pages 44–55, 2004.

26. A.D. Flaxman, A.M. Frieze, and J. Vera. A geometric preferential attachment model of networks II. InWAW ’07:Proceedings of the 5th Workshop On Algorithms And Models ForThe Web-Graph, pages 41–55, 2007.

27. M. Gaertler. Clustering. In U. Brandes and T. Erlebach, editors,Network Analysis: Methodological Foundations, pages178–215. Springer, 2005.

28. J. Gehrke, P. Ginsparg, and J. Kleinberg. Overview of the2003 KDD Cup.SIGKDD Explorations, 5(2):149–151, 2003.

29. M. Girvan and M.E.J. Newman. Community structure in social and biological networks.Proceedings of the NationalAcademy of Sciences of the United States of America, 99(12):7821–7826, 2002.

30. S. Guattery and G.L. Miller. On the quality of spectral separators.SIAM Journal on Matrix Analysis and Applications,19:701–719, 1998.

31. S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications.Bulletin of the American MathematicalSociety, 43:439–561, 2006.

32. H. Jeong, S.P. Mason, A-L. Barabasi, and Z.N. Oltvai. Lethality and centrality in protein networks.Nature, 411:41–42,2001.

33. B. Karrer, E. Levina, and M.E.J. Newman. Robustness of community structure in networks. arXiv:0709.2108, September2007.

34. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs.SIAM Journal onScientific Computing, 20:359–392, 1998.

35. A. Khalil and Y. Liu. Experiments with PageRank computation. Technical Report 603, Indiana University Department ofComputer Science, December 2004.

36. B. Klimt and Y. Yang. Introducing the enron corpus. InCEAS ’04: Proceedings of the 1st Conference on Email andAnti-Spam, 2004.

37. R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. InKDD ’06: Proceedings of the12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 611–617, 2006.

38. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph.In FOCS ’00: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pages 57–65, 2000.

39. K. Lang. Finding good nearly balanced cuts in power law graphs. Technical Report YRL-2004-036, Yahoo! ResearchLabs, Pasadena, CA, November 2004.

40. K. Lang and S. Rao. Finding near-optimal cuts: an empirical evaluation. InSODA ’93: Proceedings of the 4th annualACM-SIAM Symposium on Discrete algorithms, pages 212–221, 1993.

41. K. Lang and S. Rao. A flow-based method for improving the expansion or conductance of graph cuts. InIPCO ’04:Proceedings of the 10th International IPCO Conference on Integer Programming and Combinatorial Optimization, pages325–337, 2004.

33

42. T. Leighton and S. Rao. An approximate max-flow min-cut theorem for uniform multicommodity flow problems withapplications to approximation algorithms. InFOCS ’88: Proceedings of the 28th Annual Symposium on Foundations ofComputer Science, pages 422–431, 1988.

43. T. Leighton and S. Rao. Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms.Journal of the ACM, 46(6):787–832, 1999.

44. J. Leskovec, L.A. Adamic, and B.A. Huberman. The dynamics of viral marketing.ACM Transactions on the Web, 1(1),2007.

45. J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs overtime: densification laws, shrinking diameters and possibleexplanations. InKDD ’05: Proceeding of the 11th ACM SIGKDD International Conference on Knowledge Discovery inData Mining, pages 177–187, 2005.

46. J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densification and shrinking diameters.ACM Transactionson Knowledge Discovery from Data, 1(1), 2007.

47. J. Leskovec, M. McGlohon, C. Faloutsos, N.S. Glance, andM. Hurst. Patterns of cascading behavior in large blog graphs.In SDM ’07: Proceedings of the 7th SIAM Conference on Data Mining, 2007.

48. D. Lusseau, K. Schneider, O.J. Boisseau, P. Haase, E. Slooten, and S.M. Dawson. The bottlenose dolphin community ofdoubtful sound features a large proportion of long-lastingassociations.Behavioral Ecology and Sociobiology, 54:396–405,2003.

49. R. Milo, N. Kashtan, S. Itzkovitz, M.E.J. Newman, and U. Alon. On the uniform generation of random graphs withprescribed degree sequences. arXiv:cond-mat/0312028v2,May 2004.

50. A.L. Montgomery and C. Faloutsos. Identifying web browsing trends and patterns.Computer, 34(7):94–95, 2001.

51. M.E.J. Newman. The structure and function of complex networks. SIAM Review, 45:167–256, 2003.

52. M.E.J. Newman. Detecting community structure in networks. The European Physical Journal B, 38:321–330, 2004.

53. M.E.J. Newman. Finding community structure in networksusing the eigenvectors of matrices.Physical Review E,74:036104, 2006.

54. M.E.J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences ofthe United States of America, 103(23):8577–8582, 2006.

55. M.E.J. Newman and M. Girvan. Finding and evaluating community structure in networks.Physical Review E, 69:026113,2004.

56. Y. Qi, J.K. Seetharaman, and Z.B. Joseph. Random forest similarity for protein-protein interaction prediction from multiplesources. InPacific Symposium on Biocomputing, 2005.

57. E. Ravasz and A.-L. Barabasi. Hierarchical organization in complex networks.Physical Review E, 67:026112, 2003.

58. J. Reichardt and S. Bornholdt. Statistical mechanics ofcommunity detection.Physical Review E, 74:016110, 2006.

59. M. Richardson, R. Agrawal, and P. Domingos. Trust management for the semantic web. InISWC ’03: Proceedings of the2nd International Semantic Web Conference, pages 351–368, 2003.

60. M. Ripeanu, I. Foster, and A. Iamnitchi. Mapping the gnutella network: Properties of large-scale peer-to-peer systems andimplications for system design.IEEE Internet Computing, 6(1):50–57, 2002.

61. S.F. Sampson. A novitiate in a period of change: An experimental and case study of social relationships. Technical report,Cornell University Deptartment of Sociology PhD thesis, 1968.

62. S.E. Schaeffer. Graph clustering.Computer Science Review, 1(1):27–64, 2007.

63. G. Siganos, S.L. Tauro, and M. Faloutsos. Jellyfish: A conceptual model for the as internet topology.Journal of Commu-nications and Networks, 8:339–350, 2006.

64. D.A. Spielman and S.-H. Teng. Spectral partitioning works: Planar graphs and finite element meshes. InFOCS ’96:Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science, pages 96–107, 1996.

65. D.A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solvinglinear systems. InSTOC ’04: Proceedings of the 36th annual ACM Symposium on Theory of Computing, pages 81–90,2004.

34

66. S.L. Tauro, C. Palmer, G. Siganos, and M. Faloutsos. A simple conceptual model for the internet topology. InGLOBECOM’01: Global Telecommunications Conference, pages 1667–1671, 2001.

67. J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction.Science, 290(5500):2319–2323, 2000.

68. U. von Luxburg. A tutorial on spectral clustering. Technical Report 149, Max Plank Institute for Biological Cybernetics,August 2006.

69. D.J. Watts and S.H. Strogatz. Collective dynamics of small-world networks.Nature, 393:440–442, 1998.

70. W.W. Zachary. An information flow model for conflict and fission in small groups.Journal of Anthropological Research,33:452–473, 1977.

35

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ (

cond

ucta

nce)


(nodes, edges)Network (6946668, 30507070)

1-core (3325624, 26886026)2-core (2362606, 25024555)3-core (1899606, 23672842)4-core (1608867, 22536592)5-core (1400681, 21517689)

(a) LINKED IN

10-3

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)




(b) EPINIONS

10-3

10-2

10-1

100

100 101 102 103 104 105

Φ (

cond

ucta

nce)



1-core (70912, 226107)3-core (21390, 118965)

5-core (9438, 71980)7-core (4784, 45334)9-core (2531, 28284)

(c) DELICIOUS

10-2

10-1

100

100 101 102 103 104

Φ (

cond

ucta

nce)




(d) CA-ASTRO-PH

Figure S8: NCP of networks and their “higher order network cores.” Here, we use the networkk-coredecomposition heuristic as explained in the text to obtain higher order network cores. Note that aslayers of periphery are being “peeled” away, the NCP maintains characteristic upward-sloping shape,while shifting slightly upward (best community scores are getting slightly worse) and to the left (bestcommunities are getting slightly smaller).

36

100

101

102

103

104

105

106

107

100 101 102 103 104 105 106 107 108

k* (si

ze o

f bes

t clu

ster

)

n (network size)

AverageMedian

(a) Size of best cluster

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107 108

cond

ucta

nce

of b

est c

lust

er

network size

(b) Conductance of best cluster

Figure S9: Sizek∗ and conductanceΦ(k∗) of best cluster (wherek∗ = arg min Φ(k)) of the our largesocial and information networks. Each point corresponds toone network (of size corresponding to itscoordinate along the X-axis). The average and the median were calculated for networks withn > 200(as indicated by a vertical line); average community size is106 nodes and median is60.

100

101

102

103

104

105

106

101 102 103 104 105 106 107

k* (si

ze o

f bes

t clu

ster

)

n(t) (network size at time t)

AverageMedian


10-3

10-2

10-1

101 102 103 104 105 106 107

Φ(k

*) (

cond

ucta

nce

of k

*)

n(t) (network size at time t)


Figure S10: Sizek∗ and conductanceΦ(k∗) of best cluster (wherek∗ = arg min Φ(k)) as a function oftime in a time-evolving network. We took snapshots of the growing LINKED IN social network and foreach snapshot we calculated and plotted these quantities. The average and the median were calculatedfor n(t) > 200 (as indicated by a vertical line); average community size here is107 nodes and medianis 72.

37

(a) (b)

(c) (d)

(e) (f)

Figure S11: The Forest Fire burning process. Attachment mechanism for the Forest Fire Model imitatesthe spread of a “fire” on a network by recursively attempting to burn edges and attaching to nodes at theendpoints of successfully burnt edges. (11(a)) A new nodeu joins the network and randomly selects a“seed” nodea to link to. (11(b)) Thenu attempts to “burn”a’s neighbors, and if successful it attachesa link to them. Here, dashed lines represent newly created edges and numbers correspond to the orderin which edges are added; directed lines indicate existing edges where the fire successfully spreads; anddouble lines represent edges where the fire did not spread. Inthis example, the fire spreads froma toc, but notd. (11(c)) Then fromc, the fire spreads everywhere butb, and thusu adds new links to allof c’s neighbors exceptb. (11(d)) The fire then spreads frome to f , but not toe’s other neighbors,and thus an edge is added fromu to f . (11(e)) In this example, the fire does not spread to any off ’sneighbors, and thus the process terminates. (11(f)) The newgraph at the end of this step of the ForestFire generative process. Notice that in this example, the fire burned the “bridge” edge(c, e) and thusucreates edges that attached a previously well-separated cluster of nodes (indicated by green) to the restof the network. When a new node attaches edges in a network, itwill usually create connections insidethe seed node’s local community (which gives the downward part of the NCP), but when bridge edgesget burned previously well-separated clusters get better attached to each other (which gives the upwardsloping part of the NCP).

38

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(a) p = 0.27

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(b) p = 0.28

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(c) p = 0.29

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(d) p = 0.30

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(e) p = 0.31

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(f) p = 0.32

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(g) p = 0.33

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(h) p = 0.34

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)


(i) p = 0.35

Figure S12: [Best viewed in color.] NCP for the Forest Fire Model at various parameter settings ofthe burning probabilityp (as we increase it left to right, top to bottom). Note that thevery largest andsmallest values forp lead to less realistic NCPs, as discussed in the text.

39

100

101

102

103

104

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k* (si

ze o

f bes

t clu

ster

)

p (burning probability)


10-4

10-3

10-2

10-1

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Φ(k

* ) (c

ondu

ctan

ce o

f k* )

p (burning probability)


Figure S13: Sizek∗ and conductanceΦ(k∗) of best cluster (wherek∗ = arg min Φ(k)) in the Forest FireModel, as a function of burning probabilityp.

40

100

101

102

103

104

105

106

107

100 101 102 103 104 105 106 107

k* (

size

of b

est c

lust

er)

n (network size)

Median

(a) p = 0.27

100

101

102

103

104

105

106

107

100 101 102 103 104 105 106 107

k* (

size

of b

est c

lust

er)

n (network size)

Median

(b) p = 0.28

100

101

102

103

104

105

106

107

100 101 102 103 104 105 106 107

k* (

size

of b

est c

lust

er)

n (network size)

Median

(c) p = 0.29

100

101

102

103

104

105

106

107

100 101 102 103 104 105 106 107

k* (

size

of b

est c

lust

er)

n (network size)

Median

(d) p = 0.30

100

101

102

103

104

105

106

107

100 101 102 103 104 105 106 107

k* (

size

of b

est c

lust

er)

n (network size)

Median

(e) p = 0.31

100

101

102

103

104

105

106

107

100 101 102 103 104 105 106 107

k* (

size

of b

est c

lust

er)

n (network size)

Median

(f) p = 0.32

100

101

102

103

104

105

106

107

100 101 102 103 104 105 106 107

k* (

size

of b

est c

lust

er)

n (network size)

Median

(g) p = 0.33

100

101

102

103

104

105

106

107

100 101 102 103 104 105 106 107

k* (

size

of b

est c

lust

er)

n (network size)

Median

(h) p = 0.34

100

101

102

103

104

105

106

100 101 102 103 104 105 106

k* (

size

of b

est c

lust

er)

n (network size)

Median

(i) p = 0.35

Figure S14: Sizek∗ of best cluster (wherek∗ = arg min Φ(k)) for (multiple independent) networksgenerated by the Forest Fire Model, as a function of burning probability p.

41

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ(k

*) (

cond

ucta

nce

of k

*)

n (network size)

(a) p = 0.27

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ(k

*) (

cond

ucta

nce

of k

*)

n (network size)

(b) p = 0.28

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ(k

*) (

cond

ucta

nce

of k

*)

n (network size)

(c) p = 0.29

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ(k

*) (

cond

ucta

nce

of k

*)

n (network size)

(d) p = 0.30

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ(k

*) (

cond

ucta

nce

of k

*)

n (network size)

(e) p = 0.31

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ(k

*) (

cond

ucta

nce

of k

*)

n (network size)

(f) p = 0.32

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ(k

*) (

cond

ucta

nce

of k

*)

n (network size)

(g) p = 0.33

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106 107

Φ(k

*) (

cond

ucta

nce

of k

*)

n (network size)

(h) p = 0.34

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ(k

*) (

cond

ucta

nce

of k

*)

n (network size)

(i) p = 0.35

Figure S15: ConductanceΦ(k∗) of best cluster (wherek∗ = arg min Φ(k)) for (multiple independent)networks generated by the Forest Fire Model, as a function ofburning probabilityp.

42

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(a) p = 0.27

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(b) p = 0.28

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(c) p = 0.29

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(d) p = 0.30

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(e) p = 0.31

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(f) p = 0.32

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(g) p = 0.33

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(h) p = 0.34

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Φ (

cond

ucta

nce)



(i) p = 0.35

Figure S16: [Best viewed in color.] NCP for networks from theForest Fire Model and for the corre-sponding network cores. Shown are results at various parameter settings of the burning probabilityp (aswe increase it left to right, top to bottom). Note that the NCPof the core behaves in the characteristicdownward-then-upward-sloping manner and as in real networks shifts upward slightly.

43

Lower bounds on the conductance of best cut in the network.

Leighton-Rao: connected clusters (green), disconnected clusters (magenta).

NCP plots obtained by Graclus and Newman’s Dendrogram algorithm.

Figure S17: Result of other algorithms for three networks: EPINIONS, EMAIL -ENRON, and CA-ASTRO-PH. Top row plots (in black) conductance curves as obtained by Local Spectral and Metis+MQI. Toprow also shows lower bounds on conductance of any cut (Spectral lower bound, dashed line) and the cutseparating the graph in half (SDP lower bound, red triangle). Middle row shows NCP plots for connected(green) and disconnected (magenta) pieces from our implementation of the Leighton-Rao algorithm.Bottom row shows the conductance of some cuts found by Graclus and by Newman’s Dendrogramalgorithm. The overall conclusion is that the qualitative shape of the NCP plots is a structural propertyof large networks and the plot remains practically unchanged regardless of what particular communitydetection algorithm we use.

44

Conductance of connected clusters found by Local Spectral (blue) and Metis+MQI (red)

Cluster compactness: average shortest path length

Cluster compactness: external vs. internal conductance

Figure S18: Result of comparing Local Spectral (blue) and Metis+MQI (red) on connected clustersfor three networks: ATP-DBLP, EMAIL -ENRON, and CA-ASTRO-PH. In the top row, we plot theconductance of the bounding cut. In the middle row, we plot the average shortest path length in thecluster. In the bottom row, we plot the ratio of the external conductance to the internal conductance.Observe that generally Metis+MQI yields better (lower conductance) cuts while Local Spectral yieldspieces that are more compact: they have shorter path lengthsand internal connectivity.

45

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 10610-3

10-2

10-1

100

Q, R

(m

odul

arity

)

Φ (

cond

ucta

nce)


ΦQR

(a) Two-dimensional Grid

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 10610-2

10-1

100

Q, R

(m

odul

arity

)

Φ (

cond

ucta

nce)


ΦQR

(b) Random Graph

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 10610-4

10-3

10-2

10-1

100

Q, R

(m

odul

arity

)

Φ (

cond

ucta

nce)


ΦQR

(c) Hierarchical Random Graph

10-810-710-610-510-410-310-210-1100

100 101 102 103 104 105 10610-4

10-3

10-2

10-1

100

Q, R

(m

odul

arity

)

Φ (

cond

ucta

nce)


ΦQR

(d) LIVEJOURNAL01

10-810-710-610-510-410-310-210-1100

100 101 102 103 104 105 106 10710-3

10-2

10-1

100

Q, R

(m

odul

arity

)

Φ (

cond

ucta

nce)


ΦQR

(e) MESSENGER-DE

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 10610-3

10-2

10-1

100

Q, R

(m

odul

arity

)

Φ (

cond

ucta

nce)


ΦQR

(f) ATP-DBLP

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 10510-2

10-1

100

Q, R

(m

odul

arity

)

Φ (

cond

ucta

nce)


ΦQR

(g) CIT-HEP-TH

10-710-610-510-410-310-210-1100101

100 101 102 103 104 105 10610-4

10-3

10-2

10-1

100

Q, R

(m

odul

arity

)

Φ (

cond

ucta

nce)


ΦQR

(h) WEB-GOOGLE

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 10610-3

10-2

10-1

100

Q, R

(m

odul

arity

)

Φ (

cond

ucta

nce)


ΦQR

(i) A MAZONALL

Figure S19: ModularityQ, Modularity RatioR, and ConductanceΦ for three synthetic networks andsix large social and information networks with six different types of link semantics: Friendships inL IVEJOURNAL01 blogging network; Communications in German Yahoo MESSENGER-DE network;Author-to-paper ATP-DBLP network among computer scientists; Citation network CIT-HEP-TH from ofArxiv high-energy physics; WEB-GOOGLE graph of the Web; and AMAZONALL product co-purchasingnetwork from Amazon.

46

supplementary material: large-scale community structurein …jure/pub/ncp/ncp-supp.pdf ·...

Documents