multiscale community blockmodel for network explorationlsong/papers/hoparsonxin11.pdf · hierarchy:...

Multiscale Community Blockmodel for Network Exploration

Qirong Ho Ankur P. Parikh Le Song Eric P. XingCarnegie Mellon University Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University

Abstract

Real world networks exhibit a complex set ofphenomena such as underlying hierarchical or-ganization, multiscale interaction, and varyingtopologies of communities. Most existing meth-ods do not adequately capture the intrinsic inter-play among such phenomena. We propose a non-parametric Multiscale Community Blockmodel(MSCB) to model the generation of hierarchiesin social communities, selective membership ofactors to subsets of these communities, and theresultant networks due to within- and cross- com-munity interactions. By using the nested ChineseRestaurant Process, our model automatically in-fers the hierarchy structure from the data. We de-velop a collapsed Gibbs sampling algorithm forposterior inference, conduct extensive validationusing synthetic networks, and demonstrate theutility of our model in real-world datasets suchas predator-prey networks and citation networks.

1 INTRODUCTIONHow do complex networks and their self-organizationarise from coordinated interactions and information shar-ing among the actors? One way to tap into this ques-tion is to understand the latent structures over actors whichlead to the formation and organization of these networks.In particular, we are interested in uncovering the func-tional/sociological communities of network actors, andtheir influence on network connections. We consider acommunity to be a group of actors that share a commontheme, like a clique of football fans in a social network, oran ecosystem of dependent organisms in a biological foodweb. Our objective is to gain a deeper understanding of therelationships within and among these communities, so asto shed insight into the network topology.

More specifically, we seek to address three critical aspects

Appearing in Proceedings of the 14th International Conference onArtificial Intelligence and Statistics (AISTATS) 2011, Fort Laud-erdale, FL, USA. Volume 15 of JMLR: W&CP 15. Copyright2011 by the authors.

of network modeling and community discovery:

1. Hierarchy — not all communities are equal: a com-munity can contain sub-communities, or be containedby super-communities. This is a natural way to struc-ture the latent space of actors.

2. Multiscale Granularity — we must distinguish be-tween coarse or generic associations that may occurin a large super-community, as opposed to fine grainedinteractions that occur within or among small, closely-interconnected sub-communities.

3. Assortativity/Disassortativity — some communitieshave strong within-community interactions and weakcross-community interactions (assortativity), yet oth-ers may exhibit the reverse (disassortativity).

These aspects are not independent, but are strongly inter-related. As an example, consider an oceanic food web(Figure 1), a directed network with species as actors andpredator-prey relationships as edges. This network exhibitshierarchy: cold-blooded animals and mammals are largesuper-communities that can be sub-divided into smallersub-communities, such as sharks and squid, or toothedwhales and pinnipeds. These sub-communities can in turnbe divided into even smaller communities (not shown). Theideas of hierarchy and network should not be confused witheach other. The hierarchy is an organization of actors insome latent space learned from the observed network.

Next, the predator-prey relationships in the ocean are mul-tiscale. Consider a sperm whale: it occasionally eats fish,which are common prey for many oceanic animals. Hence,this “sperm whale, fish” interaction is generic. Moreover,sperm whales usually eat giant squid, which are prey spe-cific to them (making this interaction fine-grained). It isimportant to differentiate between such interactions of dif-ferent scale.

Finally, the toothed whale sub-community demonstratesboth assortative and disassortative behavior. Many toothedwhales feed on small fish and seals, which are cross-community interactions. However, whales such as orcasfeed on other whales, which are within-community inter-actions.


We propose a nonparametric Multiscale CommunityBlockmodel (MSCB) that presents a unified approach toaddress these three concerns. Using the nested ChineseRestaurant Process (Blei, Griffiths, and Jordan 2010) as anonparametric structural prior, our model learns the struc-ture of the hierarchy from the data without requiring thebranching factor at each node to be prespecified. Moreover,by exploiting latent space ideas from Blei et al. (2003)and Airoldi et al. (2008), we uncover the coarse/fine-grained interactions that underlie the network. Finally,our model builds upon the blockmodel concept (Wang andWong 1987; Airoldi, Blei, Fienberg, and Xing 2008) to in-tegrate assortativity and disassortativity into our hierarchy.

1.1 Comparison to Existing Work

Existing methods for graph clustering and inferring com-munity structure do not adequately capture the three as-pects we have described. Methods such as Girvan andNewman (2002), Hoff et al. (2002), Handcock et al. (2007),Krause et al. (2003) and Guimera and Amaral (2005) can-not discover disassortative communities characterized byweak within-community and strong cross-community in-teractions. Furthermore, they do not explicitly model orga-nizational structure — and by extension, multiscale gran-ularity of interactions. These methods do not meet any ofour criteria, and are unsuited for our purposes.

The Mixed Membership Stochastic Blockmodel(MMSB) (Airoldi, Blei, Fienberg, and Xing 2008)aims to discover the multiple latent roles played by eachactor in the network, while employing a blockmodel toaccommodate both disassortative and assortative typesof interactions. The multi-role memberships discoveredby MMSB are similar, but not identical, to our notion ofcoarse/fine-grained interactions. Furthermore, MMSBdoes not induce a hierarchical structure over the actors.These considerations prevent MMSB from modeling theorganized network phenomena that our model is designedto explore. Another example of a latent space modelis Miller et al.’s link prediction model (2009), whichallows each actor to take on multiple binary features in aninfinite-dimensional space. Like MMSB, this model doesnot learn a structure over its latent space, and thereforecannot replicate our model’s ability to discover communityhierarchies.

On the other hand, methods such as Clauset et al. (2004),Radicchi et al. (2004) and the infinite stochastic block-model (Kemp and Tenenbaum 2008) explicitly model someform of organizational structure. However, they do not per-mit actors to have multiple kinds of interactions, whichprecludes them from learning the kind of multiscale inter-actions we have described. Roy et al. (2007) generalizethe infinite relational model (Kemp, Tenenbaum, Griffiths,Yamada, and Ueda 2006) for hierarchical group discovery,and extend their work to the nonparametric setting with

Cold-blooded

Mammals

Sharks Fish

Toothed Whales

Pinnipeds

Sea Creatures

Cold-blooded Mammals

Shar

ks

Fish

Too

the

d

Wh

ale

s

Pin

nip

ed

sOrca Whale

Sperm Whale

Orca WhaleSperm Whale

Sea Lion Seal

Sea LionSeal

Bottlenose Dolphin

Bottlenose Dolphin

HerringTuna

Herring

TunaMako

Hammerhead

HammerheadMako

Squid

Squ

id

Giant

GiantFlying

Flying

Figure 1: Illustration of an oceanic food web as a set of nestedcommunities (Left), and the corresponding hierarchy of com-munities (Right). Vertices in the network represent individualspecies, and directed edges represent predator-prey relationships(not all shown). Solid edges are fine-grained, specific interac-tions, while dashed edges are coarse-grained and generic to alarge community.

Mondrian Processes (Teh and Roy 2009). However, theirmodels are limited to binary hierarchies. Our model as-sumes no limit on the hierarchy’s branching factor, whichis more realistic for certain networks.

2 MULTISCALE COMMUNITYBLOCKMODEL (MSCB)

In the sequel, we describe the different aspects of themodel, beginning with the hierarchy and then proceedingto network edge generation. We use the oceanic food webin Figure 1 as a running example.

2.1 The Community HierarchyIn our model, the hierarchy is a tree where each nodeis a community. The root of the tree is designated aslevel 0. Nodes closer to the root represent large super-communities, (e.g. the “cold-blooded animals” and “mam-mals” in Figure 1), while those closer to the leaves repre-sent finer-grained sub-communities (e.g. “toothed whales”or “sharks”).

Each actor is associated with a single path of super-sub communities in the hierarchy (which we call its pathci). This path delineates a sequence of communities fromcoarse to fine. For example, a sperm whale could have thepath [mammal, toothed whale].

Selecting the number of branches at every tree node a priorican be daunting because of the huge number of possibili-ties. One might consider using heuristic methods that guessat the number of such children, but doing so would defeatthe purpose of employing a probabilistic model.

We solve this problem by adopting a nonparametricBayesian prior on paths through trees, the nested Chi-nese Restaurant Process (nCRP) (Blei, Griffiths, and Jor-dan 2010), which automatically selects the number ofbranches based on the data. The generative process for thenCRP works according to the following metaphor: Actor 1

Qirong Ho, Ankur P. Parikh, Le Song, Eric P. Xing

chooses his tree path first, followed by actor 2, and so on.Consider actor i. He begins at the root, then with proba-bility n(1)

x,i−1/(i − 1 + γ) he selects branch x of the tree,and with probability γ/(i− 1+ γ), he picks a new branch.n

(1)x,i−1 is the number of actors before i that chose branch x

at level 1, and γ is a hyperparameter dictating the probabil-ity that an actor will start a new branch. Higher values of γwill increase the width of the hierarchy.

Actor i continues this process as he descends the tree.When picking a branch at level k, with probabilityn

(k)y,i−1/(n

(k−1)i−1 + γ) he selects branch y, and with proba-

bility γ/(n(k−1)i−1 + γ) he starts a new branch. Here, n(k−1)

i−1

counts the number of actors 1, . . . , i − 1 having the samepath as actor i up to (and including) level k − 1. Out ofthese actors, n(k)

y,i−1 is the number that picked branch y (atlevel k). This sequence of branch choices defines the pathof actor i, and the union of all these paths forms the hierar-chy. A more formal treatment of the nCRP can be found inthe Supplemental. In our model, we limit the hierarchy toa maximum depth of K.

2.2 Multiscale Membership

In order to enable multiscale granularity on the interactions,we associate each actor i with a Multiscale Membership(MM) vector θi. The MM vector is aK-dimensional multi-nomial that encodes an actor’s tendencies to interact as amember of the different super- and sub- communities alonghis/her path of depthK. Consider two toothed whales: dol-phins and sperm whales. Both have the same path in thetree, [mammal, toothed whale], yet both behave very dif-ferently. A dolphin’s diet mainly consists of fish, which arecommon prey for many mammals. Thus it has a high prob-ability of interacting as a member of the mammal super-community, although it occasionally chooses prey that aremore specific to its species.

A sperm whale on the other hand barely eats fish, and thusrarely interacts as a member of its super-community. In-stead, it eats giant squid, a more specific prey uncommonto most mammals. As a result, a sperm whale has a higherprobability of participating in fine-grained interactions, in-stead of coarse ones like the dolphin does.

Like the mixed membership vector of the MMSB (Airoldi,Blei, Fienberg, and Xing 2008), which allows an actor tohave a distribution over roles, our Multiscale Membershipvector allows an actor to have a distribution over commu-nities. However, there is a key difference: in MMSB, anactor may have a distribution over all possible latent roles,whereas in our model, an actor’s Multiscale Membershipvector is a distribution over only the set of super and sub-communities along the actor’s path. This is because allow-ing an actor to have a distribution over all communities inthe hierarchy can render the hierarchy virtually meaning-less: a dolphin could simultaneously be in the shark and

ci

N

N-1

z→ij

z←ji

θi

N×N

Eijγ

m,π

Stick(m,π)

nCRP(γ)

Bernoulli(SB )

Multinomial(θi)

∞×∞

Bh,h'λ1,λ2

Beta(λ1,λ2)

Multinomial(θi)

ij

Figure 2: Graphical model for MSCB.

toothed whale communities, which is unrealistic.

The Multiscale Membership vectors θi are drawn from atwo-parameter stick breaking process Stick(m,π) (Blei,Griffiths, and Jordan 2010). The stick breaking prior makesit more intuitive to bias interactions toward coarser or finerlevels compared to a Dirichlet prior with either a single pa-rameter (which is not expressive enough), or K− 1 param-eters (which may be too expressive). The parameterm > 0influences the mean of θi, and π > 0 influences its vari-ance (details in the Supplemental). Because the hierarchyis only learnt up to depth K, we truncate the Stick(m,π).

2.3 Network Edge GenerationAt this point, we shall introduce some notation. Let E bethe N × N adjacency matrix of observed network edges,where Eij corresponds to the directed edge or interac-tion/relationship from actor i to j. In the context of ourfood web, the actors are sea creatures like dolphins andsperm whales, and the edges represent predator-prey inter-actions. A value of Eij = 1 indicates that the interactionis present, while Eij = 0 indicates absence, and we ignoreself-edges Eii.

We introduce our generative process for network edges:

• For each actor i ∈ {1, . . . , N}– Sample i’s path ci ∼ nCRP(γ).– Sample i’s MM θi ∼ Stick(m,π).

• To generate the network, for each directed edge Eij :

– Sample donor level z→ij ∼ Multinomial(θi).– Let1 h = ci[z→ij ].– Sample receiver level z←ij ∼ Multinomial(θj).– Let h′ = cj [z←ij ].– Sample the edge Eij ∼ Bernoulli(SB(h, h

′)). Weshall explain the meaning of SB later.

The basic idea is as follows: for every directed edge Eij ,both actor i (the donor) and actor j (the receiver) pick

1Formally, h is the community at level z→ij on path ci.


BB

BRoot

1 2 3

1 2 1 12 BFigure 3: Sibling groups in the hierarchy, and their associatedcommunity compatibility matrices B.

communities h and h′ from their respective paths ci, cj ,according to their MM vectors θi, θj . The communi-ties h, h′ are then used to select a community compatibil-ity parameter SB(h, h

′), which in turn generates Eij ∼Bernoulli(SB(h, h

′)). Note that the arrow in z→ij or z←ijdenotes donor or receiver respectively, not edge directionbetween i and j.

2.4 Community Compatibility Matrices B

We now discuss the SB() function. Intuitively, the compat-ibility from h to h′ is high if actors from h often interactwith actors from h′. Conversely, a low compatibility indi-cates that actors from h rarely interact with actors from h′.Thus, it is natural to define compatibility to be a Bernoulliparameter in [0, 1], where 1 indicates perfect compatibil-ity. This notion of compatibility is what allows our modelto account for both assortative and disassortative behavior.(For example, strong assortative interactions correspond tohigh compatibility parameters when h = h′).

There are many ways to associate compatibility parameterswith pairs of communities h, h′. Our goal is to meaning-fully integrate compatibility with the hierarchy and multi-scale interactions over communities. A first attempt mightbe to ignore the hierarchy, and place a fullH×H compati-bility matrix over all community pairs h, h′, which is anal-ogous to the blockmatrix of MMSB (Airoldi, Blei, Fien-berg, and Xing 2008). However, this formulation does notcapture the multiscale nature of interactions, because thereis no connection between the compatibility parameter forh, h′ and those communities’ levels in the hierarchy.

Instead, we restrict the compatibility parameters by defin-ing a compatibility matrix for each sibling group (a setof children under the same parent) in the hierarchy. Eachsibling group’s compatibility matrix defines the interactionprobability between every pair of siblings in that group —refer to Figure 3 for an illustration. Since the number ofhierarchy nodes is not pre-specified, the number and sizeof the sibling group compatibility matrices must be deter-mined automatically from the data. We shall address thisissue when we describe our inference procedure.

When the interacting communities h, h′ share the same par-ent, we simply choose the appropriate entry from their sib-

ling group matrix. However, if h, h′ do not share the sameparent, then we invoke the following coarsening procedure:

1. Recall that h = ci[z→ij ] and h′ = cj [z←ij ].2. Find zmin = min(z→ij , z←ij).3. If hcoarse = ci[zmin] and h′coarse = cj [zmin] are in

the same sibling group, then we look up its compati-bility matrix entry Bhcoarse,h′coarse

. We then generateEij ∼ Bernoulli(Bhcoarse,h′coarse

).4. Otherwise, hcoarse, h′coarse have zero compatibility,

and we generate Eij = 0.

Essentially, if actor i picks a community at a deeper levelthan actor j, then i coarsens up along his path to the samelevel as j. We can now formally define the SB() functionfrom the previous section:

SB(h, h′) =

Bhcoarse,h′coarse

hcoarse and h′coarse have same parent0 otherwise

h = ci[z→ij ] h′ = cj [z←ij ]

hcoarse = ci[zmin] h′coarse = cj [zmin]

zmin = min(z→ij , z←ij).

For brevity, we define the shorthand SijB := SB(h, h′).

Finally, a Beta(λ1, λ2) prior is placed over every commu-nity compatibility parameter Bhcoarse,h′coarse

. This addsthe following step to our generative process:

• For each hcoarse, h′coarse:

– Sample Bhcoarse,h′coarse∼ Beta(λ1, λ2).

Increasing λ1 will bias the model towards having moreedges, whereas increasing λ2 biases the model towardshaving fewer edges. Intuitively, λ1 is the number of fakeedges we are willing to assume, while λ2 is our assumednumber of fake non-edges.

A graphical model representation of our generative processcan be found in Figure 2.

3 COLLAPSED GIBBS SAMPLERINFERENCE

Exact inference on our model is intractable, so we derive acollapsed Gibbs sampling scheme for posterior inference.The compatibility matrices B present a challenge sincethey can change in number/size as nodes are added anddeleted during the sampling process (because the hierar-chy structure is not prespecified). To finesse this issue, weanalytically integrate them out using Beta-Bernoulli conju-gacy, which adds dependence among interactions that im-plicitly share a compatibility parameter.

For faster mixing, the θi’s are also integrated out. Thus,the only variables that need to be explicitly sampled arethe levels z and the paths c. The sampling equations areprovided below.


Sampling Levels The distribution of z→ij conditionedon all other variables is

P(z→ij | c, z−(→ij),E, γ,m, π, λ1, λ2) ∝ (1)P(Eij | c, z,E−(ij), λ1, λ2)P(z→ij | zi,(−j),m, π)

where E−(ij) is the set of all edges except Eij , andzi,(−j) = {z→i·, z←·i} \ z→ij . The first term, for a par-ticular value of z→ij , is equal to{

Γ(a+b+λ1+λ2)Γ(a+λ1)Γ(b+λ2) ·

Γ(a+Eij+λ1)Γ(b+(1−Eij)+λ2)Γ(a+b+1+λ1+λ2) SijB 6= 0

0 otherwise

a =∣∣∣{(x, y) | (x, y) 6= (i, j),SxyB = SijB , Exy = 1

}∣∣∣b =

∣∣∣{(x, y) | (x, y) 6= (i, j),SxyB = SijB , Exy = 0}∣∣∣ (2)

In Eq. 2, the compatibility matrices B have been integratedout via Beta-Bernoulli conjugacy. As a result, z→ij de-pends on the interactions Exy that share Eij’s compatibil-ity parameter at this point in the sampling process.

The second term is computed by conditioning on the stick-breaking lengths V1, ..., VK associated with z→ij :P(z→ij = k | zi,(−j),m, π) = (3)

mπ +#[zi,(−j) = k]

π +#[zi,(−j) ≥ k]

k−1∏u=1

(1−m)π +#[zi,(−j) > u]

π +#[zi,(−j) ≥ u]

Since we have limited the maximum depth toK, we simplyignore the event z→ij > K, and renormalize the distribu-tion of z→ij over the domain {1, . . . ,K}. The distributionof z←ij is derived in similar fashion.

The runtime complexity of sampling a single zij is O(K),where K is the (fixed) depth of our hierarchy. Hence thetotal runtime for all z is O(N2K).

Sampling Paths The distribution of ci conditioned on allother variables is

P(ci | c−i, z,E, γ,m, π, λ1, λ2) ∝ (4)P(E(i·),(·i) | c, z,E−(i·),−(·i), λ1, λ2)P(ci | c−i, γ)

where E(i·),(·i) = {Exy | x= i or y= i} is the set of edgesExy that depend on ci, and E−(i·),−(·i) is its complement.The second term can be computed using the nCRP defini-tion (refer to the Supplemental). The first term, for a par-ticular value of ci, is

∏B∈B(i·),(·i)

Γ(gB+hB+λ1+λ2)Γ(gB+λ1)Γ(hB+λ2) ·

Γ(gB+rB+λ1)Γ(hB+sB+λ2)Γ(gB+hB+rB+sB+λ1+λ2)

∀Exy∈E(i·),(·i), SxyB 6=0

0 otherwise

B(i·),(·i) = {Bh,h′ | ∃(i, j)[Eij ∈E(i·),(·i),SijB=Bh,h′ ]}

gB =∣∣{(x, y) | Exy∈E−(i·),−(·i), S

xyB =B,Exy=1

}∣∣hB =

∣∣{(x, y) | Exy∈E−(i·),−(·i), SxyB =B,Exy=0

}∣∣rB =

∣∣{(x, y) | Exy∈E(i·),(·i), SxyB =B,Exy=1

}∣∣sB =

∣∣{(x, y) | Exy∈E(i·),(·i), SxyB =B,Exy=0

}∣∣ (5)

where B(i·),(·i) is the set of community compatibility pa-rameters Bh,h′ associated with some edge in E(i·),(·i).Similar to Eq. (2), Eq. (5) is a consequence of integrat-ing out B for all interactions E associated with actor i.

The runtime for a single ci is O(NH), whereH is the num-ber of hierarchy nodes. Hence the time to sample all c isO(N2H). Note that H = O(NK), so the complexity ofsampling all c is O(N3K).

4 SIMULATIONWe first evaluate our model’s ability to recover hier-archies on simulated data. Our focus is to exam-ine how our model’s ability to model both assortative(within-community) interactions and disassortative (cross-community) interactions differentiates it from a traditionalhierarchical clustering algorithm.

Our experiments explore the effect of different compatibil-ity matrices B. We first explore an on-diagonal B, whosediagonal elements are much larger than the off-diagonals(strong assortative interactions). We also investigate an off-diagonal B, whose off-diagonal elements are larger (strongdisassortative interactions). For either B type, we experi-mented with maximum hierarchy depths K = 2 and 3. Forthe K = 2 simulations, the number of actors N was 150,while for K = 3 we used 300 actors. Additional detailsand experiments can be found in the Supplemental.

We compare our approach to hierarchical spectral cluster-ing (denoted HSpectral). For spectral clustering, it is un-clear how the number of clusters at each node would beselected, so we give it the number of 1st-level branches asan advantage (and then let it do a binary split at the deeperlevels). For our model, we fix m = π = λ1 = λ2 = .5 andsearch over γ = {.01, .1, .5, 1, 1.5, 2}, picking the valuethat maximizes the marginal likelihood. We calculate theF1 score at each level k, F1k = 2∗Precision∗Recall

Recall+Precision whereRecall = TP

TP+FN , and Precision = TPTP+FP . TP is true

positive count (actors that should be in the same cluster, upto depth k), FP is false positive count, TN is true negativecount, and FN is false negative count. The total F1 scoreis computed by averaging the F1k scores for each level.

Figure 4 illustrates our results as a function of the numberof branches at the first level of the generated tree. As onecan see, in Figure 4(a) and Figure 4(b), when B is stronglyon-diagonal, our algorithm performs well, but a little worsethan HSpectral (since HSpectral is given the number oflevel 1 branches). A specific example for K = 2 is shownin Figures 4(e), 4(f), 4(g), 4(h), on which both models per-formed reasonably well.

However, when B is strongly off-diagonal (implying strongcross-community interactions), HSpectral performs poorly.This is because, by formulation, spectral clustering cannotrecover a disassortative community. On the other hand, ourmethod still gives good results (Figure 4(c), Figure 4(d)).


1 2 3 4 5 60

0.2

0.4

0.6

0.8

1

number of branches at first level

f1 s

core

tree recovery, K=2, on−diagonal B

HSpectralOur Model

(a)

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

tree recovery, K=3, on−diagonal B

HSpectralOur Model

(b)

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

tree recovery, K=2, off−diagonal B

HSpectralOur Model

(c)

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

tree recovery, K=3, off−diagonal B

HSpectralOur Model

(d)

20 40 60 80 100 120 140

20

40

60

80

100

120

140

(e)

Original

root

140 10

95 17 23 7

(f)

Our Model

root

106 34 10

97 9 33 8

(g)

HSpectral

root

141 9

102 39 5

(h)

20 40 60 80 100 120 140

20

40

60

80

100

120

140

(i)

Original

root

92 53 5

90 5 38 5 <5

(j)

Our Model

root

81 51 16

81 12 35 16

(k)

HSpectral

root

32 48 70

13 19 24 24 23 47

(l)

Figure 4: Simulation Results. Figures 4(a), 4(b), 4(c), and 4(d) show quantitative results. Figures 4(e), 4(f), 4(g) 4(h) illustrate resultsfor one on-diagonal (assortative) network, and Figures 4(i), 4(j), 4(k) , 4(l) illustrate results for one off-diagonal (disassortative) network.4(e) and 4(i) are the original networks for these two cases (black indicates edge). The numbers inside hierarchy nodes are actor counts(nodes of size < 5 are not shown). See text for details.

0 5 10 15 20 25−900

−800

−700

−600

−500

−400

−300

−200

−100Grass dataset: held−out log marginal−likelihoods

MMSB number of roles

Log

mar

gina

l−lik

elih

ood

MMSBOur model, K=2Our model, K=3

0 5 10 15 20 25−4500

−4000

−3500

−3000

−2500

−2000

−1500

−1000

−500

0Terrorist dataset: held−out log marginal−likelihoods

MMSB number of roles

Log

mar

gina

l−lik

elih

ood

MMSBOur model, K=2Our model, K=3

Figure 5: Held-out marginal likelihoods for our model andMMSB. Dotted lines show our model’s error bars.

A K = 2 example is shown in Figures 4(i), 4(j), 4(k), 4(l)where our model performs accurately while HSpectral es-sentially divides the actors randomly and performs poorly.Thus our model successfully captures a variety of commu-nity interactions that traditional clustering methods cannot.Moreove, it can also recover the actor-specific multi-scaleinteraction levels for a richer network analysis.

5 HELD-OUT EVALUATIONHaving evaluated how our model compares to a traditionalhierarchical clustering method without a latent space, wenow seek to compare it to latent space models that do notaccount for hierarchical structure. Since our latent spaceis integrated with the hierarchy, it is not possible to com-pare to a “non-hierarchical version” of our model. MMSB(Airoldi, Blei, Fienberg, and Xing 2008) seems the bestchoice for comparison, since it has analogous (but differ-ent) notions to our multi-scale membership and communitycompatibility in a non-hierarchical setting.

We use two real-world datasets, a 75-species food web ofgrass-feeding wasps (Dawah, Hawkins, and Claridge 1995;Clauset, Moore, and Newman 2008), and the 9/11 hijackernetwork consisting of 62 terrorists (Krebs 2002; Clauset,Moore, and Newman 2008). Our choices reflect two com-mon modes of interaction seen in real-world network data:edges in the food web denote predator-prey relationships,while edges in the terrorist network reflect social cohesion.The food web could be represented as a hierarchy where

different branches reflect different trophic levels (e.g. para-site, predator or prey), while the terrorist network could beinterpreted as an organization chart. In the following exper-iments, we compare our model to MMSB using held-outmarginal likelihood; models with higher likelihood imply abetter fit to the data.

For each dataset, we generated 5 sets of training and testsubgraphs; each train/test pair was obtained by randomlypartitioning the actors into two equal sets, and keeping onlythe edges within each partition. With each train/test pair,we first used the training subgraph to select an appropri-ate prior on the community compatibility parameters B, byperforming a gridsearch over (λ1, λ2) ∈ {.1, .3, .5, .7, .9}2according to the log marginal likelihood. The remain-ing parameters were fixed to γ = 1,m = π = 0.5, aswe found our results to be relatively insensitive to them.Using the best gridsearch parameters, we then estimatedthe log marginal likelihood on the corresponding test sub-graphs, averaging over them to obtain our model’s averageheld-out likelihood. This entire procedure was conductedfor maximum hierarchy depths K = 2 and 3. The pro-cedure for MMSB was similar, except that we used 100random restarts of the MMSB variational EM algorithmon the training subgraphs to select the best parameters.MMSB also requires the number of latent roles R as atuning parameter, so we repeated the experiment for each2 ≤ R ≤ 20. For either algorithm, log marginal likelihoodswere estimated using 10,000 importance samples.

The results are shown in Figure 5. On both datasets, ourmodel’s held-out likelihood for either value of K is su-perior to MMSB for all R. Notably, MMSB’s likelihoodpeaks on both datasets at R = 2, but selecting so few roleswill lead to an extremely coarse network analysis. In con-trast, our model automatically recovers a suitable level ofhierarchical complexity and enables rich interpretations ofthe data — as we shall demonstrate next.


Hierarchy Root

45

1

14

2

8

3

5

4

2

5

1

6

43

1.1

2

1.2

12

2.1

2

2.2

8

3.1

5

4.1

2

5.1

1

6.1

36 Parasitoids7 Hyper-Parasitoids

1 Parasitoids1 Hyper-Parasitoids

12 Herbivores 2 Herbivores 8 Grasses2 Parasitoids

2 Hyper-Parasitoids1 Hyper-Hyper-Parasitoids

1 Hyper-Parasitoids1 Hyper-Hyper-Parasitoids

1 Herbivores

Species Index

Hie

rarc

hy P

ositi

on

Grass dataset: Multiscale Membership vector for each species

10 20 30 40 50 60 70

1

2

3

4

5

6

1.1

1.2

2.1

2.2

3.1

4.1

5.1

6.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

2.1

2.2

3.1

4.1 5.1

6.1

9

1

1011

2

1213

3

14

15

16

45

17

18

19

6

20 2122

7

23

8

24 2526 27 2829 30 31

32

3334 35 3637 38

3940

4142 43 4445 46 47 48495051 52

5354

55

565758

59 6061

62 6364

65

66

67

686970 71 7273

74 75

Grass Herbivore Parasitoid Hyper-Parasitoid

Hyper-Hyper-

Parasitoid

Figure 6: Grass network: Top Left: Inferred hierarchy of communities, with community trophic level counts at the bottom. Top Right:Multiscale Membership vectors for each actor. Bottom: Original network. Edges show interacting communities (edge head/tail colorsmatch assumed hierarchy underlying the interactions) and interaction level (1 = solid, 2 = dashed) inferred by our model. Node shapesrepresent annotated trophic levels (see legend in bottom right).

Hierarchy Root

878 69 20 20 12 1

810 26 14 8 7 6 7 43 11 15 14 6 19 1 6 6 1

92: theory68: field

66: quantum50: gauge

49: noncommutative44: theories

40: string31: model30: black29: brane

5: supergravity4: string

3: wrapped3: branes

3: intersecting3: solutions3: extended3: gauged

2: 7-manifolds2: pp-waves

5: pp-wave3: plane-wave

3: theory2: iia

2: type2: action2: matrix2: plane2: string

2: background

5: higher4: spin

2: dimensions2: four

2: equations1: model1: susy

1: theory1: strings

1: analysis

2: fluxes2: rotating

1: 7-manifolds1: membranes

1: chiral1: m-theory

1: one1: strings1: probes1: beyond

3: background3: wave2: type

2: matrix2: r-r

2: plane2: pp1: iib

1: model1: iia

3: string2: theory2: field

1: sector1: flux

1: time-dependent1: berkovits’

1: chiral1: supergravity1: projection

12: string10: theory

10: tachyon6: models

6: field5: matrix4: open

4: supersymmetric4: intersecting

4: bmn

3: tachyon3: rolling2: n=1

2: tachyons2: super

1: intersecting1: solutions

1: supergravity1: bsft1: time

3: inflation3: brane

2: interaction2: evolution2: models2: theory2: strings2: matrix

2: pp-wave2: string

3: theories3: theory

3: geometric3: tachyon

2: open2: duality

2: inflation2: rolling2: field

2: string

2: matrix2: models

1: spacelike1: intersecting1: dependent

1: worlds1: integral

1: dbi1: dijkgraaf-vafa

1: calculation

4: strings4: pp-wave

3: supergravity3: string2: open2: big

2: holography2: correspondence

2: star2: branes

1: 3d1: instability

1: singularities1: null

1: sector1: solutions1: doubling1: theory1: ghost1: curves

1: ns5-brane1: principle

1: cosmological1: deformed

2: theory2: string

1: amplitudes1: geometries

1: orbifolds1: strings

1: free1: s^51: hkt

1: moyal

1: supersymmetry1: limits

1: penrose1: maximal

Figure 7: HEP network: Inferred hierarchy of communities, with the most frequent title keywords at the bottom. Community positions(circles) show the number of papers.

6 REAL-DATA QUALITATIVE ANALYSISIn this section, we apply our model to interpret two real-world networks. We demonstrate that our model recoversthe three network aspects we seek: hierarchy, multiscalegranularity, and assortativity/disassortativity.

For both experiments, we use the optimal parameters froma held-out gridsearch similar to the previous section. Wethen ran our Gibbs sampler for 10,000 burn-in iterations,and took 500 samples. In order to account for posteriorspread, we report a “consensus” sample that is analogousto an average. A description of the consensus and otherexperimental details can be found in the Supplemental.

6.1 Grass-Feeding Wasp Parasitoids Food Web

We begin with the earlier grass dataset, consisting of 75species in a food web, and in which interactions represent

predator-prey relationships. This dataset annotates eachspecies with its position or “trophic level” in the food web:grass, herbivore, parasitoid, hyper-parasitoid (parasites thatprey on other parasites), and hyper-hyper parasitoid. OurGibbs sampler’s inferred community hierarchy and Multi-scale Membership (MM) vectors are reported in Figure 6.We also show the original network, where each interactionEij = 1 has been augmented with its associated commu-nity and interaction level (missing links Eij = 0 are notshown). Trophic level annotations are shown in the hierar-chy as counts, and in the network as node shapes.

In general, the level 1 super-communities separate thetrophic levels. For example, community 3 contains all grassspecies, 2 contains most herbivores, and 1 contains mostparasitoids. Note that the trophic levels form a set of dis-assortative communities, e.g. herbivores feed on grasses,but not on other herbivores. In contrast, Clauset et al.’s


0 100 200 300 400 500 600 700 800 900 1000

0

100

200

300

400

500

600

700

800

900

1000

HEP dataset adjacency matrix, sorted in hierarchical order

Figure 8: HEP network: Adjacency matrix, permuted accordingto the communities in Figure 7.

method, which assumes assortative communities, did notrecover this structure in their experiments (Clauset, Moore,and Newman 2008).

The smaller super-communities are still more interesting:for instance, the herbivore in super-community 6 is thesole prey of the parasitoids in super-community 4, whichjustifies its separation from the other herbivores in super-community 2. Moreover, community 5 solely contains thetwo apex parasitoids with the largest and 2nd-largest rangeof prey species.

At level 2, the communities are separated by more subtlecriteria than just trophic levels. The herbivores in commu-nity 2.2 are the sole prey of species 42 and 41 in commu-nity 1.1, while community 1.2 contains another apex par-asitoid with an especially large range of prey species. Inboth cases, our model has separated these auxiliary foodwebs from the main web.

We now investigate the Multiscale Memberships recoveredby our model. The MM vectors in Figure 6 show the fre-quency at which each species interacts as a member of aparticular super- or sub-community. Most species iden-tify at the super-community (i.e. generic) level, thoughsome occasionally identify at the sub-community level.Our results show that level 2 interactions occur only withinsuper-communities, hence they account for fine-grained,within-community interactions. For example, the within-community links in community 4, as well as the links fromspecies 65 in sub-community 1.2 to other members of com-munity 1, are all level 2 interactions. Note that we have notshown interaction levels for missing links, and a numberof these are accounted for by level 2 interactions (e.g. incommunity 1).

6.2 High-Energy Physics Citation Network

Finally, we consider a 1,000-paper subgraph of the arXivhigh-energy physics citation network, taken from the 2003KDD Cup (2010). We constructed this subgraph bysubsampling papers involved in citations from Jan 2002through May 2003. Our Gibbs sampler completed 10,000samples in just under 23 hours on a single processor,demonstrating that our algorithm scales to networks with

thousands of actors.

The inferred community hierarchy is shown in Figure 7,where each sub-community has been annotated with its pa-pers’ most frequent title words2. We also show the adja-cency matrix in Figure 8, permuted to match the order ofinferred communities.

As expected, our model learns communites reflecting spe-cific areas of study (an assortative network). The giant 810-paper level 2 community has a sparse citation pattern, im-plying that its papers are not specific to any research topic.This is confirmed by the top 3 keywords: ‘theory’, ‘field’and ‘quantum’, which are general to physics research. Theother level 2 communities under the same parent are morefocused, with specific physical concepts like ‘supergrav-ity’, ‘string’ and ‘pp-wave’. This is also reflected in theadjacency matrix, which is denser among these communi-ties. The remaining super-communities form a dense sub-network mostly separated from the rest, implying narrowerresearch foci. In particular, three of the sub-communitiesinvolve the title keyword “tachyon”, which is absent fromthe giant level 1 community.

7 CONCLUSION

We have developed a nonparametric Multiscale Commu-nity Blockmodel (MSCB) that models social networks interms of the hierarchical community memberships thatactors undertake during interactions. Our model auto-matically infers the structure of the hierarchy while si-multaneously recovering the Multiscale Memberships ofevery actor, setting it apart from hierarchy-discoveringmethods that are restricted to binary hierarchies and/orsingle-community-memberships for actors. Moreover, ourmodel is expressive enough to account for both assortative(within-community) and disassortative (cross-community)interactions, as we have demonstrated through our simula-tion and real dataset experiments. We believe these aspectsmake our model suitable for exploring and understandingreal-world network phenomena.

Acknowledgements

This paper is based on work supported by NSFIIS-0713379, NSF DBI-0546594 (Career), ONRN000140910758, AFOSR FA9550010247, NIH1R01GM093156, and an Alfred P. Sloan ResearchFellowship to Eric P. Xing. Qirong Ho is supportedby a graduate fellowship from the Agency for Science,Technology And Research, Singapore.

2While this output is reminiscent of topic models, our modelis not a topic model. The hierarchy is learnt only from the citationnetwork, without the paper contents.


ReferencesAiroldi, E., D. Blei, S. Fienberg, and E. Xing (2008).

Mixed membership stochastic blockmodels. TheJournal of Machine Learning Research 9, 1981–2014.

Blei, D., T. Griffiths, and M. Jordan (2010). The nestedChinese restaurant process and Bayesian nonpara-metric inference of topic hierarchies. Journal of theACM (JACM) 57(2), 1–30.

Blei, D., A. Ng, and M. Jordan (2003). Latent dirich-let allocation. The Journal of Machine Learning Re-search 3, 993–1022.

Clauset, A., C. Moore, and M. Newman (2008). Hierar-chical structure and the prediction of missing linksin networks. Nature 453(7191), 98–101.

Clauset, A., M. Newman, and C. Moore (2004). Findingcommunity structure in very large networks. Physi-cal Review E 70(6), 66111.

Dawah, H., B. Hawkins, and M. Claridge (1995). Struc-ture of the parasitoid communities of grass-feedingchalcid wasps. Journal of animal ecology 64(6),708–720.

Girvan, M. and M. Newman (2002). Community struc-ture in social and biological networks. Proceedingsof the National Academy of Sciences 99(12), 7821.

Guimera, R. and L. Amaral (2005). Functional cartog-raphy of complex metabolic networks. Nature 433,895–900.

Handcock, M., A. Raftery, and J. Tantrum (2007).Model-based clustering for social networks. Jour-nal of the Royal Statistical Society-Series A 170(2),301–354.

Hoff, P., A. Raftery, and M. Handcock (2002). Latentspace approaches to social network analysis. Journalof the American Statistical Association 97, 1090–1098.

KDD (2010, June). KDD Cup 2003 - Datasets. http://www.cs.cornell.edu/projects/kddcup/datasets.html.

Kemp, C. and J. Tenenbaum (2008). The discoveryof structural form. Proceedings of the NationalAcademy of Sciences 105(31), 10687.

Kemp, C., J. Tenenbaum, T. Griffiths, T. Yamada, andN. Ueda (2006). Learning systems of concepts withan infinite relational model. In Proceedings of theNational Conference on Artificial Intelligence, Vol-ume 21, pp. 381. Menlo Park, CA; Cambridge, MA;London; AAAI Press; MIT Press; 1999.

Krause, A., K. Frank, D. Mason, R. Ulanowicz, andW. Taylor (2003). Compartments revealed in food-web structure. Nature 426(6964), 282–285.

Krebs, V. (2002). Mapping networks of terrorist cells.Connections 24(3), 43–52.

Miller, K., T. Griffiths, and M. Jordan (2009). Nonpara-metric Latent Feature Models for Link Prediction.Advances in Neural Information Processing Systems(NIPS).

Radicchi, F., C. Castellano, F. Cecconi, V. Loreto, andD. Parisi (2004). Defining and identifying com-munities in networks. Proceedings of the NationalAcademy of Sciences 101(9), 2658.

Roy, D., C. Kemp, V. Mansinghka, and J. Tenenbaum(2007). Learning annotated hierarchies from rela-tional data. Advances in neural information process-ing systems 19, 1185.

Teh, Y. and D. Roy (2009). The Mondrian Process. Ad-vances in neural information processing systems.

Wang, Y. and G. Wong (1987). Stochastic blockmodelsfor directed graphs. Journal of the American Statis-tical Association 82(397), 8–19.

Supplemental MaterialMultiscale Community Blockmodel for Network Exploration

A Nested Chinese Restaurant ProcessThe nested Chinese Restaurant Process (nCRP) [1] is an extension of the regular Chinese Restaurant Process (CRP), arecursively-defined prior over positive integers. For concreteness, we shall use the first level of each actor path, ci1, todefine the CRP:

P(ci1 = x | c1:(i−1),1) =

{|{j<i | cj1=x}|

i−1+γ x ∈ {c1:(i−1),1}γ

i−1+γ x is the smallest positive integer not in {c1:(i−1),1}(1)

where γ > 0 is a “concentration” parameter that controls the probability of drawing new integers, and for conciseness wedefine c1:(i−1),1 ≡ (c11, . . . , c(i−1)1). The nCRP is essentially a hierarchy of CRP priors, beginning with a single CRP priorat the top level. With each unique integer x seen at the top-level prior, we associate a child CRP prior with |{i | ci1 = x}|observations, resulting in a two-level tree of CRP priors. We can repeat this process ad infinitum on the newly-created childpriors, resulting in an infinite-level tree of CRP priors, though we only use a K-level nCRP. All CRP priors in the nCRP sharethe same concentration parameter γ.

Now we can finish describing our generative process: for each actor i ∈ N , we can sample ci ∼ nCRP(γ) using therecursive nCRP definition:

P(cik = x | c1:(i−1), ci,1:(k−1)) =|{j<i | cj,1:(k−1)=ci,1:(k−1)∧cjk=x}||{j<i | cj,1:(k−1)=ci,1:(k−1)}|+γ x ∈ {cjk | (j < i) ∧ cj,1:(k−1) = ci,1:(k−1)}

γ

|{j<i | cj,1:(k−1)=ci,1:(k−1)}|+γ x is the smallest positive integer not in the above set.(2)

B Stick Breaking ProcessesStick breaking constructions work as follows: Consider a stick of length 1. Draw Vi1 ∼ Beta(mπ, (1−m)π). Let θi1 = Vi1and let 1 − θi1 be the remainder of the stick after chopping off this length Vi1. To calculate the length θi2, draw Vi2 ∼Beta(mπ, (1 − m)π) and chop off this fraction of the remainder of the stick, giving θi2 = Vi2(1 − Vi1). Thus Vik is thefraction to chop off from the stick’s remainder, and θik is the length of the kth stick that was chopped off. In general, we drawVik ∼ Beta(mπ, (1−m)π) from k = 1 to k =∞ and the corresponding {θik}∞k=1 is defined below:

θik = Vik

k−1∏u=1

(1− Viu) (3)

This process is known as the two-parameter GEM distribution [1] (although we refer to it as Stick(m,π)) and draws fromStick(m,π) are denoted as θi ∼ Stick(m,π). m > 0 influences the mean of θi, and π > 0 influences its variance. Becausethe hierarchy is only learnt up to depth K, we truncate the Stick(m,π) distribution at level K. The stick breaking prior makesit more intuitive to bias interactions toward coarser or finer levels compared to a Dirichlet prior with either a single parameter(which is not expressive enough), or K − 1 parameters (which may be too expressive).

C Collapsed Gibbs SamplerExact inference on our model is intractable, so we derive a collapsed Gibbs sampling scheme for posterior inference. The θ’sand B’s are integrated out for faster mixing, so we only have to sample z and c.

1

Sampling levels The distribution of z→ij conditioned on all other variables is

P(z→ij | c, z−(→ij),E, γ,m, π, λ1, λ2)

∝ P(Eij , z→ij | c, z−(→ij),E−(ij), γ,m, π, λ1, λ2)

= P(Eij | c, z,E−(ij), γ,m, π, λ1, λ2)P(z→ij | c, z−(←ij),E−(ij), γ,m, π, λ1, λ2)

= P(Eij | c, z,E−(ij), λ1, λ2)P(z→ij | zi,(−j),m, π) (4)

where E−(ij) is the set of all edges except Eij , and zi,(−j) = {z→i·, z←·i} \ z→ij . The first term, for a particular value ofz→ij , is

First term =

{Γ(a+b+λ1+λ2)

Γ(a+λ1)Γ(b+λ2) ·Γ(a+Eij+λ1)Γ(b+(1−Eij)+λ2)

Γ(a+b+1+λ1+λ2) SijB 6= 0

0 otherwise

a =∣∣∣{(x, y) | (x, y) 6= (i, j),SxyB = SijB , Exy = 1

}∣∣∣b =

∣∣∣{(x, y) | (x, y) 6= (i, j),SxyB = SijB , Exy = 0}∣∣∣ (5)

The second term can be computed by conditioning on the stick-breaking lengths V1, ..., VK associated with z→ij :

P(z→ij = k | zi,(−j),m, π) = E[I(z→ij = k) | zi,(−j),m, π

]= E

[E[I(z→ij = k) | Vi1, ..., Vik, zi,(−j),m, π

]]= E

[Vik

k−1∏u=1

(1− Viu) | zi,(−j),m, π

]

= E[Vik | zi,(−j),m, π]k−1∏u=1

E[(1− Viu) | zi,(−j),m, π]

=mπ +#[zi,(−j) = k]

π +#[zi,(−j) ≥ k]

k−1∏u=1

(1−m)π +#[zi,(−j) > u]

π +#[zi,(−j) ≥ u](6)

Since we have limited the maximum depth to K, we simply ignore the event z→ij > K, and renormalize the distribution ofz→ij over the domain {1, . . . ,K}. The distribution of z←ij is derived in similar fashion.

Sampling paths The distribution of ci conditioned on all other variables is

P(ci | c−i, z,E, γ,m, π, λ1, λ2)

∝ P(ci,E(i·),(·i) | c−i, z,E−(i·),−(·i), γ,m, π, λ1, λ2)

= P(E(i,·),(·,i) | c, z,E−(i·),−(·i), γ,m, π, λ1, λ2)P(ci | c−i, z,E−(i·),−(·i), γ,m, π, λ1, λ2)

= P(E(i·),(·i) | c, z,E−(i·),−(·i), λ1, λ2)P(ci | c−i, γ) (7)

where E(i·),(·i) = {Exy | x = i or y = i} is the set of all edges Eij whose distributions depend on ci, and E−(i·),−(·i) is itscomplement. The second term can be computed using the recursive nCRP definition

P(cik = x | c1:(i−1), ci,1:(k−1), γ) =|{j<i | cj,1:(k−1)=ci,1:(k−1)∧cjk=x}||{j<i | cj,1:(k−1)=ci,1:(k−1)}|+γ x ∈ {cjk | (j < i) ∧ cj,1:(k−1) = ci,1:(k−1)}

γ

|{j<i | cj,1:(k−1)=ci,1:(k−1)}|+γ x is the smallest positive integer not in the above set.(8)

2

while the first term, for a particular value of ci, is

First term =

{∏B∈B(i,·),(·,i)

Γ(gB+hB+λ1+λ2)Γ(gB+λ1)Γ(hB+λ2) ·

Γ(gB+rB+λ1)Γ(hB+sB+λ2)Γ(gB+hB+rB+sB+λ1+λ2) ∀Exy ∈ E(i·),(·i), S

xyB 6= 0

0 otherwise

gB =∣∣{(x, y) | Exy ∈ E−(i·),−(·i), S

xyB = B,Exy = 1

}∣∣hB =

∣∣{(x, y) | Exy ∈ E−(i·),−(·i), SxyB = B,Exy = 0

}∣∣rB =

∣∣{(x, y) | Exy ∈ E(i·),(·i), SxyB = B,Exy = 1

}∣∣sB =

∣∣{(x, y) | Exy ∈ E(i·),(·i), SxyB = B,Exy = 0

}∣∣ (9)

where B(i·),(·i) = {B ∈ B | ∃(i, j), (Eij ∈ E(i·),(·i),SijB = B)} is the set of allB ∈ B associated with some edge in E(i·),(·i)

through SB .

D Simulation Framework Details and Additional Experiments

D.1 K=2 experiment detailsFor K = 2, the 4 types of B’s explored are:

1. on-diagonal, low noise - Bon−diagonal = (.4, .7), Boff−diagonal = (.02, .02);

2. on-diagonal, high noise - Bon−diagonal = (.3, .6), Boff−diagonal = (.1, .1);

3. off-diagonal, low noise - Bon−diagonal = (.02, .02), Boff−diagonal = (.4, .7);

4. off-diagonal, high noise - Bon−diagonal = (.1, .1), Boff−diagonal = (.3, .6).

Bon−diagonal = (a, b) means that actors interacting in the same level-1 community do so with probability a, while actorsinteracting in the same level 2 community do so with probability b. Boff−diagonal gives analogous interaction probabilitiesfor different communities on the same level.

The experiment had N = 150 actors and θ = (.25, .75) for all actors.

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

tree recovery, on−diagonal B, low noise

HSpectralOur Model

(a)

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

tree recovery, on−diagonal B, high noise

HSpectralOur Model

(b)

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

tree recovery, off−diagonal B, low noise

HSpectralOur Model

(c)

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

tree recovery, off−diagonal B, high noise

HSpectralOur Model

(d)

Figure 1: K = 2 experiments

The Gibbs sampler was run for 1,500 burn-in iterations on each experiment. The “number of branches at level 1” refers tonumber of branches of size ≥ 5 (since there are often branches of size 1 or 2). For fairness, the ”correct” number of level 1branches given to spectral clustering is also the number of branches of size ≥ 5. Note that spectral clustering was given thethe number of first level clusters as an advantage (with binary splits in the deeper levels).

D.2 K=3 experiment resultsFor K = 3, the 4 types of B’s explored are

1. on-diagonal, low noise - Bon−diagonal = (.5, .7, .9), Boff−diagonal = (.02, .02, .02);

2. on-diagonal, high noise - Bon−diagonal = (.5, .7, .9), Boff−diagonal = (.2, .2, .2);

3

3. off-diagonal, low noise - Bon−diagonal = (.02, .02, .02), Boff−diagonal = (.5, .7, .9);

4. off-diagonal, high noise - Bon−diagonal = (.2, .2, .2), Boff−diagonal = (.5, .7, .9).

Similar to the K = 2 experiments, Bon−diagonal = (a, b, c) means that actors interacting in the same level-1, 2 and 3communities do so with probabilities a, b and c respectively. Boff−diagonal gives analogous interaction probabilities fordifferent communities on the same level.

The experiment had N = 300 actors and θ = (.15, .3, .55) for all actors.

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

K=3, low noise, on−diagonal B

HSpectralOur Model

(a)

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

K=3, high noise, on−diagonal B

HSpectralOur Model

(b)

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

K=3, low noise, off−diagonal B

HSpectralOur Model

(c)

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1


f1 s

core

K=3, high noise, off−diagonal B

HSpectralOur Model

(d)

Figure 2: K = 3 experiments

The Gibbs sampler was run for 3,000 burn-in iterations on each experiment. As in the previous case, spectral clusteringwas given the the number of first level clusters as an advantage (with binary splits in the deeper levels). Note that spectralclustering actually performs better in the “high-noise off-diagonal” case than the ”low noise off-diagonal” case, since thehigher noise means more assortativity for the off-diagonal case.

E Qualitative analysis details

E.1 Grass Food Web DatasetWe ran our Gibbs sampler on the full network to infer the community hierarchy and actor Multiscale Memberships. Themodel parameters were chosen via gridsearch over (λ1, λ2) ∈ {.1, .3, .5, .7, .9}2, according to the marginal log likelihood(estimated using 10,000 importance samples). In line with the held-out experiments, we fixed the remaining parameters toγ = 1,m = 0.5, π = 0.5. Finally, the hierarchy depth was set to K = 2. We ran our Gibbs sampler using the optimalparameters λ1 = 0.1, λ2 = 0.5 for 10,000 iterations of burn-in, and took 100 samples with a lag time of 5 iterations. Aplateauing log complete likelihood plot revealed that our sampler coverged well before the last iteration.

The Gibbs samples represent a posterior distribution over paths ci. In order to represent the “average” of this posterior,we generated a consensus sample by counting the number of times each pair of actors shared the same community hierarchyposition, over all samples. Actors that shared positions in > 50% of all samples were assigned to the same path in theconsensus. For levels z→ij and z←ij , we simply took the mode over all samples. In a final post-processing step to reducevisual clutter, we merged bottom-level (i.e. level-2) communities with ≤ 5 actors into one community under the same parent.

E.2 High Energy Physics datasetWe applied the same parameter selection and post-processing as the previous dataset; the optimal gridsearch parameters were(λ1 = 0.7, λ2 = 0.5). Each of the 25 parameter combinations required less than 6 hours to test on a single processor core.We ran our Gibbs sampler for 10,000 iterations of burn-in, and took 10 samples with a lag time of 50 iterations. The entireGibbs sampling procedure completed in just under 23 hours on a single processor core.

References[1] D.M. Blei, T.L. Griffiths, and M.I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference

of topic hierarchies. Journal of the ACM (JACM), 57(2):1–30, 2010.

4

multiscale community blockmodel for network explorationlsong/papers/hoparsonxin11.pdf · hierarchy:...

Documents