on decomposable random graphs and link prediction … · in an aim to help guide the sampling of...
Post on 13-Sep-2018
213 Views
Preview:
TRANSCRIPT
On decomposable random graphs and link prediction
models
Mohamad Elmasri
Department of Mathematics and Statistics
McGill University, Montreal
August 2017
A thesis submitted to McGill University in partial fulllment of the
requirements for the degree of Doctor of Philosophy
c⃝ Mohamad Elmasri 2017
i
Abstract
In combinatorial graph theory, decomposable graphs are such type of graphs that are guar-
anteed to be decomposable into conditionally independent components, known as maximal
cliques. In statistics, decomposable graphs are widely used in the eld of graphical models
or Bayesian model determination, where the dependency structure among high dimensional
data or model parameters is unknown. Decomposable graphs are hence used as functional
priors over large covariance matrices or as priors over hierarchies of model parameters. One
such example is the Gaussian graphical model (Lauritzen, 1996; Whittaker, 2009), which has
seen success in a variety of applications. Beyond this framework, decomposable graphs are
seldom used in statistical applications.
Random graphs, on the other hand, have recently seen much research interest, where the
focus is on developing methodologies for models on relational data in the form of random
binary matrices. A principle component of such models is to assume a network framework
by mapping the relations to edges of the network, and data sources to nodes. The likelihood
of an edge is assumed to be driven by anity parameters of the associated nodes.
The rst part of this work attempts to propose a framework for modelling random de-
composable graphs, using similar tools as in random graphs. Rather than modelling edges
between nodes, the framework models the bipartite links between the graph nodes and latent
community nodes, through node anity parameters. The latent communities are assumed
to represent the maximal cliques in decomposable graphs. Under the proposed framework,
simple Markov update rules are given with explicit lower bounds for its mixing time (time
ii
until convergence). Under a set of conditions, an exact expression for the expected number
of maximal cliques per node is given.
The second part of this work illustrates a new application of decomposable graphs that is
motivated by the proposed framework. Combinatorially, there is a unique set of subgraphs of
any maximal clique. Treating maximal cliques as latent communities allows the treatment of
subgraphs of maximal cliques as sub-clusters within each community. The proposed frame-
work is extended to incorporate a sub-clustering component, which enables the modelling
of decomposable graphs and simultaneous modelling of the sub-clustering dynamics forming
within each larger community.
The nal part of this work deals with the topic of link prediction in networks with
presence-only data, where absence is only an indication of missing information and not a
prohibited link. The work is motivated by a particular example of identifying undocumented
or potential interactions among species from the set of available documented interactions,
in an aim to help guide the sampling of ecological networks by identifying the most likely
undocumented interactions. The problem is framed in bipartite graph structure, where
edges represent interactions between pairs of species. The work rst constructs a Bayesian
latent score model, which ranks observed edges from the most probable down to the least
certain. To improve scoring eciency, and thus link prediction, the work incorporates a
Markov random eld component informed by phylogenetic relationships among species. The
model is validated using two host-parasite networks constructed from published databases,
the Global Mammal Parasite Database and the Enhanced Infectious Diseases database, each
with thousands of pairwise interactions. Finally, the model is extended by integrating a
correction mechanism for missing interactions in the observed data, which proves valuable
in reducing uncertainty in unobserved interactions.
iii
Résumé
En théorie des graphes combinatoire, les graphes décomposables sont un type de graphe
dont il est garanti qu'ils sont décomposables en composantes conditionnellement indépen-
dantes, appelées cliques maximum. En statistiques, les graphes décomposables sont com-
munément utilisés dans le champ des modèles graphiques ou dans la détermination de mod-
èles Bayésiens, pour lesquels la structure de dépendence entre variables à haute dimensional-
ité ou des paramètres du modèle est inconnue. Les graphes décomposables sont ainsi utilisés
comme précédents fonctionnels par rapport aux matrices à large covariance ou en tant que
précédents par rapport aux hierarchies des paramètres du modèle. Un exemple de cette
utilisation est celle du modèle graphique Gaussien Lauritzen (1996); Whittaker (2009) qui a
été appliqué avec succès dans un grand nombre de cas.
Les graphes aléatoires ont généré beaucoup d'intérêt, en particulier, sur les données
relationnelles en de matrices aléatoires binaires. Une composante principale de ces modèles
est la dénition d'un cadre de réseau en associant les relations aux liens du réseau et les
sources de données aux noeuds.
La première partie de ce travail propose un cadre de modèlisation pour les graphes décom-
posables aléatoires et utilise des outils similaires à ceux utilisés pour les graphes aléatoires.
Plutôt que de modèliser les liens entre les noeuds, le cadre modèlise les associations bipartites
entre les noeuds du graphe et les noeuds des communautés latentes, à l'aide des paramètres
d'anité entre les noeuds. L'hypothèse émise étant que les communautés latentes représen-
tent les cliques maximum des graphes décomposables. Au sein de ce cadre proposé, les règles
iv
simples de mise à jour de Markov se voient attribuées une limite basse explicite pour leur
temps mélangé (temps sous convergence).
La seconde partie de ce travail illustre une nouvelle application des graphes décompos-
ables s'appuyant sur le cadre proposé. Combinatoirement, il existe un ensemble unique de
sous-graphes pour toute clique maximum. En traitant chaque clique maximum en tant que
communauté latente il est possible de traiter les sous-graphes des cliques maximum en tant
que sous-group au sein de chaque communauté. Le cadre proposé est étendu pour incorporer
une composante de sous-groupement, ce qui autorise la modélisation des graphes décompos-
ables et simultanément la modélisation de dynamiques de sous-groupement qui se forment
au sein de chaque communauté plus large.
La dernière partie de ce travail traite du sujet des prédictions de lien pour les réseaux
avec des données présence uniquement, où l'abscence est seulement une indication de don-
nées manquantes et non d'un lien interdit. Ce travail s'appuie sur un exemple specique,
celui de l'identication d'interactions non-documentées ou potentielles au sein d'espèces
appartennant à l'ensemble des interactions documentées. L'objectif est d'aider à guider
l'échantillonnage de réseaux écologiques en identiant les relation non-documentées les plus
vraisemblables. Le problème est cadré en structure bipartite de graphe où les liens représen-
tent les interactions entre paires d'espèces. Le travail développe tout d'abord un modèle
de score latent Bayésien qui ordonne les liens observés du plus probable au moins certain.
Pour améliorer l'ecience du score et partant la prédiction des liens, le travail incorpore
un composant de champ aléatoire de Markov utilisant lesretations phylogéniques entre es-
pèces. Le modèle est validé en utilisant deux réseaux hôte/parasite construits à partir de
deux bases de données publiées; la base globale mammifère parasiteet la base de données
améliorée des maladie infectieuses, chacune contenant des milliers de paires d'interactions.
Finalement, le modèle est étendu en intégrant un méchanisme de correction pour les inter-
actions manquantes dans les données observées, qui s'avère ecace à diminuer l'incertitude
dans les interactions inobservées.
v
Acknowledgments
First and foremost, I am sincerely grateful to my supervisor, Professor David A. Stephens.
Since my early days in the Doctoral programme, he encouraged me to follow my own research
path, gave me ample room to learn and grow academically and professionally, was generous
with nancial support, and always provided valuable suggestions.
I am also grateful to the faculty of the Department of Mathematics and Statistics for their
excellent graduate courses that were essential to my learning. Thanks to the administrative
and IT sta of the department for their help through many applications and other paper
work, and thanks to the cleaning team that kept our oces tidy and boards clean.
I am especially grateful to Professor Russell Steele, for his unyielding optimism and
encouragement, and for being instrumental in shaping the student-run Stat and Biology
Exchange group (S-Bex). A large part of this work has been motivated by the problems and
ideas discussed in this inter-disciplinary group. Thanks to Amanda Winegardner and Zoa
Taranu for organizing S-Bex and for making it such an enjoyable experience. I would also
like to thank Maxwell Farrell, an S-Bex member, with whom I spent much time discussing
ideas and collaborating on research work.
Thanks to all the friends that helped me during those years: Patrick Montjourides, Oscar
Xacur, Ivo Pendev, Jeno Grebennikov, Hassein Asmar, and many others, I can not stress
how thankful I am. I am very grateful to Friedrich Huebler from the UNESCO Institute for
Statistics, for his professional mentorship and support.
I am indebted to my family in Montreal, for providing a second home and for all the
vi
delicious food and good times. Importantly, I'd like to express my never-ending gratitude
for a long list of things to my parents Maha and Ahmed and my siblings, Fatima, Ebrahim,
Maryam and Noor.
In no words I can describe how grateful I am to my wife Sheena Bell, who stood by my
side all along this journey. Without you, this simply could not have been accomplished.
I would like to thank the Lorne Trottier Science Accelerator Fellowships, Fonds de
recherche du Québec - Nature et technologies (FRQNT), and the Department Graduate
Awards, for their generous nancial support. I also like to thank the examiners and defence
committee for their comments and valuable feedback, and thanks to everyone that helped in
editing this document.
vii
Contents
Abstract i
Résumé iii
Acknowledgments v
List of Figures xvi
List of Tables xviii
1 Introduction 1
1.1 Thesis contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 8
2.1 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Key properties of Poisson processes . . . . . . . . . . . . . . . . . . . 11
2.1.2 The Cox process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Bayesian models for exchangeable graphs . . . . . . . . . . . . . . . . . . . . 14
2.2.1 The de Finetti representation of sequences . . . . . . . . . . . . . . . 16
2.2.2 The Aldous-Hoover representation theorem for random graphs . . . . 18
2.2.3 Exchangeable graphs as exchangeable 2-arrays . . . . . . . . . . . . . 20
2.2.4 The Kallenberg representation theorem for random graphs . . . . . . 22
Contents viii
2.2.5 Exchangeable graphs as exchangeable measures on R2+ . . . . . . . . 25
2.3 Completely random measures . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Sampling CRM from unit rate Poisson processes . . . . . . . . . . . . 31
2.3.1.1 Homogeneous CRMs . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1.2 Inhomogeneous CRMs . . . . . . . . . . . . . . . . . . . . . 33
3 Decomposable random graphs 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Decomposable graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Models for decomposable graphs . . . . . . . . . . . . . . . . . . . . . 40
3.3 Decomposable random graphs by conditioning on junction trees . . . . . . . 42
3.3.1 Decomposable graph as point processes. . . . . . . . . . . . . . . . . 47
3.3.2 Finite graphs forming from domain restrictions . . . . . . . . . . . . 50
3.3.2.1 Augmentation by an identity matrix . . . . . . . . . . . . . 53
3.3.2.2 Likelihood factorization with respect to Z . . . . . . . . . . 57
3.4 Exact sampling conditional on a junction tree . . . . . . . . . . . . . . . . . 59
3.4.1 Sequential sampling with nite steps . . . . . . . . . . . . . . . . . . 60
3.4.2 Sampling using a Markov stopped process . . . . . . . . . . . . . . . 61
3.4.2.1 Mixing time of the stopped process . . . . . . . . . . . . . . 62
3.5 Edge updates on a junction tree . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6.0.2 On the joint distribution of a realization . . . . . . . . . . . 69
3.6.1 The multiplicative model . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6.1.1 Posterior distribution for the special case of a single marginal 72
3.6.1.2 Inference by Gibbs sampling . . . . . . . . . . . . . . . . . . 76
3.6.2 The log transformed multiplicative model . . . . . . . . . . . . . . . . 76
3.6.2.1 Posterior distribution for the two marginals . . . . . . . . . 77
Contents ix
3.7 Model properties: Expected number of cliques per node . . . . . . . . . . . . 79
3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4 Sub-clustering in decomposable graphs and size-varying junction trees 88
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Subgraphs of cliques as sub-clusters . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Permissible moves in the bipartite relation . . . . . . . . . . . . . . . . . . . 90
4.3.1 Disconnecting single-clique nodes . . . . . . . . . . . . . . . . . . . . 92
4.3.2 Disconnecting multi-clique nodes . . . . . . . . . . . . . . . . . . . . 94
4.3.3 Connecting nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 Promoting a sub-clique to be maximal . . . . . . . . . . . . . . . . . . . . . 101
4.5 Markov updates under size-varying junction trees . . . . . . . . . . . . . . . 103
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5 A Bayesian model for link prediction in ecological networks 107
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2 Bayesian hierarchical model for prediction of ecological interactions . . . . . 109
5.2.1 Network-based latent score model . . . . . . . . . . . . . . . . . . . . 109
5.2.2 Prior and Posterior distribution of choice parameters . . . . . . . . . 113
5.2.3 Markov Chain Monte Carlo algorithm . . . . . . . . . . . . . . . . . . 115
5.3 Uncertainty in unobserved interactions . . . . . . . . . . . . . . . . . . . . . 116
5.3.1 Markov Chain Monte Carlo algorithm . . . . . . . . . . . . . . . . . . 118
5.4 A case study with host-parasite networks . . . . . . . . . . . . . . . . . . . . 119
5.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4.3 Prediction comparison by cross-validation . . . . . . . . . . . . . . . 122
5.4.4 Uncertainty in unobserved interactions . . . . . . . . . . . . . . . . . 126
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Contents x
Appendices 133
A Latent formulation and sampling 134
A.1 Existence of the joint distribution . . . . . . . . . . . . . . . . . . . . . . . . 138
A.1.1 Parametrization using an exponential distribution . . . . . . . . . . . 138
A.2 Latent score sampling with uncertainty . . . . . . . . . . . . . . . . . . . . . 139
B Details on the MCMC algorithm 141
C Additional results 145
C.1 Posterior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
C.2 Representative trace plots and diagnostics . . . . . . . . . . . . . . . . . . . 146
C.3 Parameter numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
C.4 Uncertainty - histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
C.5 Interaction matrices for subsets - Carnivora and Rodentia . . . . . . . . . . . 150
C.6 ROC with and without g for full GMPD and EID2 databases . . . . . . . . . 151
C.7 Percentage of recovered pairwise interactions . . . . . . . . . . . . . . . . . . 154
C.8 Posterior degree distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
C.9 Hyperparameters and eective size . . . . . . . . . . . . . . . . . . . . . . . 156
6 Conclusion and future research 158
6.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
xi
List of Figures
2.1 An example of a simple graph generated under the Kallenberg representation.
The top left corner shows a generated Poisson point process (θi, ϑi) with
restrictions on the location (x-axis) and weight (y-axis) domains shown in dot-
ted grey lines, points outside the restricted cube are shown with grey circles.
Using the point process and the cohesion function W shown by the heat map
in the top right corner, we generate a random simple graph as shown in the
bottom left corner, where only nodes with active edges are shown; in black
circles are nodes within the restricted cube, in grey are nodes outside the re-
stricted cube though with active edges. The graph is shown in the bottom
right corner with the same colour coding. . . . . . . . . . . . . . . . . . . . . 28
3.1 An undirected decomposable graphs of 4 cliques of size 3; ABC, BEF, BCE,CDE. 38
3.2 A decomposable graph and its bipartite graph linking junction trees of cliques
and perfect orderings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 An example of arbitrary adding an edge between nodes in a decomposable
graph: on the left is the original graph, in the middle, node E joins clique
AD causing a change in the junction tree while preserving decomposability,
on the right, node F joins clique ABC, abolishing decomposability by forming
the circle ADEF with no inner chord. . . . . . . . . . . . . . . . . . . . . . . 44
List of Figures xii
3.4 A realization of a decomposable graph in 3.4d from the point process in 3.4a
and the tree 3.4b. The grey area in 3.4a is the edge-greedy partition (r, ro],
where only one extra node (in blue) was needed to guarantee all active cliques
are maximal, since Zr′,r(θ′3∩.) is a subset of Zr′,r(θ′6∩.) and Zr′,r(θ′7∩.).
3.4c is the biadjacency matrix of active (clique-)nodes representing the graph. 52
3.5 Relaxation of (3.13) by removing the empty rows in the realization of Figure
3.4c and augmenting the results with an identity matrix. . . . . . . . . . . . 54
3.6 A realization of a 5-node junction tree from (3.21), on the left is the original
directed weighted tree where Wk = W (ϑ′k, ϑi) for a random ϑi, on the right is
the undirected tree by expectation where W∗ = E(W ). . . . . . . . . . . . . 65
3.7 Moving along the bipartite graph of Figure 3.2, from junction tree T1 to T2,
through severing and reconnecting the edge C2, C3 (dotted lines) to C2, C1. 66
3.8 Density of W (x, y) = exp(−(x+ y)). . . . . . . . . . . . . . . . . . . . . . . 72
3.9 Dierent size realizations from W (x, y) = exp(−(λ1x + λ2y)); the 10-node
tree on the top left is sampled according to (3.21) with a (c′ = 1, r′ = 10)-
truncation. The top and middle panels are the decomposable graphs resulting
from dierent size realization settings, the middle panel illustrates the eect of
varying λ2 for the same parameter set (θi, ϑi) generated from a (c = 2, r =
50)-truncation, the corresponding adjacency matrices are in the bottom panel. 73
3.10 Junction tree, decomposable graph, and posterior MCMC trace plots for three
randomly selected nodes, where fiiid∼ Beta(α, 1), for the single marginal dis-
tribution of W (x, y) = f(y). . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.11 Junction tree, decomposable graph, and the posterior MCMC trace plot of
ϑi = ϑ = 0.3, for the case W (ϑ′k, ϑi) = ϑ. . . . . . . . . . . . . . . . . . . . . 75
3.12 A binary 3-regular tree, with 10-nodes including the root node ϑ′0 and over
two levels (L = 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
List of Figures xiii
4.1 A 4-node clique (left) and all its unique subgraphs, including single-node
cliques, for a total of 15 subgraphs. . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 An example of a biadjacency matrix (left), with 5 maximal cliques, stared and
in red, and 10 sub-cliques. The corresponding junction tree (top right) has all
sub-cliques and their ascendants circulated and connected with dashed lines,
with maximal cliques in red solid lines. The decomposable graph (bottom
right) summarizes the biadjacency matrix. . . . . . . . . . . . . . . . . . . . 91
4.3 Examples of disconnecting single-clique nodes of the graph in Figure 4.2. The
top panel shows the case when disconnecting node A from clique ABCD (top
left), where BCD is still maximal, and the previous sub-clique AB is now
maximal, adding another clique-node to the junction tree joined at BCD (top
right), while discarding all other sub-cliques that contain A with nodes C or
D, as AC. The middle row shows the case when disconnecting node G from
FGH (middle left), where FH is still maximal, while the previous sub-clique
GH is now maximal adding an extra clique-node to the junction tree (middle
right) connected to FH. The bottom panel shows the case when a maximal
clique becomes sub-maximal, by disconnecting the node E from CEF (bottom
left), where CF is now a sub-clique of CEF (shown dashed and in blue),
thus removing the corresponding clique-node from the junction tree (bottom
right), while connecting all previous CEF edges to CDF. The new maximal
clique-node EF adds an edge to the tree with CDF. . . . . . . . . . . . . . . 95
4.4 An example: disconnecting a multi-clique node D from the maximal clique
ABCD in Z and G, where the resulting graph G ′ is decomposable albeit Z′ is
not its representative bipartite matrix; missing the maximal clique BCD in G ′. 98
List of Figures xiv
4.5 Examples of disconnecting multi-clique nodes of the example in Figure 4.2.
The graph in the top panel (top left) shows the example of disconnecting C
from ABCD, cases (i.c) and (ii.a) of Proposition 6, where the separator CD
belongs to the sub-clique ACD, making it maximal. The junction tree (top
right) is rewired accordingly, and no sub-clique is discarded. The graph in
the bottom panel (bottom left) illustrates the case of disconnecting H from
FGH to form FG, while discarding the sub-clique GH, as in (i.a) and (ii.a)
of Proposition 6, since FG∩HI is empty, the junction tree (bottom right) is
rewired accordingly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6 An example of connecting a node to a sub-clique in an adjacent maximal
clique. Node H connects to the sub-clique EF (left) from the example in
Figure 4.2, by (iii) of Corollary 5 this forms the new maximal clique EFH
connecting maximal cliques CEF and FGH. . . . . . . . . . . . . . . . . . . 102
5.1 Left ordered interaction matrix Z of GMPD (left) and EID2 (right) databases. 120
5.2 Degree distribution of hosts (red crosses) and parasites (blue stars) on log-
scale, for the GMPD (left) and EID2 (right) databases. . . . . . . . . . . . . 121
5.3 ROC comparison of the latent score (LS) network model with three varia-
tions and the regular NN algorithm. The proposed LS full model in black,
the anity-only variation in cyan, phylogeny-only variation in grey, and the
weighted-by-counts version in green. The regular NN algorithm in brown. All
ROC curves are based on an average of 10-fold cross-validations. . . . . . . . 124
5.4 Posterior associations matrix comparison: for the GMPD (top panel) and
EID2 (bottom panel), between the anity-only (left), phylogeny-only (mid-
dle) and full model (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5 Comparison of ROC curves for the model with g (black) and without g (grey),
for GMPD-Carnivora on the left and the EID2-Rodentia on the right. . . . . 128
List of Figures xv
C.1 Boxplots of posterior estimates for the host and parasite parameters with the
80 highest medians, and the posterior distributions of the scale parameter,
dashed horizontal lines are the mean posterior and 95% credible intervals, for
the GMPD (top panel) and EID2 (bottom panel). . . . . . . . . . . . . . . . 146
C.2 Trace plots for the GMPD and EID2: host (top) and parasite (middle) of
highest median posterior, and the similarity matrix scaling parameter (bottom).147
C.3 ACF plots and eective sample sizes for the GMPD and EID2: host (top)
and parasite (middle) of highest median posterior, and the similarity matrix
scaling parameter (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
C.4 Posterior histogram for g for the GMPD (left) and EID2 (right) databases. . 149
C.5 Comparison in posterior log-probability between observed and unobserved in-
teractions, for the model without g (left) and with g (right), for the GMPD-
Carnivora database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
C.6 Association matrices of the whole GMPD-Carnivora subset: Observed (left),
posterior for the model without g (middle), posterior for the model with g
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
C.7 Association matrices of the whole EID2-Rodentia subset: Observed (left),
posterior for the model without g (middle), posterior for the model with g
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
C.8 Comparison of ROC curves for the full dataset, for the models with(out) g. . 151
C.9 Posterior association matrices for the full datasets. . . . . . . . . . . . . . . . 152
C.10 Number of pairwise recovered interactions from the original data. . . . . . . 154
C.11 Comparison of degree distribution on log-scale, for the full model (without
accounting for uncertainty) and the model with g, GMPD dataset. . . . . . . 155
C.12 Comparison of degree distribution on log-scale, for the full model (without
accounting for uncertainty) and the model with g, EID2 dataset. . . . . . . . 156
List of Figures xvi
C.13 Trace plots of convergence of three chains started at dierent values for the
expected value of the hyperparameter for the GMPD dataset . . . . . . . . . 157
xvii
List of Tables
2.1 Summary of some known models admitting the graphon representation. . . . 22
3.1 Possible prefect ordering of cliques of Figure 3.1 . . . . . . . . . . . . . . . . 39
3.2 A summary table of the number of clique-nodes at distance k from clique-
nodes at level ℓ ≤ L for a d-regular tree with L levels, where ⌊x⌋ is the oor
operator and d = (d− 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 Multi-clique nodes of example in Figure 4.2, their disconnect from maximal
cliques, separator sets and possible sub-cliques to become maximal. . . . . . 98
5.1 Area under the curve and prediction values for tested models . . . . . . . . . 125
5.2 Two-sided Wilcoxon signed rank test to compare model AUCs . . . . . . . . 125
5.3 AUC comparison between models with g and without g on the GMPD and
EID2 databases and clade subsets . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4 Percentage of observed interactions correctly predicted in the held-out portion
of the validation set (in parentheses) and in the full data, for the GMPD and
EID2 databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
C.1 Posterior means, Monte Carlo standard errors and credible intervals for the
highest anity parameters and the scale parameter. . . . . . . . . . . . . . . 148
C.2 AUC comparison between models with g and without g, on the GMPD databases
and clade subsets, with dierent variations of the model . . . . . . . . . . . . 152
List of Tables xviii
C.3 Percentage of observed interactions correctly predicted in the held-out portion
of the validation set (in parentheses) and in the full data, for the GMPD database152
C.4 AUC comparison between models with g and without g, on the EID2 databases
and clade subsets, with dierent variations of the model . . . . . . . . . . . . 153
C.5 Percentage of observed interactions correctly predicted in the held-out portion
of the validation set (in parentheses) and in the full data, for the EID2 database153
1
Chapter 1
Introduction
With technology advancing, data gathering capacity is consistently improving and new forms
of data are emerging. Some forms adhere to the classical 1-dimensional sequential observa-
tions, which represent the randomness in the data sources. Others dier from the classical
type, in a sense that they represent relationships between two or more data sources or objects.
Structured relational data is one type of such new forms of data which gained prominence in
graph and network based technologies, where pairwise relationships between network nodes
are of interest. For example, structured relational data proved essential in biology as a tool
to summarize complex multi-way relationships amongst organisms, and to predict unknown
interactions. Applications of such data extends to many other elds.
The statistical community, on the other hand, is consistently developing empirical models
to analyze new emerging forms of data. For the case of relational data, some of the popular
recently developed models are the blockmodel (Wang and Wong, 1987), latent distance model
(Ho et al., 2002), and the innite relational model (Kemp et al., 2006), and their variations.
Under certain assumptions, some models provide strong theoretical and asymptotic results,
nonetheless, most are intrinsically misspecied for many real-world applications, especially
for large networks, where a sparseness property is essential (Newman, 2010). Sparseness is
generally dened as the proportion of relations (edges) to the number objects (nodes), if edges
1 Introduction 2
grow linearly with respect to nodes then the dataset is described as sparse, otherwise as dense.
The lack of the sparseness property in most initial models is due to the theoretical foundation
those models are built upon, which regards relational data as random observations of random
arrays or matrices. Progress has been achieved recently in this domain by adapting a dierent
theoretical foundation which builds on continuous-time stochastic processes. In this sense, a
real world relational dataset is seen as a sample from an unknown continuous-time process
(Borgs et al., 2015, 2016, 2014a,b; Caron and Fox, 2014; Janson, 2016; Veitch and Roy,
2015). The new framework provides many rich results associated with stochastic processes,
though challenges still exist in the area of nonparametric estimation of such multidimensional
processes.
Models for relational data were initially inuenced by the introduction of probabilistic
methods to graph theory, most notably the work of Erdös and Rényi (1959), which studied
asymptotic probability of graph connectivity. This introduction gave rise to a branch of
mathematics known as random graph theory, which includes most of the probabilistic models
for relational data. A larger branch of graph theory existed much earlier, dating back to 1736
to a paper written by Leonhard Euler on the Seven Bridges problem (Biggs et al., 1976).
Since then, research in graph theory mostly fell in the domain of discrete mathematics and
produced many rich results, such as the characterizations of dierent types of graphs and
their properties, which have also seen great applications in statistics outside the eld of
relational data modelling.
Decomposable (chordal) graphs are such a type of well studied objects in discrete math-
ematics that have seen wide applicability in statistics. A graph is said to be decomposable
if, and only if, any cycle of four or more nodes has an edge (chord) that does not belong
to the cycle. This property ensures that a given graph can be decomposable into multiple
independent subgraphs, known as maximal cliques. If one views graph nodes as random
variates and graph edges as pairwise variate relations, then the decomposability property
translates to conditional independence between subsets of variates, or what is known as the
1 Introduction 3
Markov property. Such analogy enabled decomposable graphs to be used as functional pri-
ors over large covariance matrices or as priors over hierarchies of model parameters, which
gave rise to a branch of statistics known as graphical models, that aims to infer conditional
dependency simultaneously while inferring model parameters. Other types of graphs have
also seen applications in the branch of graphical models, nonetheless, the explicit interpreta-
tion of conditional dependencies in decomposable graphs have earned them special attention
in statistics, primarily because they greatly simplify the observational data likelihood. For
example, the Gaussian graphical model has seen success in a variety of applications of such
dependency nature (Lauritzen, 1996; Whittaker, 2009). In fact, the earliest introductions of
decomposable graphs to statistics is in the eld of Bayesian model determination, by Dar-
roch et al. (1980) and Wermuth and Lauritzen (1983), as a generating class of decomposable
log-linear models on multidimensional contingency tables.
A few eorts in statistics exists that utilize decomposable graphs beyond graphical mod-
els, for example, the work of Tank et al. (2015) that applied decomposable graphs for struc-
tural learning of time series, and the work of Caron and Doucet (2009) to Bayesian nonpara-
metric models. The lack of a broader statistical applicability of decomposable graphs could
be attributed to two aspects, a combinatorial one and a statistical one. The combinatorial
issues include, for example, ecient methods for testing for decomposability in large graphs,
and nding the largest fully connected component, where the latter is still an open problem.
The statistical issues include ecient sampling methods, where only recently a uniform,
though intricate, sampling algorithm is proposed in Thomas and Green (2009), with a more
ecient local update scheme by Stingo and Marchetti (2015).
A main focus of this work is to extend the recent developments in modelling of relational
data to modelling of, what we term, decomposable random graphs. In this framework,
we propose a generative model of decomposable graphs, where a sample of the model is a
random size biadjacency matrix and with a deterministic mapping function a decomposable
graph is attained. Edges are generated sequentially through probabilities driven by node
1 Introduction 4
specic parameters. The sequential generation guarantees decomposability of the graph
at each step, and is a native process in such context for the Markovian interpretation of
decomposability. The model builds on the work of Thomas and Green (2009), and adapts
a bipartite representation of the graph, between nodes and the maximum fully connected
components; the maximal cliques. Such representation is later extended to allow for a new
application of decomposable graphs, where one is not only able to model the graph, but also
simultaneously model latent sub-clusters within maximal cliques. The clustering mechanism
of the model evades two limitations of most clustering algorithms, choosing the correct
number of clusters, and choosing a proper distance metric for clustering. Both limitations
are addressed through the generation process and the construction of the model.
In practice, hierarchical clustering is fundamental to many applications, in fact, it is
natural in many systems in the real world. For example, the evolutionary tree of organisms
in biology, as well as the categorization of documents into topics in literature. Thus, we
anticipate a wide range of applications to the proposed model.
Another area of focus in this work is on the topic of link prediction in networks with
presence-only data. Conventionally, the existence of an edge in a network is an indication of
dependence or interaction among the pair of nodes connected by the edge, in reverse, a pair
of nodes are conditionally independent if there is no edge connecting them. This seems to
be the belief of most network-based models of relational data. Yet, the absence of an edge
in certain types of relational data is only an indication of unknown information, where the
true edge could exist but is currently unobserved, or forbidden as in the case of conditional
independence.
One example of interest to this work, is the case of identifying undocumented or potential
interactions among species from the set of available documented interactions. In an aim
to guide the sampling of ecological networks by identifying the most likely undocumented
interactions, this work tackles this problem by proposing a network based Bayesian latent
score model, in which scores are assigned to observed edges, much like the conventional
1 Introduction 5
network based models. This work improves on that by incorporating a Markov random
eld component, in this case the phylogenetic information which also depends on observed
edges. By estimating the parameters of the model, the posterior distribution is then used
to predict undocumented interactions. Since it is hard to know the exact number of actual
true interactions from forbidden ones, a measure of uncertainty is built which attempts to
estimate the false negative rate on the data source. This rate is then used to gauge the
predicted number of potential interactions.
1.1 Thesis contribution
The following is a list summarizing the contributions of this thesis.
• The class of decomposable graphs are extensively applied in the context of graphical
models, primarily due to its explicit interpretation of conditional dependencies that
greatly simplify the observational data likelihood. Chapter 3 attempts to extend the
statistical use of decomposable graphs by proposing a dierent modelling framework.
In the classical settings, decomposable graphs are modelled via the adjacency matrix,
instead, the proposed framework models them via their biadjacency matrix which
represent the connection between the graph nodes and the conditionally independent
components of the graph, known as maximal cliques. The decomposable graph is
retrieved by a deterministic mapping function. The framework represents maximal
cliques as latent communities with their specic membership parameters, mimicking
that of the graph node parameters. The likelihood of a node becoming a part of a
latent community depends on both of their specic parameters.
• The proposed biadjacency representation of decomposable graph in Chapter 3 yields
simple Markov update rules, enabling a sort of parallelization in the Monte Carlo
Markov Chain methods. As a result, the convergence time is minimized. Section
3.4 illustrates results on mixing time (time until convergence) for the proposed mod-
1 Introduction 6
elling framework. As a consequence of decoupling the graph nodes from the maximal
cliques, it is possible to compute the expected number of cliques per node, which is the
contribution of Section 3.7. This expectation, though exact, requires a certain set of
assumptions relating to the dependency structure among the maximal cliques, known
as the junction tree of the graph. Therefore, it is characterized conditionally.
• Chapter 4 generalizes the framework of Chapter 3 to open the door for a new applica-
tion of decomposable graphs. This is done by extending the biadjacency representation
to allow for interactions between graph nodes and subgraphs of maximal cliques. Sub-
graphs of maximal cliques can naturally be seen as sub-clusters within each maximal
clique. In fact, combinatorially, a maximal clique of N nodes has 2N − 1 unique clique
sub-graphs, including single nodes. The ability of the biadjacency representation to
account for such sub-clusters adds to its richness. Rather than solely modelling decom-
posable graphs, as in the classical settings, it is now possible to model the decomposable
graph and at the same time the latent dynamics forming within each maximal clique.
There are a few ways to address the dynamics of interactions between maximal cliques
and sub-clusters. Section 4.3 lists a series of propositions and corollaries illustrating
possible connect and disconnect moves in the biadjacency representation that guaran-
tee the decomposability of the graph. Section 4.5 summarizes all possible moves in
simpler Markov update steps.
• Most ecological networks, while highly critical to the functioning of ecosystems, are
only partially observed and fully characterizing them requires substantial sampling ef-
fort that is not feasible in most situations (Jordano, 2015). Many ecological networks
are often based on presence-only data, where an unobserved interaction may be either
present or absent. Chapter 5 introduces a latent score model for link prediction in eco-
logical networks, motivated by the class of Auto-models of Besag (1974). The proposed
model is a combination of two separate models: (i) an anity based exchangeable ran-
1 Introduction 7
dom networks model; (ii) a Markov random eld network model that is informed by
phylogeny. To account for uncertainty in unobserved interactions, inuenced by the
work of Jiang et al. (2011), Section 5.3 incorporates a measure of the proportion of
missing links in the observed data, which strengthens the posterior predictive accuracy
of the model. Section 5.4.3 compares the predictive performance of the proposed model
and three of its variates to a nearest-neighbour algorithm. The model is validated using
two host-parasite networks constructed from published databases, the Global Mammal
Parasite Database and the Enhanced Infectious Diseases database.
1.2 Thesis outline
This work is organized as follows, Chapter 2 is a literature review of specic topics that
relate to dierent parts of this work, where preliminary notations are also introduced. One
of the main results discussed in this chapter is the newly developed framework of random
graphs that builds on continuous-time stochastic processes, and its contrast to the initially
used framework of random arrays and matrices. Chapter 3 is one of the main contributions
of this work which proposes a model for decomposable random graphs. The preliminary
notation and background on decomposable graphs is also introduced in this chapter. Chap-
ter 4 is another contribution related to decomposable graphs, where the new application of
sub-clustering is introduced with a series of propositions addressing possible update moves.
Chapter 5 is the nal contribution of this work, which deals with link prediction in bipar-
tite networks with presence-only data. This chapter was initially submitted for publication
in a peer-reviewed journal, and thus formatted with appendices containing details on com-
putational aspects and additional convergence results relating to simulations and real data
examples. Finally, Chapter 6 summarizes the research contributions and discusses possible
future research directions.
8
Chapter 2
Background
2.1 Poisson process
The Poisson process, or the Poisson point process, is one of the most studied point processes
in many disciplines, due to its simplicity and favourable mathematical properties. It is
dened on a measurable space, most commonly the Euclidean space for practical reasons.
For example, the arrival times of customers, or the number of heads appearing in a sequence
of coin tossing. The naming is directly derived from its relation to the Poisson distribution,
where a random variable N is said to have a Poisson distribution with parameter µ if the
probability of an event occurring n ≥ 0 discrete times is
P(N = n) =µne−µ
n!. (2.1)
The parameter µ represents the expected number of occurrences, as
E[N ] =∞∑n=0
nP(N = n) = µ. (2.2)
Poisson processes are constructed by letting the random variable N of (2.1) to be a
count function over measurable subsets A over a space S. Such that, N(A) is the count
2 Background 9
of event occurrences in A, which is also distributed as a Poisson random variable with
parameter function µ(A). More formally, following the language of Billingsley (2008), dene
the probability triple (S,F ,P), where S is the elementary set of events, F is the σ-algebra
of subsets ("events") in S and P : S ↦→ [0, 1] is a probability measure on measurable subsets
in S.
Denition 1. A Poisson process, dened on a probability space (S,F ,P), is a point process
Π of countable sets of points in S. Such that, if A is a measurable subset of S, then the
number of points of Π in A is a well dened random variable measured as
N(A) = #Π ∩ A, (2.3)
and satisfy the following properties:
(i) for any disjoint countable subsets A1, A2, · · · ⊂ S the random variables N(A1), N(A2), . . .
are independent;
(ii) and for each i ∈ N, N(Ai) is a Poisson random variable with mean function 0 ≤
µ(Ai) ≤ ∞;
Note that for non-nite µ(A), Π ∩ A is countably innite with probability 1, and nite
with probability 1 if µ(A) is nite.
Remark. If S = Rd, a d-dimensional Euclidean space, then the measurable subsets A1, A2, . . .
in Denition 1 are the Borel sets, which form the smallest σ-algebra containing the open
sets. For d = 1 (real line) they are the open intervals (a, b), a, b ∈ R, and open rectangles for
d = 2.
The function µ of the Poisson process is generally dened as a mean measure, since it
satises the formal denition of a measure:
(i) non-negativity: for all A ⊂ S, N(A) ≥ 0;
2 Background 10
(ii) measure zero of empty sets: N(∅) = 0;
(iii) countably additivity: for any countable collection of pairwise disjoint measurable sub-
sets A1, A2, · · · ⊂ S
N(⋃
i
Ai
)=∑i
N(Ai). (2.4)
Moreover, µ is strictly a non-atomic measure, since, by contradiction, for an atomic µ
at a point x, a non-zero probability are assigned for larger than one count over the same
atom as
P(N(x) ≥ 2) = 1− eµ − µe−µ > 0. (2.5)
When S = Rd, µ is dened with respect to a positive measurable rate or intensity function
λ, which is also often called the Lévy measure in the language of stochastic processes. Such
relation categorizes Poisson processes in two classes, an inhomogeneous and a homogeneous
one.
Inhomogeneous Poisson processes are dened with µ(A) taking the general form of
µ(A) =
∫A
λ(x)dx, (2.6)
for a d-dimensional measure dx := dx1,dx2, . . .dxd. In most cases the integral above
converges, and thus, µ(A) is nite, and with probability 1.
Homogeneous Poisson processes are a special case where λ is a constant as
µ(A) = λ|A|, (2.7)
where |A| is the Lebesgue measure of A in Rd. A unit rate Poisson process is a
homogeneous Poisson process with λ = 1.
As an example, for a Poisson process Π dened on the real line with a homogeneous
intensity function λ > 0, the probability that the interval (a, b] has n points, for any a, b ∈ R
2 Background 11
with a ≤ b, is
P(N(a, b] = n) =[λ(b− a)]n
n!e−λ(b−a). (2.8)
The following section introduces some of the interesting mathematical properties of Pois-
son processes.
2.1.1 Key properties of Poisson processes
As a reason for its fame, the Poisson process has many key mathematical properties which
yield to surprisingly simple calculations; most of which are immediate results from the prop-
erties of the Poisson distribution. This section lists, without proofs, some of the most
important properties. For an extensive review and formulation of general properties of the
Poisson process refer to Kingman (1993, ch. 2 and 3).
Theorem 1 (Superposition Theorem (Kingman, 1993, ch. 2.2)). For a countable collection
of independent Poisson processes Π1,Π2, . . . on a measurable space S, where for each i ∈ N,
Πi has mean measure µi. Then their superposition (joint union)
Π =∞⋃i=1
Πi, (2.9)
is a Poisson process with mean measure
µ =∞∑i=1
µi. (2.10)
The Superposition property follows directly from the countable additivity property of
independent Poisson random variables, (i) and (ii) of Denition 1. Moreover, a restricted
Poisson process is still a Poisson process, though with a dierent mean measure, which is
another important property stated formally in the following theorem.
Theorem 2 (Restriction Theorem (Kingman, 1993, ch. 2.2)). Let Π be a Poisson process
with mean measure µ on S, and let S1 be a measurable subset of S. Then the random
2 Background 12
countable set
Π1 = Π ∩ S1, (2.11)
can be regarded either as a Poisson process on S with mean measure
µ1(A) = µ(A ∩ S1), (2.12)
or as a Poisson process on S1 with mean measure as the restriction of µ to S1.
The superposition and restriction properties above explain unions and decompositions of
Poisson processes. A related concept is the mapping of Poisson processes, which is dened
in the following theorem.
Theorem 3 (Mapping Theorem (Kingman, 1993, ch. 2.3)). Let Π be a Poisson process with
σ-nite mean measure µ on the state space S, and let f : S ↦→ Ω be a measurable function
such that the measure
µΩ(A) = µ(f−1(A)), f−1(A) = x ∈ S : f(x) ∈ A (2.13)
has no atoms. Then f(Π) is a Poisson process on Ω having the induced measure µΩ as its
mean measure.
The Mapping theorem above has many implications, for example, it helps in dening
sums over Poisson processes, shown by the Campbell's theorem below. Nonetheless, the
Mapping theorem requires µ to be a σ-nite measure, which is an extra condition that is
not required by the Superposition and Restriction theorems. A measure is called σ-nite if
there exists a countable partition of the space where the measure of each partition is nite.
Finally, the following is a super mapping theorem, or what is called the Campbell's theorem,
which denes the distribution of sums of mapped Poisson processes.
Theorem 4 (Campbell's Theorem (Kingman, 1993, ch. 3.2)). Let Π be a Poisson process
with mean measure µ on the state space S, and let f : S ↦→ R be a measurable function.
2 Background 13
Then the sum
Σ =∑X∈Π
f(X) (2.14)
is absolutely convergent with probability 1, if, and only if,
∫Smin(|f(x)|, 1)µ(dx) <∞. (2.15)
If this condition holds, then the characteristics function Σ in (2.14) is
E[eitΣ] = exp(−∫S1− e−itf(x)µ(dx)
), (2.16)
where "it" is a complex number with t > 0, such that the integral on the right converges.
Moreover, the expectation exists if, and only if, the integral converges and
E[Σ] =
∫Sf(x)µ(dx). (2.17)
If the expectation converges then the variance is
Var(Σ) =
∫Sf(x)2µ(dx). (2.18)
2.1.2 The Cox process
The Cox process is a generalization of the Poisson process, which is also known as the doubly
stochastic Poisson process. Introduced by Cox (1955) with the intensity function λ, dened
in (2.6), is itself a stochastic process.
Denition 2 (The Cox process (Kingman, 1993, ch. 6.1)). A process Π dened on a probabil-
ity space (S,F ,P) with non-atomic measure µ on S, is called a Cox process if the conditional
distribution of Π given µ is a Poisson process with mean measure µ.
Therefore, for the count function N of (2.3), if A1, A2, . . . , An are disjoint measurable
2 Background 14
subsets of S, the unconditional joint distribution of N(A1), N(A2), . . . , N(An) is
E[P(N(A1, N(A2), . . . ) | µ)
]= E
[ n∏i=1
P(N(Ai) | µ)], (2.19)
where N(Ai) | µ is a Poisson process with mean measure µ. The unconditional expectation
is
E[N(Ai)] = Eµ
[E[N(Ai) | µ]
]= Eµ
[ ∫Ai
λ(x)dx]=
∫Ai
E[λ(x)]dx, (2.20)
where λ(x) is a real-valued measurable random process on S.
Many point processes that are not Poisson could be made into one by conditioning, as
in Denition 2. The Cox process enjoys much of the mathematical properties of a Poisson
process. Sections 2.2.4 and 2.2.5 use the unit rate Poisson process to introduce a class
of exchangeable random graphs, where in certain applications a Cox process is used. The
class of completely random measures introduced in Section 2.3 depend heavily on Poisson
processes, in representation and sampling.
2.2 Bayesian models for exchangeable graphs
Structured relational data are commonly used in a variety of applications where encoding
relationships between 2 or more objects is needed. Special cases of structured relational data
are graphs and networks that encode pairwise relationships between objects, and are natu-
rally represented by adjacency matrices or 2-dimensional data arrays. Much recent work has
been done on statistical modelling of graph and network data. For such models to be viable
for any form of data, the distribution of the data or at least some of its properties should be
recoverable from existing observations. In Bayesian modelling, it is common to represent a
series of 1-dimensional observations as an exchangeable sequence, for which the de Finetti's
theorem (De Finetti, 1931) and the law of large numbers provide a fundamental theoreti-
cal foundation and an indispensable tool in recovering the distributional characteristics of
2 Background 15
the data. As dierent forms of data become widely available, much work has been done
to extend the de Finetti's framework of exchangeable sequences, in particular, to higher di-
mensions of structured relational data, such as the d-dimensional arrays or simply d-arrays.
The Aldous-Hoover theorem (Aldous, 1981; Hoover, 1979) and the convergence results of
Kallenberg (1999) played a central role in such work, where the former gave an exact char-
acterization of the conditional independence structure of a random 2-array if it satises a
form of exchangeability property, and the latter gave theoretical convergence results for es-
timation problems. These results have inspired much of the recent work in Bayesian models
of graphs and 2-arrays, where the rst of such work, applying the Aldous-Hoover theorem,
is attributed to Ho (2008). Albeit, the rst known work on random graphs is due to Erdös
and Rényi (1959), and since then many random graphs and 2-array models have been pro-
posed, most of which are covered by the following books and surveys, Newman (2010, 2003),
Bollobás (2001), Durrett (2007), Fienberg (2012),Goldenberg et al. (2010), and Orbanz and
Roy (2015).
Currently, much of the literature is concluding a more general nonparametric frame-
work, which is based on a generative model of random functions or equivalently random
measures. The framework builds on two notions of exchangeability, the exchangeability of
discrete random structures as in the Aldous-Hoover theorem, and the exchangeability of
continuous-space point processes as in the Kallenberg theorem (Kallenberg, 2005). The rea-
son for adopting two notions of exchangeability is due to the known fact that random graphs
represented by an exchangeable discrete 2-array are either trivially empty or dense (Orbanz
and Roy, 2015). Following the terminology of Bollobás and Riordan (2007), a graph of n
nodes is called dense if the number of edges is of the order O(n2), and called sparse if its of
the order o(n). On the other hand, the notion of exchangeability based on continuous-space
point processes, under certain conditions, yields sparse graphs as shown by Caron and Fox
(2014), Veitch and Roy (2015) and Borgs et al. (2014b). The sparseness property of the
model is crucial in many applications especially for real world large networks, as shown by
2 Background 16
Newman (2010).
This work adopts and builds on the random graph framework base on the exchangeability
notion of continuous-space point processes. The rest of this section is dedicated to introduce
all necessary preliminaries and notations. First starting with the denition of exchangeable
sequences, building up to the denition of exchangeable 2-arrays based on the Aldous-Hoover
theorem. Then, extending to the continuous counterpart of exchangeable point processes.
2.2.1 The de Finetti representation of sequences
The de Finetti representation of exchangeable sequences is at the heart of most Bayesian
models, though not always discussed, it is implicitly invoked through the more known concept
of independent, identically distributed (i.i.d.) random variables. An exchangeable sequence
is an innite sequence of random variable (ξ1, ξ2, . . . ) taking values in a space S, whose joint
distribution admits the following equality
P(ξ1 ∈ A1, ξ2 ∈ A2, . . . ) = P(ξ1 ∈ Aπ(1), ξ2 ∈ Aπ(2), . . . ), (2.21)
for a collection (A1, A2, . . .) of measurable subsets and for every permutation π of N :=
1, 2, . . . . In principle, this indicates an equality of distribution between any two random
permutation of the sequence. For simplicity, let (ξn) indicate a sequence of random variables
with an implicit index n ∈ N, and the notationd= for an equality in distribution. Thus an
exchangeable sequence admits the equality (ξn)d= (ξπ(n)) for every index permutation π of
N := 1, 2, . . . .
The de Finetti representation theorem connects exchangeable sequences to i.i.d. random
variables by showing that for any exchangeable sequence (ξn) there is a random probability
measure Φ, such that the sequence (ξn) is i.i.d. given Φ, as shown in the following theorem.
Theorem 5. (de Finetti exchangeability (De Finetti, 1931)) Let ξ1, ξ2, . . . be an innite
sequence of random variables with values in a space S, then ξ1, ξ2, . . . are exchangeable if,
2 Background 17
and only if, there is a random probability measure Φ on S such that ξ1, ξ2, . . . are i.i.d. given
Φ. In addition, the joint distribution is
P(ξ1 ∈ A1, ξ2 ∈ A2, . . . ) =
∫M(S)
∞∏i=1
θ(Ai)φ(dθ), (2.22)
where M(S) is the set of probability distributions on S, and φ is the distribution of Φ. φ is
often called the mixing or the de Finetti measure, and Φ the direct random measure, often
known as the distribution function. Furthermore, the empirical distribution
En( . ) :=1
n
n∑i=1
δξi( . ), n ∈ N, (2.23)
converges to Φ as n→ ∞ with probability 1 under φ for every measurable subset A ∈ S, that
is
En(A) → Φ(A) as n→ ∞. (2.24)
The product form in the integral of (2.22) is commonly known in statistics as the like-
lihood of i.i.d. random variables given the known distributional family Φ. Thus, a higher
step generalization is when the distributional family is unknown, and the de Finetti measure
acts as a distribution on all probability distributions or general measures on the space S.
Sampling a random variable using the de Finetti representation theorem requires a further
step, where we rst draw a probability distribution from φ and then we sample directly the
random variables as:
Φ ∼ φ,
ξ1, ξ2, · · · | Φ i.i.d.∼ Φ.
The de Finetti representation theorem for sequences yields a very strong tool to statisti-
cians. That is, for any partially observed exchangeable sequence from an unknown distribu-
tion, it guarantees the existence of a de Finetti measure φ that allows an i.i.d. representation.
2 Background 18
Moreover, the law of large numbers in (2.23), guarantees the recovery of the generating dis-
tribution from observational data. The next section introduces an equivalence of the de
Finetti representation theorem, though for random graphs and 2-arrays.
2.2.2 The Aldous-Hoover representation theorem for random graphs
A random matrix or a 2-array is a further generalization of a sequence of random variables.
Much like innite sequences, one can dene an innite matrix ξ∞ as
ξ∞ = (ξij) =
⎛⎜⎜⎜⎜⎝ξ11 ξ12 . . .
ξ21 ξ22 . . .
......
. . .
⎞⎟⎟⎟⎟⎠ , (2.25)
where the entries (ξij) are random variables taking values in a space S. If S is binary, for
example S = 0, 1, a random matrix is then called a random graph as it corresponds to an
adjacency matrix of a graph.
Extending the notion of exchangeability of sequences of random variables to random
matrices is especially interesting for practical reasons. For one, most observed networks of
graph-valued data are nite, thus a notion of exchangeability, like the de Finetti's theorem,
would regard the observed matrix as a partial observation from an innite random matrix.
With asymptotic results, one might then be able to recover the generating distribution, up
to some uncertainty, from the observed matrix much like the law of large number acts on
exchangeable sequences.
Intuitively, extending the notion of exchangeability requires special attention to labels of
rows and columns, as they become focal in the permutation step. For example, when rows
and columns of a matrix represent the same set of objects, one might view exchangeability as
a joint permutation of both rows and columns, simultaneously. If rows represent a dierent
set of objects than columns, a separate permutation is then desirable. The following denition
summarizes the exchangeability notion of random matrices.
2 Background 19
Denition 3. A random matrix (ξij) is called jointly exchangeable if
(ξij)d= (ξπ(i)π(j)), (2.26)
for every permutation π of N. It is called separately exchangeable if
(ξij)d= (ξπ(i)π′(j)), (2.27)
for every permutation π and π′ of N.
Remark. (Random variables as random functions) It is well established that random variables
can be equally represented by their cumulative distribution function (CDF). In a sense that,
for a random variable ξi taking values in the space S = [a, b] with CDF D, then ξi can be
sampled using a uniform random variable as
ξid= D−1(Ui), Ui ∼ Uniform[0, 1]. (2.28)
D−1 is known as the right-continuous inverse of the CDF D, and dened as
D−1(u) = infξ ∈ [a, b] | u ≤ D(ξ). (2.29)
Thus for an exchangeable sequences of random variable ξ1, ξ2, . . ., there is a random function
f acting like an inverse CDF, such that
(ξ1, ξ2, . . . )d= (f(U1), f(U2), . . . ), U1, U2, · · · iid∼ Uniform[0, 1]. (2.30)
Without further ado, we now represent the two versions of Aldous-Hoover theorem (Al-
dous, 1981; Hoover, 1979) for jointly and separately exchangeable random matrices, the
equivalence of the de Finetti's theorem of exchangeable random sequences.
Theorem 6. (Aldous-Hoover, jointly exchangeable) A random 2-array (ξij) taking values in
2 Background 20
a space S, is jointly exchangeable if, and only if, there exists a measurable random function
f : [0, 1]3 ↦→ S, such that
(ξij)d= (f(Ui, Uj, Uij)), (2.31)
where the sequence (Ui) and the 2-array (Uij) are both i.i.d. Uniform[0,1] random variables
with Uij = Uji, and are independent of f .
The matrix (Uij) thus represent an upper-triagonal matrix of uniform random variables.
Moreover, the function f need not be symmetric in its rst two arguments, in fact, if
f(a, b, . ) = f(b, a, . ) for all a, b, then ξij = ξji for all i and j.
Theorem 7. (Aldous, separately exchangeable) A random 2-array (ξij) taking values in a
space S, is separately exchangeable if, and only if, there exists a measurable random function
f : [0, 1]3 ↦→ S, such that
(ξij)d= (f(U row
i , U colj , Uij)), (2.32)
where the sequences (U rowi ) and (U col
i ) and the 2-array (Uij) are all i.i.d. Uniform[0,1]
random variables, which are independent of f .
Notice that the only dierence between Theorem 6 and 7 is the indexing of the 2-array
(Uij), where the former requires an additional condition that Uij = Uji, since both rows
and columns represent the same set of objects. The exchangeability results for 2-arrays and
random matrices introduced above suggest a simple generative framework for exchangeable
random graphs, which is the topic of the next section.
2.2.3 Exchangeable graphs as exchangeable 2-arrays
Given the Aldous-Hoover representation theorem for jointly and separately exchangeable 2-
arrays, it is now straightforward to dene an exchangeable graph. For a graph with adjacency
2 Background 21
matrix (ξij), the graph is exchangeable in the sense of (2.31), if, and only if, (ξij) is jointly
exchangeable, and in the sense of (2.32), if, and only if, (ξij) is separately exchangeable.
In the case of simple graphs, namely undirected graphs with no self-loops, one can further
simplify the representation of the random function in (2.31) and (2.32) by considering a lower
dimensional random function W : [0, 1]2 ↦→ [0, 1], such that
(ξij)d= (IUij < W (Ui, Uj)), (2.33)
where IA = 1 if event A occurs, and (Ui) and (Uij) are independent i.i.d. [0, 1] uniform
random variables that are independent of W . Further, for a symmetric graph, W must
be symmetric in its arguments, namely W (x, y) = W (y, x), and Uij = Uji. For a sepa-
rately exchangeable graph, W (Ui, Uj) is then replaced by W (U rowi , U col
j ) as in (2.32), for two
independent sequences of i.i.d. uniform random variables (U rowi ) and (U col
j ) that are also in-
dependent of W . The random measurable function W is often called a graphon. Thus, given
a distribution φ over the space of all graphons, the generative model of a jointly exchangeable
random graph as 2-array is
W ∼ φ
Uiiid∼ Uniform[0, 1], ∀i ∈ N
ξij | W,Ui, Uj ∼ Bernoulli(W (Ui, Uj)) ∀i, j ∈ N.
(2.34)
This simple generative model is quite powerful, as it encompasses many already known
models. Table 2.1 lists some, but not all, known graph models and their equivalent graphon
parametrization. Nonetheless, these models are intrinsically misspecied for many real world
applications with sparse graph structures, where a form of exchangeability notion as in
Denition 3 is desired (Orbanz and Roy, 2015). For that reason, the next section introduces
a slightly dierent notion of exchangeability and its representation theorem that grants
sparseness under certain conditions.
2 Background 22
Table 2.1: Summary of some known models admitting the graphon representation.
Model Graphon (W )Latent class (1987) mUi,Uj
•, Ui ∈ 1, . . . , KInnite relational model (2006) mUi,Uj
•, Ui ∈ 1, . . . , KLatent distance (2002) − | Ui − Uj |Eigenmodel (2008) U⊺
i DUj†
Latent feature relational model (2009) U⊺i DUj
† , Ui ∈ 0, 1∞Probabilistic matrix factorization (2011) U⊺
i Vj‡
Latent attribute model (2012)∑
k IUikIUjkD(k)UikUjk
† , Ui ∈ 0, . . . ,∞∞• mUi,Uj
is a form of an expected value of a sum of Bernoulli random variables parame-terized by (Ui).
† D is a random diagonal matrix.‡ V is a vector of latent feature scores.
2.2.4 The Kallenberg representation theorem for random graphs
The Aldous-Hoover representation theorem models random graphs as discrete 2-array adja-
cency matrices. Analogous to this framework is the Kallenberg representation theorem which
models random graphs as a point process on the continuous-space R2+. This is achieved by
embedding the graph nodes (θi) in the continuous-space R+, thus the adjacency matrix ξ
becomes a purely atomic measure on R2+ as
ξ =∑i,j∈N
zijδ(θi,θj), (2.35)
where zij = 1 if (θi, θj) is an edge of the graph, otherwise zij = 0. Therefore, the exchange-
ability notion of point processes on R2+, due to Kallenberg (1990, 2005), is now directly
applicable. This exchangeability notion is slightly dierent than the one introduced in De-
nition 3, and stated in the following denitions.
Denition 4. A random measure ξ on R2+ is called jointly exchangeable if for every
measure preserving transformation T on R+ we have
ξd= ξ (T ⊗ T )−1, (2.36)
2 Background 23
where ⊗ is the tensor product. It is called separately exchangeable if for every measure
preserving transformations T1 and T2 on R+ we have
ξd= ξ (T1 ⊗ T2)
−1. (2.37)
To parallel the random permutation notion in Denition 3, a common way to dene
a measure preserving transformation is by permuting a random partitioning of R+. More
precisely, a random measure ξ on R2+ is separately exchangeable if, and only if, for any h > 0
and for any permutation π and π′ of N, we have
(ξ(Ai × Aj)
)d=(ξ(Aπ(i) × Aπ′(j))
), (2.38)
where Ai = [h(i − 1), hi) for i ∈ N, for joint exchangeability π′ = π. Even though the
exchangeability form in (2.38) seems comparable to that of (2.26) and (2.27) for very small
h, the exchangeability notion underlining the Aldous-Hoover representation theorem relies
on the hidden assumption that the number of nodes is xed and known when exchangeability
is invoked. This assumption results in the restrictive generative ability of the Aldous-Hoover
representation theorem to non-sparse graphs. On the other hand, the exchangeability notion
of continuous-space point processes does not rely on the number of nodes, rather on partition
sizes of R+, with a random (possibly innite) number of nodes in each partition. This
notion of partition-size dependence becomes more apparent as we introduce the Kallenberg
representation theorem and the generative process of random graphs in the next section.
Note that the Aldous-Hoover representation constitutes a projective family of the Kallenberg
representation theorem, thus the latter is seen as a generalization of the former.
The following results of Kallenberg (1990, 2005) gives a de Finetti-style representation
theorem of exchangeable measures on R2+ as in the sense of Denition 4; let Λ denote the
Lebesgue measure on R+, ΛD denote the Lebesgue measure on the diagonal of R2+.
Theorem 8. (Kallenberg, jointly exchangeable) A random measure ξ on R2+ is jointly ex-
2 Background 24
changeable if, and only if, almost surely
ξ =∑i,j
f(α, ϑi, ϑj, Uij)δ(θi,θj)
+∑j,k
(g(α, ϑj, χjk)δ(θj ,σjk) + g′(α, ϑj, χjk)δ(σjk,θj)
)+∑k
(l(α, ηk)δ(ρk,ρ′k) + l′(α, ηk)δ(ρ′k,ρk)
)+∑j
(h(α, ϑj)(δθj ⊗ Λ) + h′(α, ϑj)(Λ⊗ δθj)
)+ βΛD + γΛ2.
(2.39)
For some measurable function f ≥ 0 on R4+, g, g
′ ≥ 0 on R3+ and h, h′, l, l′ ≥ 0 on R2
+,
some collection of independent uniformly distributed random variables (Uij) on [0, 1] with
Uij = Uji, some independent unit rate Poisson processes (θj, ϑj) and (σij, χij) on R2+ and
(ρj, ρ′j, ηj) on R3+, for i, j ∈ N , and some independent set of random variables α, β, γ ≥ 0.
Theorem 9. (Kallenberg, separately exchangeable) A random measure ξ on R2+ is separately
exchangeable if, and only if, almost surely
ξ =∑i,j
f(α, ϑi, ϑ′j, Uij)δ(θi,θ′j)
+∑j,k
(g(α, ϑj, χjk)δ(θj ,σjk) + g′(α, ϑ′
j, χ′jk)δ(θ′j ,σ′
jk)
)+∑k
(l(α, ηk)δ(ρk,ρ′k)
)+∑j
(h(α, ϑj)(δθj ⊗ Λ) + h′(α, ϑ′
j)(Λ⊗ δθ′j))+ γΛ2.
(2.40)
For some measurable function f ≥ 0 on R4+, g, g
′ ≥ 0 on R3+ and h, h′, l,≥ 0 on R2
+,
some collection of independent uniformly distributed random variables (Uij) on [0, 1], some
independent unit rate Poisson processes (θj, ϑj), (θ′j, ϑ′j), (σij, χij) and (σ′
ij, χ′ij) on
R2+ and (ρj, ρ′j, ηj) on R3
+, for i, j ∈ N , and some independent set of random variables
α, γ ≥ 0.
Theorem 8 and 9 do not quite resemble 6 and 7 of the Aldous-Hoover representation.
2 Background 25
Particularly, in the second, third and fourth terms in (2.39) and (2.40). Nonetheless, to give
more interpretation in the context of atomic measures, rst note that all terms associated
with Lebesgue measures must have measure zero, as the case for the random functions h
and h′, and variables β and γ. Moreover, the random functions g and g′ contribute only
star shaped structures to the graph, as shown in the indexing of the δ component. The
random functions l and l′, also by construction, contribute only isolated disconnected nodes.
Thus, the primary part of the Kallenberg representation is the random function f , which
contributes most of the interesting structures of a graph, and parallels that of (2.31) and
(2.32) in the Aldous-Hoover representation. This brings us to the topic of the next section,
which is a general framework of random graphs as exchangeable random measures.
2.2.5 Exchangeable graphs as exchangeable measures on R2+
Following the Kallenberg representation theorem of exchangeable random measures on R2+,
one can now characterize exchangeable graphs in the sense of Denition 4 using a graphon-
type of representation as seen in Section 2.2.3. In fact, the work of Veitch and Roy (2015)
does exactly so, that is, an atomic measure ξ on R2+ is jointly exchangeable if, and only if, it
can be represented by a triple (I, S,W ) of measurable random functions, where I : R+ ↦→ R+,
S : R2+ ↦→ R+ and W : R3
+ ↦→ [0, 1] with W (α, . , . ) symmetric for every α ∈ R+. Such that,
ξ =∑i,j
IUij ≤ W (α, ϑi, ϑj)δ(θi,θj)
+∑j,k
Iχjk ≤ S(α, ϑj)(δ(θj ,σjk) + δ(σjk,θj))
+∑k
Iηnk ≤ I(α)(δ(ρk,ρ′k) + δ(ρ′k,ρk)),
(2.41)
2 Background 26
where all symbols as in Theorem 8. For a separately exchangeable random measure, the
characterization is slightly dierent as
ξ =∑i,j
IUij ≤ W (α, ϑi, ϑ′j)δ(θi,θ′j)
+∑j,k
Iχjk ≤ S(α, ϑj)δ(θj ,σjk) + Iχ′jk ≤ S ′(α, ϑ′
j)δ(θ′j ,σ′jk)
+∑k
Iηnk ≤ I(α)(δ(ρk,ρ′k) + δ(ρ′k,ρk)),
(2.42)
where all symbols as in Theorem 9, S ′ : R2+ ↦→ R+ is also a measurable random function,
and W (α, . , . ) is not symmetric in its arguments. The triple (I, S,W ) of random functions
correspond to the triple (f, g, l) in (2.39), where the last term in (2.39) is omitted due to its
zero measure contribution.
The characterization of graphs as exchangeable random measures on R2+ paves the the-
oretical foundation for a family of Bayesian models of sparse graphs, which is unattainable
with the Aldous-Hoover representation. Indeed, the work of Veitch and Roy (2015) shows
that the triple (0, 0,W ) yields dense graphs with probability 1 if an integrable W has a
compact support, and sparse otherwise. This result was rst conveyed in Caron and Fox
(2014) by using a Cox process (Denition 2) form for W as
W (Ui, Uj) = 1− exp(− 2ϑ−1(Ui)ϑ
−1(Uj))
i = j, (2.43)
where (ϑi) are points of a Poisson process, though parameterized as the jumps of a completely
random measure (CRM). The next section discusses in more details the parametrization of
CRMs and their connection to atomic measures and the Kallenberg representation theorem.
For completeness, a possible generative process of simple nite graphs with the parametriza-
tion (0, 0,W ) and W (x, x) = 0, is by using a cut-restriction on a unit rate Poisson process.
Let [0, v]2 be the cubic restriction of R2+, where only nodes with location θ ≤ v are consid-
ered. Dene a unit rate Poisson process (θi, ϑi) on [0, v] × [0, c], then a generative model
2 Background 27
is
Nv ∼ Poisson(cv),
θi | Nviid∼ Uniform[0, v],
ϑi | Nviid∼ Uniform[0, 1],
(θi, θj) | W,ϑi, ϑj ∼ Bernoulli(W (ϑi, ϑj)),
(2.44)
where Nv is the number of nodes. Non-active nodes are implicitly discard post-sampling.
As discussed earlier, the above generative model does not depend on the number of nodes,
rather on the cut-restriction c and v as enforced by discarding non-active nodes, which is
not the case in (2.34). Figure 2.1 shows graphically an example of a generated simple graph
using the Kallenberg representation.
2.3 Completely random measures
In the previous section we introduced the concept of a completely random measure (CRM)
when discussing possible parametrization of the functionW in the Kallenberg representation
theorem. This section illustrates briey the theoretical concept of a CRM.
The idea of CRMs stems from the simple observation that for a Poisson process Π on a
measurable space (S,F), where F is a σ-algebra, the simple count function
N(A) = #Π ∩ A, (2.45)
is a random measure.
First, it is a measure because it satises the three measure properties, shown in Section
2.1. Second, the random summand on the right of (2.4) constitutes a number of independent
random variables, as a by-product of the Poisson process. With this observation, Kingman
(1967) suggested the concept of completely random measures. Besides N(A1), N(A2), . . .
being a set of random variables, one can also let the function N be a random non-negative
2 Background 28
θ
ϑ
ϑϑ
01
0 1
θ
θ
Figure 2.1: An example of a simple graph generated under the Kallenberg representation.The top left corner shows a generated Poisson point process (θi, ϑi) with restrictions onthe location (x-axis) and weight (y-axis) domains shown in dotted grey lines, points outsidethe restricted cube are shown with grey circles. Using the point process and the cohesionfunction W shown by the heat map in the top right corner, we generate a random simplegraph as shown in the bottom left corner, where only nodes with active edges are shown; inblack circles are nodes within the restricted cube, in grey are nodes outside the restrictedcube though with active edges. The graph is shown in the bottom right corner with thesame colour coding.
2 Background 29
function that admits the three measure properties. Notably, let ν : S ↦→ R+ be a random
non-negative function such that for any collection of pairwise disjoint measurable subsets
A1, A2, · · · ⊂ S, the random variables ν(A1), ν(A2), . . . are independent, and
ν(⋃
i
Ai
)=∑i
ν(Ai). (2.46)
The denition above while simple, is much richer than the denition of random measures
based on Poisson processes. However, it does not interpret the wide applicability of CRMs
in the recent literature of Bayesian nonparametrics. This development is related to two
other observations of Kingman (1967): (i) the natural construction of a wide range of CRMs
from nonhomogeneous Poisson processes, thus gaining the rich mathematical properties of
the latter; (ii) the general characterization of the joint distribution function of ν(A) using
the Laplace transform (generating functions). These two observations proved to be very
signicant for Bayesian modelling. They allowed a straightforward sampling procedure, and
they permitted the use of a exible classes of priors over functional spaces, some of which
with strong conjugacy properties, as in the case of random graphs.
To show this in a concise manner, let ν be a CRM dened on a measurable space (S,F).
Kingman (1967) showed that if ν is σ-nite, then by the Lévy-Khinchin representation (Sato,
1999), the Laplace transform for any measurable subset A ∈ S and t > 0 is
E[e−tν(A)] = exp
(−∫A×R+
(1− e−tω)µ(dθ, dω)
), (2.47)
for some measure µ : S × R+ ↦→ R+ that make the above integral converges.
Note that a σ-nite measure requires that there is a countable dissection of the space
S =∑
i Si, such that ν(Si) is nite with positive probability. To ensure this property, the
following condition must be satised
∫A×R+
(1− e−ω)µ(dθ, dω) <∞. (2.48)
2 Background 30
The characterization in (2.47) shows that the joint distribution function of ν(A) is
uniquely determined by the articially extended measure µ(dθ, dω), which is referred to
as the Lévy measure. The compelling part of such formulation is its direct connection to
the distribution function of a Poisson point process. Consider a nonhomogeneous Poisson
point process Π on the product space S × R, with σ-nite mean measure µ(dθ, dω), where
the pairs (θi, ωi) are the points of the Poisson process. The distribution function of Π(A)
for any t > 0 can easily be shown to be
E[e−tΠ(A)] = exp
(−∫A×R+
(1− e−tω)µ(dθ, dω)
). (2.49)
From (2.47) and (2.49), we can see that the set of σ-nite CRMs can be completely charac-
terized by Poisson processes on the extended space S × R+ via the Poisson mean measure
µ(dθ, dω). Thus, facilitating the sampling procedure signicantly, however, this observation
indicates that the measure ν is purely atomic from its resemblance to the Poisson process,
and can be specied as
ν =∞∑i=1
ωiδθi , (2.50)
where δx is the Dirac delta function dened at x ∈ S, and the pairs (θi, ωi) ∈ S × R+ are
the points of the Poisson process.
Further, when µ(dθ, dω) decomposes into a product of two measures, for example µ(dθ, dω) =
λ(dθ)ρ(dω), the CRM is called homogeneous, which implies that the atoms (θi) are indepen-
dent of weights (ωi). We will denote these measures by CRM(ρ, λ), where ρ is often called
the jump intensity of the Lévy measure, and more generally ρ characterizes the independent
increments of the process, and is directly related to the intensity function in (2.6). In this
work, and in much of the Bayesian nonparametric literature, the measure ρ is of particular
interest, as it plays a key role in dening the jump density of any measurable subset A ⊂ S.
We denote a CRM to be innitely active if it satises the condition of having an innite
2 Background 31
number of jumps in any measurable subset A, which is satised if
∫ ∞
0
ρ(dω) = ∞. (2.51)
In other words, when the integral in (2.6) diverges. Otherwise, we will denote the CRM
as nitely active, as the number jumps will be nite almost surely. Moreover, the number of
atoms in A is innite if, and only if, µ(A× R+) = λ(A)ρ(R+) = ∞.
For a comprehensive review of CRMs see Kingman (1992), and for examples of CRMs
application in Bayesian nonparametric see Lijoi and Prünster (2010); Regazzini et al. (2003).
2.3.1 Sampling CRM from unit rate Poisson processes
The characterization of σ-nite CRMs as Poisson process, as shown in (2.47) and (2.49),
enables a direct sampling procedure using unit rate Poisson processes. This section lists
the necessary conditions on CRMs to have such a representation, where an exact sampler is
given for few examples (Orbanz and Williamson, 2011).
Theorem 10 (Poisson representation of CRMs Orbanz and Williamson (2011)). Let ν be a
CRMs having the form
ν =∞∑i=1
ωiδθi , (2.52)
for random variables dened in the space (θi, ωi) ∈ S ×R+, where S is a Polish space. Let
µ(dθ, dω) be the Lévy measure, as dened in (2.47). Let ν satisfy the following conditions:
(i) ν is Σ-nite, such that, there exists a disjoint countable partition (Si) of S where
P(ν(Si) <∞) > 0 for all i;
(ii) No jumps of size 0, µ(S, 0) = 0;
(iii) The Lévy measure µ is σ-nite, that is µ(S, (w,∞)) <∞.
2 Background 32
Denote µ as the tail of the Lévy measure µ
µ(x) = µ(S, x) =∫∫
S×(x,∞)
µ(dθ, dw). (2.53)
Then, there is a probability kernel p : S × R+ ↦→ [0, 1], such that
ν =∞∑i=1
ωiδθid=
∞∑i=1
µ(ωi)δθi , (2.54)
where wi ∼ Π, a unit rate Poisson process on R+, and (θi) are independent random variables
with θi ∼ p( . , µ(ωi)). Moreover, p is unique up to equivalence, and is dened as
µ(A,B) =
∫B
p(A,w)µ(S, dw). (2.55)
If µ is σ-nite, then (2.55) simplies to
p(dθ, w) :=µ(dθ, . )
µ(S, . ) (w), for dθ ∈ B(S), (2.56)
where B(S) are the Borel sets of S.
Remark. We made a distinction between Σ-nite and σ-nite in conditions (i) and (iii),
where the former is dened for ν. Condition (ii) is easily satised for a continuous measure
µ(S, . ), such as the Gamma or Beta processes (Brix, 1999; Hjort, 1990; Lijoi et al., 2007).
Condition (iii) relates to the nitely active homogeneous CRMs of (2.51). Moreover, in
certain CRMs, S might have to be restricted to a compact subset to satisfy the condition.
Therefore, a unit rate Poisson process, with the tail distribution of the Lévy measure,
gives an exact sampler to the weight dimension (wi) of a CRM that satisfy condition (i)-
(iii). For the location dimension (θi), the probability kernel in (2.55) and (2.56) allows direct
sampling. Theorem 10 is a related extension of the Ferguson-Klass representation of pure-
jump Lévy processes, (Ferguson and Klass, 1972). The rest of this section illustrates few
2 Background 33
practical examples on the sampling of CRMs by unit rate Poisson processes.
2.3.1.1 Homogeneous CRMs
As mentioned in the previous section, for a decomposable Lévy measure of the form µ(dθ, dω) =
λ(dθ)ρ(dω), the associated CRM is called homogeneous. Thus, for a σ-nite µ, the probabil-
ity kernel (2.56) simplies to p(dθ, w) := λ(dθ)/λ(S), and the two dimensions are sampled
independently.
Example 2.3.1 (Generalized Gamma Process, (Brix, 1999; Lijoi et al., 2007)). The Lévy
measure of the Generalized Gamma process is given by
µ(dθ, dw) =1
Γ(1− σ)w−1−σe−τwdwλ(dθ), (2.57)
where Γ is the incomplete gamma function, and the two parameters (σ, τ) satisfy
(σ, τ) ∈ (−∞, 0]× (0,∞) or (σ, τ) ∈ (0, 1)× [0,∞). (2.58)
Thus the tail µ is dened as
µ(x) =
∫ ∞
x
1
Γ(1− σ)w−1−σe−τwdwλ(S) =
⎧⎪⎪⎨⎪⎪⎩τσΓ(−σ,τx)
Γ(1−σ) λ(S) if τ > 0
xΓ(1−σ)σλ(S) if τ = 0.
(2.59)
Special cases of (2.57) are, the Gamma process (σ = 0, τ > 0), the stable process (σ ∈
(0, 1), τ = 0), and the inverse-Gaussian process (σ = 1/2, τ > 0).
2.3.1.2 Inhomogeneous CRMs
The Beta process is an inhomogeneous CRM, with the Lévy measure
µ(dθ, dw) = c(θ)w−1(1− w)c(θ)−1dwλ(dθ) (2.60)
2 Background 34
where c(θ) is assumed to be a non-negative piecewise continuous function (Hjort, 1990).
Condition (iii) in Theorem 10 requires that the Lévy measure is σ-nite. For a state space
S = R+, the Beta process can have an innite measure. Therefore, by restricting it to a
subspace S = [0, θmax) for some θmax ∈ R+, the niteness condition is satised. Moreover, the
tail measure for the Beta process involves evaluating a degenerate incomplete beta function,
which does not have an analytically solution, but could be achieved numerically. On the
other hand, for some choices of c(θ), the probability kernel of (2.56) exists, as shown in the
following example.
Example 2.3.2 (Beta process with c(θ) = e−λ(θ)). For S = [0, θmax), (2.56) becomes
p([0, θ], w) =1− w − (1− w)exp(−λ(θ))
1− w − (1− w)exp(−λ(θmax)). (2.61)
If λ(θ) is invertible, then p([0, . ], w)−1(x) can have an analytical expression.
35
Chapter 3
Decomposable random graphs
3.1 Introduction
In high dimensional multivariate data with unknown dependency structure, graphical models
are used to simultaneously infer model parameters and the conditional dependency among
variates. The class of decomposable graphs is extensively applied in this context, primar-
ily due to its explicit interpretation of conditional dependencies that greatly simplify the
observational data likelihood. The Gaussian graphical model (Lauritzen, 1996; Whittaker,
2009) has seen success in a variety of applications of such dependency nature. Nonetheless,
most work related to decomposable graphs is focused in their utility as functional priors over
large covariance matrices or as priors over hierarchies of model parameters. Few eorts in
statistical literature exists beyond this framework, for example, in structural learning of time
series (Tank et al., 2015), and in Bayesian nonparametric models on decomposable graphs
(Caron and Doucet, 2009).
In parallel, the literature of random graphs have seen much interest recently, where the
interest is generally focused in modelling structural relational data in the form of a random
d-arrays of binary or count data. The rst work on random graphs is credited to Erdös and
Rényi (1959), and since then many random graphs and 2-array models have been proposed.
3 Decomposable random graphs 36
For example, blockmodel (Wang and Wong, 1987), latent distance model (Ho et al., 2002),
innite relational model (Kemp et al., 2006), and many others. Refer to Newman (2010,
2003) for a good introduction. A general principle of random graph models is to assume
a latent anity parameter for each node in the network, governing its likelihood to form
edges with other nodes. Anities are hence seen as the drivers of the observational network
structure, and modelling interest is mostly focused on their inference. Moreover, recent
developments in random graphs points towards a unied modelling framework, based on the
Aldous-Hoover and the Kallenberg representation theorems (Aldous, 1981; Hoover, 1979;
Kallenberg, 1999). Both representation theorems model random graphs as innite objects,
where a realization is a sampling from such objects through a random function indexed by
node anities.
This work attempts to bridge the gap between the sole use of decomposable graphs
in graphical models for Bayesian model determination and the recent anity-based ran-
dom graphs framework. Therefore, motivated by the Kallenberg representation theorem, a
decomposable random graph model is proposed. This work builds on the junction tree rep-
resentation of decomposable graphs and their direct connection to some of the combinatorial
properties of such graphs. Their explicit interpretation of conditional dependencies allows
for the construction of Markov update rules of edge probabilities that yield an easy sampling
scheme.
Section 3.2 introduces preliminaries on the combinatorial structure of decomposable
graphs, their relation to junction trees, their decomposability of the observable data like-
lihood, and some of the current models on decomposable graphs. Section 3.3 introduces
a decomposable random graph model conditioned on a junction tree and discusses certain
issues related to the Kallenberg representation theorem of graphs. Sections 3.4 and 3.5 illus-
trate an iterative sampling procedure for the proposed model. Section 3.6 shows a sample
of practical examples and some of their properties. Section 3.7 gives an exact expression of
some expectation results, conditional on certain types of trees.
3 Decomposable random graphs 37
3.2 Preliminaries
3.2.1 Decomposable graphs
Let G = (Θ, E) be an undirected graph with a set of nodes Θ = θii∈N and edges E =
θi, θji,j∈N. A pair of nodes θi, θj ∈ Θ are adjacent if θi, θj ∈ E, the set notation is
used since the edge (θi, θj) is identical to (θj, θi) in an undirected graph. A graph G ′ = (Θ′, E ′)
is called a subgraph of G if Θ′ ⊂ Θ and (or) E ′ ⊂ E, for simplicity, let G(Θ′) be the subgraph
induced by Θ′ ⊂ Θ, where only edges between the nodes Θ′ are included, similarly G(E ′) is
a subgraph where only nodes forming edges in E ′ are included. A subset C ∈ Θ is said to
be complete if every two distinct nodes in C are adjacent, thus G(C) is a complete subgraph
of G and is commonly called a clique of G. It is worth noting that subgraphs of cliques are
also cliques, thus one can dene a maximal clique to be a subgraph that cannot be extended
by including any adjacent node while remaining complete. Consequently, all subgraphs of
maximal cliques are also cliques, but not necessary maximal.
In graph theory, there are many types of graphs categorized by their overall structure, or
by certain properties, for example connectivity. In this work, we mainly focus on a specic
type of graphs admitting what is called the decomposable (chordal) property. The graph G
is said to be decomposable if, and only if, any cycle of four or more nodes has an edge that
is not part of the cycle. An equivalent denition is given by Lauritzen (1996) as follows.
Denition 5. (Decomposable graphs, (Lauritzen, 1996)) A graph G = (Θ, E) is decompos-
able if it could be partitioned into a triple (A, S,B) of disjoint subsets of Θ, such that
A ⊥G B | S, S is complete.
In other words, A is independent of B given S.
A well known property of decomposable graphs is its perfect ordering sequence of maximal
cliques. Denote the set of maximal cliques of G by C, and let |C| = K. Dene a permutation
3 Decomposable random graphs 38
π : 1, . . . , K ↦→ 1, . . . , K such that,
Hπ(j) =
j⋃i=1
Cπ(i), Sπ(j) = Hπ(j−1)
⋂Cπ(j), C. ∈ C. (3.1)
Then a sequence Cπ(K) = G(Cπ) = (G(Cπ(1)), . . . ,G(Cπ(K))) is called a perfect ordering
sequence (POS) of G if, and only if, for all j > 1, there exist an i < j such that G(Sπ(j)) ⊆
G(Cπ(i)). The latter is known as the running intersection property (RIP) of the sequence.
The set CK = C1, . . . , CK is called the cliques of G as each component G(Ci) forms a
maximal clique, and the set SK = S1, . . . , SK is called the minimal separators of G, where
each component G(Si) decomposes G in a sense of Denition 5. While each maximal clique
appears once in Cπ(K), separators could repeat multiple times in Sπ(K), thus the naming of
minimal separators as in the unique set of separators. The POS is a strong property, where
a graph G is decomposable if, and only if, the maximal cliques of G could be numbered in a
way that adheres to the RIP, thus forming a POS. Nonetheless, a decomposable graph could
be characterized by multiple distinct POSs of the maximal cliques. For example, consider a
graph formed of four triangles ABC, BCE, CDE, BEF, as shown in Figure 3.1. Table 3.1
lists three possible perfect ordering representation.
A
B
C D
E
F
Figure 3.1: An undirected decomposable graphs of 4 cliques of size 3; ABC, BEF,BCE,CDE.
Despite the non-uniqueness of the POSs, Lauritzen (1996) has showed that the multiplic-
ity of the minimal separators does not depend on the perfect ordering, implying a unique set
of separators S across all POSs. Moreover, enumerating all POSs of a graph is directly re-
lated to enumerating what is called the junction trees. A tree T = (C, E) is called a junction
tree of cliques of G, or simply the junction tree, if the nodes of T are the maximal cliques of
3 Decomposable random graphs 39
Table 3.1: Possible prefect ordering of cliques of Figure 3.1
perfect ordering separators(Cπ(1), Cπ(2), Cπ(3), Cπ(4)) (Sπ(2), Sπ(2), Sπ(4))
(ABC, BCE, CDE, BEF) (BC, CE, BE)(CDE, BCE, BEF, ABC) (CE, BE, BC)(BEF, BCE, ABC, CDE) (BE, BC, CE)
G, and each edge in E corresponds to a minimal separator S ∈ S. The junction tree concept
is generally expressed in a broader sense, that is, for any collection C of subsets of a nite
set of nodes Θ, not necessary the maximal cliques, a tree T = (C, E) is called a junction tree
if any pairwise intersection C1 ∩ C2 of pairs C1, C2,∈ C is contained in every node in the
unique path in T between C1 and C2. Equivalently, for any node θ ∈ Θ the set of subsets in
C containing θ induces a connected subtree of T . There is a direct theoretical link between
junction trees and POSs, as shown in Cowell et al. (2006).
Theorem 11. (Junction tree, (Cowell et al., 2006)) A graph G is decomposable if, and only
if, there exists a junction tree of cliques.
Despite the guaranteed existence of a junction tree, it is possible that a decomposable
graph admits more than one unique junction tree, which is a direct consequence of the non-
uniqueness of the POSs. Nonetheless, since the set of separators is unique, the junction
tree edge set E is unique and characterizes all junction trees (Cowell et al., 2006). The
connection between POSs and junction trees could be succinctly summarized in a bipartite
network between both sets, as shown in Hara and Takemura (2006), and illustrated by the
example in Figure 3.2.
The bipartite link in decomposable graphs between maximal cliques and junction trees
play a central role in the generative Bayesian model proposed in this work. We use this
dichotomy to move around the space of decomposable graphs by alternating between the
two sets. For a broader scope, the next section discusses some already existing models for
decomposable graphs and their implication on this work.
3 Decomposable random graphs 40
A B
CD
E
FC1
C2
C3
(a) a decomposable graph of aclique of size 3 (C2), and twocliques of size 4 (C1, C3)
C1 C3 C2
C3 C1 C2
C1, C2, C3
C1, C3, C2
C2, C1, C3
C2, C3, C1
C3, C1, C2
C3, C2, C1
T1
T2
junction trees perfect orderings
(b) a connected bipartite graph between junc-tion trees of cliques and perfect orderings
Figure 3.2: A decomposable graph and its bipartite graph linking junction trees of cliquesand perfect orderings.
3.2.2 Models for decomposable graphs
The earliest introduction of decomposable graphs in statistics was by Darroch et al. (1980)
and Wermuth and Lauritzen (1983) as a generating class of decomposable log-linear mod-
els on multidimensional contingency tables. As a result of the direct connection between
decomposability as in Denition 5 and the notion of conditional independence, decompos-
able graphs helped in reducing the number of factors in contingency tables without altering
the maximum likelihood estimates. Factors belonging to the same maximal clique were col-
lapsed. Models using decomposable graphs have appeared since then in various topics in
statistics. For example, the work of Spiegelhalter et al. (1993) where decomposable graphs
were used on Bayesian expert systems, and Cowell et al. (2006), a recent book on this topic.
The work of Giudici and Green (1999) and Frydenberg and Steen (1989) used the decom-
posability structure to factorize the likelihood for Bayesian model determination and mixed
graphical interaction models, respectively. Stingo and Marchetti (2015) proposed ecient
local updates for undirected graphical models by updating the junction tree. Most recent
work involves using decomposable graphs as a latent interaction structure or as a clustering
prior (Bornn and Caron, 2011; Ni et al., 2016).
3 Decomposable random graphs 41
The relative wide use of decomposable graphs stems from the separation property of
cliques and separators, which leads to a partitioning of the likelihood. In particular, Dawid
and Lauritzen (1993) have showed that, if, and only if, a random variable X = (Xi)i<n with
a Markov distribution p and a conditional dependency abiding to a decomposable graph G,
then the likelihood factorizes as
p(X | G) =∏
C∈C p(XC)∏S∈S p(XS)
, (3.2)
where C and S are the sets of maximal cliques and minimal separators, respectively, and
XA = Xi : i ∈ A.
Despite the broad use of decomposable graphs in statistics, little work has been done on
the sampling aspect. The lack of sampling methods is party due to the complexity of testing
for decomposability in large graphs, for example the size of the largest maximal clique is still
an open problem. In addition, it is partly due to the lack of explicit methods that generate
and quantify the space of junction trees or perfect orderings associated with a given graph.
The recent notable work of Thomas and Green (2009) and Stingo and Marchetti (2015)
take a steps in this direction, where both focus on updating the junction tree for faster
mixing time. Nonetheless, computational complexity is still the largest obstacle. As noted
by Thomas and Green (2009), one of the best available clique tree search algorithms is by
Tarjan and Yannakakis (1984), which is of the order O(|Θ|+|E|). Yet, for most dense graphs
|E| is of the order O(|Θ|2), and at best O(|Θ|) for sparse graphs.
This work adopts a more general modelling objective, where decomposable graphs are
seen as special cases of random graphs, in the sense discussed in Sections 2.2.3 and 2.2.5 and
surveyed in Orbanz and Roy (2015). Much of earlier work on decomposable graphs focused
on its junction tree representation, for its simplicity and computational eciency. The next
section introduces a model for decomposable random graphs that also builds on the junction
tree representation.
3 Decomposable random graphs 42
3.3 Decomposable random graphs by conditioning on junc-
tion trees
By denition, the building blocks of decomposable graphs are their maximal cliques and the
set of minimal separators. The smallest possible clique is a complete graph on two nodes (a
stick). For our modelling purpose, we will regard the smallest possible clique to be an isolated
node, where two isolated nodes form two maximal cliques, and connected they form a single
maximal clique. Hence, an n-node graph can have a maximum n maximal cliques, where
all nodes are isolated, and a minimum of one single clique, where all nodes are connected
forming an n-complete graph.
Relating the number of nodes to the range of possible cliques reects the fact that cliques
could be seen as latent communities which are observed in the clique form by the attainment
of node memberships. A decomposable graph is then an interaction between two sets of
objects, the graph nodes and the latent communities. In the discrete case, when the number
of nodes is known, out of the n-possible communities of an n-node graph, only 1 ≤ k ≤
n communities are observable in the form of maximal cliques. The rest of n − k clique-
communities are either latent with no visible node members or subgraphs of maximal cliques,
either way they are unobservable.
Let G be a decomposable graph with TG being one of its junction trees. In classical
settings, G is modelled via its adjacency matrix and TG is a function of G, and research
interest is in modelling the probability of node interactions.
Classical representation:
Given: G = (Θ, E) TG = f(G), interest: P(θi, θj ∈ E). (3.3)
This work models decomposable graph via their biadjacency matrix. By separating the
notion of nodes and maximal cliques, the biadjacency matrix connects the graph nodes to
3 Decomposable random graphs 43
the latent community nodes representing maximal cliques. Let θ′1, θ′2, · · · ∈ Θ′ be a set of
latent communities connected via the tree T = (Θ′, E), we dene Z to be the biadjacency
matrix of a decomposable graph G, where zki = 1 implies node θi is a member of clique θ′k,
otherwise zki = 0. In essence, Z represents a bipartite interactions between the two sets, Θ′
and Θ, such that θ′k, θi ∈ EZ , the Z edge set, also implies that node θi is a member of
clique θ′k. The interest is in modelling the probability of node-clique interactions.
Alternative representation:
Given: G = (Θ, E) T = (Θ′, E), Z =(Θ′,Θ, EZ
)interest: P(zki = 1). (3.4)
G is a deterministic function of Z, since its adjacency matrix is
A = (aij)ij =(minz⊺.iz.j , 1
)ij, (3.5)
where z.i is the i-th column of Z. Essentially, members of the same community, a row in Z,
are connected in G.
This assumes that an observed junction tree TG of G is, in some way, a subtree of T ,
since the maximal cliques C of G are a subset of Θ′. A more precise relation is TG = f(T ), as
function of T ; as a subtree when T (C) is a fully connected tree, that is, when all community
nodes representing maximal cliques are connected, with no sub-maximal nodes in between.
To fully capture the dynamics in decomposable graphs, a model for Z ought to be it-
erative, rst by modelling Z | T and iteratively T | Z. Classical models for decomposable
graphs, such as the work of Green and Thomas (2013), adopt a similar tree-dependent it-
erative scheme, where the conditional T update relies upon the bipartite relation in Figure
3.2, between trees and perfect orderings. This work models T | Z in a similar manner, thus,
the focus is proposing a model for Z | T .
Sampling edges in a decomposable graph is highly dependent on the current conguration
of the graph. Otherwise, (dis)connecting an arbitrary edge might hinder the graph unde-
3 Decomposable random graphs 44
composable. Figure 3.3 illustrates an example where a decomposable graph in 3.3a stays
decomposable in 3.3b, when node E joins clique AD, though with a dierent junction tree.
It becomes non-decomposable in 3.3c, when node F joins clique ABC, thus forming the circle
ADEF with no inner chord.
A
B
C
D
E
F
G
graph
junction treeABC AD DE EFG
(a) decomposable graph
A
B
C
D
E
F
G
graph
junction treeABC ADE EFG
(b) decomposable graphwith dierent junction tree
A
B
C
D
E
F
G
graph
(c) non-decomposable graph
Figure 3.3: An example of arbitrary adding an edge between nodes in a decomposablegraph: on the left is the original graph, in the middle, node E joins clique AD causing achange in the junction tree while preserving decomposability, on the right, node F joinsclique ABC, abolishing decomposability by forming the circle ADEF with no inner chord.
The Markov local dependency in decomposable graphs, shown in Figure 3.3, translates
directly to the biadjacency representation Z. Given T , sampling zki is highly dependent
on the current conguration of Z, that includes the current conguration of zki. Green
and Thomas (2013) have illustrated conditional (dis)connect moves on G | T that ensures
decomposability. The following proposition illustrates the permissible moves in Z | T that
ensures Z maps to a decomposable graph through (3.5).
Remark. The notation θ′k indexes the nodes of Z, it also represents the tree nodes in T .
To avoid ambiguity, let the term "node(s)" refer to the graph nodes, and "clique-node(s)"
to the nodes of the latent clique communities of Θ′ in the given tree T , unless otherwise
specied. For simplicity, we will often use the term "clique θ′k" to refer to the maximal
clique represented by the tree node θ′k, having the shorthand notation G(θ′k).
3 Decomposable random graphs 45
Proposition 1 (Permissible moves in Z | T ). Let T = (Θ′, E) be an arbitrary tree over the
set of clique-nodes Θ′. For a decomposable graph G = (Θ, E), with a junction tree being a
subtree of T , let Z be the biadjacency matrix of G, where zki = 1 implies node θi ∈ Θ is a
member of the maximal clique represented by θ′k ∈ Θ′. For an arbitrary node θi ∈ Θ let T |i
denote the subtree of T induced by the node θi as
T |i = T(θ′s ∈ Θ′ : θi ∈ G(θ′s)
), (3.6)
where θi ∈ G(θ′s) implies zsi = 1. Moreover, let T|ibd refers to the boundary clique-nodes, those
of degree 1 (leaf nodes), of the induced tree T |i, and T|inei to the neighbouring clique-nodes in
T to T |i, as
T|ibd =
θ′s ∈ Θ′ : θi ∈ G(θ′s), deg
(θ′s, T
|i)= 1,
T|inei =
θ′s ∈ Θ′ : (θ′k, θ
′s) ∈ E , zki = 1, zsi = 0
.
(3.7)
Suppose θ′k ∈ T|ibd
⋃T
|inei, let Z
′ be the biadjacency matrix formed by one of the following
moves:
• connect move: if θ′k ∈ T|inei then zki = 1;
• disconnect move: if θ′k ∈ T|ibd then zki = 0.
Then Z′ represents a decomposable graph G ′, through the mapping implied by the matrix
in (3.5), with junction tree T ′G′ = f(T ).
Proof. The boundary and neighbouring sets of (3.7) do not guarantee that non-empty rows
of Z′ represent maximal cliques in G ′. For example, (dis)connecting a node from a maximal
clique can cause the clique to be sub-maximal. However, one can always construct a junction
tree of G ′ given T , thus by Theorem 11, G ′ is decomposable.
If all clique-nodes of maximal cliques of G ′ are adjacent in T , a junction tree of G ′ is
simply the induced tree T ′G′ = T
(θ′s ∈ Θ′ : θ′s is maximal in G ′
). Otherwise, since every
non-maximal clique θ′s is contained in some maximal clique θ′k that is adjacent to it in T ,
3 Decomposable random graphs 46
θ′s, θ′k ∈ E , then all edges in T , except θ′s, θ′k, can be rewired to θ′k. This process forms
the tree T ′ where all maximal clique-nodes are adjacent and non-maximal clique-nodes are
leaf nodes. A junction tree T ′G′ is then the induced tree on T ′ as T ′
G′ = T ′(θ′s ∈ Θ′ :
θ′s is maximal in G ′).
The boundary and neighbouring sets in (3.7) of Proposition 1 do not ensure that cliques
remain maximal after a (dis)connect move in Z. For such cliques to remain maximal, we
impose extra conditions on both T|ibd and T
|inei. In the case of T
|ibd, we impose the extra
condition that a boundary clique stays maximal after a node's disconnection. Similarly, for
the case of T|inei, we impose the extra condition that a neighbouring clique stays maximal
after a node's connection. Formally, with abuse of notation, let T|ibd and T
|inei be as in (3.7),
though with the extra imposed conditions as
T|ibd = T
|ibd
⋂θ′k ∈ Θ′ : θ′k \ θi ⊈ θ′s, s = k
,
T|inei = T
|inei
⋂θ′k ∈ Θ′ : θ′k ∪ θi ⊈ θ′s, s = k
,
(3.8)
where θ′k \ θi refers to the subgraph formed by disconnecting the node θi from clique θ′k,
θ′k ∪ θi refers to opposite, the subgraph formed by connecting node θi to clique θ′k.
The extra imposed conditions in (3.8) are arguably restrictive and computationally ex-
pensive in large graphs, however, for a coherent introduction to the model, we will retain
these conditions. Section 3.3.2 examines a related issue, in which a practical solution is
proposed that also soften these conditions.
Proposition 1 illustrated the permissible moves in Z that ensures its representability as
a biadjacency matrix of a decomposable graph. The next section introduces a model for
random decomposable graph as realization from a continuous-time point processes in R2+.
3 Decomposable random graphs 47
3.3.1 Decomposable graph as point processes.
Drawing from the point process representation of graphs in Sections 2.2.4 and 2.2.5, let
(θi, ϑi) (θ′i, ϑ′i) be unit rate Poisson process on R2
+ representing the set of nodes Θ and
clique-nodes Θ′, respectively. Refer to θ as the node location and ϑ as the node weight.
Given a tree T = (Θ′, E), the biadjacency matrix Z, takes the form of a bipartite atomic
measure on R2+, as
Z =∑k,i
zkiδ(θ′k,θi), (3.9)
where zki := IUki ≤ W (α, ϑ′k, ϑi), for some random variable α ∈ R+, a uniform random
array (Uki) on [0, 1], and a random measurable function W : R3+ ↦→ [0, 1]. The decomposable
graph G represented by Z is then characterized as
G =∑i,j
min(∑
k
zkizkjδ(θ′k,θi)δ(θ′k,θj), 1)δ(θi,θj), (3.10)
Again G is completely determined by Z. The following denitions introduce useful graph
functions and notations used in this work.
Denition 6. Denote the operators v and e as node and edge sets of graph like structures,
respectively. Such that, v(G(x)) ∈ Θ are the nodes of the subgraph G(x), and e(G(x)) ∈ E
are the edges. Since Z also represents a bipartite graph, let v(Z(y)) be the subset of nodes and
clique-nodes in Θ′ ∪ Θ for the subgraph Z(y), and e(Z(y)) are the node-clique membership
edges. To distinguishing between nodes and clique-nodes in Z, denote vn(Z(y)) := v(Z(y)) \
Θ′ as the set of graph nodes , and vc(Z(y)) := v(Z(y)) \ Θ as the set of clique-nodes. For
the subtree T (t), v(T (t)) ∈ Θ′ and e(T (t)) ∈ E.
Denition 7. Following the notation Denition 6, denote nei as the operator of the set
of neighbouring nodes, and deg as the number of degrees for a specic node, respectively.
Such that nei(θi,G) are the neighbouring nodes of θi in G and deg(θi,G) = |(nei(θi,G)| is the
node degree. The junction tree follows similarly. For Z, nei(θi,Z) = vc(Z( . ∩ θi)
)and
3 Decomposable random graphs 48
nei(θ′k,Z) = vn(Z(θ′k ∩ . )
).
Given the characterization of neighbouring and boundary cliques (Eq. (3.8)), and the
characterization of zki in (3.9), we can accurately dene the n + 1 Markov update step for
z(n+1)ki given the current conguration Z(n), as
P(z(n+1)ki = 1 | Z(n), T ) = W (n+1)(ϑ′
k, ϑi) =
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
0 if z(n)ki = 0 and θ′k /∈ T
(n)|inei ,
1 if z(n)ki = 1 and θ′k /∈ T
(n)|ibd ,
W (ϑ′k, ϑi) if z
(n)ki = 1 and θ′k ∈ T
(n)|ibd ,
W (ϑ′k, ϑi) if z
(n)ki = 0 and θ′k ∈ T
(n)|inei .
(3.11)
Note that θ′k ∈ T(n)|ibd at step n only if θi is a member of the clique θ′k, that is z
(n)ki = 1.
Similarly, θ′k is a neighbour to T (n)|i only if z(n)ki = 0. Otherwise, as in the rst and second
case in (3.11), z(n+1)ki = z
(n)ki . Then, (3.11) simplies to
P(z(n+1)ki = 1 | Z(n), T ) = W (n+1)(ϑ′
k, ϑi) =
⎧⎪⎪⎨⎪⎪⎩W (ϑ′
k, ϑi) if θ′k ∈ T(n)|ibd
⋃T
(n)|inei ,
z(n)ki otherwise.
(3.12)
For simplicity, W is specied in (3.11) and (3.12) with its last two arguments relating to
the weight parameter of the nodes and clique-nodes. Nonetheless, the form of W in (3.12)
is still unspecied, and for it to be a sensible modelling object, the most general denition
would require it to be at least measurable with respect to a probability space. Depending on
the sampling method, more conditions might be required. Notably, the framework used thus
far mimics that of the Kallenberg representation theorem as introduced in Section 2.2.5. A
realization from such a random innite measure is seen as cubic [0, r]2, r > 0, truncation of
R2+, and in that sense, the point process on the nite region [0, r]2 might not be nite. In
practice, a realization from a nite restriction is desired to be nite. Kallenberg (2005, Prop.
9.25) has given necessary and sucient conditions for an exchangeable measure to be a.s.
3 Decomposable random graphs 49
locally nite. Since we are not particularly focused on exchangeable random measures, yet
still interested in the niteness of a realization, the following denition simplies Kallenberg's
condition by taking the random functions S = S ′ = I = 0 in Section 2.2.5.
Denition 8. (locally nite) Let ξ be a random atomic measure on R2+, such that for a
measurable random function W : R3+ ↦→ [0, 1], ξ takes the form
ξ =∑i,j
IUij ≤ W (α, ϑ′i, ϑj)δ(θ′i,θj),
where (θ′i, ϑ′i) and (θj, ϑj) are two independent unit rate Poisson processes on R2
+, and
(Uij) is a [0, 1] uniformly distributed 2-array random variables. Then for a xed α, the
random measure ξ is a.s. locally nite if, and only if, the following conditions are satised:
(i) ΛW 1 = ∞ = ΛW 2 = ∞ = 0,
(ii) ΛW 1 > 1 <∞ and ΛW 2 > 1 <∞,
(iii)∫R2+W (x, y)IW 1(y) ≤ 1IW 2(x) ≤ 1dxdy <∞,
where W 1(y) =∫R+W (x, y)dx and W 2(x) =
∫R+W (x, y)dy, and Λ is the Lebesgue measure.
In summary, if W is integrable then ξ is a.s. locally nite.
Thus far, we have introduced the general framework of the proposed model, the Markov
update scheme at each step, and the required conditions on W to ensure a nite realization
of the model. We will now give a formal denition of decomposable random graphs.
Denition 9 (Decomposable random graph). A decomposable random graph G, is a random
graph associated with a biadjacency atomic random measure Z taking the form in (3.9), with
the random function W : R3+ ↦→ [0, 1] satisfying the conditions in Denition 8, where Z is
constructed by means of a Markov process having the update steps of (3.12). A realization
of such measure also takes the form of an (r′, r)-truncation as Zr′,r = Z( . ∩ [0, r′] × [0, r]),
for r′, r > 0.
3 Decomposable random graphs 50
The denition of random decomposable graphs in 9 specied the Markov update process
of (3.12), which itself depends on the boundary and neighbouring sets of (3.8). Instead, the
simpler boundary and neighbouring sets of (3.7) could be used, since as shown in Proposition
1, a formation of Z using the sets in (3.7) would result in a decomposable graph by the
mapping of (3.10). However, in this case, the latent tree T connecting the clique-nodes
(θ′k, ϑ′k) would not be seen as the limit of a junction tree of the graph as more nodes enter
the truncation. A direct result from the fact that the clique-nodes in T can represent both
maximal and non-maximal cliques under (3.7). Thus treatment of decomposable graphs is
possible, in Chapter 4 we showcase an application resulting from such treatment, where we
treat sub-maximal cliques as sub-clusters of maximal cliques.
The sampling notion of an (r′, r)-truncation mentioned in Denition 9 is not yet fully
discussed, in particular, how it assures decomposability with scaling of r or r′. The next
section formalizes this notion, where certain issues relating to decomposability are disclosed
with some proposed solutions.
3.3.2 Finite graphs forming from domain restrictions
Having made the choice of representing the biadjacency matrix Z as an innite point process
on R2+, then, a nite observation of Z can be seen as the graph resulting from the cubic
restriction [0, r′]× [0, r] of R2+, where clique-nodes and nodes are visible only if they appear
in some edge in Zr′,r = Z( . ∩ [0, r′]× [0, r]), with locations satisfying that θ′k < r′ and θi < r.
In this case, refer to the appearing clique-nodes and nodes as active.
There is a clear ambiguity relating to the inuence of each domain restriction on the
other, specially due the Markov formation of the graph. Nonetheless, if we neglect for a
moment the formation method, and regard the biadjacency Z to be an innite xed object
sampled by the (r′, r)-truncation as Zr′,r, there is still doubt on how an achieved realization
forms a decomposable graph. For example, a random embedding of clique-node locations
θ′1, θ′2, . . . in R+ can result in an empty realization even for large values of r′, inuenced
3 Decomposable random graphs 51
by the inter-dependence between clique-nodes in T . Essentially, what is required is that a
realization from Zr′,r is decomposable with a junction tree as a function or subtree of T , albeit
not necessary completely connected. In fact, the observable part of T forms a collection,
perhaps connected, of parts of the junction tree parameterizing the graph mapped from
Zr′,r. Even if such notion is allowed, there is still no promise that all active clique-nodes in
a realization is a maximal clique, since a visible portion of a clique, where part is located
outside the truncation, might be contained in another clique within the truncation. A simple
way to address those two issues is to ensure that the restriction point r is magnitudes larger
than r′, to allow enough active nodes such that all active clique-nodes are maximal. Gauging
the truncation size can be done be ensuring that the following A0 set is empty.
A0θ′k < r′ : Zr′,r(θ′k ∩ . ) ⊆ Zr′,r(θ′s ∩ . ), for some θ′s < r′, s = k, θ′k is active
= ∅. (3.13)
Note that θ′s need not be active in (3.13), as for non-active cliques Zr′,r(θ′k ∩ . ) =
∅. Essentially, the conditions in A0 are the same conditions added to the boundary and
neighbouring sets in (3.8).
To scale freely the (r′, r)-truncation, while ensuring A0 is empty, one can, extend the
truncated location node domain by an "edge-greedy" partition (r, ro] as [0, r]∪ (r, ro]. Rather
than trimming all external edges connecting from (r, ro] to [0, r′], as done with edges outside
the cube [0, r′] × [0, r], we only allow a maximum of a single edge per node θi ∈ (r, ro]
to connect to any active clique-node θ′k ∈ [0, r′], only if it causes Zr′,ro(θ′k ∩ . ) to be
maximal if it was not in Zr′,r(θ′k ∩ . ). In other words, let e(Zr′,r) be the edge set formed
in Zr′,r, then |e(Zr′,ro(θi ∩ .))| ≤ 1, for each r < θi ≤ ro, and (θ′k, θi) ∈ e(Zr′,ro) only
if Zr′,r(θ′k ∩ . ) ⊆ Zr′,r(θ′s ∩ . ) for some θ′s, θ′k ≤ r′, k = s. Figure 3.4 illustrates this
process from a realization of a decomposable graph using the restrictions [0, r′]× [0, r] with
the edge-greedy partition (r, r0], where one extra node, θ∗, is included to clique θ′3 to insure
the set in (3.13) is empty.
Remark. Allowing r to be much larger than r′ relates directly to the notion discussed at the
3 Decomposable random graphs 52
beginning of Section 3.3, that the set of maximal cliques of T are only partially observable
given the nodes.
θ
θ'
θ1θ2 θ3θ4θ5 θ6 θ7 θ8 θ9
θ 1'
θ 2'
θ 3'
θ 4'
θ 5'θ
6'θ 7
'
r
r'
ro
(a) graph as a point process
(b) latent tree T
sampled biadjacency matrix
θ′ θ
7654321
1 2 3 4 5 6 7 8 9 *blue node
(c) sampled and extended biadjacency matrix
2
5 3 4
67
1
8
9
*
(d) mapped decomposablegraph
Figure 3.4: A realization of a decomposable graph in 3.4d from the point process in 3.4a andthe tree 3.4b. The grey area in 3.4a is the edge-greedy partition (r, ro], where only one extranode (in blue) was needed to guarantee all active cliques are maximal, since Zr′,r(θ′3 ∩ .)is a subset of Zr′,r(θ′6 ∩ .) and Zr′,r(θ′7 ∩ .). 3.4c is the biadjacency matrix of active(clique-)nodes representing the graph.
Condition A0 is not such a computational burden given the biadjacency representation,
since one can compare the o-diagonal to the diagonal entries of Zr′,rZ⊤r′,r, to arrive at where
3 Decomposable random graphs 53
to allow an edge in the edge-greedy partition. In a graph of Nc active clique-nodes, A0
checks each of the Nc clique-nodes with their neighbours, thus for a d-regular junction tree,
where all clique-nodes are of degree d, the computational complexity is of a linear order as
O(dNc). Nonetheless, a simpler solution is possible by an identity matrix augmentation, as
shown in the next subsection
3.3.2.1 Augmentation by an identity matrix
To avoid such unnecessary checks in A0, one can simply augment a realization by an identity
matrix, after the removal of empty rows, as per the Kallenberg representation of random
graphs. In essence, this operation articially adds a maximum Nc extra nodes to the graph,
each node is connected to a single unique clique, thus uniquely index the clique set. This
process is summarized in Figure 3.5, with the removal of empty rows and an identity aug-
mentation to the realization in Figure 3.4c. Such augmentation, though it seems articial, is
a natural consequence of the used framework and the edge-greedy partition. In a sense, given
the (r′, r)-truncation method, for any realization over Zr′,r, with probability 1, there exists
an ro > r such that the edge-greedy partition (r, ro] embeds an identity matrix. To show
this, we will extend the results of Veitch and Roy (2015) concerning the degree distribution
of the Kallenberg exchangeable graph discussed in Section 2.2.5, to the case of a biadjacency
measure.
Consider the random biadjacency atomic measure generated using the Kallenberg repre-
sentation and taking the following simplied form
G =∑k,i
IUki ≤ W (α, ϑ′k, ϑi)δ(θ′k,θi), (3.14)
where all notations and conditions follow that of Denition 8. A realization from G is also
an (r′, r)-truncation as Gr′,r = G( . ∩ [0, r′] × [0, r]) where only edge-connected nodes are
visible. The construction of G diers from that of the decomposable graph in Denition 9,
3 Decomposable random graphs 54
reduced clique-node matrix
θ′ θ
7631
1 2 3 4 5 6 7 8 9 * * * *
identity
(a) augmented clique-node bipartite matrix.
2
5
*
3 4
67
*
1
8
9*
*
(b) mapped decomposablegraph
Figure 3.5: Relaxation of (3.13) by removing the empty rows in the realization of Figure3.4c and augmenting the results with an identity matrix.
where the latter is conditioned on a latent tree structure while the former is not. Given a
realization Gr′,r, we then can dene the degree distribution of any point in the domain of
the x-truncated Poisson process Πx, or any of the two domains by symmetry. Let
deg((θ, ϑ),Πr,Π′r′ , (Uij)), (3.15)
be the degree of the point (θ, ϑ) in the domain of Πr conditioned that (θ, ϑ) ∈ Πr. However,
the probability that (θ, ϑ) ∈ Πr is 0, thus as noted by Veitch and Roy (2015) and discussed
more generally in Chiu et al. (2013), this conditioning is ambiguous and ill formulated.
Nonetheless, a version of the required conditioning can be obtained by the Palm theory of
measures on point sequences. The Slivnyak-Mecke theorem states that the distribution of a
Poisson process Π conditioned on a point x is equal to the distribution of Π ∪ x; with
this we characterize the degree distribution in the following lemma.
Lemma 1. For a biadjacency measure Gr′,r, dened in (3.14), with a non-randomW : R2+ ↦→
[0, 1] and a xed α, the degree distribution of a point (θ, ϑ) ∈ R2+, θ < r, is deg((θ, ϑ),Πr ∪
(θ, ϑ),Π′r′ , (Uki))
d∼ Poisson(r′W 1(ϑ)), and by symmetry of construction, deg((θ′, ϑ′),Π′r′ ∪
(θ′, ϑ′),Πr, (Uki))d∼ Poisson(rW 2(ϑ
′)).
3 Decomposable random graphs 55
Proof. Since (θ, ϑ) ∈ Πr with probability 0, then using the Palm theory we have
deg((θ, ϑ),Πr ∪ (θ, ϑ),Π′r′ , (Uki)) =
∑(θ′k,ϑ
′k)∈Π
′r′
IUθ′k,θ ≤ W (ϑ′k, ϑ). (3.16)
By Denition 8, W is a.s. nite, thus by a version of Campbell's theorem (Kingman,
1993, ch 5.3), the characteristic function of (3.16) is
E[exp
(itdeg
((θ, ϑ), .
))]= exp
(∫R+
∫[0,1]
(1− eitIu≤W (x,ϑ))r′dudx)
= exp(r′W 1(ϑ)(e
it − 1)),
(3.17)
where W 1(y) =∫R+W (x, y)dx. The results follows similarly for the second domain. For a
random W , the same results can be achieved by conditioning.
Now that the degree distribution in Gr′.r is well dened, we can now show that the
identity matrix augmentation, in Figure 3.5, is a natural consequence.
Proposition 2. For a realization over Zr′,r from a biadjacency measure as Dened in 9, let
Θ′r′ be the nite set of active clique-nodes in Zr′,r, where |Θ′
r′ | > 1. Then, with probability
1, there exists an ro > r, such that each θ′k ∈ Θ′r′ is indexed by a unique node r < θπ(i) < ro
that is not connected to any other active clique-node. Thus, the partition (r, ro] embeds an
identity matrix.
Proof. Given a realization over Zr,′r, index the almost surely nite set of active clique-nodes
as θ′1, θ′2, . . . ,∈ Θ′
r′ . For t > 0, let
Y(k)t = |e(Zr′,r+t(θ′k ∩ . ))| − |e(Zr′,r(θ′k ∩ . ))|
= deg((θ′k, ϑ′k),Π
′r′ ,Πr+t, (Uki))− deg((θ′k, ϑ
′k),Π
′r′ ,Πr, (Uki)),
as the degree of the k-th active clique θ′k ∈ Θ′r′ over the partition (r, r + t]. For a nite t,
Y(k)t is an almost surely non-negative nite process, by niteness of the generating measure
(Denition 8). For the ltration F := σ(α, (θ′k, ϑ′k), T ), dene τ (k) to be the stopping time
3 Decomposable random graphs 56
of the event that an edge appears between a node in a unit interval and θ′k, while no edge in
the same interval appears for the rest of the active clique-nodes. Formally,
τ (k) := mint ∈ N : Y (k)
t+1 − Y(k)t > 0
⋂s =k
Y (s)t+1 − Y
(s)t = 0, θ′s ∈ Θ′
r′
. (3.18)
Then, it suces to show that τ (k) <∞ with probability 1 for each θ′k ∈ Θ′r′ and then take
ro = maxk(τ(k)). By conditioning on the latent tree, (Y
(k)t )k are not independent and do not
yield an accessible distribution. Nonetheless, if we let (Y(k)t )k represent the analogous process
though under the standard biadjacency measure of (3.14), then (Y(k)t )k are independent with
a well dened distribution (Lemma 1). Moreover, for each k, Y(k)t is dominated by Y
(k)t as
Y(k)t ≤ Y
(k)t , since the latter could be seen as induced by an innite complete graph KG,
where T ⊂ KG. For an analogous ltration F := σ(α, (θ′k, ϑ′k), KG), and the stopping time
τ (k) under (Y(k)t )k we have
P(τ (k) ≥ n) ≤ P(τ (k) ≥ n) ≤ 1
nE[τ (k) | F ]
=1
nE[∑t≥1
tIs < t : s = τ (k)Iτ (k) = t | F]
≤ 1
n
∑t≥1
t[ t−1∏i=1
1−P(Y
(k)i+1 − Y
(k)i > 0
)P(⋂s =k
Y (s)i+1 − Y
(s)i = 0
)]
≤ 1
n
∑t≥1
t
[1− exp
(−∑s =k
W 2(ϑ′s))]t−1
=1
nexp
(2∑s =k
W 2(ϑ′s))−→ 0 as n −→ ∞.
The inequalities above are a result of the Markov inequality, the independence of (Y(k)t )k,
the removal of the rst probability in the third line, the direct application of the geometric
series sum, and nally by condition (i) in Denition 8. The proof could be also achieved via
the Borel-Cantelli Lemma.
This section formalized the notion of a realization from a decomposable random graph
3 Decomposable random graphs 57
of Denition 9 through the means of an (r′, r)-truncation. A realization with active non-
maximal cliques, if it occurs, can be corrected by an edge-greedy partition to fulll condition
A0 (E.q. (3.13)), or by an identity matrix augmentation, where the latter happens with
probability 1 for a xed set of active clique-nodes and an ro < ∞. This section discussed
more generally the issues of post-generation embedding in R2+ while ignoring the Markovian
nature of the generation process. Section 3.4 lls the gap by illustrating a practical sampling
procedure of such a process, where the results of this section become useful. Meanwhile,
Section 3.3.2.2 demonstrates some interesting results relating to likelihood factorization in
terms of the Z representation.
3.3.2.2 Likelihood factorization with respect to Z
Denition 9 introduced decomposable random graph and a process forming through se-
quential Markov updates using (3.12). In (3.12) the restrictive boundary and neighbouring
induced-tree set of (3.8) were used to ensure every active clique-node in Z represents a
maximal clique in G. Otherwise, the simpler sets in (3.7) can be used.
In the eld of graphical models, decomposable graphs are used to factorize the likelihood
of multivariate distribution into a product of likelihoods over conditionally independent com-
ponents. This is illustrated in (3.2). An interesting question is whether one can factorize
the likelihood of a multivariate distribution, with conditional dependency abiding to a de-
composable graph G, with respect to its Z representation instead, and is this factorization
equivalent to one represented in (3.2)? If factorization is possible, do active non-maximal
clique-nodes inuence the factorization, in other words, can the sets (3.7) be used instead of
(3.8)?
Theorem 12 (Likelihood factorization with respecto to Z). Let Z be an Nc×Nv biadjacency
matrix generated from the Markov process (3.12) over the latent community tree T = (Θ′, E).
In (3.12), let the simpler boundary and neighbouring sets of (3.7) be used. Let G be the
decomposable graph generated from Z by (3.5) or (3.10), with junction tree TG. Moreover, let
3 Decomposable random graphs 58
X = (Xi)i<Nv be a random variable with a Markov distribution p and conditional dependency
abiding to G, then the likelihood of X | Z can be represented as
p(X | Z) =∏
θ′k∈Θ′ p(Xθk)∏θ′k,θ
′j∈E
p(Xθ′k∩θ′j). (3.19)
In fact, p(X | Z) = p(X | G) of (3.2).
Proof. Assuming that X∅ = ∅, such that p(X∅ = x∅ | Z) = 1, thus discarding all empty
clique-nodes from the numerator and denominator of (3.19). Since not all θ′k ∈ Θ′ are
maximal, we will show that every non-maximal clique in the numerator of (3.19) cancels out
with an equivalent factor in the denominator, leaving the minimal separator set S of G as in
(3.2).
Active clique-nodes that are not maximal can either be: i) on the path between two
maximal cliques; ii) on a boundary branch of T stemming out of a maximal clique.
For case i), let θ′k1 , θ′k2, . . . , θ′kn−1
be sub-maximal cliques on the bath between two maximal
cliques, θ′k0 and θ′kn, that are adjacent on some junction tree TG of G. Let S = θ′k0 ∩ θ′kn the
separator representing the edge θ′k0 , θ′kn in TG. It is straightforward to show that S ⊆ θ′ki
for all i = 1, . . . , n−1, otherwise the RIP is violated. There are n edges for n−1 sub-maximal
cliques-nodes in a path between two maximal cliques. For each of the sub-maximal cliques
θ′ki , i = 1, . . . , n− 1, by the RIP, either θ′ki ⊆ θ′ki−1, or θ′ki ⊆ θ′ki+1
, or both. If θ′ki ⊆ θ′ki−1then
p(Xθ′ki∩θ′ki−1
) = p(Xθ′ki), thus eliminating the same factor in the numerator of (3.19). The
opposite is true, when θ′ki ⊆ θ′ki+1. This process reduces the path to the single edge θ′k0 , θ′kn
representing S.
For case ii), all sub-maximal clique-nodes on a boundary branch of T stemming out of
a maximal clique, say θ′k0 , are contained in θ′k0 . By the RIP, all their edges can be rewired
to θ′k0 . The intersection in the denominator of (3.19) returns the sub-maximal factors as in
case i), hence eliminating them from the numerator.
The results of Theorem 12 enables one to use the faster mixing set of (3.7) in the Markov
3 Decomposable random graphs 59
update process without aecting the likelihood of interest. This enables specifying the a
multivariate distribution completely in terms of Z, avoiding the transformation to G.
3.4 Exact sampling conditional on a junction tree
Sampling from the proposed model could be done in multiple ways, primarily due to the
Markovian nature of decomposable graphs. This section illustrates two methods, one based
on a sequential procedure with nite number of steps, while the second adapts a Markov
update method, where samples are obtained from the stopped process. Nonetheless, both
methods overlap in the sampling and embedding of the Poisson process and the assignment
of clique-nodes, which is discussed below.
To sample a decomposable graph from an (r′, r)-truncation, let T = (Θ′, E) be an innite
tree with clique-nodes Θ′ = (θ′1, θ′2, . . . ). Thus far, only the location dimension of the used
Poisson process is considered in the (r′, r)-truncation. This risks innite values for the weight
dimension (ϑ). It is only natural to assume a Poisson process on the [0, r]× [0, c] cube, where
only points with θ < r and ϑ < c are kept. A standard generative model of nodes and their
location embedding can be:
Nv ∼ Poisson(cr),
(θi) | Nviid∼ Uniform[0, r],
(ϑi) | Nviid∼ Uniform[0, 1],
Nc ∼ Poisson(c′r′),
(θ′k) | Nciid∼ Uniform[0, r′].
(ϑ′k) | Nc
iid∼ Uniform[0, 1].
(3.20)
where Nv is the number of nodes and Nc is the number of clique-nodes.
The iterative sampling of T | Z is discussed in a later section, Section 3.5. This section
only samples a subtree of a given tree by adopting a random walk type of sampler of clique
3 Decomposable random graphs 60
edges, to avoid the high probability of disjoint components associated with random sampling.
The latter could be the case when the tree is known to be nite. The assignment process is
then:
θ′1 ≡ θ′σ(1),
θ′n+1 | θ′1, . . . θ′n ∼ Uniform(θ′k ∈ Θ′ : θ′k, θ′s ∈ E , s ≤ n
),
(3.21)
where σ(1) is a randomly selected clique-node as the root of the sampled tree, and the uniform
distribution samples from the neighbouring clique-nodes in Θ′ to the already assigned ones.
Recall that T (n)|i is the θi-induced subtree of T at the n-th Markov step, as dened
in (3.6), T(n)|ibd is the boundary clique-nodes and T
(n)|inei is the neighbouring clique-nodes, as
dened in (3.8). Note that all subtree quantities are dened prior to the (r′, r)-truncation,
thus, we are implicitly assuming that they abide to the condition that θ′k < r′, particularly
for T(n)|inei .
3.4.1 Sequential sampling with nite steps
Because of the dependency induced by T , and as discussed in Section 3.3.2, some nodes
might only connect to clique-nodes outside the (r′, r)-truncation (non-active). Then, for
i = 1, . . . , Nv, a node is active within the truncation proportional to the [c′, r′] truncation
total mass as:
θi is active | W, c′, ϑi ind∼W 1(c
′, ϑi)
W 1(ϑi), (3.22)
where W 1(c′, ϑ) =
∫ c′0W (x, ϑ)dx.
For each active θi sample edges as:
• sample the rst edge as
θ′π(k), θi | (ϑ′k),W
ind∼W (ϑ′
π(k), ϑi)
W 1(c′, ϑi). (3.23)
3 Decomposable random graphs 61
• at the (n+ 1) step, sample edges to neighbouring clique-nodes sequentially as
θ′π(n+1) | (θ′π(k))k≤n ∼ Uniform(T
(n)|inei \ (θ′π(k))k≤n
),
θ′π(n+1), θi | ϑ′π(n+1), ϑi,W ∼ Bernoulli
(W (ϑ′π(n+1), ϑi)
W 1(c′, ϑi)
).
(3.24)
3.4.2 Sampling using a Markov stopped process
A Markov chain sampling of decomposable graphs depends on a stopped process, where a
Markov chain is run and a realization is obtained by stopping the chain at a specic time.
Such a process is slower in nature than the sequential sampling process discussed in the
previous section. In principle, one samples edges uniformly and decides whether they appear
at the current step given the current conguration of the biadjacency matrix. For the n+ 1
Markov step, sample edge indices uniformly as
k | Nciid∼ Uniform[1, . . . , Nc],
i | Nviid∼ Uniform[1, . . . , Nv].
(3.25)
Sample the θ′k, θi edge as
θ′k, θi | ϑ′k, ϑi,W, T ∼ Bernoulli
(W (ϑ′
k, ϑi) Iθ′k ∈ T(n)|ibd ∪ T (n)|i
nei ∪ χ(n)|i0
), (3.26)
with
χ(n)|i0 (θ′) =
⎧⎪⎪⎨⎪⎪⎩θ′ if |v(T (n)|i)| = 0
∅ otherwise.
(3.27)
A realization is then the result of stopping the above iterative process at a random time
t > 0. Ideally, the stopping time should be chosen after the Markov chain has reached
stationarity, such time is referred to as the mixing time of the Markov chain. The next
section gives a mixing time result on the Markov stopped process illustrated here.
3 Decomposable random graphs 62
3.4.2.1 Mixing time of the stopped process
For a precise denition of the mixing time, let Ω be a the state space of a Markov chain
(Xt)t≥0 with transition matrix P . Let P t(x, y) = P(Xt = y | X0 = x) for x, y ∈ Ω, be the
probability of the chain reaching state y in t-steps given it started at state x. Dene the
total variation distance d(t) between the transition matrix P t, at step t, and the stationary
distribution p as
d(t) := maxx∈Ω
∥ P t(x, .)− p ∥TV , (3.28)
where ∥ . ∥TV is the total variation norm. Then the mixing time tmix is dened as
tmix := mint > 0 : d(t) < 1/4. (3.29)
Variations of mixing times for other thersholds ϵ = 1/4 exits, though it is shown inde-
pendently that tmix(ϵ) ≤ [log2(ϵ−1)]tmix(1/4). Therefore, it suces to work with (3.29). For
an excellent introduction to Markov chain mixing times, refer to the book of Levin et al.
(2009).
For the proposed sampling method (E.q. (3.25) and (3.26)), a unique stationary distribu-
tion p exists, since by construction the chain is irreducible, that is for any two congurations
x, y ∈ Ω, P t(x, y) > 0 for some t ∈ N (Levin et al., 2009, Coro. 1.17, Prop. 1.19). Then, it
remains to nd a lower bound for tmix.
A known method to establish lower bounds for mixing times over irreducible Markov
chains is by bounding the probability of the rst time a coupling over the chain meets.
Given an irreducible Markov chain over a state space Ω, with transition probability P , a
coupling is a process of running two Markov chains (Xt)t and(Yt)t, with the same P , though
with dierent starting points. A coupling meets when the two chains visit a state at the
same time and move together at all times after they meet. More precisely,
if Xs = Ys, then Xt = Yt for t ≥ s. (3.30)
3 Decomposable random graphs 63
Theorem 13. (Levin et al. (2009, Theo. 5.2)) Let (Xt, Yt) be a coupling with transition
matrix P satisfying (3.30), for which X0 = x and Y0 = y. Let τcouple be the rst time the
chain meets:
τcouple := mint > 0 : Xt = Yt. (3.31)
Then
d(t) ≤ maxx,y∈Ω
Px,y(τcouple > t). (3.32)
An example of a coupling on an n-node rooted binary tree, is by taking two lazy random
walks (Xt, Yt), started at nodes X0 = x, Y0 = y, where at each step a fair coin decides
which chain to move. Then, uniformly move the chosen chain to a neighbouring node, while
keeping the other chain xed. Once the two chains are at the same level from the root node,
couple them by moving them further or closer to the root simultaneously. In this case, the
rst coupling time is less than the commute time (τ0,∂B), the time a chain commutes from
the root to the set of leaves ∂B and back. By τ0,∂B the coupling would have occurred.
Proposition 3. (Commute Time Identity (Levin et al., 2009, Prop. 10.6)) Given a nite
tree Tn with n nodes, a root node x0, and a set of leaves ∂B. Let τ0,∂B be the commute time
dened as
τ0,∂B := mint ≥ τ∂B : Xt = X0 = x0, Xτ∂B ∈ ∂B, (3.33)
for a random walk (Xt)t on Tn. Then
E[τ0,∂B] = 2(n− 1)∑k
1
Γx0k, (3.34)
where Γx0k is the number of nodes at distance k from the root.
Remark. The maximum commute time is attained for a lazy random walk on a straight line
(a path) tree with n nodes at each side of the root, where E[τ0,∂B] = 4n2. For a lazy random
3 Decomposable random graphs 64
walk with probability p that the chain stays at the same conguration, it is easy to see that
the expected commute time (3.34) becomes E[τ0,∂B]/p.
A similar approach could be applied to the proposed sampling scheme of (3.25) and
(3.26). First, note that sampling edges for a xed node θi depends on the conguration
of other nodes. This dependence is enforced by the extra conditions added to T(n)|ibd and
T(n)|inei in (3.8) versus (3.7). However, as discussed in Section 3.3.2, by using the edge-greedy
partition one can relax both of those conditions, either by satisfying A0 (E.q. (3.13)) using
the minimum amount of steps in post-sampling, or by an identity matrix augmentation as in
Figure 3.5. Moreover, (3.7) will still result in a decomposable graph, as shown by Proposition
1, though not all active clique-nodes are maximal.
The objective of breaking down the dependency between nodes is to reduce the problem
of studying the mixing time on the whole graph to studying it on each node independently,
over the given tree. In this case, the process in (3.26) does not map directly to a random
walk process, where we can apply the commute time identity. For three reasons: (i) for each
node θi, the edges of the junction tree are directional and weighted by W (ϑ′., ϑi); (ii) The
variable χ(n)|i0 in (3.27) acts like a transporting hub to a random clique-node whenever the
random walk returns to the starting position; (iii) The commute time in 3 depends on a
root node, that is not a property of the proposed sampling method. Nonetheless, all three
reasons can be handled. Reason (i), for a non-atomic W a uniform expected weight of
E[W ] =
∫∫R2+
W (x, y)dxdy, (3.35)
can be used. It is attained by a direct application of the Mapping theorem of (Kingman,
1993), as in Figure 3.6. Reason (ii), the transport hub property only speeds up the commute
time, thus an upper bound is still the commute time of (3.33). Reason (iii),∑
k 1/Γk
is smallest when the designated root node is the centre of the tree, where each side is
symmetric. It becomes larger as the designated root node moves away from the centre, with
3 Decomposable random graphs 65
the maximum of Lmax/2, half the maximum distance between two leaf nodes.
θ′1 θ′2
θ′3
θ′4 θ′5
W1
W2
W3
W1
W5
W3W
4
W3
θ′1 θ′2
θ′3
θ′4 θ′5
W∗
W∗
W∗W
∗
Figure 3.6: A realization of a 5-node junction tree from (3.21), on the left is the originaldirected weighted tree where Wk = W (ϑ′
k, ϑi) for a random ϑi, on the right is the undirectedtree by expectation where W∗ = E(W ).
Lemma 2. For the Markov update process of Section 3.4.2, given a connected tree with Nc
clique-nodes, the lower bound of the expected mixing time for each node, holding all other
nodes constant, is
tmix ≥8Nc∫∫
R2+W (x, y)dxdy
Lmax
2≥ 8Nc∫∫
R2+W (x, y)dxdy
Nc∑k=1
1
Γk, (3.36)
where Γk is number of nodes at distance k form a root node θ′0, selected randomly from the
non-leaf nodes of the tree, and Lmax is the maximum distance between two leaf clique-nodes.
If nodes are sampled independently, when (3.7) is used instead of (3.8), then (3.36) is the
global mixing time achieved by parallel sampling.
The proof follows directly from Theorem 13 and Proposition 3 by a lazy random walk
with probability as in (3.35).
3.5 Edge updates on a junction tree
Section 3.3 proposed a model for decomposable random graphs by conditioning on a xed
junction tree, where graph edges are formed conditionally through a Markov process, as
shown in (3.12) and Section 3.4. Nonetheless, conditioning the model on a xed junction
3 Decomposable random graphs 66
tress is quite restrictive for two main reasons: (i) the junction trees representation is not
unique; (ii) a junction tree is oftentimes unknown and an estimate is desired. Sampling
of junction trees is possible, for example, by single edge updates on the given tree. This
connection is summarized by Hara and Takemura (2006), through a connected bipartite
graph between the set of possible junction trees and the set of POSs, as shown in Figure 3.2.
Despite the non-uniqueness of junction tree and the POSs, Lauritzen (1996) has showed
that the set of minimal separators, edges of the junction tree, is unique with varying mul-
tiplicity for each separator. The separator multiplicity relates to the number of ways its
corresponding edge can be formed, and thus the number of trees that are a unit distance, or
a single move, away. Therefore, for two adjacent maximal cliques θ′k and θ′s in some junction
tree T , if G(θ′k) ∩ G(θ′s) ⊂ G(θ′m), for a third maximal clique θ′m, then one can alter the edge
θ′k, θ′s by severing it on one side and reconnecting it to θ′m. Certainly the connectivity of
the junction tree must be respected. For example, in Figure 3.2, moving from the junction
tree T1 in the Subgure 3.2b to T2, requires the severing of edge C2, C3 from the C3 side
and reconnecting it to C1, as shown in the Figure 3.7. The separating nodes between C2 and
C3 are G(C2) ∩ G(C2) = CD and are contained in the clique C1 = ABCD.
C1
C3
C2
S
Figure 3.7: Moving along the bipartite graph of Figure 3.2, from junction tree T1 to T2,through severing and reconnecting the edge C2, C3 (dotted lines) to C2, C1.
The set of clique-nodes a severed edge can reconnect to is the same set of clique-nodes
that satisfy the running intersection property of the POSs, introduced in (3.1). To formalize
this notion, for some tree T = (Θ′, E) and edge θ′k, θ′s ∈ E , let J(k,s−) be the set of maximal
3 Decomposable random graphs 67
cliques that satisfy the RIP when the edge is severed at the θ′s's side, as
J(k,s−) =θ′m ∈ Θ′ : θ′k ∩ θ′s ⊂ θ′m, θ′k, θ′s ∈ E , θ′s ∼ θ′m ∈ T
(E \θ′k, θ′s
). (3.37)
The notation θ′s ∼ θ′m ∈ T(E \
θ′k, θ′s
)indicates the existence of a path between θ′s
and θ′k in T , when the edge θ′k, θ′s is removed. Note that θ′s ∈ J(k,s−), as it satises the
RIP. Let εk(s→m) = 1 be the indicator that the edge θ′k, θ′s is replaced by θ′k, θ′m. Using
a uniform prior, the probability of such move is
P(εk(s→m) = 1 | Z, T ) =
⎧⎪⎪⎨⎪⎪⎩1
|J(k,s−)
| if θ′m ∈ J(k,s−)
0 otherwise.
(3.38)
A weighted version can also be formed. For example, when larger cliques are favoured
over smaller ones, the update distribution can take the form
P(εk(s→m) = 1 | Z, T ) =
⎧⎪⎪⎨⎪⎪⎩|v(G(θ′m))|∑
x∈J(k,s−)
|v(G(x))| if θ′m ∈ J(k,s−)
0 otherwise,
(3.39)
where v(G(x)) are the nodes of the subgraph G(x) for clique x.
To combine the results with the ones of Section 3.4, an iterative sampling of the decom-
posable graph and the junction tree is:
(i) generate Nv, (θi), Nc and (θ′k) as in (3.20);
(ii) sample an initial junction tree by the random assignment process in (3.21);
(iii) at the n-th Markov step:
• sample Z(n+1) | T (n) according to samplers of Section 3.4;
• sample T (n+1) | Z(n+1) according to (3.38), or its weighted version.
3 Decomposable random graphs 68
The conditional sampling process proposed in this section preserves the number of clique-
nodes in the junction tree, in line with the assumptions of Section 3.3. Chapter 4 proposes
a more elaborate modelling scheme of decomposable graphs, which introduces a notion of
sub-clustering and a method for sampling junction trees with varying sizes.
3.6 Examples
The framework presented in this work builds on the point process representation of random
graphs (see Section 2.2.5). The Poisson process, thus, arises naturally as a suitable generating
class of many σ-nite random function (measures). This section aims to showcase a few
practical examples of decomposable graphs under dierent choices of W and the (r′, r)-
truncation, where the unit rate Poisson process is used as a sampling mechanism.
In some recent work, for example Gao et al. (2015) and Wolfe and Olhede (2013), the
function W is treated as a limit object of a series of graph realizations. In other work, such
as Caron and Fox (2014), W is treated as a deterministic function of completely random
measures, as in Section 2.3, where inference also accounts for the truncation point r. This
section follows the latter by letting W be a deterministic function of some known para-
metric distributions, and the interest is in estimating the distributional parameters given a
realization.
Sampling from parametric distributions is usually done through the right-continuous
inverse of the distributional CDF by means of a uniform random variable. There is a direct
link between unit rate Poisson processes and uniform random variables, that can be shown
in few ways. For example, using distributional equality the unit rate Poisson observations
(ϑi) can be ordered such that ϑ(1) < ϑ(2) < . . ., then ϑ(i+1) − ϑ(i) ∼ Exponential(1), as the
inter-arrival times between events. As a result, exp(−(ϑ(i+1) − ϑ(i))) ∼ Uniform[0, 1].
The biadjacency representation of decomposable graphs results in a simple expression for
the conditional joint distribution, however, the conditioning choice is important, as shown
3 Decomposable random graphs 69
in the following subsection.
3.6.0.2 On the joint distribution of a realization
The Markov nature of decomposable graphs forces nodes to establish their clique connections
in Z via a path over T . For example, a node θi initially connects to the clique θ′σ(1); attempts
unsuccessfully to connect to neighbouring cliques-nodes of θ′σ(1) in T ; with a successful con-
nection to θ′σ(2); θi attempts the neighbours of θ′σ(2) that are not yet attempted, and so
on. This results in T |i, which denes the successful connection path of θi, the unsuccessful
attempts are dened by T|inei.
Disregarding the initial starting clique for node θi, by conditioning on all other connec-
toins and a tree T , the joint distribution of z.i, the i-th column of Z, can be dened as
P(z.i | Z−(.i), T ) =
∏θ′∈v(T |i)
P(zk(θ′)i = 1)
∏θ′∈v(T |i
nei)
P(zk(θ′)i = 0)
=
∏θ′∈v(T |i)
W (ϑ′k(θ′), ϑi)
∏θ′∈v(T |i
nei)
1−W (ϑ′k(θ′), ϑi)
,
(3.40)
where k(θ′) is the index of θ′ and Z−(.i) is Z excluding the i-th column.
Therefore, for an observed Nv-node decomposable graph G with Nc maximal cliques
forming a junction tree T , let Z be its Nc ×Nv biadjacency matrix, with no empty rows or
columns. Dene the following neighbourhood indicator as
δneiki =
⎧⎪⎨⎪⎩ 1 if θ′k ∈ T|inei
0 otherwise(3.41)
where T|inei as in (3.8). Then (3.40) simplies to
P(z.i | Z−(.k), T ) =Nc∏k=1
W (ϑ′
k(θ′), ϑi)
zki1−W (ϑ′
k(θ′), ϑi)
(1−zki)δneiki
. (3.42)
3 Decomposable random graphs 70
The dependence on all other node-clique connections Z−(.i) in (3.40) is a direct result of
using the quantity T|inei, which includes clique-nodes that are neighbouring to T |i that do not
cause a maximal clique to be sub-maximal (Eq. (3.8)). Not all columns of Z exhibit such
dependence, nonetheless, the conditions causing z.i to depend on Z−(.i) in (3.8) are only an
artifact of the proposed sampling process in (3.12), to force every non-empty node of a nite
T to be maximal at each Markov step. Proposition 1 and the discussion of Section 3.3.2
both suggest that the dependence is not essential, even with non-empty non-maximal nodes
in T the result is a decomposable graph. However, non-empty non-maximal cliques are not
identiable in the mapped biadjacency of an observed decomposable graph. Therefore, the
dependence of z.i on Z−(.i) is only meaningful when the conditioning on true tree T used
in the generation process. When the true T is unknown and the junction tree TG is used
instead, such dependence is obsolete.
Therefore, for an observed Nv-node decomposable graph G with a connected junction
tree TG, its Nc ×Nv biadjacency matrix Z has the joint distribution
P(Z | TG) =Nv∏i=1
P(z.i | TG) =Nc∏k=1
W (ϑ′
k(θ′), ϑi)
zki1−W (ϑ′
k(θ′), ϑi)
(1−zki)δneiki
, (3.43)
where δneiki now depends on TG. In fact, (3.43) shows that the choice of a junction tree only
aects the joint distribution through the component δneiki. Therefore, assuming a uniform
distribution over the set of possible junction trees, the choice of TG over an alternative
junction tree T ′G can be made with posterior ratios as
log
P(TG | Z)P(T ′
G | Z)
=
Nc∑i=1
(1− zki)(δneiki − δnei
′ki) log
1−W (ϑ′
k(θ′), ϑi)
.
The following section applies (3.43) for specic examples.
3 Decomposable random graphs 71
3.6.1 The multiplicative model
Many network generating models fall under the random function characterization, as illus-
trated in Table 2.1. The multiplicative model of linkage probability encompass a wide class
of such models, where the link probability of the (i, j)-th edge has a general form of
(i, j) | pi, pj ∼ Bernoulli(pipj), pi ∈ [0, 1]. (3.44)
Examples of such models are Bickel and Chen (2009); Chung and Lu (2002) and Olhede
and Wolfe (2012). A multiplicative form of the function W can be dened as
W (x, y) = f(x)f(y), x, y ∈ R+, for an integrable f : R+ ↦→ [0, 1] (3.45)
The marginals are also functions of f as W 1(s) = W 2(s) = af(s) where a =∫R+f(x)dx.
A natural choice for f is a continuous density function, where a = 1, more generally, a
cumulative distribution function (CDF) or the complementary (tail distribution) CDF can
also be used.
Example 3.6.1 below illustrates a case where the tail of an exponential distribution is
used.
Example 3.6.1 (Tail of an exponential distribution, fast decay). Let f be the tail of an
exponential distribution, as f(x) =∫∞xλ exp(−λs)ds = exp(−λx), such that
W (x, y) = e−λ1xe−λ2y. (3.46)
The marginals areW 1(y) = exp(−λ2y)/λ1 andW 2(x) = exp(−λ1x)/λ2, whereW (x, y) =
λ1λ2W 1(y)W 2(x). Figure 3.8 shows the density of (3.46), where λ1 = λ2 = 1.
Figure 3.9 illustrates dierent size realizations from (3.46) for the same 10-node tree
(Subgure 3.9a), sampled according to (3.21) where λ1 = 1. Each realization in the top
panel is based on a dierent (c, r)-truncation of the node domain with λ2 = 1. The middle
3 Decomposable random graphs 72
ϑ
ϑ'
Figure 3.8: Density of W (x, y) = exp(−(x+ y)).
panel illustrates the eect of varying the scaling parameter λ2, therefore, a single node
parameter set (θi, ϑi) is used across the panel. The bottom panel plots the adjacency
matrix of the corresponding decomposable graph in the upper subgure. It is evident, high
values of λ2 separate the graph in 3.9d, while lower values support more cohesion, 3.9f.
Example 3.6.2 (Beta multiplicative priors). Let fi(x) ∼ Beta(αi, 1), for x ∈ R+, a multi-
plicative form for W with Beta kernals is
W (x, y) = f1(x)f2(y). (3.47)
By the ordering of the unit rate Poisson process (ϑi), a Beta random variable can be sampled
as exp(−(ϑ(i+1)−ϑ(i))/α) ∼ Beta(α, 1). Therefore, using distributional equalities, the gener-
ating sequential scheme in (3.24) could be equivalently used with the following modication:
ϑ′k | α1
iid∼ Beta(α1, 1),
ϑi | α2iid∼ Beta(α2, 1),
W (ϑ′k, ϑi) = ϑ′
kϑi.
(3.48)
3.6.1.1 Posterior distribution for the special case of a single marginal
A node-clique connection probability under a single marginal is when W of the form
3 Decomposable random graphs 73
(a) sampled 10-node junction tree
(b) (c = 2, r = 10, λ2 = 1)
(c) (c = 2, r = 20, λ2 = 1)
(d) (c = 2, r = 50, λ2 = 1)
(e) (c = 2, r = 50, λ2 = 5)
(f) (c = 2, r = 50, λ2 = 1/5)
nodes
node
s
5 15 25 35 45 55 65 75 85 95 105
110
100
9080
7060
5040
3020
101
(g) (c = 2, r = 50, λ2 = 1)
nodes
node
s
5 15 25 35 45 55 65 75 85 95 105
110
100
9080
7060
5040
3020
101
(h) (c = 2, r = 50, λ2 = 5)
nodes
node
s
5 15 25 35 45 55 65 75 85 95 105
110
100
9080
7060
5040
3020
101
(i) (c = 2, r = 50, λ2 = 1/5)
Figure 3.9: Dierent size realizations from W (x, y) = exp(−(λ1x + λ2y)); the 10-nodetree on the top left is sampled according to (3.21) with a (c′ = 1, r′ = 10)-truncation. Thetop and middle panels are the decomposable graphs resulting from dierent size realizationsettings, the middle panel illustrates the eect of varying λ2 for the same parameter set(θi, ϑi) generated from a (c = 2, r = 50)-truncation, the corresponding adjacency matricesare in the bottom panel.
3 Decomposable random graphs 74
W (x, y) = f(x), or W (x, y) = f(y), with f : R+ ↦→ [0, 1].
Under such parametrization, a posterior distribution of f | Z, TG is possible. For the
special case of Example 3.47 and the generations process in (3.48), for an observed Nv-node
decomposable graph G with Nc maximal cliques forming a junction tree TG, let Z be its
Nc×Nv biadjacency matrix, with no empty rows or column. By (3.43), the joint conditional
distribution of Z | TG is
P(Z | (ϑi), TG, f) =Nc∏k=1
Nv∏i=1
f(ϑi)zki(1−f(ϑi))(1−zki)δ
neiki =
Nv∏i=1
f(ϑi)mi(1−f(ϑi))m
δnei
i , (3.49)
where mi =∑Nc
k=1 zki and mδnei
i =∑Nc
k=1 δneiki.
When f(ϑi) = fi ∼ Beta(α, 1), as in Example 3.47, the posterior distribution of fi | Z, TGis
fi | Z, TG ∼ Beta(α +mi, 1 +mδnei
i ). (3.50)
The marginal joint distribution is
P(Z | TG) =Nv∏i=1
∫fmii (1− fi)
mδnei
i p(fi | α)dfi = αNv
Nv∏i=1
Γ(α +mi)Γ(mδnei
i + 1)
Γ(α +mi +mδneii + 1)
. (3.51)
Figure 3.10 shows the posterior distribution for three selected f(ϑi)'s for a decomposable
graph of 50 cliques and 201 nodes, when f(ϑi) ∼Beta (α, 1). Under the same prior, Figure
3.11 shows the posterior distribution when f(ϑi) = f(ϑ) ∀i for a decomposable graph of 20
cliques and 103 nodes.
A joint conditional distribution can be achieved for the case when both marginals are
used. Nonetheless, the product form in (3.49) does not grant an easy access to the posteriors.
Section 3.6.2 introduces an alternative parametrization that transforms the product to a sum
in the log-scale, thus allowing a direct access to the posteriors.
3 Decomposable random graphs 75
(a) 50-node junction tree (b) 201-node decomposablegraph
0 500 1000 1500 2000 2500 3000
0.0
0.2
0.4
0.6
0.8
1.0
ω
0 500 1000 1500 2000 2500 3000
0.0
0.2
0.4
0.6
0.8
1.0
ω
0 500 1000 1500 2000 2500 3000
0.0
0.2
0.4
0.6
0.8
1.0
ω
Figure 3.10: Junction tree, decomposable graph, and posterior MCMC trace plots for
three randomly selected nodes, where fiiid∼ Beta(α, 1), for the single marginal distribution
of W (x, y) = f(y).
(a) a 20-node junction tree (b) 103-node decomposablegraph
0 500 1000 1500 2000 2500 3000
0.0
0.2
0.4
0.6
0.8
1.0
ω
(c) posterior trace plot of ϑ
Figure 3.11: Junction tree, decomposable graph, and the posterior MCMC trace plot ofϑi = ϑ = 0.3, for the case W (ϑ′
k, ϑi) = ϑ.
3 Decomposable random graphs 76
3.6.1.2 Inference by Gibbs sampling
Under the Beta prior of Example 3.47, a Gibbs sampling is possible for an observed biad-
jacency matrix Z. Suppose W is of the form W (ϑ′k, ϑi) = ϑi, where the random variables
(ϑi) are i.i.d Beta(α, 1). The interest is in deriving the distribution of P(zki = 1|z−(ki), TG),
where z−(ki) denotes the entries of column k excluding zki. A Gibbs sampling is done by
integrating over the distribution of ϑi conditioning on the given junction tree. Thus, the
conditional distribution of node θi connecting θ′k is
P(zki = 1|z−(ki), TG) =
∫ 1
0
P(zki|ϑi)p(ϑi|z−(ki), T|i)dϑi
=
∑Nc
s=1s =k
zsi + α∑Nc
s=1s=k
zsi +∑Nc
s=1s =k
(1− zsi)δneisi + α + 1
=m−k,i + α
m−k,i +mδnei−k,i + α + 1
,
(3.52)
where m−k,i =∑Nc
s=1s =k
zsi, mδnei
−k,i =∑Nc
s=1s =k
(1− zsi), and δneisi as in (3.41).
The general conditional distribution mimics that of (3.11) and (3.12), as
P(znewki = 1|z−(ki), TG) =
⎧⎪⎪⎨⎪⎪⎩m−k,i+α
m−k,i+mδnei−k,i+α+1
if θ′k ∈ T|ibd
⋃T
|inei,
zoldki otherwise.
(3.53)
3.6.2 The log transformed multiplicative model
Many models for random graphs parameterize the probability of an edge through a multi-
plicative form in the logarithmic scale. For example, the work of Caron (2012); Caron and
Fox (2014) and few examples in Veitch and Roy (2015). Under such parameterization, the
form of W is
W (x, y) = 1− exp(− xy
)x, y ∈ R+. (3.54)
The form in (3.54) is generally referred to as the Cox processes of Denition 2, since it
3 Decomposable random graphs 77
can be seen as the probability of at least one event of a Poisson random variable having a
mean measure as a unit rate Poisson process, hence a doubly stochastic distribution.
3.6.2.1 Posterior distribution for the two marginals
Let G be an observed Nv-node decomposable graph with a connected junction tree TG of Nc
maximal cliques. Let Z be its Nc×Nv biadjacency matrix, with no empty rows or columns.
According to (3.43), the joint conditional distribtuion of Z | (ϑ′k), (ϑi), TG is
P(Z | (ϑ′k), (ϑi), TG) =
Nc∏k=1
Nv∏i=1
(1− exp(−ϑ′
kϑi)
)zkiexp
(− ϑ′
kϑi(1− zki)δneiki
)(3.55)
where δneiki as in (3.41).
The product form in (3.55) does not grant simple posterior expressions. By introducing
an intermediary latent variable, as a computational trick, one can transform the product
of densities to a sum in the exponential scale, in a manner similar to the Swendsen-Wang
algorithm (Swendsen and Wang, 1987). Reparametrize zki using a latent variable ϕki > 0
such that
zki =
⎧⎪⎪⎨⎪⎪⎩1 ϕki < 1
0 ϕki = 1.
(3.56)
In a sense that zki is completely determined by ϕki. Moreover, let ϕki = min(ϕ∗hj, 1) where
ϕ∗ki is distributed as an exponential random variable with parameter ϑ′
kϑi. The conditional
joint density of (zki, ϕki), given θ′k ∈ T
|ibd ∪ T
|inei, is
p(zki, ϕki | ϑ′k, ϑi, T, ) = ϑ′
kϑi exp(− ϑ′
kϑiϕki)Iϕki < 1+ exp
(− ϑ′
kϑi)Iϕki = 1. (3.57)
Such that
P(zki = 1 | ϑ′k, ϑi, TG) = P(ϕki < 1 | ϑ′
k, ϑi, T, ) = 1− exp(−ϑ′kϑi).
3 Decomposable random graphs 78
Therefore, attaining the joint conidional distribution of (Z,Φ), where Φ = (ϕki), is
straightforward as
P(Z,Φ | (ϑ′k), (ϑi), TG) =Nc∏k=1
Nv∏i=1
(ϑ′kϑi exp(−ϑ′kϑiϕki)
)zkiexp
(− ϑ′kϑi(1− zki)δ
neiki
)
=
[ Nc∏k=1
ϑ′kmk
][ Nv∏i=1
ϑnii
]exp
(−
Nc∑k=1
Nv∑i=1
ϑ′kϑiϕki(zki + δneiki)
),
(3.58)
where mk =∑Nv
i=1 zki and ni =∑Nc
k=1 zki
The work of Chapter 5 applies a similar trick, refer to Appendix A.1.1 for a complete
derivation. There are dierent parameterization choices that can also have equivalent re-
sults, for example letting ϕki be a Poissson(xy) random variable where 1− exp(−xy) is the
probability of at least one event. Nonetheless, under (3.58) the posterior distribution for the
anity parameters are
P(ϑ′k | Z,Φ, (ϑi), TG) ∝ ϑ′
kmk exp
(− ϑ′
k
Nv∑i=1
ϑiϕki(1− zki)δneiki
)p(ϑ′
k),
P(ϑi | Z,Φ, (ϑ′k), TG) ∝ ϑi
ni exp
(− ϑi
Nc∑k=1
ϑ′kϕki(1− zki)δ
neiki
)p(ϑi),
(3.59)
where p is the prior distribution. A natural conjugate prior for (3.59) is the Gamma distri-
bution. Conditionally updating ϕki can be done by a truncated Exponential distribution at
1 as
ϕhj | Z, ϑ′k, ϑi,∼
⎧⎪⎪⎨⎪⎪⎩χ1 if zhj = 0
tExp
(ϑ′kϑi, 1
)if zhj = 1,
(3.60)
where χ1 is the atomic measure at 1, and tExp(λ, x) is the exponential distribution with
parameter λ and truncated at x.
3 Decomposable random graphs 79
3.7 Model properties: Expected number of cliques per
node
Lemma 1 in Section 3.3.2 dened the degree function of a regular bipartite measure (E.q
(3.14)), as
deg((θ, ϑ),Πr ∪ (θ, ϑ),Πr′ , (Uki)
).
The set Πr∪(θ, ϑ) is used to properly dene the conditioning on the null set (θ, ϑ) ∈ Πr,
by application of the Slivnyak-Mecke theorem.
The expression of the degree function in (3.16) does not hold for the proposed decompos-
able random graphs model of Denition 9. First, because the set Πr′ of clique-nodes carries a
dependency structure based on T . Second, the Markovian nature of the process restricts the
node-clique (dis)connections to the set of boundary and neighbouring clique-nodes by means
of (3.7) or (3.8). Nonetheless, with the following Poisson process identity, an analogous
degree function can be dened.
Lemma 3 (Product of distinct Poisson processes (Kingman, 1993, ch. 3.1)). For a Poisson
process Π dened on a probability space (S,F ,P) with mean measure µ, let f1, f2, . . . , fn be
a collection of real-valued functions, such that fi : S ↦→ R+. Then the following distinct
product identity holds
E[ ∑p1,p1,...,pn∈Πpi =pj ,i =j
n∏i=1
fi(pi)]=
n∏i=1
E[∑pi∈Π
fi(pi)]. (3.61)
The proof is derived by induction on the correlation of sums over Poisson processes. We
rst simplify notations by remarking that a degree function only depends on the weight
dimension ϑ of Π, since the location dimension θ has no probability information other than
what is decoded in T . Therefore, it suces to work with a projected Poisson process on the
3 Decomposable random graphs 80
weight domain. Let Πϑr = ϑi : (θi, ϑi) ∈ Πr, accordingly Πϑ′
r′ = ϑ′k : (θ
′k, ϑ
′k) ∈ Π′
r′ as the
projection of the Poisson process on the ϑ dimension for nodes and clique nodes, respectively,
where Πx is a rate x homogeneous Poisson process.
Lemma 4. For a biadjacency measure Z resulting from the decomposable random graph
process of Denition 9, though with the boundary and neighbouring sets of (3.7), with a non-
random W : R2+ ↦→ [0, 1] and a xed α, the degree function in Z of a node (θ, ϑ) ∈ R2
+, given
an initial connection to clique node (θ′0, ϑ′0) ∈ R2
+ is
deg(ϑ,Πϑ
r ∪ ϑ,Πϑ′
r′ ∪ ϑ′0, (Uki), T
)=
∞∑k=0
[ ∑y∈Πϑ′
r′
y∈Γϑ′0
k
∏s∈P(ϑ′0→y)
IUk(s)i(ϑ) ≤ W (s, ϑ)], (3.62)
where (Uki) is a 2-array of uniform[0,1] random variables, with k(x) and i(y) being abbrevi-
ations for the index of x and y. Γϑ′0k is the set of clique-nodes in Πϑ′
r′ at distance k from ϑ′0,
where Γϑ′00 = ϑ′
0. P(ϑ′0 → y) is the set of clique-nodes on the path from ϑ′
0 to y. Moreover,
the expectation of (3.62) is
E[deg(ϑ, ϑ′
0, . )] =∞∑k=0
Γϑ′0k
(r′W 1(ϑ)
)k+1, (3.63)
where deg(ϑ, ϑ′
0, . ) is an abbreviation for the left hand side of (3.62), and Γϑ′0k = |Γϑ
′0k |. For
a random W , the results can been seen by conditioning.
Proof. By invoking the Slivnyak-Mecke theorem twice for the event ϑ ∈ Πϑr and ϑ′
0 ∈ Πϑ′r .
Let y0, y1, y2, . . . , yn ∈ Πϑ′
r′ be a series of clique-nodes on the path from y0 to yn, where ys is
at distance s from y0. By the Markovian nature of decomposable graphs, for an the edge
(ys, ϑ) to form with probability larger than 0, ys must be a neighbouring clique-node to T|(ϑ),
that is ys ∈ T|ϑnei of (3.7), implying y0, y1, . . . , ys−1 ∈ T |ϑ. Thus the event that (ys, ϑ) ∈ Z
amounts tos∏j=0
IUk(yj),i(ϑ) ≤ W (yj, ϑ). (3.64)
3 Decomposable random graphs 81
By uniqueness of paths in trees and the ordering of clique-nodes by distance k in the set
Γϑ′0k from an assumed initial point ϑ′
0, (3.62) is obtained. For (3.63), the identity of Lemma 3
is directly applicable, since the set Γϑ′0k contains distinct points of Πϑ′
r′ . Thus, for each y ∈ Γϑ′0k
the inner sum of (3.62) becomes
∏s∈P(ϑ′0→y)
E[ ∑s∈Πϑ′
r′
IUk(s)i(ϑ) ≤ W (s, ϑ)]=
∏s∈P(ϑ′0→y)
∫R+
W (s, ϑ)r′ds. (3.65)
For y ∈ Γϑ′0k , the path length is |P(ϑ′
0 → y)| = k + 1, and (3.63) follows.
The degree expectation (3.63), while invariant to the value of the initial clique-node ϑ′0, it
depends on it through the number of clique-nodes at a certain distance from the initial point,
as indicated by (Γk)k. Therefore, for certain tree structures, for example d-regular trees, the
sizes of (Γk)k are explicit functions of tree degree distribution. In such cases, a more explicit
characterization of (3.63) is achievable. The following corollary gives a compact expression
of the expected clique-degree of a node for d-regular trees, where d ≥ 3. The clique-degree
is the number of cliques a node connects to.
Remark. The naming of d-regular trees is sometimes associated with trees of d−1 degree for
a root node and d for all other nodes, where a binary tree is then a 3-regular tree (2 children
per parent node). In this work, we refer to d-regular trees as trees where all nodes are of
degree d, thus, a binary tree is also a 3-regular tree, as shown in Figure 3.12.
θ′0
θ′1 θ′2 θ′3
θ′4 θ′5 θ′6 θ′7 θ′8 θ′9
Figure 3.12: A binary 3-regular tree, with 10-nodes including the root node ϑ′0 and over
two levels (L = 2).
3 Decomposable random graphs 82
Corollary 1. Let T be a d-regular junction tree with d ≥ 3, a root clique-node ϑ′0, and
L ∈ N levels. Such that each clique-node ϑ′k has degree d, except for leaf nodes with degree
1. A clique-node ϑ′ℓk is said to be at level ℓ ∈ 0, 1, . . . L if the distance between ϑ′
0 and ϑ′ℓk
is ℓ. For a decomposable random graph with junction tree T , the expected number of clique
connections (clique-degree) for a node ϑ with an initial connection to clique-node ϑ′ℓk is
E[deg(ϑ, ϑ′ℓ
k, T, . ) | ϑ′ℓk ∈ ℓ] = ζ + ζ2d
(dζ)L−ℓ − 1
dζ − 1+ ζ2(dζ)L−ℓ(dζ + 1)
(dζ2)ℓ − 1
dζ2 − 1, (3.66)
where ζ = r′W 1(ϑ) and d = (d − 1). Such that for an initial starting point at the root ϑ′0,
the expected value simplies to ζ + ζ2d(dLζL − 1
)/(dζ − 1
). For ϑ′L
k , a node in level L, it is
ζ + ζ2(dζ + 1)(dLζ2L − 1
)/(dζ2 − 1
).
Proof. With some simple algebra, few properties of d-regular trees are accessible, for example,
the number of clique-nodes at distance 0 < k < L from the root ϑ′0 is Γϑ
′0
k = d(d − 1)k−1,
where Γϑ′0
k = 0 for k > L. Other properties require more combinatorial work, the interest is
in dening the more general expression of Γℓk, which is the number of clique-nodes at distance
k from clique-nodes at level ℓ ∈ 0, 1, 2, . . . , L. The following is a list of simple properties
of d-regular trees with root node ϑ′0, which will come useful in dening Γℓk:
(a) for a xed ℓ, the maximum distance is maxkΓℓk > 0 = L+ ℓ, that is 2L for ℓ = L;
(b) for a tree with n nodes (|v(T )| = n),∑
k≥0 Γℓk = n, for all ℓ, where Γℓ0 = 1;
(c) Γℓk = Γϑ′0
k for all k ≤ L− ℓ;
(d) by the geometric sum, the number of nodes in T is expressible by d and L as
|v(T )| =L∑k=0
Γϑ′0
k = 1 + d(d− 1)L − 1
d− 2=ddL − 2
d− 2. (3.67)
Properties (b) and (c) show that for distances larger than L−ℓ, the standard distribution
rule of Γθ′0k = d(d − 1)k−1 does not apply. By combinatorial counting and induction, Table
3 Decomposable random graphs 83
3.2 summarizes the general expression for Γℓk for dierent values of ℓ, where d = (d− 1) and
⌊x⌋ is the oor operator. The values in the top left corner of Table 3.2, bordered by the
ladder shape, corresponds to property (c) above. Moreover, for each row the values under
the ladder come in pairs as a result of oor operator, except the rst (i.e. 1 for row L, d
for row L − 1) and the last value (dL). Therefore, the total number of clique-nodes at all
distances for clique-nodes in level ℓ is
2L∑k=1
Γℓk =L−ℓ∑k=1
d(d− 1)k−1
part above ladder in 3.2
+ℓ∑
k=0
(d− 1)L−k[δ(0,L](k) + δ(0,L](k + 1)
]
part under ladder
=L−ℓ∑k=1
d(d− 1)k−1 +ℓ∑
k=0
2(d− 1)L−k −[(d− 1)L + (d− 1)L−ℓ
]
with correction for rst and last values
,
(3.68)
where δ(0,L](s) = 1 if 0 < s ≤ L.
Table 3.2: A summary table of the number of clique-nodes at distance k from clique-nodesat level ℓ ≤ L for a d-regular tree with L levels, where ⌊x⌋ is the oor operator and d = (d−1).
No. of clique-nodes at distance kℓ 1 2 3 . . . L− 1 L L+ 1 . . . 2L− 2 2L− 1 2L
0 d dd dd2 . . . ddL−2 ddL−1 0 . . . 0 0 0
1 d dd dd2 . . . ddL−2 dL−1 dL . . . 0 0 0
2 d dd dd2 . . . dL−2 dL−1 dL−1 . . . 0 0 0...
......
......
......
......
......
...
L− 2 d dd d2 . . . dL−⌊(L)/2⌋ dL−⌊(L−1)/2⌋ dL−⌊(L−2)/2⌋ . . . dL 0 0
L− 1 d d d2 . . . dL−⌊(L+1)/2⌋ dL−⌊L/2⌋ dL−⌊(L−1)/2⌋ . . . dL−1 dL 0
L 1 d d . . . dL−⌊(L+2)/2⌋ dL−⌊(L+1)/2⌋ dL−⌊L/2⌋ . . . dL−1 dL−1 dL
To arrive at (3.66), using (3.63), with ζ = r′W 1(ϑ) and d = (d − 1), the logic used in
3 Decomposable random graphs 84
(3.68) gives
∞∑k=0
Γℓkζk+1 = ζ +
2L∑k=1
Γℓkζk+1
= ζ +L−ℓ∑k=1
ddk−1ζk+1 +ℓ∑
k=0
dL−k[ζL+ℓ−2k+2δ(0,L](k) + ζL+ℓ−2k+1δ(0,L](k + 1)
]= ζ +
L−ℓ∑k=1
ddk−1ζk+1 +ℓ∑
k=0
dL−kζL+ℓ−2k+1(ζ + 1)−[dLζL+ℓ+2 + dL−ℓζL−ℓ+1
].
(3.69)
The form in (3.66) follows directly from multiple applications of geometric series sum
and simplication.
The simplied expectation form in Corollary 1 is still restrictive, though it required
combinatorial work. The following corollary generalizes the result by extending it for an
arbitrary initial point of any level.
Corollary 2. Following the settings of Corollary 1, for a decomposable random graph with
junction tree T , the expected clique-degree of a node ϑ for an arbitrary initial starting point
ϑ′, is
E[deg(ϑ, ϑ′, T, . )] =
d− 2
ddL − 2
[ζ + dζ2
(dζ)L − 1
dζ − 1− ζd
ζ + 1
dζ − 1
dL − 1
d− 1
+ ζ2ddLζL − 1
dζ2 − 1
ζ + 1
dζ − 1
+ ζ3d(dζ)Ldζ + 1
dζ2 − 1
(dζ)L − 1
dζ − 1
].
(3.70)
The proof is directly obtainable by linearity of expectations on disjoint domains, where
the probability that the initial point is in level ℓ is Γϑ′0
ℓ /|v(T )|.
Corollary 1 and 2 illustrate the case of d-regular trees, where d ≥ 3 with d = 3 being the
binary tree. For the case of a path junction tree, where d = 2, a very similar and simpler
result is also obtained below.
3 Decomposable random graphs 85
Corollary 3. Let T be 2-regular tree with root clique-node ϑ′0 and L ∈ N levels, such that
each clique-node ϑ′k has degree 2, except leaf nodes with degree 1. Then, for a decomposable
random graph with junction tree T , the expected clique-degree of a node ϑ with an initial
clique-node ϑ′ℓ in level ℓ ∈ 0, 1, . . . , L is
E[deg(ϑ, ϑ′ℓ, T, . ) | ϑ′ℓ ∈ ℓ] = ζ + 2ζ2
ζL−ℓ − 1
ζ − 1+ ζL−ℓ+1 ζ
2ℓ − 1
ζ − 1(3.71)
where ζ = r′W 1(ϑ). For an arbitrary initial point ϑ′ the expectation becomes
E[deg(ϑ, ϑ′, T, . )] =
ζ
2L+ 1
[1− 2L
ζ + 1
ζ − 1+ 2(ζL+1 + ζ2 + ζ − 1)
ζL − 1
(ζ − 1)2
](3.72)
The proof of Corollary 3 follows the same derivation method of 1 and 2, thus it is omitted.
We now illustrate few expectation examples for small d-regular trees.
Example 3.7.1. According to Corollary 2, for the binary junction tree in Figure 3.12, with
L = 2, the expected clique-degree of an arbitrary node ϑ is
E[deg(ϑ, . )] =
ζ
5(12ζ4 + 12ζ3 + 12ζ2 + 9ζ + 5). (3.73)
For L = 3 it is
E[deg(ϑ, . )] =
ζ
11(48ζ6 + 48ζ5 + 48ζ4 + 36ζ3 + 30ζ2 + 21ζ + 11). (3.74)
Example 3.7.2. By Corollary 3, for a path junction tree with L = 2 levels (5 clique-nodes),
the expected clique-degree of an arbitrary node ϑ is
E[deg(ϑ, . )] =
ζ
5(2ζ3 + 6ζ2 + 10ζ + 7). (3.75)
3 Decomposable random graphs 86
For L = 3, 7 clique-nodes, it is
E[deg(ϑ, . )] =
ζ
7(2ζ5 + 4ζ4 + 8ζ3 + 12ζ2 + 14ζ + 9) (3.76)
3.8 Discussion
Instead of modelling the adjacency matrix of a decomposable graph, this work adopts a
dierent approach by modelling its biadjacency matrix. This is achieved by representing
decomposable graphs as deterministic functions of bipartite point processes Z, which de-
scribes the interactions of nodes and latent communities that act as potential maximal
cliques. Those interactions are driven by the anity parameters of nodes and of the commu-
nity nodes termed clique-nodes. Like other decomposable graph models, such as Green and
Thomas (2013), the proposed model adopts an iterative modelling procedure by conditioning
on a junction tree T , sampling Z | T and iteratively T | Z.
The proposed framework has several benets, most importantly, it enables a fast sam-
pling algorithms even for very large graph, which is achieved by the simplicity of the Markov
update conditions in the bipartite representation. The probability of node θi connecting to,
or disconnecting from, a maximal clique only depends on whether the latter is a boundary
or a neighbouring clique-node to the θi-induced junction tree T |i, as dened in (3.12). This
sampling algorithm is clearly much faster when the simpler boundary (T|ibd) and neighbouring
clique-node sets (T|inei) of (3.7) are used in (3.12). The boost in speed is attributed to two
aspects: (i) all quantities of T|ibd and T
|inei can be computed using simple matrix operations
on Z; (ii) T|ibd and T
|inei decouples the generative Markov chain into parallel chains, one for
each node, see Lemma 2. However, the added speed does not come without cost, by using
the simpler boundary and neighbouring sets, a realization of Z might display active com-
munities (non-empty rows) that are sub-maximal cliques. This contradicts the assumption
that those communities represent maximal cliques, even though the resulting graph is still
3 Decomposable random graphs 87
decomposable. Section 3.3 initially proposed the solution of using the more greedy T|ibd and
T|inei sets instead, though later proposed augmenting sub-maximal non-empty rows of Z with
an extra node each, as shown in Figures 3.4 and 3.5. The latter solution is justied by using
the Kallenberg representation of graphs (Section 2.2.4) where a realization is treated as a
truncation on R2+, and with certain truncations sub-maximal cliques can occur. Nonetheless,
Proposition 2 of Section 3.3.2 shows that there exists, with probability 1, a larger truncation
of the node domain that guarantees all non-empty cliques to be maximal. Therefore, one
can approximate such truncations by adding extra nodes to sub-maximal cliques. Another
appealing benet of this framework is its easy access to the set of maximal cliques and conse-
quently a junction tree of the realization, a direct result of the used bipartite representation.
With this framework, we can derive a Markov update scheme with its mixing time,
conditional on a given junction tree, Lemma 2. Moreover, we can explicitly dene the
expected number of cliques per node, given that the junction tree is a d-regular tree. This
expectation is shown by rst conditioning on an initial starting clique-node, and then by
generalizing to an arbitrary starting clique-node, Corollaries 1 and 2.
This work can be improved in few directions. First, the lower bound of the mixing time
in Lemma 2 depends on the structure of the junction tree through the component∑
k 1/Γk.
It might be possible to replace this component by a general measure of tree density that can
be easily assumed from the graph. This might increase the lower bound while simplifying its
computation. Third, it is possible to extend the expectation results of Section 3.7 to include,
for example, the expected number of nodes per clique. Since the sum of the columns (nodes)
of Z is equal to the sum of its rows (cliques), the column-wise expectation can be used
to derive the row-wise expectation. In addition, the dependency of the node expectation
on tree quantities could potentially be substituted by general tree measures, of length and
density, analogous to the proposed generalization of the mixing time. This could simplify
the expression of expectation, though it would replace the equality with lower and upper
bounds.
88
Chapter 4
Sub-clustering in decomposable graphs
and size-varying junction trees
4.1 Introduction
The bipartite representation of decomposable graphs proposed in Section 3.3 assumes that
the latent communities (θ′1, θ′2, . . . ) represent possible maximal cliques of a decomposable
graph. Therefore, interactions between graph nodes and those community nodes, in the
biadjacency matrix Z, had to abide by specic rules (see Eq. (3.12)). This chapter extends
this assumption by allowing latent communities to also represent subgraphs of maximal
cliques, and thus, forming a type of sub-clustering.
The interpretation of latent communities as cliques still holds, since by denition, non-
empty subgraphs of cliques are completely connected components and therefore are cliques.
The introduction of sub-cliques extends the representation of Z, from a biadjacency matrix,
to a bipartite graph that is a generator of decomposable graphs through a specic mapping
function, for example (3.10). Nonetheless, the new representation allows for plenty of inter-
esting dynamics in the interaction and interchangeability of sub-cliques with their ascendant
maximals. Those dynamics call for more extensive Markov update rules, rst to ensure
4 Sub-clustering in decomposable graphs and size-varying junction trees 89
decomposability, and second to guard the representability of Z as a node-clique bipartite
graph. For the set of maximal cliques the node-clique interaction rules are very similar to
the ones in (3.11). For sub-cliques, the rules dier, as cross-clique interactions are possible
without reshaping the graph.
To formalize those notions, the following section will rst illustrate combinatorial prop-
erties of cliques alongside their relation to the biadjacency representation Z. Then, rules for
each possible (dis)connect move are formulated, each in its separate section. Finally, a new
Markov update scheme is introduced.
4.2 Subgraphs of cliques as sub-clusters
A clique of size N has 2N − 1 possible unique subgraphs of smaller size cliques, which we
initially termed as sub-cliques. The uniqueness of those sub-cliques is related to the node
labels, and not the number of nodes in the subgraph. The 2N − 1 number is derived by
counting the number of ways a subgraph of sizes N and smaller can be formed from a set of
N nodes. Figure 4.1 illustrates an example of a 4-node clique with all its unique subgraphs
forming smaller cliques, including single-node cliques. This amounts to 15 unique subgraphs
with(Nn
)subgraphs of size n, for a total of
N∑n=1
(N
n
)= 2N − 1. (4.1)
Figure 4.1: A 4-node clique (left) and all its unique subgraphs, including single-nodecliques, for a total of 15 subgraphs.
In the biadjacency matrix representation, restricting the number of unique sub-cliques for
each maximal clique to the combinatorial number of 2N − 1 requires a tremendous amount
of bookkeeping, that is deemed unnecessary. Rather, we adopt a representation analogous to
4 Sub-clustering in decomposable graphs and size-varying junction trees 90
that of the multi-graphs, where many latent-communities could represent the same unique
sub-clique, prompting the importance of the latter. Therefore, at each step, the Markov
update scheme will only keep track of those latent communities representing maximal cliques,
and all others will be agged as sub-cliques. Moreover, the relation between sub-cliques
and their ascendant maximals is not exclusive, since sub-cliques within separators can be
linked to multiple maximal cliques. For example, Subgure 4.2a shows a biadjacency matrix
realization with sub-cliques, where maximal cliques are stared and in red. The corresponding
decomposable graph, shown in Subgure 4.2c, consists of a 4-node, three 3-node, and a 2-node
maximal cliques. Some sub-cliques are contained in multiple maximal cliques, as shown with
dashed lines in the junction tree of Subgure 4.2b, where the sub-clique CD, also a separator,
is contained in both ABCD and CDF. Similarly, for CF and the single-node sub-clique D,
both are subsets of multiple maximal cliques.
As shown in the example of Figure 4.2, sub-cliques in the adjacency matrix do not aect
the decomposable graph directly; if disregarded, the graph is unchanged. This conrms the
fact that nodes can freely connect and disconnect to sub-cliques without risking decompos-
ability, if all members of a sub-clique are also members of a single maximal clique. Other
types of interactions are possible with conditions illustrated in the following sections.
4.3 Permissible moves in the bipartite relation
Recall that a decomposable graph is specied by the tuple (G,Z, T ), where G is the de-
composable graph composed of nodes (θi), Z is its node-clique bipartite relation matrix
represented by an innite point process on R2+, and T = (Θ′, E) is the maximal clique junc-
tion tree, where Θ′ is the set of latent community nodes representing maximal cliques of G
and E are edges formed by minimal separators. In this chapter, we will regard Z as xed
size biadjacency matrix of the bipartite node-clique relations, where the number of rows and
columns is xed.
4 Sub-clustering in decomposable graphs and size-varying junction trees 91
θ′1*2
3*4*5678
9*101112
13*1415
A B C D E F G H I
(a) biadjacency matrix
ABCD CDF CEF FGH
HI
CD CF F
H
EF
CF
CDAAB
ACD
AC
BD GH
HI
(b) junction tree
A
B C
D
E
F
G
H
Iθ′1 θ′4
θ′9 θ′3
θ′13
(c) decomposable graph
Figure 4.2: An example of a biadjacency matrix (left), with 5 maximal cliques, stared andin red, and 10 sub-cliques. The corresponding junction tree (top right) has all sub-cliquesand their ascendants circulated and connected with dashed lines, with maximal cliques inred solid lines. The decomposable graph (bottom right) summarizes the biadjacency matrix.
Since the clique-nodes Θ′ = (θ′k) now represent maximal and sub-maximal cliques, to
avoid confusion, let C represent the subset of maximal cliques and C the subset of sub-
maximal cliques. Such that, Θ′ = C ∪ C, and θ′k ∈ C implies that G(θ′k), if not empty, is a
sub-clique in G, while θ′k ∈ C implies it is maximal.
For clique ascendant relation, we use the subset notation, as θ′s ⊂ θ′k, or equivalently
G(θ′s) ⊂ G(θ′k), to indicate that θ′s is a sub-clique of θ′k . Moreover, we refer to a node θi as
"connected to" the clique θ′k, when (θ′k, θi) is an edge in the bipartite relation represented
by Z, simply θi ∈ θ′k. Additionally, we refer to the move of removing the edge (θ′k, θi), as
"disconnecting" the node θi form clique θ′k.
Lastly, the set of permissible moves is organized into four main issues: disconnecting
single-clique nodes, disconnecting multi-clique nodes, connecting nodes, and promoting a
sub-clique to be maximal.
4 Sub-clustering in decomposable graphs and size-varying junction trees 92
4.3.1 Disconnecting single-clique nodes
Single-clique nodes are those that are members of a single maximal clique, for example:
nodes A and B in clique ABCD, node E in clique CEF, node G in FGH and node I in
HI, in Subgure 4.2c. Single-clique nodes dier in their eect on maximal cliques when
disconnected. While some cause maximal cliques to become sub-maximal, like node E in
CEF and I in HI, others have no eect. Each case inuences the junction tree dierently,
and distinguishing between the two can be achieved through the maximal clique separators.
Proposition 4 (Disconnecting single-clique nodes). In the biadjacency matrix Z of a decom-
posable graph G with some junction tree T = (C, E), let (θi) and (θ′k) index the set of nodes
and clique-nodes, respectively. The graph G ′, formed by disconnecting a single-clique node θi
from a maximal clique θ′k ∈ C, is decomposable with junction tree T ′ = (C ′, E ′). Moreover,
(i) if θ′k has other single-clique nodes, or when it contains multiple unique non-overlapping
separators, then θ′k ∈ C ′ since G(θ′k \ θi) ⊂ G(θ′s) for all θ′s ∈ C;
(ii) otherwise, if θi is the sole single-clique node in G(θ′k), then θ′k ∈ C ′, since G(θ′k \θi) ⊂
G(θ′s) for some θ′s ∈ C.
In (ii), all separators in G(θ′k) are subsets of G(θ′s), implying T ′ = (C ′ = C \ θ′k, E ′) with
E ′ =
E \θ′k, θ′m : θ′k, θ′m ∈ E , θ′m ∈ C
⋃θ′s, θ′m : θ′k, θ′m ∈ E , θ′m ∈ C
,
formed by removing clique-node θ′k from C and rewiring all tree edges to θ′s. In (i), T ′ = T .
If θ′k ∈ C, disconnecting θi does not aect the decomposable graph.
Proof. It is straightforward to show that disconnecting single-clique nodes preserves decom-
posability. Therefore, we will proof (i) and (ii). In (i), by denition of maximal cliques,
it is clear that G(θ′k \ θi) is maximal if G(θ′k) has multiple single-clique nodes. Moreover,
assume that S1 and S2 are two unique non-overlapping separators contained in G(θ′k), that is
4 Sub-clustering in decomposable graphs and size-varying junction trees 93
S1 ⊂ S2, S2 ⊂ S1, and a third separator S3 ⊂ G(θ′k) does not exist with S1 ∪ S2 ⊂ S3. Then
if G(θ′k \θi) ⊂ G(θ′s) for some θ′s ∈ C, then G(θ′k \θi) is a separator in G(θ′k) that contains
both S1 and S2, contradicting their uniqueness. (ii) follows directly from (i) alongside the
construction of T ′.
In Figure 4.2, disconnecting A or B from ABCD or G from FGH follows (i) of Proposition
4, while disconnecting E form CEF, or I from HI follows (ii). In the latter case, a rewiring
of the junction tree is necessary to account for the loss of a maximal clique.
By including sub-cliques in the representation of Z, any type of disconnection from a
maximal clique could allow a sub-clique to become maximal in a secondary move after the
disconnection. For example, in the case of single-clique nodes in Figure 4.2, the clique θ′1
stays maximal, as BCD, after disconnecting A. However, A is still a member of the sub-
cliques θ′6 (AB) and θ′10 (AC), where each could now be maximal, but not both. If AB
becomes maximal, then AC is no more a sub-clique of any maximal clique, hence, should be
discarded, the opposite is true if AC to become maximal.
The choice of which sub-clique to become maximal is left for a more detailed discussion
in Section 4.4, nonetheless, we will describe such a secondary move as a promotion of a
sub-clique, and dene it as follows.
Denition 10. Suppose θi is a node of a decomposable graph G with maximal clique set
C, such that θ′i ∈ θ′k ∈ C. Let the term "promoted" to be maximal characterize a sub-clique
θ′s ⊂ θ′k, such that disconnecting θi from θ′k admits θ′s as a maximal clique in the newly formed
graph.
In regard to Proposition 4, the update on the junction tree as a result of promoting a
sub-clique is considered in the following corollary.
Corollary 4. Following the settings of Proposition 4 and Denition 10 , let θ′m be a sub-
clique that is promoted to be maximal after the single-clique node θi disconnects from θ′k,
where θi ∈ θ′m. Then, a new junction tree is formed as T ′′ =(C ′ ∪ θ′m, E ′′), where E ′′ =
4 Sub-clustering in decomposable graphs and size-varying junction trees 94
E ′ ∪θ′k, θ′m
when θ′k ∈ C ′ as in (i) of Proposition 4, and E ′′ = E ′ ∪
θ′s, θ′m
when
θ′k ⊂ θ′s ∈ C ′, as in (ii) of Proposition 4. If |v(G(θ′m))| = 1, then E ′′ = E ′. Moreover, all
sub-cliques containing θi that are not subsets of G(θ′m), are discarded.
Proof. The proof follows directly from denitions and Proposition 4.
Subgure 4.3a is a graphical illustration of disconnecting single-clique nodes of the ex-
ample in Figure 4.2, following the steps in Proposition 4 and Corollary 4. For case (i) of
Proposition 4, Subgures 4.3b and 4.3d show the junction tree change when disconnecting
node A and G from their maximal cliques, respectively. A sub-clique is promoted to be
maximal in each case by adding an extra clique-node to the tree with relevant edges. The
decomposable graph is shown to the left of each case, in Subgures 4.3a and 4.3c, respec-
tively. Subgure 4.3f illustrates case (ii) of Proposition 4, when node E is disconnected from
CEF, thus all clique-nodes previously connected to CEF are now connected CDF, since CF
⊂ CDF. The newly formed maximal clique EC is also connected to CDF.
4.3.2 Disconnecting multi-clique nodes
Multi-clique nodes are those that are members of multiple maximal cliques, and thus are
subsets of minimal separators. Disconnecting a multi-clique node from a maximal clique
requires the latter to be adjacent, in some junction tree, to some maximal clique containing
the node, as shown in Section 3.3 (Eq. (3.11)). This condition is restrictive, though necessary
to ensure decomposability. By introducing sub-cliques to the biadjacency matrix Z, this
condition can be relaxed.
Proposition 5. Following the settings of Proposition 4, for a decomposable graph G, let θibe a multi-clique node in the maximal clique θ′k ∈ C. Dene S(θ′k)
⊂ G(θ′k) to be the set
of separators contained in G(θ′k), and S(θ′k,θi)⊂ S(θ′k)
be the subset containing θi, such that
θi ∈ s for every s ∈ S(θ′k,θi). Let G ′ be the graph formed by partitioning θ′k in two cliques, θ′k1
4 Sub-clustering in decomposable graphs and size-varying junction trees 95
A
B C
D
E
F
G
H
I
(a) disconnecting A from ABCD toform AB
ABCD CDF
AB
CF
CDA
ACDBD
AC BCD CDF
AB
CF
CDA
BD
(b) corresponding new junction tree
A
B C
D
E
F
G
H
I
(c) disconnecting G from FGH to formGH
CEF FGH
HI
GH
EF
CF
HI
CEF FH
HI
GH
EF
CF
HI
(d) corresponding new junction tree
A
B C
D
E
F
G
H
I
(e) disconnecting E from CEF to formEF
CDF CEF FGH
EF
CF
CD
D GH
CDF
CF
FGH
EF
CF
CD
D GH
(f) corresponding new junction tree
Figure 4.3: Examples of disconnecting single-clique nodes of the graph in Figure 4.2. Thetop panel shows the case when disconnecting node A from clique ABCD (top left), whereBCD is still maximal, and the previous sub-clique AB is now maximal, adding another clique-node to the junction tree joined at BCD (top right), while discarding all other sub-cliquesthat contain A with nodes C or D, as AC. The middle row shows the case when disconnectingnode G from FGH (middle left), where FH is still maximal, while the previous sub-clique GHis now maximal adding an extra clique-node to the junction tree (middle right) connectedto FH. The bottom panel shows the case when a maximal clique becomes sub-maximal, bydisconnecting the node E from CEF (bottom left), where CF is now a sub-clique of CEF(shown dashed and in blue), thus removing the corresponding clique-node from the junctiontree (bottom right), while connecting all previous CEF edges to CDF. The new maximalclique-node EF adds an edge to the tree with CDF.
4 Sub-clustering in decomposable graphs and size-varying junction trees 96
and θ′k2, such that θ′k1 ∪ θ′k2 = θ′k, G ′(θ′k2) = G(θ′k \ θi) and S(θ′k,θi)⊆ G ′(θ′k1). Then G ′ is
decomposable.
Proof. Note that neither θ′k1 nor θ′k2
are guaranteed to be maximal in G ′. To ensure decom-
posability of G ′, it suces to show that G ′ has a junction tree (Theorem 11). The only part
of the junction tree of G that is aected by the partition in Proposition 5 are the edges con-
nected to θ′k, and by proper rewiring we can guarantee the existence of a junction tree. The
simplest case is when G(θ′k2) ⊂ G(θ′k1), that is G(θ′k1) = G(θ′k), implying G ′ = G. The second
case is when G(θ′k2) ⊂ G(θ′k1). By construction S(θ′k,θi)⊆ G(θ′k1), therefore, the separator set
S(θ′k)is intact since S(θ′k)
\ S(θ′k,θi)⊂ G(θ′k2). Hence, all junction tree edges previously joined
at θ′k can now be rewired, according to the separators, to θ′k1 or θ′k2 if they are maximal,
otherwise to the maximal cliques containing them. Finally, θ′k1 and θ′k2 , or their maximal
cliques, if their intersection is non-empty, they are joined by an edge. This amounts to a
junction tree of G ′, though not necessary completely connected. Other junction trees are
possible since S(θ′k)\ S(θ′k,θi)
can also be a subset of G(θ′k1).
Proposition 5 permits a multi-clique node to disconnect from a maximal clique that is not
a boundary clique-node in the restrictive tree sets dened in (3.7) and (3.8). The proposition
permits the disconnection provided the separator set S(θ′k,θ′i)stays intact in a second maximal
clique. This second maximal clique can also be a sub-clique that is promoted to be maximal
after the disconnection, which allows more exibility in the possible disconnect moves. The
following proposition illustrates such cases and their eect on the junction tree.
Proposition 6 (Disconnecting multi-clique nodes). Following the settings of Proposition
5 and Denition 10, let θi be a multi-clique node of some maximal clique θ′k ∈ C. The
biadjacency matrix Z′ formed by disconnecting θi from θ′k represents a decomposable graph G ′
if there exists a clique θ′s ∈ C∪C, such that S(θ′k,θi)⊆ G(θ′s). Moreover, if G(θ′k\θi) ⊂ S(θ′k,θi)
,
then G ′ = G. Otherwise, the junction tree T ′ = (C ′, E ′) of G ′ is formed by rewiring the
separator sets S(θ′k,θi)and S(θ′k)
\ S(θ′k,θi)as follows:
4 Sub-clustering in decomposable graphs and size-varying junction trees 97
(i) for edges represented by S(θ′k,θi):
(a) if θ′s ∈ C, then θ′s ∈ C ′, and S(θ′k,θi)are rewired to θ′s in T
′;
(b) if θ′s ∈ C, a sub-clique of θ′s1 ∈ C, then S(θ′k,θi)are rewired to θ′s1 in T ′, as θ′s1 ∈ C ′;
(c) if θ′s ∈ C, a sub-clique of θ′k that is promoted to be maximal, then θ′s ∈ C ′ and S(θ′k,θi)
are rewired to θ′s.
(ii) for edges represented by S(θ′k)\ S(θi,θ′k)
:
(a) if θ′k ∈ C ′, then S(θ′k)\ S(θi,θ′k)
are preserved in T ′;
(b) if θ′k ∈ C ′, then S(θ′k)
\ S(θi,θ′k)are rewired to θ′s ∈ C ′, where G(θ′k) ⊂ G(θ′s).
The clique-nodes θ′s and θ′k, or the maximal cliques containing them in G ′, form an edge
in T ′ if their intersection is non-empty. Finally, all sub-cliques in C of θ′k containing θi are
discarded in C ′.
The proof follows directly from Proposition 5. It is worth mentioning that Proposition 6
illustrates the conditions that ensures Z′ is representative of G ′, and not the decomposability
of the latter. Figure 4.4 shows the case when disconnecting a multi-clique node results in a
decomposable graph, but not a representative biadjacency matrix.
The example in Figure 4.2 has 4 multi-clique nodes (C, D, F, H), where Proposition 6
can be applied in a number of ways. Two of which are illustrated graphically by Figure
4.5. First, the case for disconnecting C from ABCD (θ′1, Figure 4.2a), while promoting the
sub-clique ACD (θ′8) to be maximal. This applies (i.c) from Proposition 6, and since ABC
is still maximal, (ii.a) is applied. Second, is the case of disconnecting H from FGH (θ′3),
while discarding the possibly-maximal sub-clique GH (θ′15). This applies (i.a) and (ii.a)
from Proposition 6. For a complete list of possible disconnections of the multi-clique nodes
in Figure 4.2, refer to Table 4.1. Most disconnections do not necessarily result in a new
maximal clique, unless a sub-clique becomes maximal. Such cliques are listed in the last
column of Table 4.1.
4 Sub-clustering in decomposable graphs and size-varying junction trees 98
θ′123
A B C D E F G H
(a) biadjacency matrix Z
B
A C
D
E
F
G H
θ′1 θ′2
θ′3
(b) decomposable graph G
θ′123
A B C D E F G H
(c) biadjacency matrix Z′
B
A C
D
E
F
G H
θ′1 θ′2
θ′3
(d) decomposable graph G′
Figure 4.4: An example: disconnecting a multi-clique node D from the maximal cliqueABCD in Z and G, where the resulting graph G ′ is decomposable albeit Z′ is not its repre-sentative bipartite matrix; missing the maximal clique BCD in G ′.
Table 4.1: Multi-clique nodes of example in Figure 4.2, their disconnect from maximalcliques, separator sets and possible sub-cliques to become maximal.
Node (θi) MC1 (θ′k) S(θ′k)\ S(θ′k,θi)
S(θ′k,θi)θ′s : S(θ′k,θi)
⊆ G(θ′s) SC2 promoted to be MC
C ABCD ∅ C, CD CDF, ACD, CD ACDC CDF F CD, CF ∅ ∅C CEF F C, CF CDF, CF ∅D ABCD ∅ CD CDF, ACD, CD ACDD CDF F, CF CD ABCD, CD ∅F CDF CD CF, F CEF, CF ∅F CEF C CF,F CDF, CF ∅F FGH H F CDF, CEF ∅H FGH F H HI, GH GHH HI ∅ H FGH, HI HI
4 Sub-clustering in decomposable graphs and size-varying junction trees 99
A
B C
D
E
F
G
H
I
(a) disconnecting C from ABCD toform ACD
ABCD CDF
AB
CF
CDA
ACDBD
AC ABD CDF
ACD
AC
AB CF
CDA
BD
(b) corresponding new junction tree
A
B C
D
E
F
G
H
I
(c) disconnecting H from FGH to formFG
CEF FGH
HI
GH
EF
CF
HI
CEF FG
HI
EF
CF
HI
(d) corresponding new junction tree
Figure 4.5: Examples of disconnecting multi-clique nodes of the example in Figure 4.2.The graph in the top panel (top left) shows the example of disconnecting C from ABCD,cases (i.c) and (ii.a) of Proposition 6, where the separator CD belongs to the sub-cliqueACD, making it maximal. The junction tree (top right) is rewired accordingly, and nosub-clique is discarded. The graph in the bottom panel (bottom left) illustrates the case ofdisconnecting H from FGH to form FG, while discarding the sub-clique GH, as in (i.a) and(ii.a) of Proposition 6, since FG∩HI is empty, the junction tree (bottom right) is rewiredaccordingly.
4.3.3 Connecting nodes
The last piece of the puzzle is the node connection move. Recall in (3.11), nodes connect
to maximal cliques that are adjacent, in some junction tree, to cliques already with the
node's connection. Section 3.3 assumed that a junction tree is known, while that did not
guarantee the full connectivity of a sampled graph, it ensured an underlining tree structure
which can partly, if not entirely, be discerned from the sampled graph. Nonetheless, as we
sample junction trees simultaneously with the graph, in certain cases, multiple disconnected
junction trees and single-node cliques can exist, as shown in Subgure 4.5c. While this
does not demand broad changes to the previously allowed connect moves, it calls for more
1maximal clique2sub-clique
4 Sub-clustering in decomposable graphs and size-varying junction trees 100
bookkeeping, which is illustrated by the following proposition.
Proposition 7. Let G be a decomposable graph, where G consists of two disjoint components
(Gt)t=1,2, such that no element in G1 is connected to an element of G2. Suppose that θ′s is a
non-empty clique in G1, and θi is a node of G that is not connected to any element of θ′s. If
any of the following holds:
(i) θi ⊂ G2;
(ii) θ′s is maximal in G1 and adjacent to θ′k in some junction tree, where θi ⊂ G(θ′k);
(iii) θ′s is a sub-clique in a maximal clique θ′m that is adjacent to θ′k in some junction tree,
where θi ⊂ G(θ′k) and G(θ′m) ∩ G(θ′k) ⊂ G(θ′s).
Then, the graph G ′ formed by connecting θi to every element of θ′s is decomposable.
Proof. For cases (i) and (ii) the proof of decomposability is direct by applying Theorem 11,
where in (i) a junction tree is formed by combining junction trees of both disjoint parts.
In (ii) since θ′k, θ′s ∈ E for some junction tree T = (C, E), adding θi to θ′s does not alter
any separator, and thus a junction tree exits. For (iii), since θ′k, θ′s ∈ E for some junction
tree T = (C, E) and G(θ′m) ∩ G(θ′k) ⊂ G(θ′s), then separators of each maximal clique are
intact, while θ′s ∪ θi becomes maximal and enters the tree in the middle of both maximal
cliques.
The following corollary builds on Proposition 7 by listing junction tree eects for each
move type.
Corollary 5 (Connecting nodes). Following the settings in Proposition 7, for some junction
tree T = (C, E) of G, suppose θi ∈ θ′k and θi ∈ θ′s, for some cliques θ′k, θ
′s, then
(i) when θi and θ′s of two disjoint components of G and θ′s is a sub-clique of θ′m ∈ C,
connecting θi to every element of θ′s results in a decomposable graph G ′ with junction
tree
T ′ =(C ∪ θ′s, E ∪
θ′k, θ′s, θ′m, θ′s
)
4 Sub-clustering in decomposable graphs and size-varying junction trees 101
where θ′k is not a single-node clique, otherwise T ′ = (C ∪ θ′s, E ∪θ′m, θ′s
. If θ′s is
maximal, then T ′ =(C, E ∪
θ′k, θ′s
)when θ′k is not a single-node clique, otherwise
T ′ = T .
(ii) when θ′s is maximal and θ′k, θ′s ∈ E, if θ′s diers from θ′k by only θi, then connecting
θi to every element of θ′s results in a junction tree T ′ = (C ′, E ′) with
C ′ = C \ θ′k, E ′ =E \θ′k, θ′m : θ′k, θ′m ∈ E
⋃θ′s, θ′m : θ′k, θ′m ∈ E
,
otherwise T ′ = T .
(iii) when θ′s is a sub-clique of some maximal clique θ′m ∈ C, such that θ′k, θ′m ∈ E, and if
G(θ′k) ∩ G(θ′m) ⊂ G(θ′s), then connecting θi to every element of θ′s results in a junction
tree T ′ = (C ′, E ′) with
C ′ = C ∪ θ′s, E ′ =E \θ′k, θ′m
⋃θ′k, θ′s, θ′s, θ′m
.
Remark. The connect move does not require any discarding or modifying of sub-cliques, since
in all the three cases of Proposition 7, other sub-cliques would retain their status.
Figure 4.6 illustrates an example of connecting a node to an adjacent sub-clique, such
that, (iii) from Corollary 5 applies.
4.4 Promoting a sub-clique to be maximal
Both disconnect moves, single-node and multi-node, are associated with a secondary post-
disconnection move that allows a sub-clique to be maximal, as in Denition 10. The choice
of which sub-clique to become maximal might be large. In essence, the only probabilistic
quantities that could drive such a choice are the clique-node anity parameters. Section 3.3.1
characterized those parameters with a latent unit rate Poisson process (θ′k, ϑ′k) ∈ Π′ on
4 Sub-clustering in decomposable graphs and size-varying junction trees 102
A
B C
D
E
F
G
H
I
(a) connecting H to EF to form EFH
CEF FGH
HI
GH
EF
CF
HI
CEF FGH
HI
GH
EFH
CF
HI
(b) corresponding new junction tree
Figure 4.6: An example of connecting a node to a sub-clique in an adjacent maximalclique. Node H connects to the sub-clique EF (left) from the example in Figure 4.2, by (iii)of Corollary 5 this forms the new maximal clique EFH connecting maximal cliques CEF andFGH.
R2+, where (θ
′k) index the locations and (ϑ′
k) index the weights of those clique-nodes. Albeit,
at each update step, the contents and size of possible sub-cliques might dier to a large
extent. Nonetheless, by their intrinsic nature decomposable graphs favour large connected
components, such as the maximal cliques. To mimic this tendency while avoiding the heavy
work of accounting for all combinatorially possible sub-cliques, we therefore take advantage
of the continuity of the anity parameters by promoting the sub-clique with the largest
weight.
Denition 11 (Promoting a sub-clique to be maximal). Fallowing the settings of Proposi-
tions 4 and 6, let θi be a node of a maximal clique θ′k ∈ C in a decomposable graph G. Let
S(θ′k,θ′i)be the set of separators of θ′k containing θi, such that
S(θ′k,θ′i)= θ′k ∩ θ′s : θi ∈ θ′s ∈ C.
Let C(θ′k,θi)be the set of sub-cliques, index by their weights, of θ′k that could be maximal if
θi disconnects from θ′k, as
C(θ′k,θi)=ϑ′m : (θ′m, ϑ
′m) ∈ Π′, θi ∈ θ′m, S(θ′k,θi)
⊂ θ′m ⊂ θ′k.
Then, if C(θ′k,θi)= ∅, (θ′o(k,i), ϑ′
o(k,i)) is promoted to be maximal if the disconnection occurs,
4 Sub-clustering in decomposable graphs and size-varying junction trees 103
where
o(k, i) =s ∈ N : ϑ′
s = max(C(θ′k,θi)
),
the index of the largest element in C(θ′k,θi)with respect to the natural ordering in R+.
Denition 11 elaborates on 10, and thus applies to Corollary 4 and Proposition 6 (i.c), of
disconnect moves. In the connect move of Corollary 5, a sub-clique could become maximal;
however, it is a direct result of the connect move and not a secondary move. In this case, no
promotion occurs.
Denition 11 pins down the choice of which sub-clique to become maximal after a discon-
nection to one choice, if any. This streamlines the Markov update steps, from three steps: a
disconnect move, a secondary sub-clique promotion move, and a junction tree update move,
to two steps by eliminating the secondary sub-clique move. The next section summarizes all
update steps in a concise iterative Markov update scheme.
4.5 Markov updates under size-varying junction trees
Following the notations of (3.6), (3.7) and (3.8), for a decomposable graph G with some
junction tree T = (C, E), dene the θi-induced junction tree at the n-th update step, T (n)|i,
as in (3.6). Expand the denition of boundary clique-nodes of (3.7) to include maximal
cliques with multi-clique separator sets and sub-cliques. Moreover, expand the denition of
neighbouring clique-nodes of (3.7) to include sub-cliques of maximal cliques in T (n)|i and
sub-cliques of neighbouring cliques that retain the set intersection nodes, as follows:
T(n)|ibd =
θ′s : θ
′i ∈ θ′s,
(θ′s ∈ C ∧ S(θ′s,θi) ⊆ θ′k s.t. s = k
)∨(θ′s ∈ C
),
T(n)|inei =
θ′s : θi ∈ θ′s,∃θ′k ∈ T (n)|is.t.
(θ′k, θ′s ∈ E
)∨(θ′s ⊂ θ′m ∧ θ′m ∩ θ′k ⊂ θ′k ∧ θ′k, θ′m ∈ E
)∨(θ′s ⊂ θ′k
).
(4.2)
Dene the n+1 Markov iterative update step for the bipartite matrix Z with sub-cliques
4 Sub-clustering in decomposable graphs and size-varying junction trees 104
and size-varying junction tree as:
(i) update the edge z(n+1)ki given the current conguration Z(n) as:
P(z(n+1)ki = 1 | Z(n), T ) = W (n+1)(ϑ′
k, ϑi) =
⎧⎪⎪⎨⎪⎪⎩W (ϑ′
k, ϑi) if θ′k ∈ T(n)|ibd
⋃T
(n)|inei ,
z(n)ki otherwise.
(4.3)
(ii) given the new edge z(n+1)ki update the junction tree T as
• for a connect move: update T according to Corollary 5;
• for a disconnect move, using Denition 11:
if θ′k is a single-clique node, update T as in Proposition 4 and Corollary 4 ;
if θ′k is a multi-clique node, update T as in Proposition 6.
The Markov update steps can be iterated until convergence.
4.6 Discussion
This chapter has introduced a method to model sub-clusters within decomposable graphs.
This is done by extending the biadjacency representation to allow for interactions between
graph nodes and subgraphs of maximal cliques. Subgraphs of maximal cliques, or as termed
sub-maximal cliques, can naturally be seen as sub-clusters within each maximal clique. The
ability of the biadjacency representation to account for such sub-cliques adds richness to
this representation and opens doors for new applications of decomposable graphs. Rather
than solely modelling decomposable graphs, as in the classical settings, it is now possible to
model both the decomposable graph and the latent dynamics forming within each maximal
clique. Such dynamics are generally seen in behavioural type of data, such as behaviour
economics or politics. For example, maximal cliques can represent rms or political entities,
where interactions ow through specic channels. Sub-clustering dynamics can then capture
4 Sub-clustering in decomposable graphs and size-varying junction trees 105
interactions within each entity or larger maximal clique. An interesting dynamic captured
by this model, is when larger entities conglomerate in even larger maximal cliques, or when
sub-cliques separate to form independent entities.
The exibility and depth that are gained by allowing for sub-clustering in the biadjacency
matrix comes with extra complexities, primarily related to the dynamics between maximal
and sub-maximal cliques. It is not clear how these dynamics should be structured, for
example; when disconnecting a node from a maximal clique, does it also disconnect from
all sub-maximal cliques of the former? This chapter adopts the notion that a node would
not disconnect from a sub-maximal clique when disconnecting from a maximal. Instead, in
the junction tree update move, using the continuity of anity parameters, the sub-maximal
clique with the highest anity parameter would be labelled as maximal, if possible, and
added to the junction tree. The node would then disconnect from all other sub-maximal
cliques that became improper with this disconnection. In the connect move, the update
rules are less complex. Contrary to the treatment of decomposable graphs in Chapter 3,
allowing for sub-clustering requires a series of rules addressing the change in the junction
tree after every (dis)connect move. In some update steps, a maximal clique might become
sub-maximal and the opposite, varying the size of the junction tree at every step. A major
part of this chapter is dedicated to such update rules.
The clustering mechanism proposed in this section does not depend on choosing the
correct number of clusters, nor on choosing a proper clustering distance. As discussed in
Section 3.3, an n-node graph can have a maximum of n maximal cliques, with n isolated
nodes, and a minimum of 1, with a fully connected graph. This chapter adopted a xed size
biadjacency matrix Z; therefore, as long as the number of rows is larger than the number of
columns, one can potentially infer the correct number of maximal cliques. All other latent
communities would be labelled as sub-clusters.
One possible improvement for this work is a method for deciding how many sub-clusters
are desired. Do we use a square biadjacency matrix Z, or double the number of rows to
4 Sub-clustering in decomposable graphs and size-varying junction trees 106
columns? A possible solution would be to use a very large number of rows, and then skim
all small sub-clusters, for example, single-node sub-clusters. A possible direction for future
work is to adopt a sub-clustering framework that is in between the proposed method of this
chapter and the initial treatment of decomposable graphs of Chapter 3. Tree nodes θ′1, θ′2, . . .
were initially treated as the maximal cliques of a decomposable graph, such that the Markov
update step of (3.12) used the boundary and neighbouring sets of (3.8). This guarantees
that all active cliques are maximal. Nonetheless, as shown in Proposition 1, using the
boundary and neighbouring sets of (3.7) also guarantees that the mapping in (3.10) results
in a decomposable graph; though, not all active cliques in the biadjacency representation are
maximal. This also amounts to another direction of sub-clustering in decomposable graphs,
where the sub-clusters are the non-empty non-maximal nodes of the tree. This method
could potentially lead to less complex update steps, though the interpretation of sub-clusters
diers from the one proposed in this chapter. The dierence is that sub-maximal cliques are
potentially maximal as more nodes are added to the model, and thus are only temporary
sub-clusters.
107
Chapter 5
A Bayesian model for link prediction in
ecological networks
Identifying undocumented or potential interactions among species is a challenge facing mod-
ern ecologists. Our aim is to guide the sampling of ecological networks by identifying the
most likely undocumented interactions. We frame this problem using a bipartite graph
structure, where edges represent interactions between pairs of species. We rst construct a
prior network of associations by drawing from available literature. To predict undocumented
interactions, we use a hierarchical Bayesian latent score framework for bipartite graphs and
incorporate a Markov network dependence informed by phylogenetic relationships among
species. The addition of phylogenetic information to the model has a signicant improvement
in predictive accuracy. We show that such a model can easily incorporate count or binary
data, and dierent forms of neighbourhood structure. We demonstrate this model using two
host-parasite networks constructed from published databases, the Global Mammal Parasite
Database and the Enhanced Infectious Diseases database, each with thousands of pairwise
interactions. We additionally extend the model by integrating a correction mechanism for
missing interactions in the observed data, which proves valuable in reducing uncertainty in
unobserved interactions.
5 A Bayesian model for link prediction in ecological networks 108
5.1 Introduction
Ecological interactions impact the structure of populations and communities, drive co-
evolution, and can determine the functioning of ecosystems (Heleno et al., 2014). Analysis
of species interaction networks can be used to better understand the generation and stability
of ecosystems, and to identify communities and species that are vulnerable to environmental
change (Araújo et al., 2011; Ings et al., 2009). However, most ecological networks are only
partially observed and fully characterizing all interactions via systematic sampling involves
substantial eort and investment that is not feasible in most situations (Jordano, 2015).
The interest in inferring undocumented interactions and projecting interactions into the fu-
ture have made predicting species interactions a major challenge in ecology (Kissling and
Schleuning, 2015; Morales-Castilla et al., 2015). One approach to eectively ll-in gaps in
interaction networks would be targeted sampling based on verifying highly probable, yet
previously undocumented links.
In this chapter, we propose a new framework for predicting ecological interactions, and
evaluate it using two host-parasite networks. This approach departs from what has been
considered in the literature to date in three main ways. First, we describe a hierarchical
Bayesian latent variable framework for link prediction based on generative models of bipar-
tite graphs. The latent variable acts as an underlying scoring system, with higher scores
attributed to more probable links. This framework is motivated by recent work in recom-
mender systems, such as Ekstrand et al. (2011), Breese et al. (1998), which oer generalized
methods for identifying novel interactions in partially observed bipartite networks. For a
thorough review of recommender systems, see Ricci et al. (2011).
Second, we incorporate a exible Markov network dependence among nodes that we
encode using phylogenetic information in the form of a species similarity matrix. Phylogeny
is a representation of the evolutionary relationships among species, which provides a means
to quantify ecological similarity (Wiens et al., 2010). Just as many species traits co-vary with
phylogeny, species interactions are also phylogenetically structured in both antagonistic (ex.
5 A Bayesian model for link prediction in ecological networks 109
herbivory, parasitism) and mutualistic (ex. pollination, seed dispersal) networks (Gómez
et al., 2010). Encoding the Markov network as a similarity matrix allows for straightforward
expansion to dierent forms of dependence if phylogenetic information is unavailable, or
other dependence structures are preferred.
Third, we integrate a mechanism that accounts for uncertainty in undocumented inter-
actions, and proves valuable in reducing the overlap in posterior probability densities for
interacting and non-interacting pairs. A limitation of observational data is the nature of
unobserved interactions as most data sources provide information only for documented in-
teractions (Morales-Castilla et al., 2015). Thus, the absence of a documented interaction
cannot be taken as evidence that a species pair would not interact given sucient opportu-
nity.
We demonstrate this model by predicting undocumented interactions in subsets of two
published host-parasite databases. Each database consists of thousands of documented inter-
actions based on evidence presented in peer-reviewed articles or mined from genetic sequence
metadata.
5.2 Bayesian hierarchical model for prediction of ecolog-
ical interactions
5.2.1 Network-based latent score model
Given an interaction matrix for two sets of species, for example H hosts and J parasites, of
which we only observe a portion of the possible interactions, our interest is to predict missing
interactions and rank them starting with the most likely ones. Let the binary variable zhj
denote whether an interaction between host h and parasite j has been observed, such that
zhj = 1 if its established that host h carries parasite j, zhj = 0 otherwise, for h = 1, . . . , H
and j = 1, . . . , J . Moreover, assume a continuous anity (popularity) parameter for each
5 A Bayesian model for link prediction in ecological networks 110
host and each parasite based on its observed number of interactions in the network. This
parameter governs the general propensity for each organism to interact with members of
the other class. The larger the value of the anity parameter the more likely an organism
to interact: for example, a host would be susceptible to a larger number of parasites, or a
parasite would infect a larger number of hosts. Let γh > 0 be the anity parameter of host
h , and ρj > 0 for parasite j. Using a log-multiplicative form we dene the anity-only
model by setting the conditional probability of interaction to
P(zhj = 1 | Z−(hj)) = 1− exp(−γhρj), (5.1)
where Z−(hj) is the matrix Z excluding zhj.
The anity-only model can result in a workable network prediction model, as been shown
in the literature on exchangeable random networks in Bickel and Chen (2009); Chung and Lu
(2006); Ho et al. (2002), and others. However, the log-linear form in (5.1) tends to generate
an adjacency matrix with many hyperactive columns and rows. This is due to the fact that
whenever a node has a suciently high anity parameter it forms edges with almost all other
nodes, which may be unrealistic for most ecological networks. To improve the anity-only
model, we add a Markov network dependency that is based on a normalized host similarity
matrix informed by host phylogeny. Let ∆ be an H ×H matrix that quanties the pairwise
similarity between hosts, where higher values imply stronger correlations. ∆ is normalized
such that 0 < ∆hi < 1 for all h, i ∈ 1, . . . , H, h = i. Thus, we dene the full model to be
P(zhj = 1 | Z−(hj)) = 1− exp(−γhρjδηhj), δηhj =H∑i=1i =h
∆ηhizhj. (5.2)
The intuition behind this construction is that a host h is more likely to connect to a
parasite j if there are many hosts that are similar to h and at the same time connected with
j. This is done by summing the scaled similarities between h and those hosts having an edge
5 A Bayesian model for link prediction in ecological networks 111
connection with j in δηhj, increasing the probability for high values of δηhj, and penalizing it
for low values.
The scaling coecient η adjusts the trade-o between rewarding and penalizing the in-
teraction probabilities, where large values of η pressure δηhj to take more of a penalizing role,
while small values allow more rewards. Nonetheless, very small values of η suggest a weaker
explanatory power of the used similarity measure (∆), since δηhj →∑H
i=1,i =h zhj as η → 0,
which are simply the column sums driving the parasite anity parameter ρj. In other words,
for very small values of η, the model in (5.2) converges to an anity-only model under a
dierent parametrization.
We remark that the full model in (5.2) could also be seen as a layering of two models,
where the rst is a bipartite anity-only network model represented in (5.1), and the second
is the phylogeny-only model, that is
P(zhj = 1 | Z−(hj)) = 1− exp(−δηhj), (5.3)
where δηhj as in (5.2)
Later, in Section 5.4.3, we show that both models, the anity-only and phylogeny-
only, independently result in suitable predictive models that are adequate to represent some
variation in the data. However, each model captures dierent characteristics of the graph
and layering them, as in (5.2), we obtain a non-trivial improvement. This is primarily due
to the anity-only model (5.1) resulting in a highly dense posterior interaction matrix, and
penalizing by the phylogeny-only model (5.3) helps in reducing this phenomena.
Driven by recent work in network modelling, such as in Ho et al. (2002) and Ho (2005),
we nd it advantageous to use latent variables in modelling the binary variables zhj. This
facilitates the construction of the network joint distribution in this model, and it eases the
integration of a Markov network dependence that accounts for similarities among hosts. The
latter is crucial to address the ambiguity associated with the case when zhj = 0, which entails
5 A Bayesian model for link prediction in ecological networks 112
two possibilities: a yet to be observed positive interaction, or a true absence of interaction
due to incompatibility. Thus, for each zhj we dene a latent score shj ∈ R such that
zhj =
⎧⎪⎪⎨⎪⎪⎩1 if shj > 0
0 otherwise.
(5.4)
The values of the latent scores, although unobserved, completely determine the binary
variables zhj. The conditional model in (5.2) can be completely specied in terms of the
latent score as
P(zhj = 1 | Z−(hj)) = E[Ishj>0 | Z−(hj)] = P(shj > 0 | S−(hj)) (5.5)
where IA is the indicator function, resulting in 1 if A occurs, otherwise 0, and S−(hj) represents
the interaction matrix S excluding shj, replacing Z as it carries the same probability events.
Given the construction above, we use a zero-inated Gumbel distribution for the latent
score with the following density
p(shj | S−(hj)) = τhj exp(−shj − τhje−shj)Ishj>0 + exp(−τhj)Ishj=0, (5.6)
where τhj = γhρjδηhj. Hence, the conditional joint distribution becomes
P(zhj = 1, shj | Z−(hj)) = P(zhj = 1 | shj)p(shj | S−(hj)) = p(shj | S−(hj))Ishj>0 (5.7)
The construction used in this section reduces the number of parameters to estimate from
H × J to H + J + 1 by taking advantage of the bipartite graph structure.
5 A Bayesian model for link prediction in ecological networks 113
5.2.2 Prior and Posterior distribution of choice parameters
The choice of a zero-inated Gumbel was made to facilitate the construction of the joint
distribution, in a manner similar to the Swendsen-Wang algorithm (Swendsen and Wang,
1987), where the product of densities transform to a sum in the exponential scale. Alterna-
tively, a similar parametrization can be achieved using a truncated exponential distribution,
as shown in Appendix A.1.1, though it does not admit the direct interpretability as a latent
score as with the Gumbel distribution.
Let shj be distributed as in (5.6), such that zhj is completely determined by shj. By a
conditional construction the latent score joint distribution is
P(S,Z |,γ,ρ, η) =J∏
j=1
[P(S.j ,Z.j |γ,ρ, η)
]
=
J∏j=1
H∏h=1
[(γhρj δ
ηhj exp (−shj − γhρj δ
ηhje
−shj )
)zhj(e−γhρj δ
ηhj
)1−zhj]
=
[ J∏j=1
ρmj
j
][ H∏h=1
γnh
h
][ J,H∏h,j
(δηhj)zhj
]exp
(−
J,H∑h,j
shjzhj + ρjγhδηhje
−shjzhj
),
(5.8)
where .j represent the j-th column of S.j and Z.j, mj =∑H
h=1 zhj, nh =∑J
j=1 zhj, and
δηhj =∑h−1
i=1 ∆hizij = δηhj −∑H
i=h+1 ∆hizij = δηhj − δηhj, with the convention that δη1j = 1,
as the initial parasite infection is assumed to follow a dierent design. Moreover, by con-
structing the joint distribution from conditioning, the order of observation does inuence
the joint distribution, as seen by the δηhj component. This dependence is omitted but im-
plicitly assumed, nonetheless, the joint distribution is still valid by conditioning on a xed
order. Using the Hammersley-Cliord theorem (Robert and Casella, 2013), one can prove
the existence of a full joint distribution, and that this joint distribution is not aected by
ordering of observations, see Appendix A.1. Though, (5.8) allows one to derive the full joint
distribution of any order.
As a result of the embedded Markov random eld structure in δηhs, it is harder to work
with the marginal distribution P(Z | γ,ρ, η), since it has to be specied conditionally.
5 A Bayesian model for link prediction in ecological networks 114
Therefore, we build the joint posterior distribution in terms of latent score as
P(S,γ,ρ, η | Z) ∝ P(Z | S)P(S | γ,ρ, η)P(γ)P(ρ)P(η), (5.9)
For the prior specications, we choose a gamma distribution for both γ and ρ for their
conjugacy property, thus let γhiid∼ Gamma(αγ, τγ) and ρj
iid∼ Gamma(αρ, τρ), the conditional
posterior distributions of γh and ρj, respectively, are
ρj | S,γ, η,Z ∼ Gamma
(αρ +mj, τρ +
H∑h=1
γhδηhje
−shj
),
γh | S,ρ, η,Z ∼ Gamma
(αγ + nh, τγ +
J∑j=1
ρj δηhje
−shj
).
(5.10)
Many ecological and other real world networks display power-law degree distributions
(Albert and Barabasi, 2002). This is also the case with the host-parasite databases used in
this chapter, where both margins, hosts and parasites, exhibit power-law degree distributions
(see Appendix Figure 5.2). The anity only model (5.1) has been shown to generate a
power-law behaviour when a Generalized Gamma process is used (Brix, 1999; Caron and
Fox, 2014; Lijoi et al., 2007). In fact, when γh = γ for all h, the anity-only model behaves
much like the Stable Indian Buet process of Teh and Gorur (2009) that has a power-
law behaviour. Nonetheless, the full model of (5.2) does show a signicant improvement in
predictive accuracy over the anity-only model, though it does not yield a degree distribution
with a power-law.
In the case of the scaling parameter η, for simplicity and computational stability, we as-
sume an at non-informative prior as uniform[0,100], although this could be readily modied
to be any required subjective prior.
5 A Bayesian model for link prediction in ecological networks 115
Finally, the latent score is updated, given all other parameters as
shj | Z,ρ,γ, η ∼
⎧⎪⎪⎨⎪⎪⎩χ0 if zhj = 0
tGumbel
(log(γhρjδ
ηhj), 1, 0
)if zhj = 1,
(5.11)
where χ0 is an atomic measure at zero and tGumbel(τ, 1, 0
)is the zero-truncated Gumbel,
having the density
exp(−(s− τ + e−(s−τ)))
1− exp(−eτ ) χ(0,∞)(s).
5.2.3 Markov Chain Monte Carlo algorithm
By introducing a Markov network dependence among hosts using the variable δηhj, the
marginal posterior predictive distribution of each interaction zhj can only be constructed
conditionally on all other interactions, Z−(hj), as shown in (5.2). To preserve the MCMC
convergence conditions, one should update all parameters after sampling each latent score
shj in a sweeping manner. Thus, to get a single sample of an H × J matrix S one needs to
sample all parameters H × J times. To speed up computations we apply a block sampler.
First, note that each latent score shj depends only on row h and column j via the parameters
γh and ρj, and on η via the dependence variable δηhj. Hence, for H ≤ J , one can update
all anity parameters related to the elements of the diagonal block shh : h = 1, . . . , H in
parallel while retaining convergence conditions. This reduces the sampling of a single S to
J MCMC cycles, where the elements of each diagonal block
sh,(h+i) mod J : h = 1, . . . , H
are sampled in parallel for i = 0, . . . , J−1. For example, the parameters of the i-th diagonal
block are γ1, . . . , γH and ρ(h+i mod J), . . . ρ(2h+i mod J). Using this diagonal update scheme, each
parameter is then sampled in turn conditional on all the rest. Both, γ and ρ are sampled
5 A Bayesian model for link prediction in ecological networks 116
using direct sampling from the posterior in (5.10). The scale parameter η is sampled using
an Adaptive Metropolis-Hastings algorithm (Haario et al., 2001), where a new proposal η is
sampled from a log-normal distribution as q(η | η) = lognormal(log(η), σ2η) given a at prior,
and the proposal acceptance probability is
min
1,
[ H,J∏h,j
(δηhjδηhj
)zhj]exp
(−
H,J∑h,j
γhρjeshj(δηhj − δηhj
)). (5.12)
Iteratively, after updating the model parameters, we use an Adaptive Metropolis-Hastings
algorithm to also update the hyperparameters (αγ, τγ, αρ, τρ), for more details refer to Ap-
pendix B.
So far, we have assumed that the information given in Z is denite, that the observed
links are presences and unobserved ones are absences. However, as discussed previously, we
believe this will not be the case for many ecological networks. The next section introduces
a method in dealing with such cases.
5.3 Uncertainty in unobserved interactions
In ecological networks it is unlikely that all potential links among species will occur. Some
unobserved links exist but are undocumented due to limited or biased sampling, while others
may be true absences or "forbidden" links (Morales-Castilla et al., 2015). Evidence used to
support an interaction will vary depending on the nature of the system, but it often assumes
that an interaction exists if at least one piece of evidence indicates so (Jordano, 2015).
This kind of construction raises concern about the uncertainty of interactions in two ways.
The rst concern is due to uncertainty in documented interactions as false positive detection
errors may occur, potentially as a result of species misidentication, sample contamination,
or unanticipated cross-reactions in serological tests. We believe it would be useful for the
scientic community to identify weakly supported interactions that may require additional
supporting evidence, however our primary motivation is identication of "novel" interactions,
5 A Bayesian model for link prediction in ecological networks 117
which is complicated by uncertainty in unobserved interactions.
The second concern arises when unobserved associations are by default assumed to be
negative. As discussed earlier, ecological networks are often under-sampled, and some frac-
tion of unobserved interactions may occur but are currently undocumented, or represent
potential interactions that are likely to occur given sucient contact. Based on this assump-
tion we build a measure of uncertainty in unobserved interactions by modifying the proposed
model. In (5.4), we have assumed that zhj is a deterministic quantity given shj | Z−(hj), and
thus we have only sampled positive scores for the case when zhj = 1, as shown in (5.11). As a
result, in the prediction stage, the posterior predictive distribution in (5.2) is only considered
for the case when a pair has no documented associations (zhj = 0), and it is deterministic
with probability 1 otherwise, underlining the assumption that the dataset is complete and
trusted. In reality, this assumption does not hold. Thus to account for uncertainty in unob-
served associations, we attempt to measure the percentage of positive scores for where the
input is 0 (zhj = 0), as
p(zhj = 0 | shj, g) =
⎧⎪⎪⎨⎪⎪⎩1, if shj = 0,
g, if shj > 0.
(5.13)
In a sense, the construction above attempts to measure the proportion of missing links
in the observed data, where g is the probability that an interaction is unobserved when the
latent score indicates an interaction should exist. If g is large and close to 1, it is likely
that many of the unobserved interactions are likely to exist. Introducing g to the model
aects all parameter estimates and the notion of Z, therefore, in the post prediction stage,
the posterior predictive distribution is now considered for both cases. For the case of a
documented association, the probability of an interaction is dened in (5.2), and for the
case of no documentation the same probability is weighted by g as shown in more details in
(5.14).
5 A Bayesian model for link prediction in ecological networks 118
This kind of construction has been used earlier by Weir and Pettitt (2000) when modelling
spatial distributions to account for uncertainty in regions with unobserved statistics, and
later by Jiang et al. (2011) in modelling uncertainty in protein functions.
5.3.1 Markov Chain Monte Carlo algorithm
Introducing a measure of uncertainty in the model does not alter the MCMC sampling
schemes introduced in Section 5.2.3. The variables γ,ρ and η are still only associated with
S, nonetheless, by introducing the measure of uncertainty, the conditional sampling of each
individual shj is now
p(shj | S−(hj),Z, g) =
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
1ψ(shj)
τhj exp
(−(shj + τhje
−shj )
), shj > 0, zhj = 1,
0, shj = 0, zhj = 1,
gθ(g,shj)
τhj exp
(−(shj + τhje
−shj )
), shj > 0, zhj = 0,
1θ(g,shj)
1− ψ(shj), shj = 0, zhj = 0,
(5.14)
where τhj = γhρjδηhj, ψ(shj) =
∫∞0
p(s | S−(hj))ds = 1 − exp
(−γhρjδηhj
), and θ(g, shj) =
gψ(shj) + 1− ψ(shj).
Moreover, sampling the uncertainty variable is performed using the conditional distribu-
tion as
P(g | S,Z) ∝ P(Z|S, g)P(S | g) P(g) ∝ gN−+(1− g)N++ , (5.15)
where N−+ = #(h, j) : zhj = 0, shj > 0, N++ = #(h, j) : zhj = 1, shj > 0, and P (g) is a
uniform.
5 A Bayesian model for link prediction in ecological networks 119
5.4 A case study with host-parasite networks
5.4.1 Data
We implement this model on two databases, the Global Mammal Parasite Database (GMPD),
available at mammalparasites.org and documented by Nunn and Altizer (2005), and the
Enhanced Infectious Diseases (EID2) database, available at zoonosis.ac.uk/EID2 and doc-
umented in McIntyre et al. (2013) and Wardeh et al. (2015). Both databases are periodically
updated and contain associations between hosts and their parasites based on thousands of
published reports and scientic studies. The assumed interactions are based on peer-reviewed
articles that present empirical observations of associations between host-parasite pairs using
a variety of evidence types (visual identication, serological tests, or detection of genetic
material from a parasite species in one or more host individuals). Associations are reported
along with their publication or genetic sequence reference. More than one reference might be
reported per association, and by aggregation we can determine the count of unique references
per interaction.
The GMPD gathers data on wild mammals and their parasites (including both micro and
macroparasites), which is separated into three primary databases based on host taxonomy:
Primates, Carnivora, and ungulates (terrestrial hooved mammals from Artiodactyla and
Perissodactyla). For analyses we used the ungulate and Carnivora subsets updated by Huang
et al. (2015) to include articles published up to 2010. Counts of unique evidence supporting
each association were constructed according to the number of citations for each host-parasite
pair.
The EID2 database contains a broader scope of organism interactions and includes addi-
tional host groups not represented by the GMPD, including domesticated animals. However,
the host groups in the GMPD are not as well represented in the EID2 database. According
to Wardeh et al. (2015), ≈ 64% of unique interactions listed in the GMPD are found in EID2
and ≈ 30% of those in EID2 are found in the GMPD. For analyses, we used a static version
5 A Bayesian model for link prediction in ecological networks 120
of the EID2 published by Wardeh et al. (2015). We subset the database to include only
mammal hosts and removed interactions involving Homo sapiens. Counts of unique evidence
supporting each association were constructed by summing the number of publications and
unique genetic sequences reported for each host-parasite pair.
The GMPD and EID2 databases as described above were used to construct the binary
presence-only matrix Z, where zhj = 1 for pairs with documented associations, otherwise
zhj = 0. We let the pairwise similarity matrix (∆) to be the mammal phylogeny of Fritz et al.
(2009) which was taken as the inverse of the phylogenetic dissimilarity matrix calculated by
the function cophenetic in the R package ape (Paradis et al., 2004). Incorporating the
phylogeny required host names to be standardized to the taxonomy of Wilson and Reeder
(2005) which involved collapsing subspecies. In addition, we removed parasites reported only
to Genus level. This resulted in a GMPD subset with 3966 pairs of interactions among 246
hosts and 743 parasites, and an EID2 subset with 3730 pairs of interactions among 694 hosts
and 783 parasites. We nd both subsets suciently large to yield proper numerical results.
parasites
host
s
5 45 90 140 195 250 305 360 415 470 525 580 635 690
250
220
200
180
160
140
120
100
8060
4020
(a) GMPD
parasites
host
s
5 45 90 145 205 265 325 385 445 505 565 625 685 745
700
640
580
520
460
400
340
280
220
160
100
50
(b) EID2
Figure 5.1: Left ordered interaction matrix Z of GMPD (left) and EID2 (right) databases.
5 A Bayesian model for link prediction in ecological networks 121
Figure 5.1 shows the left-ordered interaction matrix Z of GMPD on the left, and the EID2
on the right. Both matrices are more or less of equal size. The EID2 has a few hosts that
interact with a large number of parasites, as seen by the horizontal strips, while the GMPD
shows a more even distribution across rows. Nonetheless, both matrices are quite sparse, and
the degree distributions of both hosts and parasites exhibit a power-law structure, as shown
in Figure 5.2. The degree distribution of parasites (blue stars) for the GMPD interaction
network shows steeper slope compared to the hosts degree distribution (red crosses). On
the other hand, the degree distribution of both hosts and parasites seem to have comparable
slopes in the EID2 interaction network.
*
*
*
*
*
**
*
**
*
*
* *
*
*
**
**
****
**
******
1 2 5 10 20 50 100
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
+
+
+ +
+
+
++
+
+
+
+
+
+
+++
++
++
++
+
++
+
+
+
++
+
+
+
+
+
+
+
++
++++++
+
++
+
++++++++++++
*+
ParasitesHosts
(a) GMPD
*
*
*
*
*
**
*
*
* * *
* *
*
**
*
*
*
********
*
**
1 2 5 10 20 50
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
+
+
+
+
+ +
++
++
++
+
++
++++
++
+
+
+
+
+
+++++++++++++
*+
ParasitesHosts
(b) EID2
Figure 5.2: Degree distribution of hosts (red crosses) and parasites (blue stars) on log-scale,for the GMPD (left) and EID2 (right) databases.
5.4.2 Parameter estimation
Using the GMPD and EID2 databases we rst t the model proposed in Section 5.2.1. We
run 12000 MCMC iterations for posterior estimates with 4000 burn-in. In total we have
J + H + 1 parameters to estimate: an anity parameter for each host and each parasite,
5 A Bayesian model for link prediction in ecological networks 122
and a scaling parameter for the similarity matrix. As well, for each database, we iteratively
sample the set of anity hyperparameters.
Standard convergence diagnostics showed that all parameters had converged. It is worth
noting that for the GMPD, the posterior distributions of the host parameters (γ) shows
large variations, which reects that some hosts are more likely to interact with parasites, or
have been more intensively studied. In the EID2 database, the variation among the hosts
is more prominent, which conrms our earlier observation that row densities of the EID2
interaction matrix are less balanced (see Figure 5.1 and Appendix Figure C.1). In both
databases, the magnitude of the scaling parameter η is signicantly greater than zero, which
indicates the importance of phylogeny in the dependence structure. For the GMPD η is
found to concentrate around 1.57, for the EID2 around 1.15. For additional convergence and
diagnostic plots, please refer to Appendix C.
5.4.3 Prediction comparison by cross-validation
To validate the predictive performance of the proposed latent score full model, we compare
it to three other variations and to a regular nearest-neighbour (NN) algorithm. So far, the
latent score full model in Section 5.2.1 is implemented using the presence-only matrix Z. We
vary this model in three ways, rst, we implement it without a dependence term, that is the
anity-only model as in (5.1). Second, we implement it with only the dependence term, that
is the phylogeny-only model in (5.3). Third, since the binary matrix Z ignores the count of
available evidence, which may be useful for increasing predictive performance in some cases,
we implement a weighted-by-counts version where the number of documented references for
each interaction are taken to be edge weights in constructing the similarity input variable δ
in (5.2). Such that
δηhj =H∑i=1
∆ηhi log(1 + yij),
5 A Bayesian model for link prediction in ecological networks 123
where yij is the documented association counts for the (i, j)-th host-parasite pair with yij = 0
if there are no documented associations.
Finally, we compare the latent score full model and the three variations to a regular NN
algorithm, in which we set the distances between hosts proportional to the number of parasite
species they share, namely ∆ = ZZ⊺ while enforcing the diagonal to zero. This particular
similarity matrix does not require additional data other than the observed interaction matrix
Z . This similarity matrix determine the host-neighbour structure and thus, conditional on
all the rest, we let the probability of a host-parasite interaction equal to the average number
of host-neighbours with documented association to the parasite, within the k-closest host-
neighbours. Consequently, we evaluate the model with dierent values for k and use the
value that results in the highest predictive accuracy.
Because the formulation of the host dependence structure requires parasites to have at
least one interaction, the hold out portion for cross-validation was restricted to parasites with
one or more associations in the data, and AUC values calculated using only this portion.
Therefore, the predictive performance of each model is evaluated by using the average of
5-fold cross-validations, where each fold sets approximately a random 17% of the observed
interactions (zhj = 1) in Z to unknowns (zhj = 0) while attempting to predict them using
the remaining portion. For the weighted-by-counts version we also set the corresponding
counts to 0 (yhj = 1). For each of the folds, we run a standard MCMC simulation to infer
the parameters of interest and to calculate the mean posterior probability of an interaction
for each of the unknowns. By uniformly thresholding those probabilities from 0 to 1, where
probabilities above the threshold are assumed to represent an interaction, we calculate the
true positive and negative rates, and the false positive and negative rates. By this process,
we nally obtain the receiver-operating characteristics (ROC) curves, and the posterior in-
teraction matrix resulting from the threshold that maximizes the area under the ROC curve
(AUC). Figure 5.3 illustrates the resulting ROC curves for the GMPD and EID2 databases,
under the tested models.
5 A Bayesian model for link prediction in ecological networks 124
Even though the auxiliary information in the similarity matrix ∆ is used in both the full
database and the cross-validated portion, it does not hold any prior knowledge of interactions
as it only informs the similarity between hosts in terms of phylogeny.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ROC Curve
1−specificity
sens
itivi
ty
LS−network: full modelLS−network: affinity−onlyLS−network: phylogeny−onlyLS−network: weighted−by−countsNearest−neighbour
(a) GMPD
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ROC Curve
1−specificity
sens
itivi
ty
LS−network: full modelLS−network: affinity−onlyLS−network: phylogeny−onlyLS−network: weighted−by−countsNearest−neighbour
(b) EID2 database
Figure 5.3: ROC comparison of the latent score (LS) network model with three variationsand the regular NN algorithm. The proposed LS full model in black, the anity-only varia-tion in cyan, phylogeny-only variation in grey, and the weighted-by-counts version in green.The regular NN algorithm in brown. All ROC curves are based on an average of 10-foldcross-validations.
Evident from the ROC curves and Table 5.1, the predictive performance of the latent score
full model outperforms the NN algorithm and all the three variations, for both databases.
For the GMPD, the NN algorithm performs almost equally to the phylogeny-only model,
which is not the case for the EID2 database. This might be attributed to the fact that
the GMPD database is focused on specic host clades, and in general better sampled than
the EID2 database. Nevertheless, neither the anity-only nor the phylogeny-only models
performed on par with the full model, which conrms the notion that each of the simpler
models captures dierent characteristics of the data, and layering them yields better results.
For a visual interpretation, Figure 5.4 illustrates the posterior association matrices of the
5 A Bayesian model for link prediction in ecological networks 125
anity-only (5.4a,5.4d), phylogeny-only (5.4b,5.4e) and the full model (5.4c,5.4f) for the
GMPD and EID2 databases respectively. From the gures, the anity-only model did not
account for any neighbouring structure and results in hyperactive hosts, while the phylogeny-
only model based on host-neighbourhoods results in greater dierences among parasites.
The full model then combines characteristics of both the simpler models. Moreover, for
an analytical comparison, we followed the recommendation of Dem²ar (2006) to use the
two-sided Wilcoxon signed rank test on the 5-fold cross-validations, nonetheless, dierent
Bayesian comparison procedures are possible. We obtain a p-value of 0.043 and < 0.005
when comparing the full model with the NN algorithm for the GMPD and EID2 databases
respectively. Indicating, for a 5% level of signicance, the full model outperforms the NN
algorithm in both databases. When comparing the full model to all three variations, the
p-value is < 0.005 across the board in favour of the full model, as seen in Table 5.2. Except
when comparing to weighted-by-counts model in GMPD database, where the AUC results
are comparable.
Table 5.1: Area under the curve and prediction values for tested models
GMPD EID2Model AUC Prediction AUC PredictionLS-network: full model 92.11 0.84 94.29 0.87LS-network: anity-only 85.51 0.78 88.41 0.78LS-network: phylogeny-only 87.60 0.80 84.93 0.74LS-network: weighted-by-counts 91.56 0.83 87.13 0.74Nearest-neighbour 86.03 0.84 86.47 0.79
Table 5.2: Two-sided Wilcoxon signed rank test to compare model AUCs
Model GMPD EID2
full model 1.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000phylogeny-only 0.043 1.000 0.000 0.000 0.000 0.043 1.000 0.000 0.000 0.000Nearest-neighbour 0.043 0.043 1.000 0.000 0.000 0.043 0.043 1.000 0.000 0.000anity-only 0.043 0.043 0.225 1.000 0.000 0.043 0.043 0.043 1.000 0.000weighted-by-counts 0.225 0.043 0.043 0.043 1.000 0.043 0.043 0.225 0.043 1.000
5 A Bayesian model for link prediction in ecological networks 126
parasites
host
s
5 45 90 140 195 250 305 360 415 470 525 580 635 690
250
220
200
180
160
140
120
100
8060
4020
(a) GMPD: anity-only
parasitesho
sts
5 45 90 140 195 250 305 360 415 470 525 580 635 690
250
220
200
180
160
140
120
100
8060
4020
(b) GMPD: phylogeny-only
parasites
host
s
5 45 90 140 195 250 305 360 415 470 525 580 635 690
250
220
200
180
160
140
120
100
8060
4020
(c) GMPD: full model
parasites
host
s
5 45 90 145 205 265 325 385 445 505 565 625 685 745
700
640
580
520
460
400
340
280
220
160
100
50
(d) EID2: anity-only
parasites
host
s
5 45 90 145 205 265 325 385 445 505 565 625 685 745
700
640
580
520
460
400
340
280
220
160
100
50
(e) EID2: phylogeny-only
parasites
host
s
5 45 90 145 205 265 325 385 445 505 565 625 685 745
700
640
580
520
460
400
340
280
220
160
100
50
(f) EID2: full model
Figure 5.4: Posterior associations matrix comparison: for the GMPD (top panel) and EID2(bottom panel), between the anity-only (left), phylogeny-only (middle) and full model(right).
5.4.4 Uncertainty in unobserved interactions
We improve on the latent score model by accounting for uncertainty in unobserved interac-
tions, as shown in Section 5.3. This addition increases the posterior predictive accuracy by
estimating the proportion of missing interactions in the observed data, and reducing scores
for unobserved interactions. Using the model in Section 5.3, we infer the false negative
rate g for both databases, using 10000 MCMC iterations with 2000 burn-in. The posterior
mean of g is found to be 0.34 for the GMPD, and 0.38 for the EID2 database, for posterior
histograms refer to Appendix Figure C.4. The EID2 false negative rate is larger than the
rate of the GMPD, which reects the dierences in search strategies and sources used in the
5 A Bayesian model for link prediction in ecological networks 127
creation of each database. Documented associations in the GMPD are identied through
systematic searches of common online reference databases to nd peer-reviewed articles that
support an interaction. EID2, on the other hand, identies associations that are supported
by information in genetic sequence databases and citations found in the biomedical related
search engine PubMed.
Incorporating the proportion of missing interactions is designed to improve posterior
predictive accuracy. To show that, we divided the databases into two sets, a training and
a validation set. Since associations in the GMPD are sourced only from peer-reviewed
articles, we were able to use information on article publication dates to create the two sets.
This mimics the discovery of interactions in the system rather than random hold-out of
observations. Taking the earliest annotated year for each association we set the training
set as all associations documented prior to and including 2004, and the validation set as all
associations up to 2010. Prior to and including 2004, there are 3462 pairs of documented
associations. By 2010, the associations increased to 3966, approximately a 15% increase. The
static EID2 database does not have any temporal information readily accessible, therefore
we created the training set by removing randomly 10% of the observed associations, where
the validation set holds all associations. This amounts to 3357 unique association pairs in
the training set and 3730 in the validation set.
For the training sets, we used an average of 5-fold cross-validations to estimate the
parameters of the model, where each fold ran for 10000 iterations with 2000 burn-in. Due to
the overlap between the two databases, we validated the model on distinct subsets of hosts
for each database. For the GMPD we used the Carnivora clade, and for the EID2 we used
the Rodentia clade.
Figure 5.5 illustrates the improvement in predictive accuracy between the models with g
and without g. For the GMPD-Carnivora the AUC is 0.935 and 0.843 for the models with
and without g, respectively. For the EID2-Rodentia the AUC is 0.899 and 0.832 for the
models with g and without g, respectively. In both cases, the model with g is a signicant
5 A Bayesian model for link prediction in ecological networks 128
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ROC Curve
1−specificity
sens
itivi
ty
LS−network: with gLS−network: without g
(a) GMPD-Carnivora
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ROC Curve
1−specificityse
nsiti
vity
LS−network: with gLS−network: without g
(b) EID2-Rodentia
Figure 5.5: Comparison of ROC curves for the model with g (black) and without g (grey),for GMPD-Carnivora on the left and the EID2-Rodentia on the right.
improvement.
Essentially, incorporating the proportion of missing interactions g reduces the overlap
in posterior probability densities between interacting and non-interacting pairs. To show
this, one can simply view the posterior histogram of log-probabilities of both categories, the
observed and unobserved interactions, for the model with and without g. For the model
with g, the overlap of the histogram of both probabilities is lower than the model without g,
which the partition clearer between the two categories. For an example, refer to Appendix
Figure C.5
Table 5.3 is a summary of AUC results when applying the two model variations to both
databases and discussed subsets. In all cases, incorporating g results in more accurate poste-
rior predictions. For the EID2 database, despite the higher value of g, the AUC dierence is
small. This might be attributed to the way the training set is designed, since the EID2 train-
ing set was created by random elimination of observed interactions compared to temporal
sub-setting for the GMPD. In particular, research documenting host-parasite associations
5 A Bayesian model for link prediction in ecological networks 129
may be driven by previous research ndings and this biased towards particular hosts or
parasites may be captured in the temporal structure of the database, but not by random
elimination. By applying the two-sided Wilcoxon signed rank test, we found the p-value to
be < 0.005 in favour of the model with g for both databases and subsets, thus signicant
prediction gains are attained when incorporating g.
Table 5.3: AUC comparison between models with g and without g on the GMPD and EID2databases and clade subsets
Model GMPD-Carnivora GMPD EID2-Rodentia EID2with g 0.935 0.924 0.899 0.938without g 0.843 0.891 0.832 0.916
For AUC results that include sub-models, refer to Appendix Tables C.2 and C.4.
Modelling uncertainty by incorporating the proportion of missing interactions certainly
improves the posterior prediction as seen in Table 5.3, where in all sets the AUC is higher
for models including g. In terms of the proportion of observed interactions recovered, the
results dier. Table 5.4 shows the percentage of observed interactions correctly predicted
in the held-out portion of the validation set (in parentheses) and in the full data, for each
database type and model. That is for the GMPD using the model with g, the percentage of
predicted interactions for documented associations from 2005 to 2010, is 0.683, and 0.832 for
all documented associations up to 2010. Using the simpler model without g, the equivalent
values are 0.788 and 0.811. For the GMPD-Carnivora both percentages only account for the
Carnivora host subset. For the EID2 database, when modelling with g, the percentage of
predicted interactions for the 10% held-out is 0.92, and 0.919 for the full database. When
modelling without g, the equivalent values are 0.92 and 0.85. It is clear that the model with
g outperforms in the predicted interactions of the full database, but lacks when it comes
to predicting the held-out portions in the validation set. The model without g simply over
estimates the amount of associations, which yields a higher recovery of observed interactions
in the held-out portion, but also predicts a greater number of unobserved interactions as
present, which reduces the AUC.
5 A Bayesian model for link prediction in ecological networks 130
Table 5.4: Percentage of observed interactions correctly predicted in the held-out portionof the validation set (in parentheses) and in the full data, for the GMPD and EID2 databases
Model GMPD-Carnivora GMPD EID2-Rodentia EID2with g (0.373) 0.827 (0.683) 0.832 (0.809) 0.825 (0.92) 0.919without g (0.573) 0.784 (0.788) 0.811 (0.681) 0.665 (0.92) 0.85
For more prediction results that include sub-models, refer to Appendix Tables C.3 and
C.5, and for more diagnostic plots and results, please refer to Appendix C.
5.5 Discussion
In this chapter we introduce a latent score model for link prediction in ecological networks
and illustrate it using two host-parasite networks. The proposed model is a combination of
two separate models, an anity based exchangeable random networks model (5.1) overlaid
with a Markov network dependence informed by phylogeny (5.3). The anity-only model is
characterized by independent anity parameters for each species, while the phylogeny-only
model is characterized by a scaled species similarity matrix. Both parts perform reasonably
well alone compared to the combined model, as shown in Figure 5.5. However, modelling with
only the anity parameters results in a highly dense posterior interaction matrix, in which
a slightly elevated anity parameter results in predicted interactions with all other species.
This situation is unlikely from a biological standpoint as species that are known to associate
with only particular evolutionary groups are predicted to associate with all others, regardless
of species identity. On the other hand, modelling using only the phylogenetic dependence
structure allows no independent inuence of the number of documented interactions per
species. By overlaying the anity-only model with a phylogeny-only dependence structure,
the posterior prediction is signicantly improved and the sparseness of the original interaction
matrix is preserved.
While we incorporated phylogeny as the dependence structure, the model can easily
accommodate dierent similarity matrices or types of dependence in an additive manner.
5 A Bayesian model for link prediction in ecological networks 131
For host-parasite networks, host traits or geographic overlap, or parasite similarity based
on phylogeny, taxonomy, or traits may improve prediction (Davies and Pedersen, 2008; Luis
et al., 2015; Pedersen et al., 2005). Introducing dierent similarity measures aects the
model characteristics in two ways: it changes the topology of the probability domain, and it
increases the number of parameters to estimate due to introduced scaling parameters. The
latter is easily integrated since the number of estimated parameters increases by one for each
new scaling parameter.
A particular dependence structure that does not require additional data is similarity
based on the number of shared interactions, as used in the NN algorithm (seen in Section
5.4.3). However, this method under-performed when compared to the phylogeny based simi-
larity. The magnitude of the scaling parameter for both databases indicates the utility of the
phylogenetic information. In host-parasite networks, parasite community similarity is often
well predicted by evolutionary distance among hosts (Davies and Pedersen, 2008; Gilbert
and Webb, 2007). In this case, the NN similarity is likely capturing some of the phyloge-
netic structure in the network and could be a reasonable approach if a reliable phylogeny is
unavailable. However, as phylogeny is estimated independently from the interaction data,
it will likely be more robust to incomplete sampling of the original network than NN type
dependence structures.
Many ecological networks are often based on presence-only data (Morales-Castilla et al.,
2015), where an unobserved interaction may be either present or absent. Thus, to account for
uncertainty in unobserved interactions we incorporate the proportion of missing interactions
in the observed data, which strengthens the posterior predictive accuracy of the model. We
additionally present a variation that includes a weighted-by-counts component, although, as
shown in Section 5.4.3, we nd the original model outperforms it. One might assume that
the count of peer-reviewed articles or unique genetic sequences reects the strength of the
underlying support. However, certain species, such as domesticated animals, or organisms
that are threats to public health, may receive signicantly more research interest (Wiethoelter
5 A Bayesian model for link prediction in ecological networks 132
et al., 2015). This elevated study eort may reveal additional interactions and increase
the number of studies reporting previously known associations. In the weighted-by-counts
model, these inated counts decrease overall predictive accuracy by estimating many weakly
supported interactions as absent. For example, an interaction between two rarely studied
species may be supported by a single valid piece of evidence, the strength of which is not
reected by the count of unique pieces of evidence.
While the intent of this research is to identify undocumented interactions, this model can
also account of uncertainty in missing interactions. In this case, the model may be used to
identify weakly supported interactions that are false positives or sampling artifacts in the
literature that may benet from additional investigation. We hope that this work inspires
new research on the modelling of host-parasite networks, and in particular, methods that
allow for the uncertainty in unobserved interactions. We believe frameworks such as ours
will be valuable tools for better understanding the structures of species interaction networks,
and could form an integral component of proactive surveillance systems for emerging diseases
(Farrell et al., 2013).
134
Appendix A
Latent formulation and sampling
For an H × J matrix Z of interactions with no empty columns or rows, of h = 1, . . . , H
hosts, and j = 1, . . . J parasites, let γh > 0 be the anity parameter of host h , and
ρj > 0 for parasite j. Let ∆ be an H × H matrix that quanties pairwise similarities
between hosts, where higher values imply stronger correlations and 0 < ∆hk < 1 for all
h, k ∈ 1, . . . , H, h = k.
Suppose that the probability of an edge zhj conditional on all other edges Z−(hj) is dened
as
P(zhj = 1 | Z−(hj)) = 1− exp(−τhj), (A.1)
where τhj = γhρjδηhj and η is a scaling coecient of the similarity matrix.
To facilitate modelling, suppose that zhj is completely determined by a latent score shj,
such that
zhj =
⎧⎪⎪⎨⎪⎪⎩1 shj > 0
0 shj = 0.
with
P(zhj = 1 | Z−(hj)) = E[Ishj>0 | Z−(hj)] = P(shj > 0 | S−(hj)) = 1− exp(−τhj).
A Latent formulation and sampling 135
Such a characterization prompt a conditional joint distribution of the form
P(zhj = 1, shj | Z−(hj)) = P(zhj = 1 | shj)p(shj | S−(hj)) = p(shj | S−(hj))Ishj>0
P(zhj = 0, shj | Z−(hj)) = P(zhj = 0 | shj)p(shj | S−(hj)) = p(shj | S−(hj))Ishj=0.
(A.2)
Moreover, it can be veried that
p(shj | zhj,Z−(hj)) =
⎧⎪⎪⎨⎪⎪⎩1
1−exp(−τhj)p(shj | S−(hj))Ishj>0 zhj = 1
1exp(−τhj)
p(shj | S−(hj))Ishj=0 zhj = 0.
It remains to dene the distribution of shj | Z−(hj) to satisfy the property that
P(zhj = 1 | Z−(hj)) = 1− exp(−τhj) =∫Rp(s | S−(hj))Is>0ds.
One possible choice is the partitioned Gumbel density as
p(shj | S−(hj)) = τhj exp(−shj − τhje−shj)Ishj>0 + exp(−τhj)Ishj=0.
The latent score is only used as a modelling tool to make the joint distribution more
tractable, as
p(zhj, shj | Z−(hj)) =[τhj exp
(−(shj + τhje
−shj)
)Ishj>0
]zhj[exp (−τhj)Ishj=0
]1−zhj= τ
zhjhj exp
(−shj − τhje
−shj).
(A.3)
By construction, the neighbourhood structure represented by δηhj depends only on the
host phylogeny, hence, the joint distribution of each column of Z is independent of all others.
Since Z has no empty columns, assuming that zhj represent the rst observed interaction for
A Latent formulation and sampling 136
the j-th column Z. j, by conditioning the column joint distribution is
P(Z.j,S.j) =H∏h=1
(ρjγhδ
ηhj
)zhjexp
(− shjzhj − ρjγhδ
ηhje
−shjzhj)
= ρmj
j
[ H∏h=1
γzhjh
][ H∏h=1
(δηhj)zhj] exp(−
H∑h=1
shjzhj + ρjγhδηhje
−shjzhj
),
(A.4)
where δηhj =∑h−1
k=1 ∆hkzkj = δηhj −∑H
k=h+1 ∆hkzkj = δηhj − δηhj, and mj =∑H
h=1 zhj, with the
convention that δη1j = 1. The full joint distribution of Z is then
P(Z,S) =
[ J∏j=1
ρmj
j
][ H∏h=1
γnhh
][ J,H∏h,j=1
(δηhj)zhj] exp(−
J,H∑h,j=1
shjzhj + ρjγhδηhje
−shjzhj
), (A.5)
where nh =∑J
j=1 zhj.
For the priors π( . ) the posterior distribution of the anity parameters are
p(ρj| . ) ∝ ρmj
j exp
(− ρj
H∑h=1
γhδηhje
−shjzhj
)π(ρj)
∝ ρmj
j exp
(− ρj
H∑h=1
γhδηhje
−shjzhj
)exp
(ρj
H∑h=1
γhδηhje
−shjzhj
)π(ρj),
p(γh| . ) ∝ γnhh exp
(− γh
J∑j=1
ρjδηhje
−shjzhj
)exp
(γh
J∑j=1
ρj δηhje
−shjzhj
)π(γh).
(A.6)
Let tGumbel(τ, 1, 0
)be a zero-truncated Gumbel random variables with scale parameter
of 1, having the density
exp(−(s− τ + e−(s−τ))
1− exp(−eτ ) χ(0,∞)(s).
Sampling the posterior latent score follows:
shj | Z,ρ,γ, η ∼
⎧⎪⎪⎨⎪⎪⎩χ0 if zhj = 0
tGumbel
(log(γhρjδ
ηhj), 1, 0
)if zhj = 1,
(A.7)
A Latent formulation and sampling 137
where χ0 is an atomic measure at zero.
The joint distribution in (A.4) and (A.5) depends on the order of observations per j-th
parasite, (z1j, z2j, . . . , zHj). The dependence is omitted but implicitly assumed. That is, each
subscript hj should be σj(h)j, where σj : 0, . . . , H ↦→ 0, . . . , H is an independent permu-
tation of the order of observations for the j-th parasite. Nonetheless, the joint distribution
is valid for each xed permutation, and the model is run as so.
The joint distribution, in general, is not tractable, primarily due to the inuence of the
order of observations. This order-dependence could be partially omitted in a way similar to
the Ising model. In particular, if we let δηhj be parameterized in the exponential scale for
some similarity matrix ∆ as
δηhj = exp
(− η
H∑i=1
∆hizij
).
Then, the thirds product in (A.5) becomes
J∏j=1
H∏h=1
(δηhj)zhj =
J∏j=1
H∏j=1
exp
(− ηzhj
h−1∑i=1
∆hizij
)=
J∏j=1
exp
(− η
2
H∑h,k=1
zhj∆hkzkj
).
However, as mentioned earlier, this transformation only partially relaxes the inuence
of order dependence, as it only aects the third product of (A.5) and not the dependence
seen in the exponential part. Moreover, the exponential scale transformation above alters the
interpretation of the neighbourhood structure in (5.2). Initially, δηhj was strictly non-negative,
where it penalizes the expected score for values less than one and complements it for values
larger than one. On the other hand, the exponential score transformation only penalizes the
expected score, as δηhj takes values strictly within (0, 1). We nd the parametrization in (5.2)
to have a better prediction performance.
A Latent formulation and sampling 138
A.1 Existence of the joint distribution
Theorem 14. (Hammersley-Cliord,(Robert and Casella, 2013)) Under marginal positively
conditions, the joint distribution of random variables Z = (z1, z2, . . . , zn) is proportional to
P(X)
P(X∗)=
n∏i=1
P(xi | x1, . . . , xi−1, x∗i+1, . . . , x
∗n)
P(x∗i | x1, . . . , xi−1, x∗i+1, . . . , x∗n)
(A.8)
where x∗i are xed observations, for example x∗i = 1.
In regards to conditional probability in (A.1), assume the phylogeny-only model where
τhj = δηhj, and δηhj as in (5.2). Since each column of Z is independent, it suces to show that
the joint distribution exist for each column. Applying the Hammersley-Cliord theorem, we
haveP(zhj | z1j, . . . , z(h−1)j, z
∗(h+1)j, z
∗Hj)
P(z∗hj | z1j, . . . , z(h−1)j, z∗(h+1)j, z∗Hj)
=[ exp(−τhj)1− exp(−τhj)
]1−zhj,
where z∗hj = 1 and
τhj =h−1∑i=1
∆ηhizij +
H∑i=h+1
∆ηhi, τ1j =
H∑j=2
∆ηhi, τHj =
H−1∑j=1
∆ηhizij.
Essentially, by removing the event of no interactions, as zh. = (0, 0, . . . , 0), and setting
τhj = 1 whenever it is 0, by Hammersley-Cliord theorem the joint distribution exists.
A.1.1 Parametrization using an exponential distribution
Rather than using a Gumbel distribution, one can achieve an equivalent parametrization
using the exponential distribution. Suppose that zhj is completely determined by a latent
variable uhj, such that
zhj =
⎧⎪⎪⎨⎪⎪⎩1 uhj < 1
0 uhj = 1.
A possible choice for the distribution of uhj | Z−(hj) is the density of a partitioned
A Latent formulation and sampling 139
exponential distribution, as
p(uhj | Z−(hj)) = τhj exp
(− τhjuhj
)Iuhj<1 + exp
(− τhj
)Iuhj=1.
where τhj = γhρjδηhj. The joint distribution becomes
p(zhj, uhj | Z−(hj)) =[τhj exp
(−τhjuhj
)Iuhj<1
]zhj[exp (−τhj)Iuhj=1
]1−zhj= τ
zhjhj exp
(−τhjuhj
).
A.2 Latent score sampling with uncertainty
By modelling the uncertainty parameter g as
p(zhj = 0 | shj, g) =
⎧⎪⎪⎨⎪⎪⎩1, if shj = 0
g, if shj > 0.
One arrives at the conditional joint distributions
P(zhj = 1, shj | g,Z−(hj)) = P(zhj = 1 | g, shj)p(shj | Z−(hj))
= p(shj | Z−(hj))Ishj>0
P(zhj = 0, shj | g,Z−(hj)) = P(zhj = 0 | g, shj)p(shj | Z−(hj))
= p(shj | Z−(hj))[gIshj>0 + Ishj=0
].
(A.9)
The conditional sampling of the latent truncated score variable shj becomes
p(shj | zhj,Z−(hj), g) =P(zhj | shj, g) . p(shj | Z−(hj))∫P(zhj | s, g) . p(s | Z−(hj))ds
= C . p(shj | Z−(hj)),
A Latent formulation and sampling 140
Such that
C =P(zhj | shj, g)∫
P(zhj | s, g) . p(s | Z−(hj))ds
=P(zhj | shj, g)∫
s>0P(zhj | s, g) . p(s | Z−(hj))ds+
∫s≤0
P(zhj | s, g) . p(s | Z−(hj))ds
=
⎧⎪⎪⎨⎪⎪⎩P(zhj |shj ,g)∫
s>0 1 . p(s|Z−(hj))ds+∫s≤0 0 . p(s|Z−(hj))ds
, when zhj = 1,
P(zhj |shj ,g)∫s>0 g . p(s|Z−(hj))ds+
∫s≤0 1 . p(s|Z−(hj))ds
, when zhj = 0,
=
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
1ψ(shj)
, shj > 0, zhj = 1,
0, shj = 0, zhj = 1,
ggψ(shj)+1−ψ(shj)
, shj > 0, zhj = 0,
1gψ(shj)+1−ψ(shj)
, shj = 0, zhj = 0,
where ψ(shj) =∫∞0
p(s | S−(hj), γh, ρj, η)ds = 1− exp
(−γhρjδηhj
).
Moreover, sampling the uncertainty variable is done by using the conditional distribution
as
P(g | S,Z) ∝ P(Z | S, g) . P(g) ∝ gN−+(1− g)N++ ,
where N−+ = #(h, j) : zhj = 0, shj > 0, N++ = #(h, j) : zhj = 1, shj > 0.
141
Appendix B
Details on the MCMC algorithm
Given the observed matrix Z to approximate the posterior density p(ρ,γ, η, ϕρ, ϕγ, ϕη|Z),
where ϕx is the gamma hyperparameters for the x parameter as ϕx = (αx, τx). Note that
each interaction zhj depends only on row h and column j via the parameters γh and ρj,
and on η via the dependency structure δ. Hence, one can update the parameters related
to the diagonal zhh : h = 1, . . . , H in parallel while retaining convergence conditions.
Generalizing this allows the parallel update of the parameters related to the i = 0, . . . , J − 1
diagonals
zh,(h+i) mod J : h = 1, . . . , H,
in the following steps, using Metropolis-Hasting steps:
1) update ϕρ, ϕγ and ϕη given (ρ,γ, η),
2) update ρ = (ρj)Jj=1 in parallel given (ϕρ, ϕγ, ϕη,γ, η) via (A.6),
3) update γ = (γh)Hh=1 in parallel given (ϕρ, ϕγ, ϕη,ρ, η) via (A.6),
4) update η given (ϕρ, ϕγ, ϕη,ρ,γ) with a proposal acceptance probability ofmin(1, a), where
a =
[ J,H∏h,j=1
(δηhjδηhj
)zhj]exp
(−
J,H∑h,j
ρjγhe−shjzhj(δηhj − δηhj)
).
B Details on the MCMC algorithm 142
A new proposal of η is sampled from a log-normal distribution as q(η | η) = lognormal(log(η), σ2η),
given a at prior.
5) update the latent variables of the diagonal s∗h,(h+i) mod J : h = 1, . . . , H as
shj | Z,ρ,γ, η ∼
⎧⎪⎪⎨⎪⎪⎩χ0 if zhj = 0
tGumbel
(log(γhρjδ
ηhj), 1, 0
)if zhj = 1,
(h, j) ∈ (h, x) : x = (h+ i) mod J for diagonal i,
where γ∗ ≈ 0.5772 is the Euler-Mascheroni constant, and χ0 is an atomic measure at
zero. tGumbel(τ, 1, 0) is a zero-truncated Gumbel distribution with a probability density
function as in (A.7).
Updating the uncertainty parameter g:
When correcting for uncertainty, just after step 4) above, sample g given (ρ,γ, η) using
a direct sample from a Beta(N−+, N++) distribution. That is, for h = 1, . . . , H and k =
1, . . . , J ,
N−+ = #(h, k) : zhk = 0, shk > 0, N++ = #(h, k) : zhk = 1, shk > 0.
Step 5) becomes
shj | Z,ρ,γ, η, g ∼
⎧⎪⎪⎪⎨⎪⎪⎪⎩1
θ(g,shj)χ0 +
gθ(g,shj)
tGumbel
(log(γhρjδ
ηhj), 1
)if zhj = 0
tGumbel
(log(γhρjδ
ηhj), 1
)if zhj = 1,
(h, j) ∈ (h, x) : x = (h+ i) mod J for diagonal i.
Updating hyperparameters ϕρ, ϕγ:
Since the host and parasite priors are characterized with a two parameter Gamma distri-
B Details on the MCMC algorithm 143
butions (α, τ), we will omit the subscripts and work with a general hyperparameter update
mechanism. Independently, for each of the parameter sets ϕρ and ϕγ given the other pa-
rameters and the latent variables ϕ∗ = (ρ,γ, η,S), update them using a Metropolis-Hasting
step, with proposals ϕ = (α, τ) from q(α, τ | α, τ). The acceptance probability is min(1, a),
where
a =p(ϕ | ϕ∗)
p(ϕ | ϕ∗)× q(ϕ | ϕ)q(ϕ | ϕ)
=
∏Nx
i=1
∫R+
p(xi | ϕ, ϕ∗)p(ϕ)dxi∏Nx
i=1
∫R+
p(xi | ϕ, ϕ∗)p(ϕ)dxi× q(ϕ | ϕ)q(ϕ | ϕ)
.
The symbols ρ and γ substitute for x above, where Nρ = J and Nγ = H. The joint
distribution in (5.8) is independent of the hyperparameters, thus it is left out.
Independent proposals are used, q(ϕ = (α, τ) | α, τ) = q(α | α)q(τ | τ), where
q(α|α) ∼ lognormal(log(α), σ2α),
q(τ |τ) ∼ lognormal(log(τ), σ2τ ).
With improper priors as
p(α, τ) = p(α)p(τ), p(α) ∝ 1
α, p(τ) ∝ 1
τ,
The general form of the acceptance probability a simplies to
a =Nx∏i=1
[∫R+
p(xi | α, τ , ϕ∗)dxi∫R+
p(xi | α, τ, ϕ∗)dxi
]× ατ
ατ× ατ
ατ=
Nx∏i=1
∫R+
p(xi | α, τ , ϕ∗)dxi∫R+
p(xi | α, τ, ϕ∗)dxi.
Such that, the acceptance probability a for each case is:
• ϕγ = (αγ, τγ) as
a =
[ταγγ
ταγγ
Γ(αγ)
Γ(αγ)
]H H∏h=1
Γ(nh + αγ)
Γ(nh + αγ)
(τγ +Ψ′h)nh+αγ
(τγ +Ψ′h)nh+αγ
, Ψ′h =
J∑j=1
ρjδηhje
−shj , nh =J∑j=1
zhj,
B Details on the MCMC algorithm 144
• ϕρ = (αρ, τρ) as
a =
[τ αρ
ταρ
Γ(αρ)
Γ(αρ)
]J J∏j=1
Γ(mj + αρ)
Γ(mj + αρ)
(τρ +Ψj)mj+αρ
(τρ +Ψj)mj+αρ, Ψj =
H∑h=1
γhδηhje
−shj , mj =H∑h=1
zhj.
145
Appendix C
Additional results
C.1 Posterior distributions
Figure C.1 shows posterior boxplots for the parameters with the 80 highest posterior medians
and the posterior distribution of the scaling parameter η, for the GMPD (top panel) and
EID2 (bottom panel). As shown, for the GMPD, the parasite parameters (ρ) vary moder-
ately, which reects the balance of column densities in the left-ordered interaction matrix in
Chapter Figure 5.1. The host parameters (γ) show more variation, which reects that some
hosts are more likely to interact with parasites, or have been more intensively studied. In
the EID2 database, the variation among the hosts is more prominent, which conrms our
earlier observation that row densities of the EID2 interaction matrix are less balanced (see
Chapter Figure 5.1). The ρ parameters on the other hand, do not show much variation in
either database, as seen by the column densities of the interaction matrices.
C Additional results 146
1 5 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79
01
23
4
Ordered parameters
Val
ue
(a) GMPD Parasite parameter ρ
1 5 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79
05
1015
Ordered parametersV
alue
(b) GMPD Host parameter γ
Pro
port
ion
1.52 1.54 1.56 1.58 1.60 1.62
05
1015
2025
(c) GMPD Scale parameter η
1 5 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Ordered parameters
Val
ue
(d) EID2 Parasite parameter ρ
1 5 9 14 19 24 29 34 39 44 49 54 59 64 69 74 79
010
2030
4050
60
Ordered parameters
Val
ue
(e) EID2 Host parameter γ
Pro
port
ion
1.12 1.14 1.16 1.18 1.20
010
2030
(f) EID2 Scale parameter η
Figure C.1: Boxplots of posterior estimates for the host and parasite parameters with the80 highest medians, and the posterior distributions of the scale parameter, dashed horizontallines are the mean posterior and 95% credible intervals, for the GMPD (top panel) and EID2(bottom panel).
C.2 Representative trace plots and diagnostics
C Additional results 147
0 2000 4000 6000 8000 10000
45
67
8
Iteration
Hos
t
0 2000 4000 6000 8000 10000
1.0
1.4
1.8
2.2
Iteration
Par
asite
0 2000 4000 6000 8000 10000
1.52
1.56
1.60
Iteration
Sca
le
(a) GMPD
0 2000 4000 6000 8000 10000
1.8
2.2
2.6
3.0
Iteration
Hos
t
0 2000 4000 6000 8000 10000
1.0
1.4
1.8
2.2
Iteration
Par
asite
0 2000 4000 6000 8000 10000
1.12
1.16
IterationS
cale
(b) EID2
Figure C.2: Trace plots for the GMPD and EID2: host (top) and parasite (middle) ofhighest median posterior, and the similarity matrix scaling parameter (bottom).
0 20 40 60 80 100
0.0
0.4
0.8
Lag
AC
F
Host − most active
Effective sample size: 1639
0 20 40 60 80 100
0.0
0.4
0.8
Lag
AC
F
Parasite − most active
Effective sample size: 1881
0 20 40 60 80 100
0.0
0.4
0.8
Lag
AC
F
Scale parameter
Effective sample size: 404
(a) GMPD
0 20 40 60 80 100
0.0
0.4
0.8
Lag
AC
F
Host − most active
Effective sample size: 1000
0 20 40 60 80 100
0.0
0.4
0.8
Lag
AC
F
Parasite − most active
Effective sample size: 885
0 20 40 60 80 100
0.0
0.4
0.8
Lag
AC
F
Scale parameter
Effective sample size: 330
(b) EID2
Figure C.3: ACF plots and eective sample sizes for the GMPD and EID2: host (top) andparasite (middle) of highest median posterior, and the similarity matrix scaling parameter(bottom).
C Additional results 148
C.3 Parameter numerical results
Table C.1: Posterior means, Monte Carlo standard errors and credible intervals for thehighest anity parameters and the scale parameter.
GMPD networkParameter Estimate standard dev 95% credible intervalρ(1) 2.18 0.94 (0.93, 3.91)ρ(2) 1.84 0.48 (1.13, 2.68)ρ(3) 1.81 0.64 (0.88, 2.97)ρ(4) 1.70 0.51 (0.97, 2.62)ρ(5) 1.69 0.20 (1.38, 2.04)γ(1) 10.88 1.87 (7.94, 14.01)γ(2) 8.11 1.18 (6.25, 10.15)γ(3) 7.69 0.82 (6.39, 9.08)γ(4) 7.62 1.17 (5.81, 9.61)γ(5) 7.09 1.01 (5.47, 8.86)η 1.57 0.01 (1.54, 1.59)
EID2 networkParameter Estimate standard dev 95% credible intervalρ(1) 1.70 0.54 (0.92, 2.68)ρ(2) 1.56 0.18 (1.28, 1.87)ρ(3) 1.45 0.64 (0.55, 2.63)ρ(4) 1.43 0.22 (1.09, 1.8)ρ(5) 1.42 0.67 (0.52, 2.67)γ(1) 53.59 3.90 (47.41, 60.13)γ(2) 32.10 3.67 (26.2, 38.06)γ(3) 18.95 2.80 (14.68, 23.85)γ(4) 16.55 2.62 (12.42, 21.08)γ(5) 9.83 1.78 (7.09, 12.87)η 1.15 0.01 (1.14, 1.17)
C Additional results 149
C.4 Uncertainty - histograms
Posterior estimate of g for the GMP−Carnivora database
Fre
quen
cy
0.20 0.25 0.30 0.35 0.40 0.45
010
030
050
0
Posterior estimate of g for the GMP−Carnivora database
Fre
quen
cy
0.1 0.2 0.3 0.4 0.5 0.6
010
030
050
0
Figure C.4: Posterior histogram for g for the GMPD (left) and EID2 (right) databases.
Figure C.5 is the histogram of the posterior log-probabilities when using the model with-
out g (left), and the model with g (right), for the GMPD-Carnivora subset. For the model
without g, the right mode (cyan), is the histogram of the posterior log-probabilities of all the
observed interactions in the 2010 validation set, while the left mode (pink), is the histogram
of the posterior log-probabilities of unobserved interactions. For the model with g, the
overlap of the posterior log-probabilities of the two categories, observed and unobserved, is
signicantly reduced by lowering scores for the unobserved interactions. This causes a clearer
partition in probabilities between the two categories, and only unobserved interactions with
very high posterior probability are then classied as possible interactions.
C Additional results 150
Log of probability
Den
sity
−10 −8 −6 −4 −2 0
0.0
0.1
0.2
0.3
Observed associationsUnobserved associations
(a) without g
Log of probability
Den
sity
−10 −8 −6 −4 −2 0
0.0
0.1
0.2
0.3
Observed associationsUnobserved associations
(b) with g
Figure C.5: Comparison in posterior log-probability between observed and unobservedinteractions, for the model without g (left) and with g (right), for the GMPD-Carnivoradatabase.
C.5 Interaction matrices for subsets - Carnivora and Ro-
dentia
parasites
host
s
5 25 50 75 100 130 160 190 220 250 280 310 340
110
100
9080
7060
5040
3020
101
(a) GMPD-Carnivora
parasites
host
s
5 25 50 75 100 130 160 190 220 250 280 310 340
110
100
9080
7060
5040
3020
101
(b) Without g
parasites
host
s
5 25 50 75 100 130 160 190 220 250 280 310 340
110
100
9080
7060
5040
3020
101
(c) With g
Figure C.6: Association matrices of the whole GMPD-Carnivora subset: Observed (left),posterior for the model without g (middle), posterior for the model with g (right).
C Additional results 151
parasites
host
s
5 15 25 35 45 55 65 75 85 95 105 115 125
9080
7060
5040
3020
10
(a) EID2-Rodentia
parasites
host
s
5 15 25 35 45 55 65 75 85 95 105 115 125
9080
7060
5040
3020
10(b) Without g
parasites
host
s
5 15 25 35 45 55 65 75 85 95 105 115 125
9080
7060
5040
3020
10
(c) With g
Figure C.7: Association matrices of the whole EID2-Rodentia subset: Observed (left),posterior for the model without g (middle), posterior for the model with g (right).
C.6 ROC with and without g for full GMPD and EID2
databases
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ROC Curve
1−specificity
sens
itivi
ty
LS−network: with gLS−network: without g
(a) GMPD
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ROC Curve
1−specificity
sens
itivi
ty
LS−network: with gLS−network: without g
(b) EID2
Figure C.8: Comparison of ROC curves for the full dataset, for the models with(out) g.
C Additional results 152
Table C.2: AUC comparison between models with g and without g, on the GMPD databasesand clade subsets, with dierent variations of the model
Model GMPD-Carnivora GMPD GMPD Anity GMPD Phylogenywith g 0.935 0.924 0.926 0.853without g 0.843 0.891 0.856 0.825
Table C.3: Percentage of observed interactions correctly predicted in the held-out portionof the validation set (in parentheses) and in the full data, for the GMPD database
Model GMPD-Carnivora GMPD GMPD Anity GMPD Phylogenywith g (0.373) 0.827 (0.683) 0.832 (0.437) 0.824 (0.728) 0.803without g (0.573) 0.784 (0.788) 0.811 (0.607) 0.796 (0.698) 0.726
parasites
host
s
5 45 90 140 195 250 305 360 415 470 525 580 635 690
250
220
200
180
160
140
120
100
8060
4020
(a) GMPD without g
parasites
host
s
5 45 90 140 195 250 305 360 415 470 525 580 635 690
250
220
200
180
160
140
120
100
8060
4020
(b) GMPD with g
parasites
host
s
5 45 90 145 205 265 325 385 445 505 565 625 685 745
700
640
580
520
460
400
340
280
220
160
100
50
(c) EID2 without g
parasites
host
s
5 45 90 145 205 265 325 385 445 505 565 625 685 745
700
640
580
520
460
400
340
280
220
160
100
50
(d) EID2 with g
Figure C.9: Posterior association matrices for the full datasets.
C Additional results 153
Table C.4: AUC comparison between models with g and without g, on the EID2 databasesand clade subsets, with dierent variations of the model
Model EID2-Rodentia EID2 EID2 Anity EID2 Phylogenywith g 0.899 0.938 0.942 0.845without g 0.832 0.916 0.913 0.801
Table C.5: Percentage of observed interactions correctly predicted in the held-out portionof the validation set (in parentheses) and in the full data, for the EID2 database
Model EID2-Rodentia EID2 EID2 Anity EID2 Phylogenywith g (0.809) 0.825 (0.92) 0.919 (0.893) 0.934 (0.847) 0.822without g (0.681) 0.665 (0.92) 0.85 (0.834) 0.797 (0.786) 0.666
C Additional results 154
C.7 Percentage of recovered pairwise interactions
One of the ways to compare model performance other than the ROC curve is to look at
predictive performance by measuring the proportion of recovered true interactions in the
data. To show this, for each model, sort descendingly all pairwise interactions based on their
posterior predictive probabilities. Count the number of true interactions that have been
recovered within the x pairs with highest probabilities. Scaling x from 1 to 1000 to get the
following model comparative plots.
0 200 400 600 800 1000
020
040
060
080
0
Number of validated pairwise interactions
Num
ber
of r
ecov
ered
pai
rwis
e in
tera
ctio
ns
Full model with uncertaintyFull modelAffinity−only with uncertaintyAffinity−onlyPhylogeny−only with uncertaintyPhylogeny−onlyx=y
(a) GMPD
0 200 400 600 800 1000
010
020
030
040
050
060
070
0
Number of validated pairwise interactions
Num
ber
of r
ecov
ered
pai
rwis
e in
tera
ctio
ns
Full model with uncertaintyFull modelAffinity−only with uncertaintyAffinity−onlyPhylogeny−only with uncertaintyPhylogeny−onlyx=y
(b) EID2
Figure C.10: Number of pairwise recovered interactions from the original data.
C Additional results 155
C.8 Posterior degree distribution
+
+
+ +
+
+
++
+
+
+
+
+
+
+++
++
++
++
+
++
+
+
+
++
+
+
+
+
+
+
+
++
++++++
+
++
+
++++++++++++
1 2 5 10 20 50 100 200
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
+ Hosts
Estimated
(a) Hosts - full model
*
*
*
*
*
**
*
**
*
*
**
*
*
**
**
****
**
******
1 2 5 10 20 50 100 200
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
* Parasites
Estimated
(b) Parasites - full model
+
+
+ +
+
+
++
+
+
+
+
+
+
+++
++
++
++
+
++
+
+
+
++
+
+
+
+
+
+
+
++
++++++
+
++
+
++++++++++++
1 2 5 10 20 50 100 200
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
+ Hosts
Estimated
(c) Hosts - full model with g
*
*
*
*
*
**
*
**
*
*
**
*
*
**
**
****
**
******
1 2 5 10 20 50 100 200
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
* Parasites
Estimated
(d) Parasites - full model with g
Figure C.11: Comparison of degree distribution on log-scale, for the full model (withoutaccounting for uncertainty) and the model with g, GMPD dataset.
C Additional results 156
+
+
+
+
+ +
++
++++
+
++
++++
++
+
+
+
+
+
+++++++++++++
1 2 5 10 20 50 100 200
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
+ Hosts
Estimated
(a) Hosts - full model
*
*
*
*
*
**
*
*
* **
**
*
**
*
*
*
********
*
**
1 2 5 10 20 50 100 200
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
* Parasites
Estimated
(b) Parasites - full model
+
+
+
+
+ +
++
++++
+
++
++++
++
+
+
+
+
+
+++++++++++++
1 2 5 10 20 50 100 200
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
+ Hosts
Estimated
(c) Hosts - model with g
*
*
*
*
*
**
*
*
* **
**
*
**
*
*
*
********
*
**
1 2 5 10 20 50 100 200
12
510
2050
100
200
500
Degree
Num
ber
of N
odes
* Parasites
Estimated
(d) Parasites - model with g
Figure C.12: Comparison of degree distribution on log-scale, for the full model (withoutaccounting for uncertainty) and the model with g, EID2 dataset.
C.9 Hyperparameters and eective size
C Additional results 157
0 2000 4000 6000 8000
0.0
0.1
0.2
0.3
0.4
Expected value of host affinity
Iteration
Hos
ts
0 2000 4000 6000 8000
010
2030
40
Expected value of parasite affinity
Iteration
Par
asite
s
Figure C.13: Trace plots of convergence of three chains started at dierent values for theexpected value of the hyperparameter for the GMPD dataset
158
Chapter 6
Conclusion and future research
This thesis has contributed to two sub-elds of statistical network analysis, one which is
modelling of random graphs, and the other on link prediction. On the former, this work has
proposed a new way of modelling decomposable graphs, which is achieved by adopting a non-
classical representation of decomposable graphs as deterministic functions of bipartite point
processes. On link prediction, this work has adopted methods of measuring the proportion of
missing link at the data source and applied it to correct for link prediction in presence-only
data.
Adopting from the recent work on models for random graphs, Chapter 3 proposed a
framework for modelling decomposable graphs that is driven by node specic anity pa-
rameters. Rather than modelling the probability of an edge forming between two nodes,
the proposed framework models the probability of nodes attaining membership in maximal
cliques of the graph. The maximal cliques are represented by latent communities that are
connected into a tree, mimicking that of the junction tree representation of decomposable
graphs. The bipartite interactions between the graph nodes and those latent communities
can be mapped deterministically to an adjacency matrix of a decomposable graph.
The adopted representation of decomposable graphs yields simple Markov update steps.
Conditional on the latent clique communities the node-clique relationship is assigned, and
6 Conclusion and future research 159
iteratively, the tree connectivity of the communities is updated according to their node
memberships. The adopted iterative procedure is native to many models of decomposable
graphs, due to their conditional dependency structure. Section 3.4 illustrated two sampling
mechanism for the proposed model, one based on sequential sampling with nite steps, and
the other based on a Markov stopped process. A lower bound of mixing time for the Markov
stopped process are specied. The bipartite representation of decomposable graphs permits
an easy computation of the expected number of maximal cliques per node, which is the topic
of Section 3.7.
One of the main benets of the proposed decomposable graphs framework of Chapter 3, is
the new application of sub-clustering, shown in Chapter 4. The bipartite representation can
easily be extended to account for subgraphs (sub-cliques) of maximal cliques, adding much
richness to the model. In classical settings, one models solely the decomposable graphs,
the proposed model, adds to that by exibly modelling the latent dynamics forming within
each maximal clique. Nonetheless, introducing sub-clustering to the model comes with ex-
tra complexities related to the dynamics between maximal and sub-maximal cliques. Few
methods exists in modelling this dynamics, this work adopted a method that utilizes the
continuity nature of the specic community anity parameters. Contrary to the treatment
of decomposable graphs in Chapter 3, allowing for sub-clustering requires a series of rules
addressing the change in the junction tree after every (dis)connect move. In some update
steps, a maximal clique might become sub-maximal and the opposite, varying the size of the
junction tree at every step. A major part of this work is dedicated to such update rules.
On the second area of contribution of this work, Chapter 5 introduced a Bayesian latent
score model for link prediction in presence-only networks. The proposed model assigns scores
on observed edges of a network in an attempt to rank edges from the most probable down
the least. On the rst instance, the model adopts classical anity based representations
of networks. To improve the scoring eciency the model is augmented with an informed
Markov random eld component, that also only depends on observed links. Since it is hard
6 Conclusion and future research 160
to know the exact number of actual true interactions from forbidden ones, drawing on some
of the work of Jiang et al. (2011), a measure of uncertainty is built which attempts to
estimate the false negative rate on the data source. This rate is then used to gauge the
predicted number of potential interactions. The model is validated using two host-parasite
networks constructed from published databases, the Global Mammal Parasite Database and
the Enhanced Infectious Diseases database, each with thousands of pairwise interactions.
6.1 Future research
The following is a list of future research directions.
• Bounds for mixing times: Chapter 3, Lemma 2 specied the lower bound of the mix-
ing time for the MCMC method of the proposed framework for decomposable graphs.
The lemma depends on the structure of the junction tree, through the component∑k 1/Γk. A possible research direction is to generalize this lower bound to depend on
a general measure of tree densities, which could be assumed on the junction tree of the
graph. Arriving at an expression for the upper bound would also be helpful.
• Expectation results on decomposable graphs: Assuming that the junction tree
of the graph is a d-regular tree, Section 3.7 gave an exact expression for the expected
number of maximal cliques per node. This result could be possibly extended to column-
wise expectations, as the expected size of a maximal clique. In addition, the given
expectation depends on some tree quantities, for example the number of edges of each
tree node, and the length of the tree. It is desired to have an expression that only
depends on general tree measures, which could be extended to general non-regular
trees.
• A second sub-clustering framework: Chapter 4 illustrated a new application of
decomposable graphs, motivated by a sub-clustering method. This method depends
6 Conclusion and future research 161
on the latent communities (θ′1, θ′2, . . . ) being classied into maximal and sub-maximal
cliques, the latter is treated as sub-clusters. In Chapter 3, those latent communities are
all assumed to represent maximal cliques. A possible research direction is to adopt a
second sub-clustering method, which is in between the proposed sub-clustering method
of Chapter 4 and the initial treatment of decomposable graphs of Chapter 3. In par-
ticular, Proposition 1 shows that by using the boundary and neighbouring sets of (3.7)
in the Markov update step of (3.12), the graph resulting from the mapping in (3.10) is
decomposable, though, not all active cliques in the biadjacency representation are max-
imal. In such a case, the non-empty non-maximal cliques can be seen as sub-clusters.
This denition of sub-clusters might lead to less complex update steps than the ones
in Chapter 4. However, the interpretation of sub-clusters diers, since as more nodes
join the graph, the non-empty non-maximal cliques are potentially maximal, which is
not the case in the proposed model of Chapter 4.
• Hubs of authority using decomposable graphs: The work of Chapter 5 proposed
a link prediction model for presence-only data. To account for the uncertainty in
missing interactions, a mechanism is built to account for the proportion of missing link
in the data source. This rate is then used to gauge the predicted number of potential
interactions. The motivational data example are host-parasite networks, which are
constructed from documented interactions based on peer-reviewed articles. Using the
time of publication and authorship information it is possible to integrate the work
done on decomposable graphs in accounting for uncertainty in missing interactions.
For example, by clustering authors on dierent types of interactions or host-interest
groups. Each cluster could be dened as a maximal clique of a decomposable graphs.
Of course, this assumes conditional independence between clusters give a set of joint
authors. Nonetheless, cluster sizes could be used to promote condence in specic
pair's interaction. In a sense, the larger the number of clique of authors publishing
on a specic interaction the more condence it receives. Other measures of condence
163
Bibliography
Albert, R. and A. L. Barabasi (2002). Statistical mechanics of complex networks. Reviews
of Modern Physics 74 (1), 4797.
Aldous, D. J. (1981). Representations for partially exchangeable arrays of random variables.
Journal of Multivariate Analysis 11 (4), 581598.
Araújo, M. B., A. Rozenfeld, C. Rahbek, and P. A. Marquet (2011). Using species co-
occurrence networks to assess the impacts of climate change. Ecography 34 (6), 897908.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal
of the Royal Statistical Society. Series B (Methodological), 192236.
Bickel, P. J. and A. Chen (2009). A nonparametric view of network models and Newman
Girvan and other modularities. Proceedings of the National Academy of Sciences 106 (50),
2106821073.
Biggs, N., E. K. Lloyd, and R. J. Wilson (1976). Graph Theory, 1736-1936. Oxford University
Press.
Billingsley, P. (2008). Probability and Measure. John Wiley & Sons.
Bollobás, B. (2001). Random Graphs, volume 73 of Cambridge studies in advanced mathe-
matics. Cambridge University Press, Cambridge,.
Bollobás, B. and O. Riordan (2007). Metrics for sparse graphs. arXiv preprint
arXiv:0708.1919 .
Bibliography 164
Borgs, C., J. T. Chayes, H. Cohn, and S. Ganguly (2015). Consistent nonparametric esti-
mation for heavy-tailed sparse graphs. arXiv preprint arXiv:1508.06675 .
Borgs, C., J. T. Chayes, H. Cohn, and N. Holden (2016). Sparse exchangeable graphs and
their limits via graphon processes. arXiv preprint arXiv:1601.07134 .
Borgs, C., J. T. Chayes, H. Cohn, and Y. Zhao (2014a). An lp theory of sparse
graph convergence II: LD convergence, quotients, and right convergence. arXiv preprint
arXiv:1408.0744 .
Borgs, C., J. T. Chayes, H. Cohn, and Y. Zhao (2014b). An lp theory of sparse graph
convergence I: limits, sparse random graph models, and power law distributions. arXiv
preprint arXiv:1401.2906 .
Bornn, L. and F. Caron (2011, 12). Bayesian clustering in decomposable graphs. Bayesian
Analysis 6 (4), 829846.
Breese, J. S., D. Heckerman, and C. Kadie (1998). Empirical analysis of predictive algorithms
for collaborative ltering. In Proceedings of the Fourteenth Conference on Uncertainty in
Articial Intelligence, pp. 4352. Morgan Kaufmann Publishers Inc.
Brix, A. (1999). Generalized Gamma measures and shot-noise Cox processes. Advances in
Applied Probability , 929953.
Caron, F. (2012). Bayesian nonparametric models for bipartite graphs. In Advances in
Neural Information Processing Systems 25, pp. 20512059. Curran Associates, Inc.
Caron, F. and A. Doucet (2009). Bayesian nonparametric models on decomposable graphs.
In Advances in Neural Information Processing Systems, pp. 225233.
Caron, F. and E. B. Fox (2014). Sparse graphs using exchangeable random measures. arXiv
preprint arXiv:1401.1137 .
Bibliography 165
Chiu, S. N., D. Stoyan, W. S. Kendall, and J. Mecke (2013). Stochastic Geometry and its
Applications. John Wiley & Sons.
Chung, F. and L. Lu (2002). Connected components in random graphs with given expected
degree sequences. Annals of Combinatorics 6 (2), 125145.
Chung, F. and L. Lu (2006). Complex graphs and networks, volume 107 of CBMS regional
conference series in mathematics. In Published for the Conference Board of the Mathemat-
ical Sciences, Washington, DC, Volume 144.
Cowell, R. G., P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter (2006). Probabilistic Net-
works and Expert Systems: Exact Computational Methods for Bayesian Networks. Springer
Science & Business Media.
Cox, D. R. (1955). Some statistical methods connected with series of events. Journal of the
Royal Statistical Society. Series B (Methodological), 129164.
Darroch, J. N., S. L. Lauritzen, and T. P. Speed (1980). Markov elds and log-linear
interaction models for contingency tables. The Annals of Statistics 8 (3), 522539.
Davies, T. J. and A. B. Pedersen (2008). Phylogeny and geography predict pathogen com-
munity similarity in wild primates and humans. In Proceedings. Biological sciences - The
Royal Society, Volume 275, pp. 1695701.
Dawid, A. P. and S. L. Lauritzen (1993). Hyper Markov laws in the statistical analysis of
decomposable graphical models. The Annals of Statistics 21 (3), 12721317.
De Finetti, B. (1931). Funzione caratteristica di un fenomeno aleatorio.
Dem²ar, J. (2006). Statistical comparisons of classiers over multiple data sets. Journal of
Machine Learning Research 7, 130.
Durrett, R. (2007). Random Graph Dynamics, Volume 200. Cambridge University Press,
Cambridge.
Bibliography 166
Ekstrand, M. D., J. T. Riedl, and J. A. Konstan (2011). Collaborative ltering recommender
systems. Foundations and Trends in Human-Computer Interaction 4 (2), 81173.
Erdös, P. and A. Rényi (1959). On random graphs, I. Publicationes Mathematicae (Debre-
cen) 6, 290297.
Farrell, M. J., L. Berrang-Ford, and T. J. Davies (2013). The study of parasite sharing for
surveillance of zoonotic diseases. Environmental Research Letters 8 (1), 015036.
Ferguson, T. S. and M. J. Klass (1972). A representation of independent increment processes
without Gaussian components. The Annals of Mathematical Statistics 43 (5), 16341643.
Fienberg, S. E. (2012). A brief history of statistical models for network analysis and open
challenges. Journal of Computational and Graphical Statistics 21 (4), 825839.
Fritz, S. A., O. R. P. Bininda-Emonds, and A. Purvis (2009). Geographical variation in
predictors of mammalian extinction risk: big is bad, but only in the tropics. Ecology
letters 12 (6), 538549.
Frydenberg, M. and L. L. Steen (1989). Decomposition of maximum likelihood in mixed
graphical interaction models. Biometrika 76 (3), 539555.
Gao, C., Y. Lu, H. H. Zhou, et al. (2015). Rate-optimal graphon estimation. The Annals of
Statistics 43 (6), 26242652.
Gilbert, G. S. and C. O. Webb (2007). Phylogenetic signal in plant pathogen-host range.
Proceedings of the National Academy of Sciences of the United States of America 104 (12),
49794983.
Giudici, P. and P. Green (1999). Decomposable graphical Gaussian model determination.
Biometrika 86 (4), 785801.
Goldenberg, A., A. X. Zheng, S. E. Fienberg, and E. M. Airoldi (2010). A survey of statistical
network models. Foundations and Trends in Machine Learning 2 (2), 129233.
Bibliography 167
Gómez, J. M., M. Verdú, and F. Perfectti (2010). Ecological interactions are evolutionarily
conserved across the entire tree of life. Nature 465 (7300), 91821.
Green, P. J. and A. Thomas (2013). Sampling decomposable graphs using a Markov chain
on junction trees. Biometrika 100 (1), 91110.
Haario, H., E. Saksman, and J. Tamminen (2001). An adaptive Metropolis algorithm.
Bernoulli , 223242.
Hara, H. and A. Takemura (2006). Boundary cliques, clique trees and perfect sequences of
maximal cliques of a chordal graph. arxiv:cs.dm/0607055.
Heleno, R., C. Garcia, P. Jordano, A. Traveset, J. M. Gómez, N. Blüthgen, J. Memmott,
M. Moora, J. Cerdeira, S. Rodríguez-Echeverría, H. Freitas, and J. M. Olesen (2014).
Ecological networks: delving into the architecture of biodiversity. Biology letters 10 (1),
20131000.
Hjort, N. L. (1990). Nonparametric Bayes estimators based on Beta processes in models for
life history data. The Annals of Statistics , 12591294.
Ho, P. (2008). Modeling homophily and stochastic equivalence in symmetric relational
data. In Advances in Neural Information Processing Systems, pp. 657664.
Ho, P. D. (2005). Bilinear mixed-eects models for dyadic data. Journal of the American
Statistical Association 100 (469), 286295.
Ho, P. D., A. E. Raftery, and M. S. Handcock (2002). Latent space approaches to social
network analysis. Journal of the American Statistical Association 97 (460), 10901098.
Hoover, D. N. (1979). Relations on probability spaces and arrays of random variables.
Preprint, Institute for Advanced Study, Princeton, NJ 2.
Bibliography 168
Huang, S., J. M. Drake, J. L. Gittleman, and S. Altizer (2015). Parasite diversity declines
with host evolutionary distinctiveness: A global analysis of carnivores. Evolution 69 (3),
621630.
Ings, T. C., J. M. Montoya, J. Bascompte, N. Blüthgen, L. Brown, C. F. Dormann, F. Ed-
wards, D. Figueroa, U. Jacob, J. I. Jones, R. B. Lauridsen, M. E. Ledger, H. M. Lewis,
J. M. Olesen, F. J. F. van Veen, P. H. Warren, and G. Woodward (2009). Ecological
networksbeyond food webs. The Journal of Animal Ecology 78 (1), 25369.
Janson, S. (2016). Graphons and cut metric on sigma-nite measure spaces. arXiv preprint
arXiv:1608.01833 .
Jiang, X., D. Gold, and E. D. Kolaczyk (2011). Network-based auto-probit modeling for
protein function prediction. Biometrics 67 (3), 958966.
Jordano, P. (2015). Sampling networks of ecological interactions. bioRxiv , 025734.
Kallenberg, O. (1990). Exchangeable random measures in the plane. Journal of Theoretical
Probability 3 (1), 81136.
Kallenberg, O. (1999). Multivariate sampling and the estimation problem for exchangeable
arrays. Journal of Theoretical Probability 12 (3), 859883.
Kallenberg, O. (2005). Probabilistic symmetries and invariance principles. Springer Science
& Business Media.
Kemp, C., J. B. Tenenbaum, T. L. Griths, T. Yamada, and N. Ueda (2006). Learning
systems of concepts with an innite relational model. In AAAI, Volume 3, pp. 5.
Kingman, J. (1967). Completely random measures. Pacic Journal of Mathematics 21 (1),
5978.
Kingman, J. F. C. (1992). Poisson processes, Volume 3. Clarendon Press.
Bibliography 169
Kingman, J. F. C. (1993). Poisson Processes. Wiley Online Library.
Kissling, W. D. and M. Schleuning (2015). Multispecies interactions across trophic levels at
macroscales: retrospective and future directions. Ecography 38 (4), 346357.
Lauritzen, S. L. (1996). Graphical Models. Oxford University Press.
Levin, D. A., Y. Peres, and E. L. Wilmer (2009). Markov chains and mixing times. American
Mathematical Society.
Lijoi, A., R. H. Mena, and I. Prünster (2007). Controlling the reinforcement in Bayesian non-
parametric mixture models. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 69 (4), 715740.
Lijoi, A. and I. Prünster (2010). Models beyond the dirichlet process. Bayesian nonpara-
metrics 28, 80.
Luis, A. D., T. J. O'Shea, D. T. S. Hayman, J. L. N. Wood, A. A. Cunningham, A. T. Gilbert,
J. N. Mills, and C. T. Webb (2015). Network analysis of host-virus communities in bats
and rodents reveals determinants of cross-species transmission. Ecology Letters 18 (11),
11531162.
McIntyre, K. M., C. Setzkorn, M. Wardeh, P. J. Hepworth, a. D. Radford, and M. Baylis
(2013). Using open-access taxonomic and spatial information to create a comprehensive
database for the study of Mammalian and avian livestock and pet infections. Preventive
veterinary medicine.
Miller, K., M. I. Jordan, and T. L. Griths (2009). Nonparametric latent feature models
for link prediction. In Advances in neural information processing systems, pp. 12761284.
Morales-Castilla, I., M. G. Matias, D. Gravel, and M. B. Araújo (2015). Inferring biotic
interactions from proxies. Trends in ecology & evolution 30 (6), 347356.
Bibliography 170
Newman, M. (2010). Networks: an introduction. Oxford University Press.
Newman, M. E. (2003). The structure and function of complex networks. SIAM review 45 (2),
167256.
Ni, Y., F. C. Stingo, and V. Baladandayuthapani (2016). Sparse multi-dimensional graphi-
cal models: A unied Bayesian framework. Journal of the American Statistical Associa-
tion (just-accepted), 144.
Nunn, C. L. and S. M. Altizer (2005). The global mammal parasite database: an online
resource for infectious disease records in wild primates. Evolutionary Anthropology: Issues,
News, and Reviews 14 (1), 12.
Olhede, S. C. and P. J. Wolfe (2012). Degree-based network models. arXiv preprint
arXiv:1211.6537 .
Orbanz, P. and D. M. Roy (2015). Bayesian models of graphs, arrays and other exchange-
able random structures. Pattern Analysis and Machine Intelligence, IEEE Transactions
on 37 (2), 437461.
Orbanz, P. and S. Williamson (2011). Unit rate Poisson representations of completely random
measures. Technical report, Technical report.
Palla, K., D. Knowles, and Z. Ghahramani (2012). An innite latent attribute model for
network data. arXiv preprint arXiv:1206.6416 .
Paradis, E., J. Claude, and K. Strimmer (2004). APE: analyses of phylogenetics and evolution
in R language. Bioinformatics 20 (2), 289290.
Pedersen, A. B., S. Altizer, M. Poss, A. A. Cunningham, and C. L. Nunn (2005). Patterns of
host specicity and transmission among parasites of wild primates. International journal
for parasitology 35 (6), 64757.
Bibliography 171
Regazzini, E., A. Lijoi, and I. Prünster (2003). Distributional results for means of normalized
random measures with independent increments. Annals of Statistics , 560585.
Ricci, F., L. Rokach, and B. Shapira (2011). Introduction to Recommender Systems Hand-
book. Springer.
Robert, C. and G. Casella (2013). Monte Carlo statistical methods. Springer Science &
Business Media.
Salakhutdinov, R. and A. Mnih (2011). Probabilistic matrix factorization. In NIPS, Vol-
ume 20, pp. 18.
Sato, K. (1999). Lévy processes and innitely divisible distributions. Cambridge University
Press, Cambridge.
Spiegelhalter, D. J., A. P. Dawid, S. L. Lauritzen, and R. G. Cowell (1993). Bayesian analysis
in expert systems. Statistical Science 8 (3), 219247.
Stingo, F. and G. M. Marchetti (2015). Ecient local updates for undirected graphical
models. Statistics and Computing 25 (1), 159171.
Swendsen, R. H. and J.-S. Wang (1987). Nonuniversal critical dynamics in Monte Carlo
simulations. Physical Review Letters 58, 8688.
Tank, A., N. Foti, and E. Fox (2015). Bayesian structure learning for stationary time series.
arXiv preprint arXiv:1505.03131 .
Tarjan, R. E. and M. Yannakakis (1984). Simple linear-time algorithms to test chordality of
graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs. SIAM
Journal on Computing 13 (3), 566579.
Teh, Y. W. and D. Gorur (2009). Indian buet processes with power-law behavior. In
Advances in neural information processing systems, pp. 18381846.
Bibliography 172
Thomas, A. and P. J. Green (2009). Enumerating the junction trees of a decomposable
graph. Journal of Computational and Graphical Statistics 18 (4), 930940.
Veitch, V. and D. M. Roy (2015). The class of random graphs arising from exchangeable
random measures. arXiv preprint arXiv:1512.03099 .
Wang, Y. J. and G. Y. Wong (1987). Stochastic blockmodels for directed graphs. Journal
of the American Statistical Association 82 (397), 819.
Wardeh, M., C. Risley, M. K. McIntyre, C. Setzkorn, and M. Baylis (2015). Database of
host-pathogen and related species interactions, and their global distribution. Scientic
data 2.
Weir, I. S. and A. N. Pettitt (2000). Binary probability maps using a hidden conditional
autoregressive Gaussian process with an application to Finnish common toad data. Journal
of the Royal Statistical Society: Series C (Applied Statistics) 49 (4), 473484.
Wermuth, N. and S. L. Lauritzen (1983). Graphical and recursive models for contingency
tables. Biometrika 70 (3), pp. 537552.
Whittaker, J. (2009). Graphical models in applied multivariate statistics. Wiley Publishing.
Wiens, J. J., D. D. Ackerly, A. P. Allen, B. L. Anacker, L. B. Buckley, H. V. Cornell, E. I.
Damschen, T. Jonathan Davies, J.-A. Grytnes, S. P. Harrison, B. a. Hawkins, R. D. Holt,
C. M. McCain, and P. R. Stephens (2010). Niche conservatism as an emerging principle
in ecology and conservation biology. Ecology letters 13 (10), 131024.
Wiethoelter, A. K., D. Beltrán-Alcrudo, R. Kock, and S. M. Mor (2015). Global trends in
infectious diseases at the wildlife-livestock interface. Proceedings of the National Academy
of Sciences of the United States of America 112 (31).
Wilson, D. E. and D. M. Reeder (2005). Mammal Species of the World: A Taxonomic and
Geographic Reference (3 ed.). Baltimore, Maryland: Johns Hopkins University Press.
top related