causal inference for social network data · 2020. 2. 19. · inference using data from individuals...
TRANSCRIPT
Causal Inference for Social Network Data
Elizabeth L. Ogburn∗, Oleg Sofrygin†, Iván Díaz‡, and Mark J. van der Laan§
February 19, 2020
Abstract
We describe semiparametric estimation and inference for causal effects using observational data from
a single social network. Our asymptotic result is the first to allow for dependence of each observation
on a growing number of other units as sample size increases. While previous methods have generally
implicitly focused on one of two possible sources of dependence among social network observations,
we allow for both dependence due to transmission of information across network ties, and for
dependence due to latent similarities among nodes sharing ties. We describe estimation and inference
for new causal effects that are specifically of interest in social network settings, such as interventions
on network ties and network structure. Using our methods to reanalyze the Framingham Heart
Study data used in one of the most influential and controversial causal analyses of social network
data, we find that after accounting for network structure there is no evidence for the causal effects
claimed in the original paper.
Keywords: Statistical dependence, Causal inference, Social networks, Semiparametric inference
∗Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA†Kaiser Permanente Division of Research, 2000 Broadway, Oakland, CA, 94612, USA‡Division of Biostatistics and Epidemiology, Weill Cornell Medicine, New York, NY, USA§Department of Biostatistics, University of California Berkeley, 2121 Berkeley Way, Berkeley, CA, 94720, USA
1
arX
iv:1
705.
0852
7v5
[st
at.M
E]
17
Feb
2020
1. INTRODUCTION
Many aspects of social networks are of interest to researchers, from the clustering of individuals
into communities to the probability distributions that describe the generation of new relationships
between individuals in the network. There is increasing interest in identifying and estimating
causal effects in the contexts of social networks, that is causal effects that one individual’s behavior,
treatment assignment, beliefs, or health outcome could have on his or her social contacts’ behaviors,
exposures, beliefs, or health statuses. But methodology has not kept apace with interest in causal
inference using data from individuals connected in a social network, and many researchers have
resorted to using inappropriate statistical methods to analyze this new type of data. There have
been a number of high profile articles that use standard methods like generalized linear models
(GLM) and generalized estimating equations (GEE) to attempt to infer causal peer effects from
network data (e.g. Christakis and Fowler, 2007, 2008, 2010), and this work has inspired several
research programs that study peer effects using the same statistical methods (Ali and Dwyer, 2010;
Cacioppo et al., 2009; Madan et al., 2010; Rosenquist et al., 2010; Wasserman, 2013). However, these
methods have come under considerable criticism from the statistical community (Cohen-Cole and
Fletcher, 2008; Lyons, 2011; Shalizi and Thomas, 2011), in part because these statistical models are
not equipped to deal with dependence across individuals and are rarely appropriate for estimating
effects using network data (Ogburn and VanderWeele, 2014).
Recently, researchers interested in causal inference for interconnected subjects have begun to
develop methods designed specifically for the network setting (e.g. Aronow and Samii, 2013; Athey
et al., 2018; Basse and Airoldi, 2015, 2018; Basse et al., 2019; Bowers et al., 2013; Cai et al.,
2019; Eck et al., 2018; Eckles et al., 2014; Forastiere et al., 2016; Graham et al., 2010; Halloran
and Struchiner, 1995; Halloran and Hudgens, 2011; Hong and Raudenbush, 2006; Hudgens and
Halloran, 2008; Jagadeesan et al., 2017; Kao et al., 2012; Leung, 2016; Liu and Hudgens, 2014; Liu
et al., 2016; Papadogeorgou et al., 2019; Puelz et al., 2019; Rosenbaum, 2007; Rubin, 1990; Sävje
et al., 2017; Sävje, 2019; Sobel, 2006; Tchetgen Tchetgen and VanderWeele, 2012; Toulis et al.,
2018; VanderWeele, 2010). However, the inferential methods developed in this context generally
require observing multiple independent groups of units, which corresponds to observing multiple
independent networks, or else they require treatment to be randomized. Ideally, we would like to be
2
able to perform inference even when all observations are sampled from a single social network and
in observational settings in addition to randomized experiments. Tchetgen Tchetgen et al. (2017),
which was developed in parallel to this work, is the only other proposed solution to this problem
of which we are aware. Their approach is quite different from ours, primarily because it assumes
that the outcomes of interest comprise a single realization of a specific type of Markov random field
over the network. This corresponds to certain types of equilibrium distributions and is incompatible
with the traditional causal data-generating mechanisms that we work with in this paper, namely
causal structural equation models and directed acyclic graph (DAG) models (for a discussion of
these compatibility issues see Lauritzen and Richardson, 2002; Ogburn et al., 2018).
We build upon recent methods for causal inference from a single collection of interconnected
units when each unit is known to be independent of all but a small number of other units, with
asymptotic results relying on the number of dependent units being fixed as the total number of
units goes to infinity (van der Laan, 2014). We introduce novel causal estimands and corresponding
estimators for interventions on the network ties and structure and, as far as we are aware, provide
the first asymptotic results for this setting that allow the number of ties per node to increase as the
network grows. While previous methods (including van der Laan, 2014 and Tchetgen Tchetgen et al.,
2017) have implicitly focused on one of two possible sources of dependence among social network
observations, we allow for both dependence due to contagion or transmission of information across
network ties, and dependence due to latent similarities among nodes sharing ties. We describe
estimation and inference for causal effects that are specifically of interest in social network settings
(details about the implementation and computation of the estimation procedures can be found
in a companion paper (Sofrygin and van der Laan, 2015), written in tandem with this one). In
order to demonstrate the importance of principled methods designed to handle the complexity of
observational social network data, we reanalyze the Framingham Heart Study data used in Christakis
and Fowler (2007), which purported to find evidence that obesity is socially contagious. Our method,
which accounts for network structure and the resulting causal and statistical dependence, gives
strongly null results in contrast with the original analysis, which treated subjects as i.i.d..
In Section 2 we give some background on causal inference for social network data, discussing
briefly the relationship between causal structural equation models and network edges, the types of
3
statistical dependence likely to be found in social network data, and asymptotic growth. In Section
3 we present our target of inference and the identifying assumptions that we will use in the methods
that follow. We present the efficient influence function for our target parameter under the conditional
independence assumptions from van der Laan (2014). When these independence assumptions are
relaxed, this will still be an influence function for our target parameter but it may not be efficient. We
describe estimation procedures that will be efficient under the stronger independence assumptions
but still consistent and asymptotically normal under the weaker independence assumptions. In
Section 3.5 we prove our main result, which is the asymptotic normality of our estimator under
an asymptotic regime in which the number of ties per node grows with n. In Section 4 we discuss
estimation of causal effects that are specifically of interest in social network settings. Section 5
demonstrates the performance of our methods in simulations, and the data analysis in Section 6
shows how our principled methods accounting for both causal and statistical dependence undermine
the claims of a highly influential study on social contagion (Christakis and Fowler, 2007). Section
7 concludes.
2. BACKGROUND AND SETTING
2.1 Networks and structural equation models
A network is a collection of units, or nodes, and information about the presence or absence of
pairwise ties between them. The presence of a tie between two units indicates that the units share
some kind of a relationship; for example, in a social network we might define a tie to include
familial relatedness, friendship, or shared place of work. Some types of relationships are mutual, for
example familial relatedness; others, like friendship, can go in only one direction. For simplicity we
will assume all networks are undirected in what follows, but our methods are equally applicable to
directed networks. In an undirected network, the degree of a node is the number of ties it has. The
alters of node i are the nodes with which i shares ties.
Underlying inquiries into causal effects across network nodes is a representation of the network as
a structural equation model. Consider a network of n subjects, indexed by i, with binary undirected
ties Aij ≡ I {subjects i and j share a tie}. The matrix A with entries Aij is the adjacency matrix
for the network. Associated with each subject is a vector of random variables, Oi, including an
4
outcome Yi, covariates Ci, and an exposure or treatment variable Xi, all possibly indexed by time t.
In numerous applications across the social, political, and health sciences, researchers are interested
in ascertaining the presence of and estimating causal interactions across alter-ego pairs. Is there
interference, i.e. does the treatment of subject i have a causal effect on the outcome of subject j
when i and j share a network tie? Is there peer influence, i.e. does the outcome of subject i at time
t have a causal effect on a future outcome of subject j when i and j are adjacent in the network?
These inquiries can be formalized with the help of a causal structural equation model, informed by
the network.
A structural equation model is a system of equations of the form yi = fi [pai(Y ), εi], where
pai(Y ), the set of parents of Y , is a collection of variables that are causes of Y for subject i, and
εi is an error term that may include omitted causes of Y . In general Ci and Xi will be included
in pai(Y ) (Pearl, 2000). When causal inference is performed on network data, the network ties
inform which variables are to be included in pai(Y ). For example, if interference might be present,
then the collection of treatment variables for i’s alters, {Xj : Aij = 1}, must be included in the set
pai(Y ) (Sussman and Airoldi, 2017). If contagion might be present then {Yj,t−k : Aij = 1} must be
included in the set pai(Yt), where t indexes time and k is an outcome-specific lag time such that no
causal effect can be transmitted from one person to another in less than k time steps (Ogburn and
VanderWeele, 2013).
It is important that the network be completely and accurately specified; missing ties are akin
to missing components of a multidimensional treatment vector because they result in important
elements of exposure of interest being left out of the SEM. Whenever an inquiry into causal effects is
informed by a social network, measurement error in the network is tantamount to measurement error
in the exposure of interest, and missing edges or nodes may also result in unmeasured confounding.
This is obviously a huge burden on data collection in many settings, but would be straightforward
for online social networks. The network for which data is collected must be calibrated to the causal
question of interest. If we are interested in peer effects on academic achievement among elementary
school children and think that being in the same classroom is the relationship that determines
whether or not two children affect one another’s outcomes, then being in the same classroom is
the relationship that determines whether or not a network tie exists, and a network that captures
5
interaction during playground sports is not informative or useful. In other words, a tie between
nodes i and j represents the possibility of a causal effect of an element of Oi on an element of Oj
at a later time, and vice versa. These issues have not been made explicit in much of the existing
literature on causal inference for network data; equating a network with the underlying SEM can
help to make them precise.
2.2 Networks and dependence
Perhaps the greatest challenge and barrier to causal and statistical inference using observations
from a single, interconnected social network is dependence among observations. The literature on
statistics for dependent data is vast and multifaceted, but very little has been written about the
dependence that arises when observations are sampled from a single network. Most of the literature
on dependent random variables assumes that the domain from which observations are sampled (e.g.
time or geographic space) has an underlying Euclidean geometry. The principles behind asymptotic
results in the Euclidean dependence literature are simple and intuitive. They rely on a combination
of stationarity assumptions, i.e. assumptions that certain features of the data generating process
do not depend on an observation’s location in the sample domain, and assumptions that bound the
nature and the amount of dependence in the data. Most frequently these are mixing assumptions,
which describe the decay of the correlation between observations as a function of the distance
between them. Intuitively, in order to extract an increasing amount of information from a growing
sample of dependent observations, old observations must be predictive of new observations, which
is ensured by stationarity assumptions, and the amount of independence in the sample must grow
faster than the amount of dependence, which is ensured by mixing conditions.
This literature is not immediately applicable to the network setting. Roughly, this is due to the
difference between Euclidean and network topology. While it is possible to embed a network in Rd
in such a way that preserves distances, to do so is to allow d to increase as n increases. Euclidean
dependence results generally require d to be fixed, implying that, as new observations are sampled
at the boundary of a Euclidean domain, the average and maximum pairwise distance between
observations increases. Networks, on the other hand, often do not have a clear boundary to which
we can add observations in such a way that ensures growth in the sample domain. In a large sample
with Euclidean dependence, most observations will be distant from most other observations. This is
6
not necessarily the case in networks. The maximum distance between two nodes can be small even
in very large networks, and even if the maximum distance between two nodes is large, there may be
many nodes that are close to one another. Therefore, mixing conditions do not necessarily result
in more independence than dependence in a large sample from a network. Research indicates that
social networks generally have the small-world property (sometimes referred to as the “six degrees
of separation” property), meaning that the average distance between two nodes is small (Watts and
Strogatz, 1998). Therefore distances in real-world networks may grow slowly with sample size. Of
course some types of networks, e.g. lattices, embed in Rd as n grows, but these are generally trivial
cases that are not useful for naturally occurring networks like social networks.
Dependence in networks is of two varieties–latent variable dependence and dependence due to
direct transmission–each with its own implications for inference. In the literature on spatial and
temporal dependence, dependence is often implicitly assumed to be the result of latent traits that
are more similar for observations that are close in Euclidean distance than for distant observations.
This type of dependence is likely to be present in many network contexts as well. In networks, edges
present opportunities to transmit traits or information, and this direct transmission is an important
additional source of dependence that depends on the underlying network structure.
Latent variable dependence will be present in data sampled from a network whenever observa-
tions from nodes that are close to one another are more likely to share unmeasured traits than are
observations from distant nodes. Homophily, or the tendency of people who share similar traits to
form network ties, is a paradigmatic example of latent variable dependence. If the outcome under
study in a social network has a genetic component, then we would expect latent variable dependence
due the fact that family members, who share latent genetic traits, are more likely to be close in so-
cial distance than people who are unrelated. If the outcome were affected by geography or physical
environment, latent variable dependence could arise because people who live close to one another
are more likely to be friends than those who are geographically distant. Of course, these traits can
create dependence whether they are latent or observed. But if they are observed then conditioning
on them renders observations independent; therefore the methodological challenges are greater when
they are latent. Just like in the spatial and temporal dependence context, there is often little reason
to think that we could identify, let alone measure, all of these sources of dependence. In order to
7
make any progress towards valid inference in the presence of latent trait dependence, some structure
must be assumed, namely that the range of influence of the latent traits is primarily local in the
network and that any long-range effects are negligible. In a structural equation model, latent trait
dependence would be captured by dependence among the error terms across subjects.
Dependence due to direct transmission will be present whenever one subject’s treatments, out-
comes, or covariates affect other subjects’ treatments, outcomes, or covariates. This kind of de-
pendence, which arises from causal effects between subjects, has structure lacking in latent trait
dependence. Figure 1 depicts contagion in a network of three individuals. This diagram is the
directed acyclic graph representation (Pearl, 1995; Ogburn and VanderWeele, 2013) of the following
structural equation model: At each time t, Y ti is affected by i’s own past outcomes and those of i’s
social contacts. Individual 2 shares ties with 1 and 3 but individuals 1 and 3 are not connected.
This structure implies conditional independences: Y t−21 ⊥ Y t
3 | Yt−12 because any transmission from
individual 1 to 3 must pass through 2; Y t−21 ⊥ Y t−2
2 because information cannot be transmitted
instantaneously. If observations are observed at closely spaced time intervals then these conditional
independences can be harnessed for inference. There is no reason to think that any such conditional
independences would hold with latent variable dependence. If some time points are not observed
then the structure is lost and dependence due to direct transmission is indistinguishable from latent
variable dependence.
In this paper, we accommodate both dependence due to direct transmission and dependence
due to latent traits. We assume that both kinds of dependence are limited to dependence neighbor-
hoods determined by the underlying social network: each subject, or node, i can directly transmit
information, outcomes, or exposures to the nodes with which i shares a network tie, and each node
i can share latent traits with the nodes with which i shares a network tie or a mutual connection.
That a subject can only transmit to his or her immediate social contacts may be a reasonable
assumption (indeed, our definition of network ties makes this true), but it is likely unrealistic to
assume that latent variable dependence only affects nodes at a distance of one or two ties, as we
assume throughout. Furthermore, harnessing the structure of direct transmission requires detailed
data that may not be available in practice in many settings. This represents a first step towards
valid statistical and causal inference under more realistic assumptions than have been required by
8
Figure 1: A simple example of dependence due to direct transmission.
9
previous work, but future work is needed to address more realistic–i.e. longer range–forms of latent
variable dependence.
3. METHODS
In this section we describe estimation of and inference about the causal effect of a treatment or
exposure, X, including randomized and non-randomized exposures subject to interference. The
approach we describe below is different from traditional approaches to interference in that it is
justified when partial interference does not hold. As far as we are aware, this is the first approach
to interference that references an asymptotic regime in which the number of ties for a given individual
may grow with sample size. The estimating procedure that we describe in this section is based on
van der Laan (2014), but we generalize the results to a broader class of causal effects and to more
general and pervasive forms of dependence among observations. The conditions under which the
resulting estimators are consistent and asymptotically normally distributed are different and weaker
here than those in van der Laan (2014).
For the remainder of Section 3, we describe consistent and asymptotically normal (CAN) es-
timators of causal effects under two different sets of assumptions. One set of assumptions allows
dependence due to direct transmission but not latent variable dependence, as in van der Laan
(2014); under this set of assumptions our estimators inherit the efficiency properties from van der
Laan (2014). The other set of assumptions allows dependence due to direct transmission and latent
variable dependence; under this set of assumptions our estimators are CAN but may not be efficient.
Our main result is asymptotic normality under an asymptotic regime in which the number of ties
for a given individual may grow with sample size in Section 3.5.
In Section 3.6 we describe statistical inference for the estimators introduced in Section 3.2. We
consider two different classes of estimators: estimators that marginalize over baseline covariates and
estimators that condition on baseline covariates. In some cases, variance estimation is facilitated
by conditioning on covariates. Under the assumptions encoded in the structural equation model
in Section 3.1, the conditional estimator is in fact consistent for the marginal estimand. However,
conditional estimators have smaller variance and inference about the conditional estimand cannot be
interpreted as inference about the marginal estimand. All of our estimands and estimators condition
10
on the observed network as given by the adjacency matrix A. A table summarizing the different
assumptions and properties can be found in the Appendix.
We focus throughout on single time-point treatments. Longitudinal interventions are also possi-
ble under the theory introduced here but we leave the details for future work. We state our results
under the assumption that all variables take values on discrete sets. Analogous results are valid for
other types of random variables: it is straightforward to extend our notation and central limit theo-
rem to continuous covariates and outcomes (though all efficiency results require discrete covariates),
but continuous treatments are more complicated (see van der Laan, 2014).
3.1 Structural equation model
Let Ki =∑n
j=1Aij , that is, Ki is the degree of node i, or the number of individuals sharing a
tie with individual i. The degree of subject i and the degrees of i’s alters may be included in the
covariate vector Ci. We define Y = (Y1, ..., Yn) and C and X analogously. We use a structural
equation model to define the causal effects of interest, as in Section 2, but note that analogous
definitions may be achieved within the potential outcome framework (Pearl, 2012).
We assume that the data are generated by sequentially evaluating the following set of equations:
Ci = fC [εCi ] i = 1, . . . , n
Xi = fX [{Cj : Aij = 1} , εXi ] i = 1, . . . , n
Yi = fY [{Xj : Aij = 1} , {Cj : Aij = 1} , εYi ] i = 1, . . . , n, (1)
where fC , fX , and fY are unknown and unspecified functions and εi = (εCi , εXi , εYi) is a vector of
exogenous, unobserved errors for individual i. This set of equations corresponds to observational
settings when fX depends on C and to randomized settings when it does not. Both X and Y may
depend on A only through C. Time ordering is a fundamental component of a structural causal
model. For example, we assume that C is first drawn for all units, so that, in addition to Ci, the
other components of the vector C–corresponding to i’s social contacts–may affect the value of Xi.
In addition, nonparametric identification of causal effects requires the following assumptions on
11
the error terms from the SEM:
(εX1 , ..., εXn) ⊥ (εY1 , ..., εYn) | C, (A1)
εX1 , ..., εXn | C and εY1 , ..., εYn | C,X are identically distributed, (A2a)
εXi ⊥ εXj | C and εYi ⊥ εYj | C,X for i, j s.t.
Aij = 0 and ∃!k with Aik = Akj = 1 (A2b)
εCi , i = 1, ..., n, are identically distributed, and (A3a)
εCi ⊥ εCj for i, j s.t. Aij = 0 and ∃!k with Aik = Akj = 1. (A3b)
Assumption (A1) ensures that C suffices to control for confounding of the effect of X on Y. It
implies that any latent variable dependence affects X and Y separately; in general a latent variable
that affected X and Y jointly would constitute a violation of this assumption. Assumptions (A2b)
and (A3b) ensure that any unmeasured sources of dependence–i.e. latent trait dependence–only
affect pairs of observations up to a distance of two network ties–that is, friends or friends-of-friends.
Assumption (A3) can be omitted if attention is restricted to causal effects conditional on C.
Although our main result, given in Theorem 1 below, holds for inference in the SEM defined by
assumptions (A1)–(A3b), some asymptotic properties are guaranteed only when stronger versions
of assumptions (A2b) and (A3b) hold. We therefore introduce alternative assumptions
εX1 , ..., εXn | C are i.i.d. and εY1 , ..., εYn | C,X are i.i.d., and (A4)
εCi , i = 1, ..., n, are i.i.d. (A5)
These assumptions are consistent with dependence due to direct transmission but not latent variable
dependence.
Note that, although the variance-covariance structure of the SEM given in (1) is affected by
the dependence allowed in (A2b) and (A3b), the mean structure is unaltered by the choice of
assumptions (A2) and (A3) or (A4) and (A5). This rules out the possibility that any latent sources
of dependence introduce confounding, and in particular while it allows limited forms of homophily
to induce dependence it rules out confounding due homophily, which is a strong and often unrealistic
12
assumption (Shalizi and Thomas, 2011). Therefore, any estimator that is unbiased under (A4) and
(A5) will remain unbiased when these are relaxed to (A2) and (A3). In Section 3.2 we discuss
nonparametric identification of causal parameters, which is agnostic to the choice of the weaker or
stronger independence assumptions. In Section 3.3 we derive estimators under assumptions (A1),
(A4), and (A5)–that is, in the presence of dependence due to direct transmission but not latent
variable dependence. We use the stronger assumptions because the resulting model is amenable to
familiar tools for deriving semiparametric estimators. In Section (3.5) we prove that the estimators
derived under assumptions (A1), (A4), and (A5) are CAN under the weaker set of assumptions
(A1)–(A3b). In Section 3.6 we discuss inference under each of the two sets of assumptions.
3.2 Definition and nonparametric identification of causal effects
In principle it is possible to perform statistical inference in the model defined by assumptions
(A1)-(A3b) or by assumptions (A1), (A4), and (A5). However, in practice we may need to
make dimension-reducing assumptions on the forms of fX and fY . This is done by consider-
ing summary functions sX and sY and random variables Wi = sX,i ({Cj : Aij = 1}) and Vi =
sY,i ({Cj : Aij = 1} , {Xj : Aij = 1}) such that the model may be written as
Ci = fC [εCi ] i = 1, . . . , n
Xi = fX [Wi, εXi ] i = 1, . . . , n
Yi = fY [Vi, εYi ] i = 1, . . . , n.
For example, sX,i ({Cj : Aij = 1}) =(Ci,∑
j:Aij=1Cj
)implies that the treatment status of unit i
only depends on i’s own covariate value and on the sum of the covariate values of the units sharing a
tie with i. Analogously, sY,i ({Cj : Aij = 1} , {Xj : Aij = 1}) =(Ci,∑
j:Aij=1Cj , Xi,∑
j:Aij=1Xj
)is an example of a summary function for fY . For convenience we use the notation sX,i(C) and
sY,i(C,X) below; however, this notation should not undermine the important fact that Wi can
only depend on the subset {Cj : Aij = 1} and Vi can only depend on the subsets {Cj : Aij = 1}
and {Xj : Aij = 1} of C and X, as these are the only components of C and X that are parents of
X and Y, respectively, in the network-as-structural-causal-model. For notational convenience, in
what follows we augment the observed data random vector with Vi and Wi, recognizing that these
13
are deterministic functionals of Ci and Xi, defined by sY,i and sX,i, and are therefore technically
redundant.
A hypothetical intervention that deterministically sets Xi to a user-given value x∗i for i = 1, ..., n
is given by
Ci = fC [εCi ] i = 1, . . . , n
Xi = x∗i i = 1, . . . , n
Yi(x∗) = fY [Vi(x
∗), εYi ] i = 1, . . . , n,
where x∗ = (x∗1, . . . , x∗n). Here Yi(x∗) denotes the potential or counterfactual outcome of individ-
ual i in a hypothetical world in which P (X = x∗) = 1. Analogously, Vi(x∗) = sY,i(C,x∗) is a
counterfactual random variable in a hypothetical world in which P (X = x∗) = 1. Note that, al-
though Vi(x∗) is counterfactual, its value is determined by the observed realization of C and by the
user-specified value x∗, and it is therefore known. In order to streamline notation as we describe
increasingly complex interventions, we denote the counterfactual variables Vi(x∗) and Yi(x∗) by V ∗i
and Y ∗i , respectively.
The causal parameter of interest throughout is the expected average potential outcome in this
same hypothetical world, i.e. E[Y ∗n], where Y ∗n = 1
n
∑ni=1 Y
∗i . This parameter is conditional on
the observed adjacency matrix and, unlike typical causal parameters in i.i.d. settings, is allowed
to depend on n. Causal effects are contrasts for two different hypothetical intervention vectors x∗.
The overall effect of treatment (Hudgens and Halloran, 2008; Tchetgen Tchetgen and VanderWeele,
2012), for example, contrasts the intervention in which everyone is treated to the intervention in
which nobody is treated. In Section 4 we discuss other types of causal effects of interest in social
network settings.
We are now ready define notation that we will use throughout the remainder of the paper
for functionals of the distribution of O. Let pC(c) = P (C = c), g(x|w) = P (X = x |W = w),
gi(x|w) = P (Xi = x|Wi = w), pY (y|v) = P (Y = y|V = v), and pY,i(y|v) = P (Yi = y|Vi = v).
Define the two marginal distributions hi(v) = P (Vi = v) and hi,x∗(v) = P (V ∗i = v), noting that
both hi and hi,x∗ are determined by g and pC and are therefore observed data quantities. Finally,
14
m(v) =∑
y y pY (y|v) is the conditional expectation of Y given V = v.
In addition to assumptions (A1)-(A3b) or (A1), (A4), and (A5), identification of E[Y ∗n]requires
the positivity assumption that, for all c in the support of C,
P (V = v|C = c) > 0 for all v in the range of V ∗i . (A6)
This assumption states that, within levels of C, the values of V determined by the hypothetical
intervention x∗ have positive probability under the observed-data-generating distribution. Now the
causal parameter E[Y ∗n]is identified by
ψ =1
n
n∑i=1
E [m(V ∗i )] =1
n
n∑i=1
∑v
m(v)hi,x∗(v). (2)
This identification result is equivalent to
ψ =1
n
n∑i=1
∑c
m(sY,i(c,x∗))pC(c). (3)
From (3), it is clear that the conditional causal parameterE[Y ∗n | C = c
]is identified by 1
n
∑ni=1m(sY,i(c,x
∗)).
3.3 Estimation
Estimation of and inference about E[Y ∗n]requires a statistical modelM for the distribution of the
observed data P (O). That is, M is a collection of distributions over O of which one element is
the true data-generating distribution. Our target of inference is a pathwise differentiable mapping
Ψ :M→ R such that ψ is Ψ(P ), the mapping evaluated at the true data-generating distribution.
Under assumptions (A1), (A4), and (A5) the probability distribution of the observed data may be
factorized as
P (O = o) = P (C = c) g(x|w)pY (y|v), (4)
suggesting that M requires three components: a model for pC , a model for g, and a model for
P (Y |V ). Furthermore, the identification results in (2) and (3) indicate that ψ depends on P (Y |V )
only through m. The empirical distribution pC can be used throughout to nonparametrically es-
timate pC , but, when C is high-dimensional, g and m cannot be non-parametrically estimated at
15
rates of convergence that are fast enough to satisfy the regularity conditions of Theorem 1 (see
Appendix). Therefore, in order to define the parameter mapping we require a statistical model
M =Mg×Mm, whereMg is a collection of conditional distributions for X given W such that the
true conditional distribution is a member, andMm is a collection of expectations of Y relative to
conditional distributions of Y given V such that the true conditional expectation of Y given V is
a member. Estimation of ψ is based on the efficient influence function for the parameter mapping
Ψ :M→ R. Under assumptions (A1), (A4), and (A5), the efficient influence function, D, evaluated
at a fixed value o of O was derived by van der Laan (2014) and is given by
D(o) =
n∑j=1
1
n
n∑i=1
E [m (V ∗i ) | Cj = cj ]− ψ +1
n
n∑i=1
hx∗(vi)
h(vi){yi −m (vi)} , (5)
where h(vi) = 1n
∑nj=1 hj(vi), hx∗(vi) = 1
n
∑nj=1 hj,x∗(vi), vi = sY,i(c,x), and V ∗i = sY,i(C,x
∗). The
influence function has expected value equal to 0 at the true ψ; this fact can be used to generate
unbiased estimating equations for ψ. van der Laan (2014) showed that estimating equations based on
this efficient influence function are doubly robust: the right hand side of Equation (5) has expected
value equal to 0 if m(·) is replaced with an arbitrary functional of V or if g(·) is replaced with an
arbitrary functional of W , as long as one of the two remains correctly specified. (Recall that g(·),
along with pC , determines hx∗(vi) and h(vi).) This implies that an estimating equation based on
Equation (5) will be unbiased for ψ if either modelMm for m(·) or modelMg for g(·) is correctly
specified, i.e. contains the truth, even if one is not. This influence function is efficient in that, when
m(·) is correctly specified, it has the smallest variance among all influence functions in modelMg.
This sense of efficiency derives from the Convolution Theorem (Bickel et al., 1998), which holds
under local asymptotic normality (Van der Vaart, 1998; van der Laan, 2014) and therefore in our
setting.
The efficient influence function in a model that does not make any distributional assumptions
about C, that is under assumptions (A1) and (A4) only, is given in equation (6) below.
D′(o) =1
n
n∑i=1
(E [m (V ∗i ) | C = c]− ψ +
hx∗(vi)
h(vi){yi −m (vi)}
). (6)
We use this influence function in what follows. This is also the influence function used to derive
16
estimators conditional on C, in which case the first two terms cancel out; we will denote the
conditional influence function with Dc(o).
In the Appendix we describe a targeted maximum loss-based estimator (TMLE) of ψ, however
all of the results that follow are equally applicable to a standard estimating equation approach. The
estimator inherits the double robustness property we described above: it will be consistent for ψ if
either the working model g for g or the working model m for m is correctly specified. This resulting
estimator remains CAN for ψ under assumptions (A2) and (A3) instead of (A4) and (A5), and the
same procedure can be used to estimate the parameter conditional on C.
3.4 A note on asymptotic growth
There are many complex issues surrounding asymptotic growth of networks (e.g. Diaconis and
Janson, 2007; Shalizi and Rinaldo, 2013), and a large literature on graph limits (Lovász, 2012). These
issues are largely beyond the scope of this paper, but we believe that our methods are consistent with
realistic social networks. In particular, observed social networks and models proposed for generating
social networks tend to have heavy-tailed degree distributions, with most nodes having low degree
but a non-trivial proportion of nodes having high degree, with the maximum degree dependent on
the size of the network, resulting in asymptotically sparse networks. Some researchers speculate
that the heavy right tails of social network degree distributions tend to approximately follow power
laws: Pr(degree = k) ∼ k−α for 2 < α < 3 (Barabási and Albert, 1999; Lovász, 2012; Newman and
Park, 2003), in which case Pr(degree > k) = O(k1−α) for any fixed k. Even if degree distributions
depart from power law distributions (Clauset et al., 2009) they are frequently incompatible with
the assumption of bounded degrees, which has been used in previous methods for inference about
observations sampled from a single social network. Our new methods are not able to accommodate
the most highly connected nodes from a power law degree sequence, but they can nevertheless
be used to perform inference about the other nodes in a network that has a power law degree
distribution (see Section 4.4).
Our theoretical results require an asymptotic regime in which the number of nodes in the net-
work, n, goes to infinity. Formalizing asymptotic growth of network-generating models, in particular
for models with sparse limits, is an active area of research (Caron and Fox, 2017; Kolaczyk and Kriv-
itsky, 2015) and is beyond the scope of this paper. We take for granted a sequence of networks with
17
increasing n such that the structural equation model that specifies the distributions of covariates,
treatment, and outcome is preserved, along with key features of the network topology.
The role of the central limit theorem below is to license the use of approximate 95% confidence
intervals and normal approximations in finite samples, and as with any data-adaptive parameter
we use asymptotic arguments to show that as n → ∞, 95% confidence intervals approach nominal
coverage rates. Because our parameter of interest is conditional on A and may depend on n, it is
most natural to think of inference about the true causal parameter for the given, observed network.
However, researchers may have reason to believe that the causal parameter does not depend on n
or on A except through the distribution of C and X, in which case inference about other similar
networks may be warranted.
3.5 Asymptotic normality
In order to accommodate more realistic models of asymptotic growth in the network context, we
consider an asymptotic regime in which Ki may grow as n→∞.
Theorem 1: LetKmax,n = maxi{Ki} for a fixed network with n nodes. Suppose thatK2max,n/n→
0 as n → ∞. Under independence assumptions (A1) through (A3b), positivity assumption (A6),
and regularity conditions (see Appendix),
√Cn
(ψ − ψ
)d−→ N(0, σ2),
n/K2max,n ≤ Cn ≤ n. The asymptotic variance of ψ, σ2, is given by the variance of the influence
curve of the estimator.
In Section 4.4, below, we discuss settings in which the conditions for this theorem fail to hold,
and ways to recover valid inference for conditional estimands in some of these settings. The proof
of Theorem 1 is in the Appendix. Broadly, the proof has two parts: first, to show that the second
order terms in the expansion of ψ−ψ are stochastically less than 1/√Cn, and second, to show that
the first order terms converge to a normal distribution when scaled by a factor of order√Cn. The
proof that the second order terms are stochastically less than 1/√Cn is an extension of the empirical
process theory of Van Der Vaart and Wellner (1996) and follows the same format as the proof in
van der Laan (2014). For the proof that the first order terms converge to a normal distribution,
18
we rely on Stein’s method of central limit theorem proof (Stein, 1972). Stein’s method allows us
to derive a bound on the distance between our first order term (properly scaled) and a standard
normal distribution; this bound depends on the degree distribution K1, ...,Kn. We show that this
bound converges to 0 as n → ∞ under regularity conditions and our running assumption that
K2max,n = o(n).
When all nodes have the same number of ties, i.e. Ki = Kmax,n for all i, then the rate of
convergence will be given√Cn =
√n/K2
max,n. When Kmax,n is bounded above as n → ∞, as in
van der Laan (2014), the rate of convergence will be√n. When Kmax,n → ∞ but some nodes
have fewer than Kmax,n ties, the exact rate of convergence is between√n/K2
max,n and√n but is
difficult or impossible to determine analytically, as it may depend intricately on the structure of the
network. The inferential procedures that we describe below do not require knowledge of the rate of
convergence.
3.6 Inference
A 95% confidence interval for ψ is given by ψn ± 1.96σ/√Cn. In practice neither σ nor Cn are
likely to be known, but available variance estimation methods estimate the variance of ψn directly,
incorporating the rate of convergence without requiring it to be known a priori.
In principle, the variance of ψ can be estimated using the empirical average of the square of the
influence function, substituting ψ for ψ and the fitted values from the working models g and m for g
andm. Although this variance may be anticonservative if one, but not both, of the working models g
and m is correctly specified, using flexible or non-parametric specifications for these models increases
opportunities to estimate both consistently. However, unlike in i.i.d. settings, the expectation of
the square of the empirical version of the influence function given in Equation (5) does not reduce
to the sum of squared influence terms for each observation. Instead, it includes double sums for
all pairs of observations that are not marginally independent of one another. These terms capture
covariances between dependent observations; these extra covariance terms reflect a larger variance
and a slower rate of convergence due to dependence across observations.
When dependence is due to direct transmission, that is, under assumptions (A1), (A4), and
(A5), two alternative variance estimation procedures are available. One option is to estimate the
variance of the influence function D′(o) given by Equation (6). Our TMLE is based on D′(o), but
19
because this is the efficient influence function in a model that makes fewer assumptions than (A1),
(A4), and (A5), it has larger variance than D(o) and provides a valid (asymptotically conservative)
variance estimate even when estimation is based on D(o). For consistent and computationally
feasible estimators for the variance of D′(o) see Sofrygin and van der Laan (2015).
An alternative approach to estimate the variance of ψ under assumptions (A1), (A4), and (A5) is
to employ the following version of a parametric bootstrap, which might offer improvements in finite-
sample performance over the previously described approach. For each of B bootstrap iterations,
indexed by b = 1, . . . , B, first n covariatesCb = (Cb1, . . . , Cbn) are sampled with replacement, then the
existing model fit g is applied to sampling of n exposures Xb = (Xb1, . . . , X
bn), followed by a sample
of n outcomes Yb = (Y b1 , . . . , Y
bn ) based on the existing outcome model fit m. The corresponding
bootstrap random summariesW bi and V b
i , for i = 1, . . . , n, are constructed by applying the summary
functions sX and sY to Cb and (Cb,Xb), respectively. This bootstrap sample is then used to
obtain the predicted values from the existing auxiliary covariate fit (ˆhx∗/ˆh)(V b
i ), for i = 1, . . . , n,
followed by a bootstrap-based fitting of ε, and finally, evaluation of bootstrap TMLE. Note that
the TMLE model update is the only model fitting step needed at each iteration of the bootstrap,
which significantly lowers the computational burden of this procedure. The variance estimate is then
obtained by taking the empirical variance of bootstrap TMLE samples ψb. Because the parametric
bootstrap relies on known or assumed independences, and because only the TMLE model (i.e. not
the full likelihood) is fit at each iteration, this procedure consistently estimates the variance of the
first order terms in the expansion of ψ−ψ, and we prove in the Appendix that the higher order terms
are asymptotically neglible. However, due to dependence across observations, one must be judicious
with applications of the bootstrap. For example, the parametric bootstrap procedure described
above requires conditional independence of Xi given Wi and Yi given Vi, along with the consistent
modeling of the corresponding factors of the likelihood. It may seem natural to sample Vi directly
from its corresponding auxiliary model fit, but this is likely to result in an anti-conservative variance
estimates, since the conditional independence of Vi is unlikely to hold by virtue of its construction
as a summary measure of the network.
When latent variable dependence is present, that is under assumptions (A1) through (A3),
consistent and computationally feasible variance estimation procedures are not currently available
20
for either D′(o) or D(o), because existing methods require bootstrapping some of the observed
data. Without latent variable dependence we can take advantage of marginal and conditional
independences to employ i.i.d. or parametric bootstrap methods, but latent variable dependence
requires new methods for dependent data bootstrap. For this reason, we instead estimate the
conditional parameter with influence function DC(o). A simple plug-in estimator is available for the
variance of this influence function (see the Appendix and van der Laan, 2014).
4. EXTENSIONS
In this section we extend the estimation procedure to two causal effects of great interest in the con-
text of social networks: social contagion, or peer effects, and interventions on the network structure
itself, i.e. interventions onA = [Aij : i, j ∈ {1, . . . , n}] where, as above, Aij ≡ I {subjects i and j share a tie}.
First we introduce dynamic and stochastic interventions.
4.1 Dynamic and stochastic interventions
A dynamic intervention assigns treatment as a user-specified function dX(·) ofC; this corresponds to
substituting dX,i(C) for x∗i in the intervention model, definitions, and estimating procedure above.
Treatment is deterministically specified conditional on covariates but is but allowed to depend
(“dynamically”) on covariates. A stochastic intervention (Muñoz and van der Laan, 2012; Haneuse
and Rotnitzky, 2013; Young et al., 2014) that replaces fX with a new, user-specified function rX
is represented by an intervention SEM that replaces the equation for Xi with X∗i = rX [W ∗i , εXi ].
The intervention changes the distribution of X but does not eliminate the stochasticity introduced
by εX . In the social network setting, stochastic interventions that change the dependence of Xi
on C and of and Yi on C and X are of particular interest. For example, consider data generated
by an SEM in which fX depends on C only through Wi = 1|Ai|∑
j:Aij=1Cj , i.e. the mean of C
among the set of alters of i. We might be interested in the mean counterfactual outcome under
a stochastic intervention that forces fX to depend instead on W ∗i = maxj:Aij=1 {Cj}, i.e. the
maximum value C among the alters of i. This particular stochastic intervention modifies fX only
through W ; it is represented by an intervention SEM that replaces the equation for Xi with X∗i =
fX [W ∗i , εXi ]. For each x in the support of X, Xi is set by the intervention to x with probability
P[X = x|W = maxj:Aij=1 {Cj}
].
21
We formally define the class of stochastic interventions that alter the dependence of Xi on C and
of and Yi on (C,X), discuss identifying assumptions and estimation procedures, and then describe
some such interventions of particular interest. Let s∗X,i(·) and s∗Y,i(·, ·) be user-specified functionals.
They are denoted by an asterisk because they index hypothetical interventions rather than realized
data-generating mechanisms. Let W ∗i = s∗X,i(C) and V ∗i = s∗Y,i(C,X∗). We are concerned with the
class of stochastic interventions given by
Ci = fC [εCi ] i = 1, . . . , n
X∗i = fX [W ∗i , εXi ] i = 1, . . . , n
Y ∗i = fY [V ∗i , εYi ] i = 1, . . . , n. (7)
This can be interpreted as an intervention where, for each x∗ in the support of X and for i = 1, ..., n,
Xi is set to x∗ with probability P[X = x∗|W = s∗X,i(C)
]and Vi is set to s∗Y,i(C,x
∗) deterministically
for each possible realization x∗. Because Y depends on X only through V , this is equivalent to an
intervention that sets Vi to v with probability P[X ∈
{x∗ : s∗Y,i(C,x
∗) = v}|W = s∗X(C)
], where
s∗X(C) =(s∗X,1(C), ..., s∗X,n(C)
).
This intervention is identified under the same assumptions as the deterministic interventions
described above, with the exception of a positivity assumption that is a slight modification of (A6).
Define X ∗ = {x∗ : P [X = x∗|W = s∗X(C)] > 0} to be the set of treatment vectors x∗ that have
positive probability under the stochastic intervention defined by (7). We assume that, for all c in
the support of C,
minv∈V∗P (V = v|C = c) > 0 for V∗={s∗Y,i(C,x
∗) : x∗ ∈ X ∗}
(8)
That is, the conditional support of V ∗ must be included in the conditional support of V in order
for the intervention to be supported by the data. Note that, in order for this positivity assumption
to hold, the supports of s∗X(·) and s∗Y (·, ·) must be of the same dimensions as the supports of sX(·)
and sY (·, ·), respectively.
The causal parameter of interest is the expected average potential outcome under this hypo-
22
thetical intervention, E[Y ∗n]. Define h∗i (v) = P [V ∗i = v] = P
[s∗Y,i(C,X
∗) = v]. Then E
[Y ∗n]is
identified by
ψ =1
n
n∑i=1
∑c,x
E[Yi|s∗Y,i(c,x)
]P [X = x|W = s∗X(c)] pC(c)
=1
n
n∑i=1
E [m(V ∗i )] =1
n
n∑i=1
∑v
m(v)h∗i (v).
An influence function for ψ, evaluated at a fixed value of the observed data, o, is given by
D†(o) =n∑j=1
1
n
n∑i=1
E [m (V ∗i ) | Cj = cj ]− ψ +1
n
n∑i=1
h∗(vi)
h(vi){yi −m (vi)} ,
where h∗(vi) = 1n
∑nj=1 h
∗j(vi). (When h∗ is known, this is the efficient influence function under
assumptions (A4) and (A5).) Estimation of h∗ is carried out by substituting g and pC for g and pC
in the expression
h∗(v) =1
n
∑j
∑c,x
I(s∗Y,i(c,x) = v
)g (x|s∗X(c)) pC(c).
Since pC is an empirical distribution that puts mass one on the observed value c, the estimator ˆh∗
reduces to
ˆh∗(v) =1
n
n∑j=1
∑x
I(s∗Y,i(x,C) = v
)g(x|s∗X(C)).
We denote by ˆh and ˆh∗ the corresponding estimates of h and h∗. Now the TMLE of ψ is computed
according to the steps outlined in Section 3, but with V ∗ and Y ∗ defined as immediately above.
A special case of this class of stochastic interventions intervenes only on sX , like the example
discussed above in which the intervention forces fX to depend on W ∗i = maxj:Aij=1 {Cj} but does
23
not alter the functional form of sY . E[Y ∗n ] under this type of intervention is identified by
ψ =1
n
n∑i=1
∑c,x
E [Yi|C = c,X = x]P [X = x|W = s∗X(c)] pC(c)
=1
n
n∑i=1
E [m(V ∗i )] =1
n
n∑i=1
∑v
m(v)h∗i (v).
With V ∗i defined as sY,i(C,X∗), estimation of this class of intervention proceeds as immediately
above. The fact that X∗ is random does not affect the estimation algorithm.
4.2 Peer effects
Define Y 0i to be the outcome variable measured at a time previous to the primary outcome mea-
surement Yi. Peer effects are the class of causal effects of Y 0j on Yi for Aij = 1: the effects of
individuals’ outcomes on the subsequent outcome of their alters. We can operationalize peer effects
as the effects of dynamic interventions where the counterfactual exposure for subject i is given by
a user-specified function dX(·) of {Y 0j : Aij = 1}. In order to maintain the identifying assump-
tions A2b and A3b, the time elapsed between Y 0 and Y must permit transmission only between
nodes and their immediate alters. Otherwise, if the outcome could have spread contagiously more
broadly, there will be more dependence present than our methods can account for, and also possible
confounding of the effect of Y 0i on Yj for Aij = 1 due to mutual friends.
4.3 Interventions on network structure
An intervention on the network, i.e. an intervention that adds, removes, or relocates ties in the net-
work, is a special case of a joint intervention on sX(·) and sY (·). To see this, note that the network
structure, codified by the adjacency matrix A, enters the data-generating SEM (1) only through
sX(·) and sY (·); therefore we can represent any modification to A via the corresponding modifica-
tion to sX(·) and sY (·). This represents a strong assumption; if network structure can affect Y not
through sX(·) and sY (·) then estimating these effects is more challenging (Ogburn et al., 2014; Toulis
et al., 2018). Consider an intervention that replaces the observed adjacency matrix A with a user-
specified adjacency matrix A∗. This is a stochastic intervention, with s∗X,i(C) replaced by sA∗X,i(C) ≡
sX,i
({Cj : A∗ij = 1
})and s∗Y,i(C,X
∗) by sA∗
Y,i (C,X∗) ≡ sY,i
({X∗j : A∗ij = 1
},{Cj : A∗ij = 1
}).
The intervention SEM differs from the data-generating SEM only in that Xi depends on the covari-
24
ate values for the individuals with whom i shares ties in the intervention adjacency matrix A∗ and Yi
depends on the counterfactual treatments and observed covariate values for those same individuals.
Interventions on summary features of the adjacency matrix can also be viewed as stochastic
interventions. Instead of replacing A with A∗, an intervention on features of the network structure
replaces A with the members of a class A∗ of n× n adjacency matrices that share the intervention
features, stochastically according to some probability distribution gA∗ over A∗. For example, we
might be interested in interventions that constrain the degree distribution of the network, e.g. fixing
the maximum degree to be smaller than someD. We might specify gA∗(A) = 1|A∗|I {A ∈ A
∗}, giving
equal weight to each realization in the class A∗. Effectively, this kind of intervention sets Vi to v
with probability
P[X ∈
{x∗ : sA
∗Y,i (C,x
∗) = v}|W = sA∗X (C) for some A∗ ∈ A∗
],
where sA∗
X (C) =(sA∗
X,1(C), ..., sA∗
X,n(C)).
As with the stochastic interventions discussed in the previous section, positivity is a crucial
assumption for identifying interventions on A: the support of V ∗ must be the same as the support
of V . If replacing A with A∗ (either deterministically or as a random selection from the class A∗)
assigns to unit i a value of V that not observed in the real data for a unit in the same C stratum
as i, then the effect of the intervention that that replaces A with A∗ is not identified for unit i.
In general it will be possible to identify interventions on local but not global features of network
structure. Examples of local features of network structure include the degree of subject i and local
clustering around subject i: they depend on A only through subject i and subject i’s immediate
contacts. A local clustering coefficient for node i can be defined as the proportion of potential
triangles that include i as one vertex and that are completed, or the number of pairs of neighbors
of i who are connected divided by the total number of pairs of neighbors of i (Newman, 2009).
This measure of triangle completion captures the extent to which “the friend of my friend is also
my friend”: triangle completion is high whenever two subjects who share a mutual contact are more
likely to themselves share a tie than are two subjects chosen at random from the network. Positivity
could hold if, within each level of C, subjects were observed to have a wide range of degrees and
25
of triangle completion among their contacts. In contrast with degree and local clustering, network
centrality is a node-specific attribute that nevertheless depends on the entire network structure. It
captures the intuitive notion that some nodes are central and some nodes are fringe in any given
network. It can be measured in many different ways, based, for example, on the number of network
paths that intersect node i, on the probability that a random walk on the network will intersect
node i, or on the mean distance between node i and the other nodes in network (see Chapter 7 of
Newman, 2009 for a comprehensive discussion of these and other centrality measures). Centrality is
given by a univariate measure for each node in a network, but each node’s measure depends crucially
on the entire graph. In reality it is not generally possible to intervene on centrality without altering
the entire adjacency matrix A, and the positivity assumption is unlikely to hold.
4.4 Too many friends, too much influence
The conditions of Theorem 1 will be violated for any asymptotic regime in which the degree of one or
more nodes grows at a rate equal to or faster than√n. This is problematic because social networks
frequently have a small number of “hubs”–that is, nodes with very high degree (Newman, 2009),
and the occurrence of hubs is a feature of many of the network-generating models that have been
proposed for social networks. When a small number of individuals wield influence over a significant
portion of the rest of the population, two problems arise for statistical inference. First, the number
of hubs may stay small as n increases. If the hubs are systematically different from the rest of the
population, then a fixed or slowly growing number of hubs would not allow for consistent inference
about this distinct subpopulation. Second, and more importantly, the sweeping influence of hubs
creates dependence among all of the influenced nodes that undermines inference. Our methods rely
on the independence of Yi and Yj whenever nodes i and j do not share a tie or a mutual alter.
When hubs are present, a significant proportion of nodes will share a connection to one of these
hubs, undermining our methods.
We can recover valid inference using our methods if we condition on the hubs, treating them
as features of the background network environment rather than as observations. This results in
different causal effects or statistical estimands, as all of our inference is conditional on the identity
and characteristics of the hubs. Imagine a social network comprised of the residents of a city in which
a cultural or political leader is connected to almost all of the other nodes. It may be impossible
26
to disentangle the influence of this leader, which affects every other node, from other processes
simultaneously occurring among the other residents of the city. It will certainly be impossible to
statistically learn about the hub, as the sample size for the hub subgroup is 1. But it may make
sense to consider the hub as a feature of the city rather than a member of the network. We could
then learn about other processes occurring among the other residents of the city, conditional on the
behavior and characteristics of the leader. For example, we could evaluate the effect of a public
health initiative encouraging residents to talk to their friends about the importance of exercise, but
we could not evaluate a similar program targeting the leader’s communication about exercise.
Practically speaking, for real and finite datasets, this implies that the methods we have proposed
are inappropriate for networks in which the degree is large, compared to n, for one or more nodes.
If many nodes are connected to a significant fraction of other nodes, this problem is intractable.
However, if only a small number of nodes are highly connected we can condition on them to recover
approximately valid inference using our methods for conditional estimands. There is a theoretical
tradeoff between the rate of convergence of our estimators and the order of K relative to n that, in
finite samples, becomes a practical tradeoff between generality and variance. Increasing the number
of nodes classified as hubs will increase the rate of convergence by decreasing the size of K for
the remaining, non-hub nodes (assuming that the number of hubs remains small compared to n so
that the sample size does not decrease significantly when we exclude hubs from the analysis). On
the other hand, classifying more nodes as hubs results in analyses that are increasingly specific:
conditioning on a single hub may preserve generalizability to other networks (similar cities with
similar leaders), but conditioning on many hubs is likely to limit the generalizability of the resulting
inference.
5. SIMULATIONS
We conducted a simulation study that evaluated the finite sample and asymptotic behavior of the
TMLE procedure described in Section 3.3. We generated social networks of size n = 500, n = 1, 000,
and n = 10, 000 according to the preferential attachment model (Barabási and Albert, 1999), where
the node degree (number of friends) distribution followed a power law with α = 0.5. We generated
data with two different types of dependence: first with dependence due to direct transmission only,
27
and second with both latent variable dependence and dependence due to direct transmission. Details
of the simulations, along with results for networks generated under the small world model (Watts
and Strogatz, 1998), are in the Appendix.
Our simulations mimicked a hypothetical study designed to increase the level of physical activity
in a population comprised of members of a social network. For each community member indexed
by i = 1, . . . , n, the study collected data on i’s baseline covariates, denoted Ci, which included the
indicator of being physically active, denoted PAi and the network of friends on each subject, Fi. The
exposure or treatment, Xi, was assigned randomly to 25% of the community. For example, one can
imagine a study where treated individuals received various economic incentives to attend a local gym.
The outcome Yi was a binary indicator of maintaining gym membership for a pre-determined follow-
up period. We estimated the average of the mean counterfactual outcomes E[Y ∗n]under various
hypothetical interventions g∗ on such a community. First, we considered a stochastic intervention
g∗1 which assigned each individual to treatment with a constant probability of 0.35; this differs from
the observed allocation of treatment to 25% of the community members. We also considered a
scenario in which the economic incentive was resource constrained and could only be allocated to
up to 10% of community members. We estimated the effects of various targeted approaches to
allocating the exposure. For example, we considered an intervention g∗2 that targeted only the top
10% most connected members of the community, as such a targeted intervention would be expected
to have a higher impact on the overall average probability of maintaining gym membership among
the community, when compared to purely random assignment of exposure to 10% of the community.
Another hypothetical intervention g∗3 assigned an additional physically active friend to individuals
with fewer than 10 friends. This is an intervention on the structure of the social network itself.
Finally, we estimated the combined effect of simultaneously implementing intervention g∗2 and the
network-based intervention g∗3 on the same community . For simplicity, this simulation study only
reports the expected outcome under each of these interventions; causal effects defined as contrasts
of these interventions can be easily estimated based on the same methods.
We estimated the expected counterfactual outcomes under the four interventions and evaluated
their finite sample biases. For the simulations under dependence due to direct transmission, we es-
timated the marginal parameter E[Y ∗n]and compared three different estimators of the asymptotic
28
variance and the coverage of the corresponding confidence intervals. First, we looked at the naive
plug-in i.i.d. estimator (“IID Var ”) for the variance of the influence curve which treated observations
as if they were i.i.d. Second, we used the plug-in variance estimator based on the efficient influence
curve which adjusted for the correlated observations (“dependent IC Var ”) (Sofrygin and van der
Laan, 2015). Finally, we used the parametric bootstrap variance estimator (“bootstrap Var ”) de-
scribed in Section 3.6. The simulation results showing the mean length and coverage of these three
CI types are shown in Figure 4. The results from the simulations with latent variable dependence
are in Figure 3. We estimated the conditional parameter E[Y ∗n]and we compared two plug in
variance estimators based on the conditional influence function DC : one that assumes conditionally
i.i.d outcomes (conditional on X and C), which would be true if all dependence were due to direct
transmission but is violated in the presence of latent variable dependence (“IID Var ”), and one that
does not make this assumption (“dependent IC Var ”). In the Appendix we compare histograms of
the estimates to the predicted normal limiting distribution.
CI.type dependent IC Var bootstrap Var iid Var
N:500
N:1000
N:10000
0.00 0.05 0.10 0.15 0.20
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
Mean estimate & 95% CI length
N:500
N:1000
N:10000
0.7 0.8 0.9Coverage
Figure 2: Mean 95% CI length (left panel) and coverage (right panel) for the TMLE in preferentialattachment network with dependence due to direct transmission, by sample size, intervention andCI type.
One of the lessons of our simulation study is that by leveraging the structure of the network
it might be possible to achieve a larger overall intervention effect on a population level (Harling
29
CI.type dependent IC Var iid Var
N:500
N:1000
N:10000
0.00 0.05 0.10 0.15 0.20
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
Mean estimate & 95% CI length
N:500
N:1000
N:10000
0.75 0.80 0.85 0.90 0.95Coverage
Figure 3: Mean 95% CI length (left panel) and coverage (right panel) for the TMLE in preferentialattachment network with latent variable dependence, by sample size, intervention and CI type.
et al., 2016). For example, the results in the left panel of Figure 4 show that by targeting the
exposure assignment to highly connected and physically active individuals, intervention g∗2 increases
the mean probability of sustaining gym membership compared to the similar level of un-targeted
coverage of the exposure. We also demonstrated the feasibility of estimating effects of interventions
on the observed network structure itself, such as intervention g∗3, which can be also combined with
economic incentives, as it was mimicked by our hypothetical intervention g∗2 + g∗3. These combined
interventions could be particularly useful in resource constrained environments, since they may
result in larger community level effects at the lower coverage of the exposure assignment.
Results from simulations with dependence due to direct transmission show that conducting
inference while ignoring the nature of the dependence in such datasets generally results in anticon-
servative variance estimates and under-coverage of CIs, which can be as low as 50% even for very
large sample sizes (“IID Var ” in the right panel of Figure 4). The CIs based on the dependent vari-
ance estimates (“dependent IC Var ”) obtain nearly nominal coverage of 95% for large enough sample
sizes, but can suffer in smaller sample sizes due to lack of asymptotic normality and near-positivity
violations. Notably, the CIs based on the parametric bootstrap variance estimates provide the most
30
robust coverage for smaller sample sizes, while attaining the nominal 95% coverage in large sample
sizes for nearly all of the simulation scenarios (“bootstrap Var ”). The apparent robustness of the
parametric bootstrap method for inference in small sample sizes, even as low as n = 500, was one
of the surprising finding of this simulation study. Future work will explore the assumptions under
which this parametric bootstrap works and its sensitivity towards violations of those assumptions.
Similarly, in the simulations with latent variable dependence the variance estimates that assume
conditionally i.i.d. outcomes, i.e. that dependence may be due to direct transmission but not to
latent variables, are anti-conservative.
6. IS OBESITY SOCIALLY CONTAGIOUS IN THE FRAMINGHAM HEART STUDY?
The Framingham Heart Study (FHS), initiated in 1948, is an ongoing cohort study designed to
study cardiovascular epidemiology. FHS is an ongoing cohort study of participants from the town
of Framingham, Massachusetts, that has grown over the years to include five cohorts with a total
sample of over 15, 000. Study participants are followed through exams every 2 to 8 years. In
between exams, participants are regularly monitored through phone calls. Detailed information on
data collected in the FHS can be found in Tsao and Vasan (2015). Public versions of FHS data
through 2008 are available from the dbGaP database.
In addition to its important role in cardiovascular disease epidemiology, the FHS plays a uniquely
influential part in the study of social networks and peer effects. In the early 2000s, researchers
Christakis and Fowler (CF) discovered an untapped resource buried in the FHS data collection
tracking sheets: information on social ties that, combined with existing data on connections among
the FHS participants, allowed them to reconstruct the (partial) social network underlying the cohort.
They then leveraged this social network data to study peer effects for obesity (Christakis and
Fowler, 2007), smoking (Christakis and Fowler, 2008), and happiness (Fowler and Christakis, 2008).
Researchers have since used the same methods as Christakis and Fowler to study peer effects in the
FHS and in many other social network settings (e.g. Trogdon et al., 2008; Fowler and Christakis,
2008; Rosenquist et al., 2010).
Even though the hypotheses of interest imply non-independent subjects, these researchers relied
on models, like generalized estimating equations, that assume independent subjects (while account-
31
ing for repeated measurements within subject). To assess peer influence for obesity using FHS data,
Christakis and Fowler (2007) fit longitudinal logistic regression models of each individual’s obesity
status at exam k = 2, 3, 4, 5, 6, 7 onto each of the individual’s social contacts’ obesity statuses at
exam k and k − 1 with a separate entry into the model for each contact, controlling for individual
covariates and for the node’s own obesity status at exam k − 1. They used generalized estimating
equations to account for correlation within individual over time, but their model assumes indepen-
dence across individuals. CF fit this model separately for ten different types of social connections,
including siblings, spouses, and immediate neighbors, with estimates of the increased risk of obesity
ranging from 27% to 171%, many of which were statistically significant. Using each network tie as
an independent entry into the regression model can result in incoherent models for the full network
(Lyons, 2011; Ogburn and VanderWeele, 2014). Furthermore, Lee and Ogburn (2019) found evi-
dence of significant network dependence across observations, suggesting that even if the model were
coherent the analysis is invalid due to unaccounted statistical dependence. However, until now no
method has been available to reanalyze these data taking into account the network structure and
corresponding causal and statistical dependence.
We reanalyzed data from the first two exams, using all ten types of social connections si-
multaneously (n = 3766). The full R code for this analysis is available in a github repository
(github.com/osofr/Ogburn_etal_simulations). Instead of specifying pairwise models and treat-
ing each pair (i.e. each network tie) as an independent observation, our methods account for the
entire social network structure and allow for considerable causal and statistical dependence among
subjects. For each subject i we specified m(Vi) to be the regression model used in CF (2007), but
with proportion of obese friends replacing the indicator that a single friend is obese at each visit.
That is, we specified that the expected probability of obesity for subject i at visit 2 is a function
of the proportion of i’s friends who were obese at visit 2 (this is the exposure of interest), subject
i’s obesity at visit 1, the proportion of i’s friends who were obese at visit 1, and subject i-specific
covariates age, sex, and education. CF argue that controlling for friends’ obesity status at visit 1
controls for confounding due to homophily. It is more likely that confounding due to homophily
cannot be controlled using these data (Shalizi and Thomas, 2011; Cohen-Cole and Fletcher, 2008;
Noel and Nyhan, 2011) and we do not purport to be estimating a true, unconfounded causal effect.
32
However, under CF’s assumption of unconfoundedness, we can estimate the expected proportion of
subjects who would be obese at visit 2 under various hypothetical interventions on each subject’s
friends’ obesity statuses.
The pairwise parameter that CF estimated is not well-defined in a model that accounts for
more than one tie simultaneously. Instead we estimated the expected probability of obesity at
visit 2 under a hypothetical intervention to increase the number of each subject’s obese friends by
1. This is similar to CF’s pairwise parameter in that it estimates the effect of a single friend’s
change in obesity status. The observed empirical probability of obesity at visit 2 was 0.137. The
predicted outcome under intervention was identical up to three decimal places with a 95% parametric
bootstrap confidence interval of (0.127, 0.147). We also estimated the effect of a change in the
average BMI of each subject’s friends. At visit 2, the observed empirical mean BMI across all
subjects was 25.51 with a standard deviation of 4.42. We estimated the effect of a hypothetical
intervention that adds half of a standard deviation, or 2.21, to the average BMI for each subject’s
group of friends. The predicted outcome under this intervention 25.76 with a 95% parametric
bootstrap confidence interval of (25.04, 26.49). Our analysis is consistent with the hypothesis that
the strong results in CF are spurious, due to dependence and/or model misspecification rather than
true associations or effects.
Our estimates are not directly comparable to CF’s because (a) their pairwise parameter is not
well-defined in the context of the full network and (b) we only used data from the first two visits.
While (b) results in less power than an analysis on data from all visits, our confidence intervals
are reasonably narrow. We caution against interpreting our estimates as true causal effects, both
because of unobserved confounding in the FHS data and because the exposure was measured at the
same time as the outcome. However, this is still an instructive comparison between our methods
and the naive methods that are currently in common use. Accounting for the interdependence of the
subjects in the FHS data undermines the findings of strong contagion effects for obesity. Looked at
together, the results of CF’s analyses and of our analyses are consistent with a network-wide shift
towards increasing BMI. This could be due in part to peer effects that are undetectable in these
data, but it could also be due to common secular trends or to shared environment.
33
7. CONCLUSION
We proposed new methods that allow for causal and statistical inference using observations sampled
from members of a single interconnected social network when the observations evince dependence
due to network ties. In contrast to existing methods, our methods do not require randomization of
an exogenous treatment and they have proven performance under asymptotic regimes in which the
number of network ties grows (slowly) with sample size. In the absence of appropriate methods for
assessing peer effects researchers have routinely relied on naive methods developed for independent
units, and our analysis of peer effects for obesity in the Framingham Heart Study illustrates the
dangers of that approach and the importance of new methods like ours.
In future work we plan to address a key limitation of the present proposal, namely the assumption
that the network is observed fully and without error. We also plan to develop data-adaptive methods
for estimating the summary measures sX and sY , as it may be unreasonable to expect these to be
known a priori. Finally, we plan to develop estimating algorithms for longitudinal settings; the
influence function and asymptotic results for these settings are straightforward extensions of the
results presented here, but estimation can be challenging.
ACKNOWLEDGEMENTS
The authors are grateful to Caleb Miles, Eric Tchetgen Tchetgen and Victor De Gruttola for helpful
comments. Elizabeth L. Ogburn was supported by ONR grant N000141512343 and N000141812760.
Oleg Sofrygin and Mark van der Laan were supported by NIH grant R01 AI074345-07.
34
Supplementary Material
ESTIMATION PROCEDURE
Below we propose a targeted maximum loss-based estimator (TMLE) of ψ, however all of the results
that follow are equally applicable to a standard estimating equation approach. TMLEs are substitu-
tion estimators and are not as sensitive to the near violations of the positivity assumption that can
occur in finite samples and result in extreme values of hx∗(vi)/h(vi). Targeted maximum likelihood
estimation is a general template for estimation of smooth parameters in semi- and nonparametric
models. The estimation algorithm is constructed to solve the efficient influence function estimating
equation, thereby yielding, under regularity conditions, asymptotically linear estimators with the
same semiparametric efficiency property as the estimating equation approach described above. In
our setting, a TMLE is constructed using three elements: (i) a valid loss function L for the outcome
regression model m, (ii) initial working estimators m of m and and g of g, and (iii) a parametric
submodel mε of M, the score of which corresponds to a particular component of the score based
on the efficient influence function D(o) and such that m0 = m(·). The TMLE is then defined by
an iterative procedure that, at each step, estimates ε by minimizing the empirical risk of the loss
function L at mε. An updated estimate is then computed as mε, and the process is repeated until
convergence. The TMLE is the estimator obtained in the final step of the iteration. The result of
the previous iterative procedure is that, at the final step, the efficient influence function estimating
equation is solved. For more details about targeted maximum likelihood estimation, see Van der
Laan and Rose (2011). In the present setting, the TMLE for ψ based on D′(o) requires only one
iteration for convergence (Van der Laan and Rose, 2011). We use influence function D′(o) to derive
the TMLE, instead of D(o), because it is computationally more tractable and because the choice
of influence function does not matter for the conditional parameter that we are interested in when
latent variable dependence is present.
Initial estimators m and g of m and g may be found through maximum likelihood or loss-based
estimation methods like standard regression models; under the conditions required for Theorem
1 to hold, a similar argument shows that m-estimator for either of the nuisance models will be
CAN for its expectation. Under a conditional independence structure analogous to that implied
35
by assumptions (A1), (A4), and (A5), Benkeser et al. (2018) showed that super learning (van der
Laan Mark et al., 2007) can be used to estimate the nuisance models. The empirical distribution pC
is used to estimate pC . An estimate ˆh of h(v) optimizes the log likelihood function∑n
i=1 log h(Vi|Wi),
as if the pooled sample (Vi,Wi) were i.i.d. It can be shown that this results in a valid loss function
for h, even for dependent observations (Vi,Wi), for i = 1, . . . , n (van der Laan, 2014; Sofrygin and
van der Laan, 2015). Similarly, one can construct a direct estimator ˆhx∗ of hx∗ , by first creating a
sample (V ∗i ,Wi) and then directly optimizing the log likelihood function∑n
i=1 log hx∗(V∗i |Wi), as if
the pooled sample (V ∗i ,Wi) were i.i.d. We perform estimation of the conditional mixture density h
using a conditional histogram approach, previously described for i.i.d. data in Munoz and van der
Laan (2011). The approach relies on fitting the conditional hazards of individual bins from the
support of Vi (given Wi) using separate parametric logistic regression models.
In our highly-dependent network settings, the operational characteristics of the direct estima-
tor of h are unclear. Similarly, it is unclear how to appropriately conduct cross-validation with
our proposed direct estimation approach for h. However, lacking any other reasonable estimation
alternatives, we believe that the enormous computational advantages offered by this direct estima-
tion route, along with the encouraging results obtained from our extensive simulations, merit the
description of this estimator. We also realize that more theoretical work is needed to justify and
improve upon this approach. For additional simulation results that demonstrate the performance
of the direct estimation approach for mixture density h, we refer to Sofrygin et al. (2017, 2018).
Now the TMLE of ψ is computed as follows:
1. Define the auxiliary weights Hi as the ratio of estimated densities of V ∗ and V evaluated at
the observed value Vi. Compute the auxiliary weights as
Hi =ˆhx∗(Vi)
ˆh(Vi).
2. Compute initial predicted outcome values Yi ≡ m(Vi) and predicted potential outcome values
Y ∗i ≡ m(V ∗i ) evaluated at the counterfactual value V ∗i = sY,i(C,x∗).
3. Construct a TMLE model update mε of m by running a weighted intercept-only logistic
regression model with weights Hi defined in step (1), Yi as the outcome and including Yi as
36
an offset. That is, define ε as the estimate of the intercept parameter ε from the following
weighted logistic regression model
logitmε(v) = logitm(v) + ε,
where logit(x) = log(
x1−x
).
4. Compute updated predicted potential outcomes Yi∗as the fitted values of the regression from
step (c), evaluated at v∗ rather than v (that is, at Y ∗i instead of Yi):
Yi∗
= expit{logitY ∗i + ε},
where expit(x) = 11+e−x , i.e., the inverse of the logit function.
5. Compute the TMLE ψ as
ψ =1
n
n∑i=1
Yi∗.
The TMLE is doubly robust: it will be consistent for ψ if either the working model g for g or
the working model m for m is correctly specified. This resulting estimator remains CAN for ψ
under assumptions (A2) and (A3) instead of (A4) and (A5), and the same procedure can be used
to estimate the parameter conditional on C.
8. PROOF OF THEOREM 1
8.1 Regularity conditions
For a real-valued function c 7→ f(c), let the L2(P )-norm of f(c) be denoted by ‖f‖ = E[f(C)2]1/2.
Define Mm and Mh as the classes of possible functions that can be used for estimating the two
nuisance parameters m and h ≡ hx∗/h, respectively. Note that a model for g plus the empirical
distribution of covariates C determines h. Equivalent assumptions could be stated in terms of g
instead of h, but we focus on h because that is the functional of g and C that we model in our
estimating procedure. Assume that the TMLE update mε ∈ Mm with probability 1 and assume
that ˆhx∗/ˆh ∈ Mh with probability 1. Finally, define the following dissimilarity measure on the
37
cartesian product of F ≡Mm ×Mh:
d(
(h,m) ,(h, m
))= max
(supv∈V| h− h | (v), sup
v∈V| m− m | (v)
).
The following are the regularity conditions required for Theorem 1, i.e. for asymptotic normality
of the TMLE ψ∗.
Uniform consistency: Assume that
d((
ˆhx∗/ˆh, mε
),(hx∗/h,m
))→ 0
in probability as n→∞. Note that this assumption is only needed for proving the asymptotic
equicontinuity of our process; it is not needed for proofs of relevant convergence rates for the
second order terms.
Bounded entropy integral: Assume that there exists some η > 0, so that´ η0
√log (N(ε,F , d))dε <
∞, where N(ε,F , d) is the number of balls of size ε w.r.t. metric d needed to cover F .
Universal bound: Assume supf∈F ,O | f | (O) < ∞, where the supremum of O is over a set
that contains O with probability one. This assumption will typically be a consequence of the
choosing a specific function class F that satisfies the above entropy condition.
Positivity: Assume
supv∈V
hx∗(v)
h(v)<∞.
Consistency and rates for estimators of nuisance parameters: Assume that ‖m−m‖∥∥∥ˆh− h
∥∥∥ =
oP
((Cn)−1/2
). Note that this rate is achievable if, for example, estimation of h relies on some
pre-specified parametric model, or if both h and m are estimated at rate C−1/4n .
Rate of the second order term: Assume that
Rn1 ≡ −ˆv
{(ˆhx∗
ˆh− hx∗
h
)(mε −m)(v)h(v)dµ(v)
}= oP
(1/√Cn
).
Note that this condition is provided here purely for the sake of completeness, since it will
38
satisfied based on the previously assumed rates of convergence for ‖m−m‖∥∥∥ˆh− h
∥∥∥. This
follows from the fact that the parametric TMLE update step mε of m will have a negligible
effect on the rate of convergence of the initial estimator m, that is, mε will converge at “nearly”
the same rate as m.
Limited connectivity and limited dependence of Y,X and C: Let Kmax,n = maxi{Ki} for
a fixed network with n nodes. Assume that K2max,n/n converges to 0 in probability as n→∞.
A key condition is consistency and rates for estimators of nuisance parameters. This condition
will be satisfied, for example, if both models converge to the truth at rate C1/4n . It can in fact be
weakened, but for a more general discussion and the corresponding technical conditions we refer to
the Appendix of van der Laan (2014). With the exception of the rates of convergence, the more
general conditions for asymptotic normality of the TMLE presented in that paper apply to our
setting as well.
8.2 Overview of the proof of Theorem 1
We want to show that√Cn(ψ − ψ) converges in law to a Normal limit as n goes to infinity for
some rate√Cn such that
√n/ (Kmax(n))2 ≤
√Cn ≤
√n, where the rate
√Cn is the order of the
variance of the sum of the first-order linear approximation of (ψ − ψ).
Broadly, the proof has two parts: First, we require that the second order terms in the expansion
of ψ − ψ are stochastically less than 1/√Cn, that is that
ψn − ψ =1
n
n∑i=1
{fi(O)− E[fi(O)]}+ op
(1/√Cn
),
where fi(O) is the contribution of the ith observation to the estimator. Specifically, for our influence
function
D(o) =
n∑j=1
1
n
n∑i=1
E [m (V ∗i ) | Cj = cj ]− ψ +1
n
n∑i=1
hx∗(vi)
h(vi){yi −m (vi)} ,
39
the contribution of the ith observation is
fi(o) =n∑j=1
E [m (V ∗i ) | Cj = cj ] +hx∗(vi)
h(vi){yi −m (vi)} .
Then proving asymptotic normality of the TMLE amounts to the asymptotic analysis of the sum
1n
∑ni=1 {fi(O)− E[fi(O)]}, and the second part of the proof establishes that the first order terms
converge to a normal distribution when scaled by√Cn, that is that
√Cn
1n
∑ni=1 {fi(O)− E[fi(O)]} →d
N(0, σ2) for some finite σ2.
The proof that the second order terms are stochastically less than 1/√Cn is an extension of
the empirical process theory of Van Der Vaart and Wellner (1996) and follows the same format
as the proof in van der Laan (2014). Indeed, the proof offered by van der Laan (2014) holds
immediately after replacing the rate or scaling factor√n with
√Cn throughout. Only one step in
the van der Laan (2014) proof relies on the network structure, which is the major difference between
the setting in that paper, where the number of network connections is fixed and bounded as n goes
to infinity, and the present setting: the proof requires bounding the Orlicz norms of several empirical
processes corresponding to components of the influence function for ψ, and a key step is bounding
the expectation of E [|Xn(f)|p] , where Xn(f) is the stochastic process that describes the difference
between the empirical (indexed by n) and the true distribution functions of a component of the
influence function for ψ. This step relies on a combinatorial argument about nature of overlapping
friend groups in the underlying network, and the argument for the case of growing Ki is subsumed
by the argument for fixed K in van der Laan (2014).
The proof that the first order terms converge to a normal distribution requires a central limit
theorem for dependent data with growing and possibly irregularly sized dependency neighborhoods,
where a dependency neighborhood for unit i is a collection of observations on which the observations
for unit imay be dependent. We prove such a CLT in Lemmas 1 and 2. In the next section we use the
CLT for growing and irregular dependency neighborhoods, along with an orthogonal decomposition
of the first order terms, to prove the remainder of Theorem 1.
40
8.3 Central limit theorem for first order terms
Proving asymptotic normality of the TMLE amounts to the asymptotic analysis of the sum 1n
∑ni=1 {fi(O)− E[fi(O)]}.
As a start, decompose∑n
i=1 {fi(O)− E[fi(O)]} into a sum of three orthogonal components:
fY,i(Y,X,C) = fi(O)− E [fi(O) | X,C] ,
fX,i(X,C) = E[fi(O) | X,C]− E[fi(O) | C], and
fC,i(C) = E[fi(O) | C]− E[fi(O)].
Note that
fi(O)− E[fi(O)] = fY,i(Y,X,C) + fX,i(X,C) + fC,i(C)
and with slight abuse of notation we will also write fY,i(O), fX,i(O) and fC,i(O). Let fY(O) =∑ni=1 fY,i(O), fX(O) =
∑ni=1 fX,i(O) and fC(O) =
∑ni=1 fC,i(O). For i = 1, . . . , n, let
ZY,i =fY,i(Y,X,C)√
V ar(∑n
i=1 fY,i(Y,X,C))
ZX,i =fX,i(X,C)√
V ar(∑n
i=1 fX,i(X,C))
ZC,i =fC,i(C)√
V ar(∑n
i=1 fC,i(C)).
and
Z ′Y,i =fY,i(Y,X,C) |(X,C)√
V ar(∑n
i=1 fY,i(Y,X,C) |(X,C))
Z ′X,i =fX,i(X,C) |C√
V ar(∑n
i=1 fX,i(X,C) |C)
We use the prime to denote conditional random variables: Z ′Y,i conditions fY,i(O) on (X,C) and
rescales it by the standard error of fY(O)| (X,C). Similarly, Z ′X,i conditions fX,i(O) on C and
41
rescales it by the standard error of fX(O)|C. Let
σ2nY (x, c) = V ar
(n∑i=1
fY,i(Y,x, c) |(X = x,C = c)
)
σ2nY = EPX,C
[σ2nY (X,C)
],
σ2nX(c) = V ar
(n∑i=1
fX,i(X, c) |C = c
)
σ2nX = EPC
[σ2nX(C)
],
and
σ2nC = V ar
(n∑i=1
fC,i(C)
).
Note that by the law of total variance σ2nX = V ar(∑n
i=1 fX,i(X,C)) and σ2nY = V ar (∑n
i=1 fY,i(Y,X,C)).
Let Z ′nY denote∑n
i=1 Z′Y,i, Z
′nX denote
∑ni=1 Z
′X,i, ZnY denote
∑ni=1 ZY,i, ZnX denote
∑ni=1 ZX,i,
and ZnC denote∑n
i=1 ZC,i. We will establish convergence in distribution of each of the three terms
separately. Because Z ′nY and Z′nX converge to distributions that do not depend on their condi-
tioning events, conditional convergence in distribution implies convergence of ZnY and ZnX to the
same limiting distributions. Since fY (O),fX(O), and fC(O) are orthogonal by construction, the
variance of the limiting distribution of their sum is the sum of their marginal variances. If the three
processes converge at the same rate the limiting variance will be the sum of the variances of the
three processes. However, the three terms may converge at different rates, in which case the limiting
distribution of ψ − ψ will be given by the limiting distribution of the term(s) with the slowest rate
of convergence.
In order to show that Z ′nX , Z′nY , and ZnC all converge in distribution to a N(0, 1) random
variable, we can use three separate applications of the central limit theorem given in Lemma 1,
which is based on Stein’s method.
Stein’s method (Stein, 1972) quantifies the error in approximating a sample average with a
normal distribution. (For an introduction to Stein’s method see Ross, 2011.) Stein’s method
has been used to prove CLTs for dependent data with dependence structure given by dependency
42
neighborhoods (Chen and Shao, 2004): the dependency neighborhood for observation i is a set of
indices Di such that observation i is independent of observation j, for any j /∈ Di. Conditionally
on C, fX,i and fX,j are independent for any nodes i and j such that Aij = 0 and there is no k with
Aik = Ajk = 1, that is for any nodes that do not share a tie or have any mutual network contacts.
The same is true for fY,i and fY,j conditional on X and C and for fC,i and fC,j . Thus the three
collections of random variables Z ′X,1, ..., Z′X,n, Z
′Y,1, ..., Z
′Y,n, and ZC,1, ..., ZC,n each has a dependency
neighborhood structure with Di = i ∪ {j : Aij = 1} ∪ {k : Ajk = 1 for j : Aij = 1}, that is the
“friends” and “friends of friends” of node i. Define the indicators R(i, j) for any (i, j) ∈ {1, . . . , n}2
to be an indicator of dependence between ZX,i and ZX,j , R(i, j) = 1 iff j ∈ Di or, equivalently, if
i ∈ Dj . For any i ∈ {1, . . . , n} the set {Z ′X,j : (R(i, j) = 1, j ∈ {1, . . . , n})} forms the dependency
neighborhood of Z ′X,i and the collection {Z ′X,j : (R(i, j) = 0, j ∈ {1, . . . , n})} is independent of Z ′X,i.
The same logic applies to defining the dependency neighborhoods for Z ′Y,1, ..., Z′Y,n conditional on X
and C, and for ZC,1, ..., ZC,n based on (unconditional) independence of each fC,i(O) and fC,j(O),
as determined by the network structure and the distributional assumptions made for the baseline
covariates C.
Applied to Z ′nX , Stein’s method provides the following upper bound
d(Z ′nX , Z) ≤n∑i=1
∑j,k∈Di
E∣∣Z ′X,iZ ′X,jZ ′X,k∣∣
+
√2
π
√√√√√V ar
n∑i=1
∑j∈Di
Z ′X,iZ′X,j
,where Z ∼ N(0, 1) and d(·, ·) is the Wasserstein distance metric (Vallender, 1974).
In order to show that Z ′nX converges in distribution to Z, we must show that the righthand side
of the inequality converges to zero as n goes to infinity. We will first show that this convergence
holds when Ki = |Fi| = Kmax(n) for all i, that is when all nodes have the same number of ties. We
will then show that removing any tie from the network preserves an upper bound on the righthand
side of the inequality. This completes our proof that for any network such that Ki ≤ Kmax(n) for
all i and K2max(n)n converges to zero as n goes to infinity, Z ′nX converges in distribution to a stan-
43
dard normal distribution. The same argument applied to ZnC proves that it has a Normal limiting
distributions as well.
Lemma 1 (Applying Stein’s Method to the dependent sum). Consider a network of nodes given
by adjacency matrix A. Let U1, ..., Un be bounded mean-zero random variables with finite fourth
moments and with dependency neighborhoods Di = i ∪ {j : Aij = 1} ∪ {k : Ajk = 1 for j : Aij =
1}, and let Ki be the degree of node i. If Ki = Kmax(n) for all i and Kmax(n)2/n → 0, then∑Ui√
var(∑Ui)
D→ N(0, 1).
Proof of Lemma 1. Let U ′i = Ui√var(
∑Ui)
. Application of Stein’s method often involves defining the
so-called “Stein coupling” (W,W ′, G) (Fang, 2011; Fang et al., 2015). Consider the following sum
of dependent variables W =∑n
i=1 U′i . Define a discrete random variable I distributed uniformly
over {1, . . . , n} and define another random variable W ′ = (W −∑n
j=1R(I, j)U ′j). Finally, define
G = −nU ′I and note that (W,W ′, G) forms a Stein coupling (Fang, 2011; Fang et al., 2015). We
also let D = (W ′−W ) = −∑N
j=1R(I, j)U ′j . This Stein coupling allows us then to derive the upper
bound
d(W,Z) ≤n∑i=1
∑j,k∈Di
E∣∣U ′iU ′jU ′k∣∣+
√2
π
√√√√√V ar
n∑i=1
∑j∈Di
U ′iU′j
, (9)
as shown in Ross (2011). We will now show that, for any network structure,
n∑i=1
∑j,k∈Di
E∣∣U ′iU ′jU ′k∣∣+
√2
π
√√√√√V ar
n∑i=1
∑j∈Di
U ′iU′j
= O
∑i,j,k R(i, j)R(i, k)[∑i,j R(i, j)
]3/2 . (10)
The righthand side of the above equation is equal to√
(Kmax(n))2
n under the assumption of Kmax(n)
ties for each node i = {1, . . . , n}. By assumption, we also have that Kmax(n)√n
converges to zero as
n goes to infinity, and therefore if we can show equation (10) we have proved that∑Ui√
var(∑Ui)
D→
44
N(0, 1).
Consider the term
n∑i=1
∑j,k∈Di
E∣∣U ′iU ′jU ′k∣∣ =
1
var(∑Ui)3/2
n∑i=1
E
∣∣∣∣∣∣Ui∑j∈Di
Uk
2∣∣∣∣∣∣ .
By the assumption of bounded 4th moments, var(∑Ui)
3/2 = O
([∑i,j R(i, j)
]3/2), that is,
var(∑Ui) stabilizes to a constant when scaled by
∑i,j R(i, j). Using the fact that each |Ui| is
bounded we get
N∑i=1
E
∣∣∣∣∣∣Ui∑j∈Di
Uj
2∣∣∣∣∣∣
≤ M
n∑i=1
∑j,k
R(i, j)R(i, k)
= M
∑i,j,k
R(i, j)R(i, k),
for some positive constant M <∞. Combining the above expressions, we get
n∑i=1
∑j,k∈Di
E∣∣U ′iU ′jU ′k∣∣ = O
∑i,j,k R(i, j)R(i, k)[∑i,j R(i, j)
]3/2 .
Now consider the second term:
√√√√√V ar
n∑i=1
∑j∈Di
U ′iU′j
=
√V ar
(∑ni=1
∑j∈Di
U iU j
)var(
∑Ui)2
.
There are∑
i,j R(i, j) terms in∑n
i=1
∑j∈Di
U iU j , and the number of terms UkUl with which UiUj
has non-zero covariance is |Di ∪Dj | ≤∑
k R(i, k) +∑
k R(i, k), so V ar(∑n
i=1
∑j∈Di
U iU j
)≤
M∑
i,j R(i, j)∑
k R(i, k) for some finiteM . Therefore V ar(∑n
i=1
∑j∈Di
U iU j
)= O
(∑i,j,k R(i, j)R(i, k)
).
V ar(∑Ui)
2 = O
([∑i,j R(i, j)
]2), so the second term is of smaller order than the first term.
Therefore we have only to consider the first term and we have completed the proof.
45
Lemma 2 (Bound goes to zero when Ki ≤ Kmax(n) for all i). Convergence to zero of the righthand
side of Equation (9) is preserved under the removal of ties and holds as long as Ki ≤ Kmax(n) for
all i and K2max(n)n converges to zero as n goes to infinity.
Proof of Lemma 2. Consider a sequence of networks with n going to infinity such that the righthand
side of Equation (9) converges to 0, i.e.
n∑i=1
∑j,k∈Di
E∣∣U ′iU ′jU ′k∣∣+
√2
π
√√√√√V ar
n∑i=1
∑j∈Di
U ′iU′j
→ 0.
Because the second term is of the same or smaller order than the first, we only have to consider
the first term. For this sequence of networks, define An =∑n
i=1
∑j,k∈Di
E∣∣∣U ′iU ′jU ′k∣∣∣ . Removing a
single tie from the underlying network has the effect of rendering independent some pairs that were
previously dependent; We now consider the effect of rendering a single dependent pair independent
but otherwise leaving the distributions of the random variables the same. Suppose the pair rendered
independent is (l,m). Define a new sequence of networks with n going to infinity to be identical to
the previous sequence but with pair (l,m) independent, and let A′n be the first term in the righthand
side of Equation (9) for this new sequence. Then
A′n = An − 2∑
k∈Dl∪Dm
E∣∣U ′lU ′mU ′k∣∣
which is bounded above by An.
This completes the proof that Z ′nX , Z′nY , and ZnC have Normal limiting distributions.
Lemma 3 (Conditional CLT implies marginal CLT). Z ′nX converges to Normal distribution after
marginalizing over C (but conditioning on the network as captured by the adjacency matrix A) and
Z ′nY converges to Normal distribution after marginalizing over (X,C). That is, ZnX and ZnY both
converge to Normal distributions.
Proof of Lemma 3. For illustration consider Z ′nX =∑n
i=1 Z′2,i, where
Z ′X,i = (fX,i(X,C) |C) /√σ2nX(C)
46
and note that the proof of the convergence of ZnY is nearly identical. The conditional CLT results
from Lemma 1 show that
P[Z ′nX ≤ x |C = c
]= P
N∑i=1
fX,i(X, c)√σ2nX(c)
≤ x
|C = c
converges to Φ(x) for each x and almost every c, where Φ is the cumulative distribution function of
the standard Normal random variable and C is a given sequence (Ci : i = 1, . . . , n). Let PC denote
the distribution of C. Then
P (ZnX ≤ x) ≡ P
N∑i=1
fX,i(X,C)√σ2nX
≤ x
=
ˆcP (Z ′nX ≤ x|C = c)dPC(c).
For a given x, the dominated convergence theorem is now applied with fn(c) = P (Z ′nX ≤ x|C =
c) and the limit given by f(c) = Φ(x) = m, where m is some constant that doesn’t depend on c.
From the previous conditional CLT result it follows that fn(c) converges to f(c) pointwise for each
c. The next step is to find an integrable function g, such that fn < g and´g(c)dPC(c) <∞. The
proof is then completed by choosing g = 1.
We have now shown that ZnY , ZnX , and ZnC are asymptotically normally distributed. We now
show that the sum of the three processes converges in distribution to a Normal random variable.
Consider three cases: (1) the three processes have the same rate of marginal convergence in distribu-
tion, (2) one of the three processes converges faster than the other two, and (3) two of the processes
converge faster than the third. In all three cases the rate of convergence for the sum will be the
slowest of the three marginal rates. In case (3), the limiting distribution of the sum is determined
entirely by the one process that converges with a slower rate than the other two: the other two
processes will converge to constants (specifically to their expected values of 0) when standardized
by the slower rate; Slutsky’s theorem concludes the proof. We focus on case (1) below; case (2)
follows immediately by applying the proof below to the two processes that converge at the same
slower rate and applying Slutsky’s to the third, faster converging process.
47
For convenience, in order to show that the sum of the three dependent processes also converges
to Normal, define
C∗n := σ2nY + σ2nX + σ2nC .
Note that C∗n is related to Cn as follows: Cn = O(n2/C∗n).
Lemma 4 (CLT for the sum of the three orthogonal processes). If all three processes have the same
marginal rate of convergence, then
1√C∗n
(fY(Y,X,C) + fX(X,C) + fC(C))→ N(0, 1).
Proof of Lemma 4. Without the loss of generality, we prove that ZnX + ZnC → N(0, 2) and note
that the general result for (ZnY + ZnX + ZnC) follows by applying a similar set of arguments.
Consider the following random vector (ZnX , ZnC) taking values in IR2. Let Fn(x1, x2) ≡
P (ZnX ≤ x1, ZnC ≤ x2), where (x1, x2) ∈ IR2. Let Φ2(x1, x2) ≡ P (ZX ≤ x1)P (ZC ≤ x2), for
ZX ∼ N(0, 1) and ZC ∼ N(0, 1), that is, Φ2(x1, x2) defines the CDF of the bivariate standard
normal distribution, for (x1, x2) ∈ IR2. The goal is to show that Fn(x1, x2) → Φ2(x1, x2), for any
(x1, x2) ∈ IR2. The convergence in distribution for ZnX + ZnC will follow by applying the Cramer
and Wold Theorem (1936).
Note that
P (ZnX ≤ x1, ZnC ≤ x2)
=P (ZnX ≤ x1 |ZnC ≤ x2 )P (ZnC ≤ x2).
First, from the previous application of Stein’s method, we have that
P (ZnC ≤ x2)→ Φ(x2),
48
where Φ(x2) ≡ P (ZC ≤ x2), ZC ∼ N(0, 1) and x2 ∈ IR2. Also note that
P (ZnX ≤ x1 |ZnC ≤ x2 )
=∑c∈C
P (ZnX ≤ x1 |C = c)P (C = c |ZnC ≤ x2 ),
where C denotes the support of C, ZnX = 1√C∗nfX(X,C), ZnC = 1√
C∗nfC(C) and
P (C = c |ZnC ≤ x2 ) =P (C = c)I(
(1/√C∗n)fC(c) ≤ x2)
P ((1/√C∗n)fC(c) ≤ x2)
.
By another application of Stein’s method, it was shown that
P (ZnX ≤ x1 |C = c)→ Φ(x2),
for any realization of c ∈ C. That is, we’ve shown that the limiting distribution of ZnX conditional
on C = c, does not itself depend on the conditioning event C = c. Applying Lemma 3, we finally
conclude that Fn(x1, x2)→ Φ2(x1, x2), for any (x1, x2) ∈ IR2 and the result follows.
8.4 Variance estimation
The estimate of the variance of the TMLE ψ can be obtained from the sum, scaled by 1/n2, of the
three plug-in estimators of
σ2nY =∑i,j
E(fY,i(O)fY,j(O))
σ2nX =∑i,j
E(fX,i(O)fX,j(O))
σ2nC =∑i,j
E(fC,i(O)fC,j(O)).
Alternatively, one can estimate the variance from a single plug-in estimator
1
n2
∑i,j
E(fi(O)fj(O)).
49
Note that contribution to these variances of any pair i, j not in each others dependency neighbor-
hoods will be 0. Therefore, it is acceptable to sum only over pairs i, j sharing a tie or a mutual
contact in the underlying network. Finally, note that we do not need to know the true rate of
convergence√Cn to obtain a valid estimate of the C.I. for ψ; this rate is captured by the number
of non-zero terms in the variance sums.
9. SIMULATIONS
All simulation and estimation was carried out in R language (R Core Team, 2015) with packages
simcausal (Sofrygin et al., 2015) and tmlenet (Sofrygin and van der Laan, 2015). The full R code
for this simulation study is available in a separate github repository (github.com/osofr/Ogburn_etal_simulations).
Sofrygin and van der Laan (2015); Sofrygin et al. (2017, 2018) provide additional details on imple-
mentation, computation, and simulations for asymptotic regimes with a bounded number of ties
per node and with no latent variable dependence.
The simulations were repeated for community sizes of n = 500, n = 1, 000 and n = 10, 000. The
estimation was repeated by sampling 1, 000 such datasets, conditional on the same network (sampled
only once for each sample size). For the simulations with dependence due to direct transmission, the
baseline covariates were independently and identically distributed. The probability of success for
each Yi was a logit-linear function of i’s exposure Xi (indicator of receiving the economic incentive),
the baseline covariates Ci and the three summary measures of i’s friends exposures and baseline
covariates. In particular, we also assumed that the probability of maintaining gym membership
increased on a logit-linear scale as a function of the following network summaries: the total number of
i’s friends who were exposed (∑
j:Aij=1Xj), the total number of i’s friends who were physically active
at baseline (∑
j:Aij=1 PAj) and the product of the two summaries (∑
j:Aij=1Xj ×∑
j:Aij=1 PAj).
The summary measures and the outcome regression model were correctly specified, but we do
not know (and therefore did not a priori correctly specify a model for) the true density of h.
The economic incentive to attend local gym had a small direct effect on each individual who was
not physically active at baseline and no direct effect on those who were already physically active.
However, physically active individuals were more likely to maintain gym membership over the
follow-up period if they had at least one physically active friend at baseline. We repeated these
50
simulations with the addition of latent variable dependence, which we introduced by generating
unobserved latent variables for each node which affected the node’s own outcome as well as the
outcomes of its friends.
In addition to the preferential attachment network model with both latent variable dependence
and dependence due to direct transmission (results in main text), we also simulated under depen-
dence due to direct transmission only. We estimated the marginal parameter E[Y ∗n]and compared
three different estimators of the asymptotic variance and the coverage of the corresponding confi-
dence intervals. First, we looked at the naive plug-in i.i.d. estimator (“IID Var ”) for the variance of
the influence curve which treated observations as if they were i.i.d. Second, we used the plug-in vari-
ance estimator based on the efficient influence curve which adjusted for the correlated observations
(“dependent IC Var ”) (Sofrygin and van der Laan, 2015). Finally, we used the parametric bootstrap
variance estimator (“bootstrap Var ”) described in Section 3.6. The simulation results showing the
mean length and coverage of these three CI types are shown in Figure 4.
CI.type dependent IC Var bootstrap Var iid Var
N:500
N:1000
N:10000
0.00 0.05 0.10 0.15 0.20
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
Mean estimate & 95% CI length
N:500
N:1000
N:10000
0.7 0.8 0.9Coverage
Figure 4: Mean 95% CI length (left panel) and coverage (right panel) for the TMLE in preferentialattachment network with dependence due to direct transmission, by sample size, intervention andCI type.
Results from simulations with dependence due to direct transmission show that conducting
inference while ignoring the nature of the dependence in such datasets generally results in anticon-
51
CI.type dependent IC Var bootstrap Var iid Var
N:500
N:1000
N:10000
0.0 0.2 0.4 0.6 0.8
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
Mean estimate & 95% CI length
N:500
N:1000
N:10000
0.6 0.7 0.8 0.9Coverage
Figure 5: Mean 95% CI length (left panel) and coverage (right panel) for the TMLE in smallworld network with dependence due to direct transmission, by sample size, intervention and CItype. Results are shown for the estimates of the average expected outcome under four hypotheticalinterventions (g∗1, g∗2, g∗3 and g∗2 + g∗3).
servative variance estimates and under-coverage of CIs, which can be as low as 50% even for very
large sample sizes (“IID Var ” in the right panel of Figure 4). The CIs based on the dependent vari-
ance estimates (“dependent IC Var ”) obtain nearly nominal coverage of 95% for large enough sample
sizes, but can suffer in smaller sample sizes due to lack of asymptotic normality and near-positivity
violations. Notably, the CIs based on the parametric bootstrap variance estimates provide the most
robust coverage for smaller sample sizes, while attaining the nominal 95% coverage in large sample
sizes for nearly all of the simulation scenarios (“bootstrap Var ”). The apparent robustness of the
parametric bootstrap method for inference in small sample sizes, even as low as n = 500, was one
of the surprising finding of this simulation study. Future work will explore the assumptions under
which this parametric bootstrap works and its sensitivity towards violations of those assumptions.
We also simulated social networks from the small world network model (Watts and Strogatz,
1998) with a rewiring probability of 0.1. The results of these simulations are in Figures 5 and 6.
52
CI.type dependent IC Var iid Var
N:500
N:1000
N:10000
0.0 0.2 0.4 0.6
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
g∗2 + g∗3
g∗3 (network intervention)
g∗2 (dynamic intervention)
g∗1 (random 35%)
Mean estimate & 95% CI length
N:500
N:1000
N:10000
0.6 0.7 0.8 0.9Coverage
Figure 6: Mean 95% CI length (left panel) and coverage (right panel) for the TMLE in small worldnetwork with latent variable dependence, by sample size, intervention and CI type. Results areshown for the estimates of the average expected outcome under four hypothetical interventions (g∗1,g∗2, g∗3 and g∗2 + g∗3).
53
Figure 7: Comparing re-scaled empirical TMLE distributions (black) to their theoretical normallimit (red) with varying sample size (x-axis) and intervention type (y-axis). TMLEs were centeredat the truth and then re-scaled by true SD. Results shown for the preferential attachment network(left) and the small world network (right).
We examined the empirical distribution of the transformed TMLEs, comparing their histogram
estimates to the predicted normal limiting distribution, with the results shown in Figure 7, where
the histogram plots are displayed by sample size (horizontal axis) and the intervention type (vertical
axis). The estimates were first centered at the corresponding true parameter values and then re-
scaled by their corresponding true standard deviation (SD). We note that our results indicate that
the estimators converge to their normal theoretical limiting distribution, even in networks with
power law node degree distribution, such as the preferential attachment network model, as well
as in the densely connected networks obtained under the small world network model. The results
shown in Figure 7 were generated from simulations with dependence due to direct transmission;
simulations with latent variable dependence (not shown) evinced similar approximate normality.
10. COMPARISON OF ESTIMANDS
Table 1 summarizes the relationships among the two sets of assumptions (with and without latent
variable dependence) and the two classes of estimands (marginal over C and conditional on C)
according to their properties and according to the limitations of our proposed methods.
54
Table 1: Properties of marginal estimands and of estimands conditional on C
Properties that we have demonstrated for the two classes of estimands Estimand classMarginal Conditional
nonparametrically identified with or without latent variable (LV) dependence yes yesestimator is CAN with or without LV dependence yes yesefficient estimator is available with LV dependence no no
efficient estimator is available without LV dependence yes yesconsistent and tractable variance estimation with LV dependence no yes
consistent and tractable variance estimation without LV dependence yes yes
11. GLOSSARY OF NOTATION
A with entries Aij ≡ I {subjects i and j share a tie} is the adjacency matrix for the network.
Ki =∑n
j=1Aij , that is, Ki is the degree of node i, or the number of individuals sharing a tie with
individual i.
Fi = j : Aij = 1 is the set of nodes with with node i shares a tie (node i’s "friends").
Ci is covariates
Xi is exposure
Yi is outcome
sX is a summary function of C upon which X depends.
sY is a summary function of C,X upon which Y depends.
Wi = sX,i ({Cj : Aij = 1})
Vi = sY,i ({Cj : Aij = 1} , {Xj : Aij = 1})
Oi = (Ci,Wi, Xi, Vi, Yi)
x∗i represents a user-specified intervention value of Xi.
Yi(x∗), shorthand Y ∗i , denotes the potential or counterfactual outcome of individual i in a hypo-
thetical world in which P (X = x∗) = 1.
Vi(x∗), shorthand V ∗i , is equal to sY,i(C,x
∗) and is a counterfactual random variable in a hypothet-
55
ical world in which P (X = x∗) = 1.
Y ∗n = 1n
∑ni=1 Y
∗i .
pC(c) = P (C = c)
g(x|w) = P (X = x |W = w)
gi(x|w) = P (Xi = x|Wi = w)
pY (y|v) = P (Y = y|V = v)
pY,i(y|v) = P (Yi = y|Vi = v)
hi(v) = P (Vi = v)
hi,x∗(v) = P (V ∗i = v)
m(v) =∑
y y pY (y|v) is the conditional expectation of Y given V = v.
h(vi) = 1n
∑nj=1 hj(vi)
hx∗(vi) = 1n
∑nj=1 hj,x∗(vi)
vi = sY,i(c,x)
V ∗i = sY,i(C,x∗)
D(o) is the efficient influence function under assumptions (A1), (A4) and (A5).
D′(o) is the efficient influence function under assumptions (A1) and (A4).
Dc(o) is an influence function conditional on C = c.
Kmax,n = maxi{Ki}
√Cn is the rate of convergence in Theorem 1.
56
REFERENCES
Ali, M. M. and D. S. Dwyer (2010). Social network effects in alcohol consumption among adolescents.
Addictive behaviors 35 (4), 337–342.
Aronow, P. M. and C. Samii (2013). Estimating average causal effects under general interference.
Technical report, Yale University.
Athey, S., D. Eckles, and G. W. Imbens (2018). Exact p-values for network interference. Journal
of the American Statistical Association 113 (521), 230–240.
Barabási, A.-L. and R. Albert (1999). Emergence of scaling in random networks. science 286 (5439),
509–512.
Basse, G., A. Feller, and P. Toulis (2019). Randomization tests of causal effects under interference.
Biometrika 106 (2), 487–494.
Basse, G. W. and E. M. Airoldi (2015). Optimal design of experiments in the presence of network-
correlated outcomes. ArXiv e-prints.
Basse, G. W. and E. M. Airoldi (2018). Model-assisted design of experiments in the presence of
network-correlated outcomes. Biometrika 105 (4), 849–858.
Benkeser, D., C. Ju, S. Lendle, and M. van der Laan (2018). Online cross-validation-based ensemble
learning. Statistics in medicine 37 (2), 249–260.
Bickel, P. J., C. A. Klaassen, P. J. Bickel, Y. Ritov, J. Klaassen, J. A. Wellner, and Y. Ritov (1998).
Efficient and adaptive estimation for semiparametric models, Volume 2. Springer New York.
Bowers, J., F. M. M, and P. C (2013). Reasoning about interference between units: A general
framework. Political Analysis 21, 97–124.
Cacioppo, J. T., J. H. Fowler, and N. A. Christakis (2009). Alone in the crowd: the structure and
spread of loneliness in a large social network. Journal of personality and social psychology 97 (6),
977.
57
Cai, X., W. W. Loh, and F. W. Crawford (2019). Identification of causal intervention effects under
contagion. arXiv preprint arXiv:1912.04151 .
Caron, F. and E. B. Fox (2017). Sparse graphs using exchangeable random measures. Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 79 (5), 1295–1366.
Chen, L. H. and Q.-M. Shao (2004). Normal approximation under local dependence. The Annals
of Probability 32 (3), 1985–2028.
Christakis, N. and J. Fowler (2007). The spread of obesity in a large social network over 32 years.
New England Journal of Medicine 357 (4), 370–379.
Christakis, N. and J. Fowler (2008). The collective dynamics of smoking in a large social network.
New England journal of medicine 358 (21), 2249–2258.
Christakis, N. and J. Fowler (2010). Social network sensors for early detection of contagious out-
breaks. PloS one 5 (9), e12948.
Clauset, A., C. R. Shalizi, and M. E. Newman (2009). Power-law distributions in empirical data.
SIAM review 51 (4), 661–703.
Cohen-Cole, E. and J. Fletcher (2008). Is obesity contagious? social networks vs. environmental
factors in the obesity epidemic. Journal of Health Economics 27 (5), 1382–1387.
Diaconis, P. and S. Janson (2007). Graph limits and exchangeable random graphs. arXiv preprint
arXiv:0712.2749 .
Eck, D. J., O. Morozova, and F. W. Crawford (2018). Randomization for the direct effect of an
infectious disease intervention in a clustered study population. arXiv preprint arXiv:1808.05593 .
Eckles, D., B. Karrer, and J. Ugander (2014). Design and analysis of experiments in networks:
Reducing bias from interference. arXiv preprint arXiv:1404.7530 .
Fang, X. (2011). Multivariate, combinatorial and discretized normal approximations by Stein’s
method. Ph. D. thesis.
58
Fang, X., A. Röllin, et al. (2015). Rates of convergence for multivariate normal approximation
with applications to dense graphs and doubly indexed permutation statistics. Bernoulli 21 (4),
2157–2189.
Forastiere, L., E. M. Airoldi, and F. Mealli (2016). Identification and estimation of treatment and
interference effects in observational studies on networks. arXiv preprint arXiv:1609.06245 .
Fowler, J. H. and N. A. Christakis (2008). Dynamic spread of happiness in a large social network:
longitudinal analysis over 20 years in the framingham heart study. Bmj 337, a2338.
Graham, B., G. Imbens, and G. Ridder (2010). Measuring the effects of segregation in the presence
of social spillovers: A nonparametric approach. Technical report, National Bureau of Economic
Research.
Halloran, M. and M. Hudgens (2011). Causal inference for vaccine effects on infectiousness. The
University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series,
20.
Halloran, M. and C. Struchiner (1995). Causal inference in infectious diseases. Epidemiology ,
142–151.
Haneuse, S. and A. Rotnitzky (2013). Estimation of the effect of interventions that modify the
received treatment. Statistics in medicine 32 (30), 5260–5277.
Harling, G., R. Wang, J.-P. Onnela, and V. DeGruttola (2016). Leveraging contact network struc-
ture in the design of cluster randomized trials. Harvard University Biostatistics Working Paper
Series (Working Paper 199).
Hong, G. and S. Raudenbush (2006). Evaluating kindergarten retention policy. Journal of the
American Statistical Association 101 (475), 901–910.
Hudgens, M. and M. Halloran (2008). Toward causal inference with interference. Journal of the
American Statistical Association 103 (482), 832–842.
Jagadeesan, R., N. Pillai, and A. Volfovsky (2017). Designs for estimating the treatment effect in
networks with interference. arXiv preprint arXiv:1705.08524 .
59
Kao, E., P. Toulis, E. Airoldi, and D. Rubin (2012). Causal estimation of peer influence effects. In
Proceedings of the NIPS Workshop on Social Network and Social Media Analysis.
Kolaczyk, E. D. and P. N. Krivitsky (2015). On the question of effective sample size in network mod-
eling: An asymptotic inquiry. Statistical science: a review journal of the Institute of Mathematical
Statistics 30 (2), 184.
Lauritzen, S. L. and T. S. Richardson (2002). Chain graph models and their causal interpretations.
Journal of the Royal Statistical Society: Series B 64 (3), 321–348.
Lee, Y. and E. L. Ogburn (2019). Network dependence and confounding by network structure lead
to invalid inference. arXiv preprint arXiv:1908.00520 .
Leung, M. P. (2016). Treatment and spillover effects under network interference. Review of Eco-
nomics and Statistics, 1–42.
Liu, L. and M. G. Hudgens (2014). Large sample randomization inference of causal effects in the
presence of interference. Journal of the american statistical association 109 (505), 288–301.
Liu, L., M. G. Hudgens, and S. Becker-Dreps (2016). On inverse probability-weighted estimators in
the presence of interference. Biometrika 103 (4), 829–842.
Lovász, L. (2012). Large networks and graph limits, Volume 60. American Mathematical Soc.
Lyons, R. (2011). The spread of evidence-poor medicine via flawed social-network analysis. Statistics,
Politics, and Policy 2 (1).
Madan, A., S. T. Moturu, D. Lazer, and A. S. Pentland (2010). Social sensing: obesity, unhealthy
eating and exercise in face-to-face networks. In Wireless Health 2010, pp. 104–110. ACM.
Muñoz, I. D. and M. van der Laan (2012). Population intervention causal effects based on stochastic
interventions. Biometrics 68 (2), 541–549.
Munoz, I. D. and M. J. van der Laan (2011). Super learner based conditional density estimation
with application to marginal structural models. The International Journal of Biostatistics 7 (1),
1–20.
60
Newman, M. (2009). Networks: an introduction. Oxford: Oxford University Press.
Newman, M. E. and J. Park (2003). Why social networks are different from other types of networks.
Physical Review E 68 (3), 036122.
Noel, H. and B. Nyhan (2011). The “unfriending” problem: The consequences of homophily in
friendship retention for causal estimates of social influence. Social Networks 33 (3), 211–218.
Ogburn, E. and T. J. VanderWeele (2013). Causal diagrams for interference. Technical report,
Harvard University.
Ogburn, E. L., I. Shpitser, and Y. Lee (2018). Causal inference, social networks, and chain graphs.
arXiv preprint arXiv:1812.04990 .
Ogburn, E. L. and T. J. VanderWeele (2014). Vaccines, contagion, and social networks. arXiv
preprint arXiv:1403.1241 .
Ogburn, E. L., T. J. VanderWeele, et al. (2014). Causal diagrams for interference. Statistical
science 29 (4), 559–578.
Papadogeorgou, G., F. Mealli, and C. M. Zigler (2019). Causal inference with interfering units for
cluster and population level treatment allocation programs. Biometrics 75 (3), 778–787.
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika 82 (4), 669–688.
Pearl, J. (2000). Causality: models, reasoning and inference. Cambridge Univ Press.
Pearl, J. (2012). The causal foundations of structural equation modeling. Technical report, CALI-
FORNIA UNIV LOS ANGELES DEPT OF COMPUTER SCIENCE.
Puelz, D., G. Basse, A. Feller, and P. Toulis (2019). A graph-theoretic approach to randomization
tests of causal effects under general interference. arXiv preprint arXiv:1910.10862 .
R Core Team (2015). R: A Language and Environment for Statistical Computing. Vienna, Austria:
R Foundation for Statistical Computing.
Rosenbaum, P. (2007). Interference between units in randomized experiments. Journal of the
American Statistical Association 102 (477), 191–200.
61
Rosenquist, J. N., J. Murabito, J. H. Fowler, and N. A. Christakis (2010). The spread of alcohol
consumption behavior in a large social network. Annals of Internal Medicine 152 (7), 426–433.
Ross, N. F. (2011). Fundamentals of stein’s method. Probability Surveys 8, 210–293.
Rubin, D. (1990). Comment: Neyman (1923) and causal inference in experiments and observational
studies. Statistical Science 5 (4), 472–480.
Sävje, F. (2019). Causal inference with misspecified exposure mappings. Technical report, Technical
report, Technical report, Yale University.
Sävje, F., P. M. Aronow, and M. G. Hudgens (2017). Average treatment effects in the presence of
unknown interference. arXiv preprint arXiv:1711.06399 .
Shalizi, C. and A. Thomas (2011). Homophily and contagion are generically confounded in obser-
vational social network studies. Sociological Methods & Research 40 (2), 211–239.
Shalizi, C. R. and A. Rinaldo (2013). Consistency under sampling of exponential random graph
models. Annals of Statistics 41 (2), 508–535.
Sobel, M. (2006). What do randomized studies of housing mobility demonstrate? Journal of the
American Statistical Association 101 (476), 1398–1407.
Sofrygin, O., R. Neugebauer, and M. J. van der Laan (2017). Conducting simulations in causal
inference with networks-based structural equation models. arXiv preprint arXiv:1705.10376 .
Sofrygin, O., E. L. Ogburn, and M. J. van der Laan (2018). Single time point interventions in
network-dependent data. In Targeted Learning in Data Science, pp. 373–396. Springer.
Sofrygin, O. and M. J. van der Laan (2015). Semi-Parametric Estimation and Inference for the
Mean Outcome of the Single Time-Point Intervention in a Causally Connected Population. U.C.
Berkeley Division of Biostatistics Working Paper Series (Working Paper 344).
Sofrygin, O. and M. J. van der Laan (2015). tmlenet: Targeted Maximum Likelihood Estimation for
Network Data. R package version 0.1.0.
62
Sofrygin, O., M. J. van der Laan, and R. Neugebauer (2015). simcausal: Simulating Longitudinal
Data with Causal Inference Applications. R package version 0.5.0.
Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum
of dependent random variables. In Proc. Sixth Berkeley Symp. Math. Stat. Prob., pp. 583–602.
Sussman, D. L. and E. M. Airoldi (2017). Elements of estimation theory for causal effects in the
presence of network interference. arXiv preprint arXiv:1702.03578 .
Tchetgen Tchetgen, E. J., I. Fulcher, and I. Shpitser (2017, 09). Auto-g-computation of causal
effects on a network. Technical report.
Tchetgen Tchetgen, E. J. and T. VanderWeele (2012). On causal inference in the presence of
interference. Statistical Methods in Medical Research 21 (1), 55–75.
Toulis, P., A. Volfovsky, and E. M. Airoldi (2018). Propensity score methodology in the presence
of network entanglement between treatments. arXiv preprint arXiv:1801.07310 .
Trogdon, J. G., J. Nonnemaker, and J. Pais (2008). Peer effects in adolescent overweight. Journal
of health economics 27 (5), 1388–1399.
Tsao, C. W. and R. S. Vasan (2015). Cohort profile: The framingham heart study (fhs): overview
of milestones in cardiovascular epidemiology. International journal of epidemiology 44 (6), 1800–
1813.
Vallender, S. (1974). Calculation of the wasserstein distance between probability distributions on
the line. Theory of Probability & Its Applications 18 (4), 784–786.
van der Laan, M. J. (2014). Causal inference for a population of causally connected units. Journal
of Causal Inference 0 (0), 2193–3677.
Van der Laan, M. J. and S. Rose (2011). Targeted learning: causal inference for observational and
experimental data. Springer Science & Business Media.
van der Laan Mark, J., C. Polley Eric, et al. (2007). Super learner. Statistical Applications in
Genetics and Molecular Biology 6 (1), 1–23.
63
Van der Vaart, A. W. (1998). Asymptotic statistics, Volume 3. Cambridge university press.
Van Der Vaart, A. W. and J. A. Wellner (1996). Weak convergence. In Weak Convergence and
Empirical Processes, pp. 16–28. Springer.
VanderWeele, T. (2010). Direct and indirect effects for neighborhood-based clustered and longitu-
dinal data. Sociological Methods & Research 38 (4), 515–544.
Wasserman, S. (2013). Comment on “social contagion theory: Examining dynamic social networks
and human behavior” by nicholas christakis and james fowler. Statistics in Medicine 32 (4),
578–580.
Watts, D. J. and S. H. Strogatz (1998). Collective dynamics of small-world networks. Na-
ture 393 (6684), 440–442.
Young, J. G., M. A. Hernán, and J. M. Robins (2014). Identification, estimation and approximation
of risk under interventions that depend on the natural value of treatment using observational
data. Epidemiologic Methods 3 (1), 1–19.
REFERENCES
Ali, M. M. and D. S. Dwyer (2010). Social network effects in alcohol consumption among adolescents.
Addictive behaviors 35 (4), 337–342.
Aronow, P. M. and C. Samii (2013). Estimating average causal effects under general interference.
Technical report, Yale University.
Athey, S., D. Eckles, and G. W. Imbens (2018). Exact p-values for network interference. Journal
of the American Statistical Association 113 (521), 230–240.
Barabási, A.-L. and R. Albert (1999). Emergence of scaling in random networks. science 286 (5439),
509–512.
Basse, G., A. Feller, and P. Toulis (2019). Randomization tests of causal effects under interference.
Biometrika 106 (2), 487–494.
64
Basse, G. W. and E. M. Airoldi (2015). Optimal design of experiments in the presence of network-
correlated outcomes. ArXiv e-prints.
Basse, G. W. and E. M. Airoldi (2018). Model-assisted design of experiments in the presence of
network-correlated outcomes. Biometrika 105 (4), 849–858.
Benkeser, D., C. Ju, S. Lendle, and M. van der Laan (2018). Online cross-validation-based ensemble
learning. Statistics in medicine 37 (2), 249–260.
Bickel, P. J., C. A. Klaassen, P. J. Bickel, Y. Ritov, J. Klaassen, J. A. Wellner, and Y. Ritov (1998).
Efficient and adaptive estimation for semiparametric models, Volume 2. Springer New York.
Bowers, J., F. M. M, and P. C (2013). Reasoning about interference between units: A general
framework. Political Analysis 21, 97–124.
Cacioppo, J. T., J. H. Fowler, and N. A. Christakis (2009). Alone in the crowd: the structure and
spread of loneliness in a large social network. Journal of personality and social psychology 97 (6),
977.
Cai, X., W. W. Loh, and F. W. Crawford (2019). Identification of causal intervention effects under
contagion. arXiv preprint arXiv:1912.04151 .
Caron, F. and E. B. Fox (2017). Sparse graphs using exchangeable random measures. Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 79 (5), 1295–1366.
Chen, L. H. and Q.-M. Shao (2004). Normal approximation under local dependence. The Annals
of Probability 32 (3), 1985–2028.
Christakis, N. and J. Fowler (2007). The spread of obesity in a large social network over 32 years.
New England Journal of Medicine 357 (4), 370–379.
Christakis, N. and J. Fowler (2008). The collective dynamics of smoking in a large social network.
New England journal of medicine 358 (21), 2249–2258.
Christakis, N. and J. Fowler (2010). Social network sensors for early detection of contagious out-
breaks. PloS one 5 (9), e12948.
65
Clauset, A., C. R. Shalizi, and M. E. Newman (2009). Power-law distributions in empirical data.
SIAM review 51 (4), 661–703.
Cohen-Cole, E. and J. Fletcher (2008). Is obesity contagious? social networks vs. environmental
factors in the obesity epidemic. Journal of Health Economics 27 (5), 1382–1387.
Diaconis, P. and S. Janson (2007). Graph limits and exchangeable random graphs. arXiv preprint
arXiv:0712.2749 .
Eck, D. J., O. Morozova, and F. W. Crawford (2018). Randomization for the direct effect of an
infectious disease intervention in a clustered study population. arXiv preprint arXiv:1808.05593 .
Eckles, D., B. Karrer, and J. Ugander (2014). Design and analysis of experiments in networks:
Reducing bias from interference. arXiv preprint arXiv:1404.7530 .
Fang, X. (2011). Multivariate, combinatorial and discretized normal approximations by Stein’s
method. Ph. D. thesis.
Fang, X., A. Röllin, et al. (2015). Rates of convergence for multivariate normal approximation
with applications to dense graphs and doubly indexed permutation statistics. Bernoulli 21 (4),
2157–2189.
Forastiere, L., E. M. Airoldi, and F. Mealli (2016). Identification and estimation of treatment and
interference effects in observational studies on networks. arXiv preprint arXiv:1609.06245 .
Fowler, J. H. and N. A. Christakis (2008). Dynamic spread of happiness in a large social network:
longitudinal analysis over 20 years in the framingham heart study. Bmj 337, a2338.
Graham, B., G. Imbens, and G. Ridder (2010). Measuring the effects of segregation in the presence
of social spillovers: A nonparametric approach. Technical report, National Bureau of Economic
Research.
Halloran, M. and M. Hudgens (2011). Causal inference for vaccine effects on infectiousness. The
University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series,
20.
66
Halloran, M. and C. Struchiner (1995). Causal inference in infectious diseases. Epidemiology ,
142–151.
Haneuse, S. and A. Rotnitzky (2013). Estimation of the effect of interventions that modify the
received treatment. Statistics in medicine 32 (30), 5260–5277.
Harling, G., R. Wang, J.-P. Onnela, and V. DeGruttola (2016). Leveraging contact network struc-
ture in the design of cluster randomized trials. Harvard University Biostatistics Working Paper
Series (Working Paper 199).
Hong, G. and S. Raudenbush (2006). Evaluating kindergarten retention policy. Journal of the
American Statistical Association 101 (475), 901–910.
Hudgens, M. and M. Halloran (2008). Toward causal inference with interference. Journal of the
American Statistical Association 103 (482), 832–842.
Jagadeesan, R., N. Pillai, and A. Volfovsky (2017). Designs for estimating the treatment effect in
networks with interference. arXiv preprint arXiv:1705.08524 .
Kao, E., P. Toulis, E. Airoldi, and D. Rubin (2012). Causal estimation of peer influence effects. In
Proceedings of the NIPS Workshop on Social Network and Social Media Analysis.
Kolaczyk, E. D. and P. N. Krivitsky (2015). On the question of effective sample size in network mod-
eling: An asymptotic inquiry. Statistical science: a review journal of the Institute of Mathematical
Statistics 30 (2), 184.
Lauritzen, S. L. and T. S. Richardson (2002). Chain graph models and their causal interpretations.
Journal of the Royal Statistical Society: Series B 64 (3), 321–348.
Lee, Y. and E. L. Ogburn (2019). Network dependence and confounding by network structure lead
to invalid inference. arXiv preprint arXiv:1908.00520 .
Leung, M. P. (2016). Treatment and spillover effects under network interference. Review of Eco-
nomics and Statistics, 1–42.
67
Liu, L. and M. G. Hudgens (2014). Large sample randomization inference of causal effects in the
presence of interference. Journal of the american statistical association 109 (505), 288–301.
Liu, L., M. G. Hudgens, and S. Becker-Dreps (2016). On inverse probability-weighted estimators in
the presence of interference. Biometrika 103 (4), 829–842.
Lovász, L. (2012). Large networks and graph limits, Volume 60. American Mathematical Soc.
Lyons, R. (2011). The spread of evidence-poor medicine via flawed social-network analysis. Statistics,
Politics, and Policy 2 (1).
Madan, A., S. T. Moturu, D. Lazer, and A. S. Pentland (2010). Social sensing: obesity, unhealthy
eating and exercise in face-to-face networks. In Wireless Health 2010, pp. 104–110. ACM.
Muñoz, I. D. and M. van der Laan (2012). Population intervention causal effects based on stochastic
interventions. Biometrics 68 (2), 541–549.
Munoz, I. D. and M. J. van der Laan (2011). Super learner based conditional density estimation
with application to marginal structural models. The International Journal of Biostatistics 7 (1),
1–20.
Newman, M. (2009). Networks: an introduction. Oxford: Oxford University Press.
Newman, M. E. and J. Park (2003). Why social networks are different from other types of networks.
Physical Review E 68 (3), 036122.
Noel, H. and B. Nyhan (2011). The “unfriending” problem: The consequences of homophily in
friendship retention for causal estimates of social influence. Social Networks 33 (3), 211–218.
Ogburn, E. and T. J. VanderWeele (2013). Causal diagrams for interference. Technical report,
Harvard University.
Ogburn, E. L., I. Shpitser, and Y. Lee (2018). Causal inference, social networks, and chain graphs.
arXiv preprint arXiv:1812.04990 .
Ogburn, E. L. and T. J. VanderWeele (2014). Vaccines, contagion, and social networks. arXiv
preprint arXiv:1403.1241 .
68
Ogburn, E. L., T. J. VanderWeele, et al. (2014). Causal diagrams for interference. Statistical
science 29 (4), 559–578.
Papadogeorgou, G., F. Mealli, and C. M. Zigler (2019). Causal inference with interfering units for
cluster and population level treatment allocation programs. Biometrics 75 (3), 778–787.
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika 82 (4), 669–688.
Pearl, J. (2000). Causality: models, reasoning and inference. Cambridge Univ Press.
Pearl, J. (2012). The causal foundations of structural equation modeling. Technical report, CALI-
FORNIA UNIV LOS ANGELES DEPT OF COMPUTER SCIENCE.
Puelz, D., G. Basse, A. Feller, and P. Toulis (2019). A graph-theoretic approach to randomization
tests of causal effects under general interference. arXiv preprint arXiv:1910.10862 .
R Core Team (2015). R: A Language and Environment for Statistical Computing. Vienna, Austria:
R Foundation for Statistical Computing.
Rosenbaum, P. (2007). Interference between units in randomized experiments. Journal of the
American Statistical Association 102 (477), 191–200.
Rosenquist, J. N., J. Murabito, J. H. Fowler, and N. A. Christakis (2010). The spread of alcohol
consumption behavior in a large social network. Annals of Internal Medicine 152 (7), 426–433.
Ross, N. F. (2011). Fundamentals of stein’s method. Probability Surveys 8, 210–293.
Rubin, D. (1990). Comment: Neyman (1923) and causal inference in experiments and observational
studies. Statistical Science 5 (4), 472–480.
Sävje, F. (2019). Causal inference with misspecified exposure mappings. Technical report, Technical
report, Technical report, Yale University.
Sävje, F., P. M. Aronow, and M. G. Hudgens (2017). Average treatment effects in the presence of
unknown interference. arXiv preprint arXiv:1711.06399 .
Shalizi, C. and A. Thomas (2011). Homophily and contagion are generically confounded in obser-
vational social network studies. Sociological Methods & Research 40 (2), 211–239.
69
Shalizi, C. R. and A. Rinaldo (2013). Consistency under sampling of exponential random graph
models. Annals of Statistics 41 (2), 508–535.
Sobel, M. (2006). What do randomized studies of housing mobility demonstrate? Journal of the
American Statistical Association 101 (476), 1398–1407.
Sofrygin, O., R. Neugebauer, and M. J. van der Laan (2017). Conducting simulations in causal
inference with networks-based structural equation models. arXiv preprint arXiv:1705.10376 .
Sofrygin, O., E. L. Ogburn, and M. J. van der Laan (2018). Single time point interventions in
network-dependent data. In Targeted Learning in Data Science, pp. 373–396. Springer.
Sofrygin, O. and M. J. van der Laan (2015). Semi-Parametric Estimation and Inference for the
Mean Outcome of the Single Time-Point Intervention in a Causally Connected Population. U.C.
Berkeley Division of Biostatistics Working Paper Series (Working Paper 344).
Sofrygin, O. and M. J. van der Laan (2015). tmlenet: Targeted Maximum Likelihood Estimation for
Network Data. R package version 0.1.0.
Sofrygin, O., M. J. van der Laan, and R. Neugebauer (2015). simcausal: Simulating Longitudinal
Data with Causal Inference Applications. R package version 0.5.0.
Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum
of dependent random variables. In Proc. Sixth Berkeley Symp. Math. Stat. Prob., pp. 583–602.
Sussman, D. L. and E. M. Airoldi (2017). Elements of estimation theory for causal effects in the
presence of network interference. arXiv preprint arXiv:1702.03578 .
Tchetgen Tchetgen, E. J., I. Fulcher, and I. Shpitser (2017, 09). Auto-g-computation of causal
effects on a network. Technical report.
Tchetgen Tchetgen, E. J. and T. VanderWeele (2012). On causal inference in the presence of
interference. Statistical Methods in Medical Research 21 (1), 55–75.
Toulis, P., A. Volfovsky, and E. M. Airoldi (2018). Propensity score methodology in the presence
of network entanglement between treatments. arXiv preprint arXiv:1801.07310 .
70
Trogdon, J. G., J. Nonnemaker, and J. Pais (2008). Peer effects in adolescent overweight. Journal
of health economics 27 (5), 1388–1399.
Tsao, C. W. and R. S. Vasan (2015). Cohort profile: The framingham heart study (fhs): overview
of milestones in cardiovascular epidemiology. International journal of epidemiology 44 (6), 1800–
1813.
Vallender, S. (1974). Calculation of the wasserstein distance between probability distributions on
the line. Theory of Probability & Its Applications 18 (4), 784–786.
van der Laan, M. J. (2014). Causal inference for a population of causally connected units. Journal
of Causal Inference 0 (0), 2193–3677.
Van der Laan, M. J. and S. Rose (2011). Targeted learning: causal inference for observational and
experimental data. Springer Science & Business Media.
van der Laan Mark, J., C. Polley Eric, et al. (2007). Super learner. Statistical Applications in
Genetics and Molecular Biology 6 (1), 1–23.
Van der Vaart, A. W. (1998). Asymptotic statistics, Volume 3. Cambridge university press.
Van Der Vaart, A. W. and J. A. Wellner (1996). Weak convergence. In Weak Convergence and
Empirical Processes, pp. 16–28. Springer.
VanderWeele, T. (2010). Direct and indirect effects for neighborhood-based clustered and longitu-
dinal data. Sociological Methods & Research 38 (4), 515–544.
Wasserman, S. (2013). Comment on “social contagion theory: Examining dynamic social networks
and human behavior” by nicholas christakis and james fowler. Statistics in Medicine 32 (4),
578–580.
Watts, D. J. and S. H. Strogatz (1998). Collective dynamics of small-world networks. Na-
ture 393 (6684), 440–442.
Young, J. G., M. A. Hernán, and J. M. Robins (2014). Identification, estimation and approximation
71
of risk under interventions that depend on the natural value of treatment using observational
data. Epidemiologic Methods 3 (1), 1–19.
72