relational learning with one network: an asymptotic...

779

Relational Learning with One Network: An Asymptotic Analysis

Rongjing Xiang Jennifer NevilleDepartment of Computer Science

Purdue UniversityDepartment of Computer Science and Statistics

Purdue University

Abstract

Theoretical analysis of structured learningmethods has focused primarily on domainswhere the data consist of independent (albeitstructured) examples. Although the statisti-cal relational learning (SRL) community hasrecently developed many classification meth-ods for graph and network domains, much ofthis work has focused on modeling domainswhere there is a single network for learn-ing. For example, we could learn a modelto predict the political views of users in anonline social network, based on the friend-ship relationships among users. In this exam-ple, the data would be drawn from a singlelarge network (e.g., Facebook) and increas-ing the data size would correspond to ac-quiring a larger graph. Although SRL meth-ods can successfully improve classification inthese types of domains, there has been lit-tle theoretical analysis addressing the issueof single network domains. In particular, theasymptotic properties of estimation are notclear if the size of the model grows with thesize of the network. In this work, we focus onoutlining the conditions under which learn-ing from a single network will be asymptot-ically consistent and normal. Moreover, wecompare the properties of maximum likeli-hood estimation (MLE) with that of gener-alized maximum pseudolikelihood estimation(MPLE) and use the resulting understandingto propose novel MPLE estimators for singlenetwork domains. We include empirical anal-ysis on both synthetic and real network datato illustrate the findings.

Appearing in Proceedings of the 14th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR:W&CP 15. Copyright 2011 by the authors.

1 Introduction

Statistical relational learning methods aim to exploitthe relationships among instances, which naturally ex-ist in network data, to improve both descriptive andpredictive modeling performance compared to conven-tional learning methods that assume independent andidentically distributed (i.i.d.) instances. Relationalmodeling techniques have been applied in many ap-plication domains, such as social networks, the world-wide web, and citation analysis, with great empiricalsuccess. However, there has been relatively little the-oretical analysis of the properties of relational models,nor has there been much focus on precisely definingthe estimation and learning tasks.

From the breadth of past work on real-world applica-tions, it is clear that there are two distinct learningscenarios that are implicitly considered by relationallearning researchers and analysts. In the first scenario,the domain consists of a population of independentgraph samples (e.g., chemical compounds). Here eachgraph is structured, but the domain consists of a setof independent graphs, so we can reason theoreticallyabout the characteristics of algorithms in the limit, asthe number of available graph samples increases. Inthe second scenario, the domain consists of a single,large graph (e.g., Facebook). In this case, an increasein dataset size corresponds to acquiring a larger sam-ple from the network. Since relational models typicallyfocus on modeling the joint distribution of attributes(e.g., class labels) over a specific network structure,the asymptotic properties of the methods (as the sizeof the network grows) are not clear. Although manyrelational applications focus on learning and predic-tion in a single network, theoretical analysis of themodels has focused on the former scenario (i.e., withmultiple, independent network samples). In this work,we attempt to close this gap, by outlining the condi-tions under which learning from a single network willbe asymptotically consistent and normal.

The purpose of our asymptotic analysis is two fold.First, the asymptotic consistency argument providestheoretical justification for current relational learning

780


approaches that practitioners have been applying tonetwork data comprised of a single graph. Moreover,our analysis provides some theoretical insight into acentral question in relational learning as to whetherjoint learning over the full data graph outperformsdisjoint learning over subgraphs from the original net-work, and if so, to what extent, and in which situa-tions? Our asymptotic normality and efficiency resultsillustrate how disjoint and joint learning approachesexploit the relational dependencies in the data differ-ently, and this understanding points to a spectrum oflearning approaches in between, by varying the levelof joint learning.

More specifically, we investigate this tradeoff througha comparative analysis of maximum likelihood estima-tors (MLE) and generalized maximum pseudolikeli-hood estimators (MPLE). To the best of our knowl-edge, this is the first formal analysis of MLE andMPLE in the machine learning community for thecontext of single-network estimation. Although theasymptotic behavior of MLE and MPLE has been wellstudied for i.i.d. data, the analysis for network data re-quires different techniques and careful specification ofmodeling assumptions. Since probabilistic limit the-orems do not hold in full generality for dependentdata, asymptotic consistency and normality of rela-tional learners can only be established under certainassumptions about the relational structure. Previouswork in the statistical physics community has ana-lyzed the asymptotic behavior of estimators for Gibbsmeasures on lattice data under various structural as-sumptions (see e.g. Comets, 1992). However, the typ-ical assumptions made for lattice data, such as shiftinvariance, are not applicable to relational learningon heterogenous networks. Alternatively, we base ouranalysis on two assumptions that are more suitable forsingle-network domains. First, we assume that nodedegrees remain bounded as the size of the data graphgrows. Second, we formalize the intuition that corre-lation decays rapidly in the model as graph distanceincreases, with a notion of weak dependence which re-quires that the total correlation of each clique with allother cliques across the network is finite.

The rest of this paper is organized as follows. In Sec-tion 2, we describe a templated Markov network modelfor relational data, state our modeling assumptions,and prove basic convergence. In Section 3, we provethe asymptotic consistency and normality for MLEand MPLE in the single-network scenario. We alsocompare the asymptotic efficiency of MLE with differ-ent MPLE’s. In Section 4, we illustrate the findingswith an empirical study on both synthetic data anda real world social network. Finally, we conclude thepaper with a review of related work and a discussion.

2 Model FormulationIn this paper, our analysis is based on the framework ofMarkov networks. We chose a Markov network frame-work for the following reasons: (1) their formulationsare used widely in relational learning, (2) their richrepresentation ability (e.g., Markov networks can nat-urally translate data relationships into a model of pos-sibly cyclic probabilistic dependencies), (3) their abil-ity to derive a consistent global distribution from localspecifications, which is essential for consistent learn-ing from subgraphs drawn from an unknown, largerpopulation graph, and (4) their amenability to theo-retical analysis due to the desirable analytical prop-erties of exponential families. While we will use thegeneral terminology of Markov networks throughoutthis paper, our formulation encompasses a rich class ofundirected graphical model-based relational learners,including relational Markov networks (Taskar et al.,2002), Markov logic networks (Richardson and Domin-gos, 2006), relational dependency networks (Nevilleand Jensen, 2007), conditional random fields for rela-tional learning (Sutton and Mccallum, 2006), and P*models (Robins et al., 2007).2.1 Templated Markov NetworksTemplated Markov Networks for relationaldata can be generally written as: PG(y|x) =1Z

QT∈T

QC∈CT (G) ΦT (xC ,yC ; θT ), where T is the set

of clique templates, and Z is the partition function.Each clique C is defined on a small subgraph of thewhole data graph, while the parameters of cliqueswithin the same template T are tied, denoted byθT . Therefore, we use a single potential functionΦT for each template T . Each potential is furtherformulated as a log-linear function of a set of featuresφT . φT are computed from the vector of attributes(x)and labels(y) within the corresponding clique C:ΦT = exp

nDθT ,φT (xC ,yC)

Eo.

Throughout this paper, to keep the notation simple,we assume that only one template is defined, althoughthe analysis can easily be generalized for a finite num-ber of templates. As such, we drop the subscript T inthe above formulation. Furthermore, let n denote thenumber of cliques in graph G, i.e., n = |C(G)|. Thenthe model can be compactly written as follows.

P (y|x) =1

Zn(θ; x)exp

8<:*

θ,X

C∈C(G)

φ(xC ,yC)

+9=; (1)

where the partition function is Zn(θ; x) =RYG exp

Dθ,Pc∈C(G) φ(xC ,yC)

Edy.

Within this model, we consider learning over train-ing graphs with increasing number of cliques n → ∞.Depending on the model specification, as the size ofthe graph grows, n can grow in the order of the num-ber of nodes (e.g., node-centric clique specifications),

781

Rongjing Xiang, Jennifer Neville

Eθ: expectation taken w.r.t. Pθ

G: training graphC ∈ C(G): template potential cliquesn: number of cliques(|C(G)|)S ∈ S(G): template pseudolikelihood componentsm: number of pseudolikelihood components(|S(G)|)Ln(θ; x): the conditional log-likelihood

Lm(θ; x): the conditional log-pseudolikelihood

fn (fm): the gradient of the Ln(Lm)yR (xR): joint instantiation of y(x) over the node set RφC = φ(xC ,yC): features in the potential function

θn: the maximum likelihood estimate

θm: the maximum pseudolikelihood estimate

Table 1: Notations used in this paper

the number of edges (e.g., pairwise clique specifica-tions), or higher. The MLE θ can be written as:θn = argmaxθ Ln(θ; x), where the normalized1 log-likelihood function is:

Ln(θ; x) =

*θ,

1

n

XC∈C(G)

φC+− 1

nlogZn(θ; x) (2)

Instead of optimizing the joint distribution over thewhole training graph, the generalized maximum pseu-dolikelihood estimator (MPLE) partitions the graphinto small subgraph components. In this paper, weuse a templating formulation, similar to that of the po-tential cliques C(G), for the MPLE components S(G),where {S : ∪S = G} and each component S intersectswith mS cliques. Let m denote the number of pseu-dolikelihood components in G, i.e., m = |S(G)|. SincemS << n, m = O(n). Let ∂S be the set of cliques whichintersect with S’s Markov blanket, i.e., ∂S = {C : C ∈C(G) and C∩S 6= ∅ and ∃v /∈ S s.t. v ∈ C}. The MPLEobjective is defined as a product of local conditionalprobabilities:

PL(θ; x) =Y

S∈S(G)

P (yS |yG\S ,x) =Y

S∈S(G)

P (yS |y∂S ,x)

=Y

S∈S(G)

1

ZS(θ; x)exp

*θ,

XC:C∩S 6=∅

φ(xC ,yC)

+(3)

where the partition function is ZS(θ; x) =RYS exp

Dθ,PC:S∩C 6=∅ φ(xC ,yC)

Edy.

The MPLE can be written as: θn = argmaxθ Ln(θ; x),where the normalized log-peudolikelihood function is:

Lm(θ; x)=1

m

XS∈S(G)

XC:C∩S 6=∅

“Dθ,φC

E−log ZS(θ; x)

”(4)

Clearly, the key difference of the two estimators liesin the partition function. The MLE does not partition

1We normalize the log-likelihood because the unnormal-ized value goes to infinity as n→∞. The normalized ver-sion is also convenient to work with for proving consistency.

the graph and needs to enumerate all y variables jointlyin G in the partition function. In general, the spaceon which the MLE integral is taken grows exponen-tially with the training data size n—which alludes tothe computational infeasibility of MLE for relationallearning, distinguishing it from i.i.d. learning. In con-trast, the MPLE partitions the graph into componentsof bounded sizes and only enumerates the variableswithin each component. Therefore, the domain of theMPLE integral remains fixed as the training data sizegrows. The common practice in relational learning ofconsidering subgraphs from the training data as i.i.d.samples and learning from them independently is thus,in effect, MPLE learning.

By standard results on exponential families (see, e.g.Wainwright and Jordan, 2008), the gradient of Ln andLm, denoted by fn and fm respectively, are:

fn(θ) =1

n

XC∈C(G)

φC − Eθ[φ|x] (5)

fm(θ) =1

m

XS∈S(G)

XC:C∩S 6=∅

“φC − Eθ

hφ|y∂S ,x

i”(6)

2.2 Weak Dependence and ConvergenceWe provide theoretical justification of learning fromsubgraphs by studying the asymptotic behavior of theestimators. To this end, we first need the notionof Markov networks of infinite size. Based on stan-dard arguments in probability theory (see also Singlaand Domingos, 2007), the following local finiteness as-sumptions are sufficient for our model (1) to be welldefined at infinity.

Definition 1 (Local finiteness). The infinite Markovnetwork is locally finite if the number of variables ineach variable v’s Markov blanket is bounded, i.e., thereexists 0 < d < ∞ such that

PC:v∈C |C| ≤ d for every

v ∈ G. In addition, the feature values must be boundedas well: φ(·) ≤M <∞.

A further sufficient condition for the learners to bewell-behaved is a notion of weak dependence. The dataare weakly dependent if the total covariance of anyclique with all other cliques in the network is finite.

Definition 2 (Weak dependence). The Markovnetwork satisfies the weak dependence condi-tion if limn→∞Eθ[V [φ]] exists, where V [φ] =1n

PC1∈C(G)

PC2∈C(G) Cov

hφC1 ,φC2

iis the auto-

covariance of feature values across the whole network.

In the literature, various forms of weak depen-dence (see e.g. Dedecker et al., 2007) have been usedto prove limit theorems for dependent data. The def-inition presented herein differs from past definitionsin two main aspects. First, previous definitions usu-ally involve testing all bounded functions, while we

782


relax it to only consider the covariance of the featurefunction φ, which suffices for our purpose. Second, inrelated work, weak dependence conditions are primar-ily considered for time series and lattice data. Theseconditions can be adapted for graphs of subexponen-tial growth to give sufficient conditions of our weak de-pendence assumption. For example, exponential decayin correlation (with respect to graph distance) impliesthat Definition 2 holds. While it is computationallyinfeasible to check weak dependence conditions empir-ically, part of our current work is deriving sufficientconditions for weak dependence which are convenientto work with while remains wide applicability.

Throughout this paper, we focus on networks whereweak dependence holds. We note that infinite Markovnetworks which fail to satisfy the weak dependencecondition are not well behaved in general and the cor-responding learning algorithm will tend to performpoorly. For example, perturbation of values at a sin-gle node may cause global label changes, which couldmake the algorithm oscillate between very differentlabelings. The break of weak dependence is closelyrelated (although not identical) to the phenomenonof phase transition in statistical physics (Dobruschin,1968). Loosely speaking, phase transitions in infiniteMarkov networks occur when different configurationsat the boundary cause different probability distribu-tions over inner nodes. In many application areas ofrelational learning like social networks and web graphs,such dramatic long range influence is not likely.

An immediate consequence of the weak dependencecondition is the following law of large numbers for in-terdependent feature values.

Proposition 1. Assuming weak dependence and thetrue parameter vector θ, the following holds:

i) 1n

PC∈C(G) φC

P→Eθ[φ].

ii) V(θ) ≡ limn→∞Eθ[Vn[φ|x]] exists.

iii) Eθ

hφ|xGn

iP→Eθ[φ].

Proof Sketchi): For any t > 0, by Markov’s inequality,

P (|| 1n

XC∈C(G)

φC − Eθ[φ]||2 ≥ t)

≤ 1

t

ZX ,Y|| 1n

XC∈C(G)

φ(xC ,yC)− Eθ[φ]||2d(x,y)

=1

tnV [φ]

Letting n→∞, P (|| 1nPC∈C(G) φC − Eθ[φ]||2 ≥ t)→ 0.

ii): By the law of total covariance, Eθ[V [φ|x]] ≤Eθ[V [φ]]. Thus, the increasing sequence {Eθ[V [φ|x]]}

is bounded, and its limit exists.iii) is proved in the same way as i).

3 Asymptotic Analysis of Estimators

In this section, we establish asymptotic consistencyand normality arguments for MLE and MPLE, in orderto theoretically justify relational learning from a singlenetwork. We also analyze the efficiency of the learnersby comparing their asymptotic variance. Throughoutthis section, compactness of the parameter space andidentifiability of parameters2 are assumed for the con-venience of theoretical analysis.

3.1 Asymptotic ConsistencyWe use the following lemma, which appears in van derVaart (2000, Theorem 5.7) to prove consistency of theestimators.

Lemma 1. Let θn be a sequence of estimators, Mn berandom functions, M be a fixed function of θ such thatfor every δ > 0,

supθ|Mn(θ)−M(θ)| P→ 0, (7)

supθ:||θ−θ0||≥δ

M(θ) < M(θ0) (8)

Mn(θn) ≥Mn(θ0)− ε(n) (9)

where ε(n)P→ 0. Then θn

P→ θ0.

In order to discuss consistency, we need to considerthe generative distribution for the data, which is thefull Markov network parameterized by θ0:

Pθ0(x,y) =1

Zn(θ0)exp

8<:*

θ0,X

C∈C(G)

φ(xC ,yC)

+9=;(10)

where Zn(θ0)=RXG,YG exp

Dθ0,Pc∈C(G) φ(xC ,yC)

Ed(x,y).

Our general conditional model (1) differs from thegenerative distribution (10) in that while the gener-ative distribution is over the whole space (XG,YG),the conditional learner only varies over YG. We thusrelate the estimate based on the conditional likelihood(resp. pseudolikelihood) to that based on the fullgenerative likelihood (resp. pseudolikelihood). Thelog partition function of the generative likelihood(resp. pseudolikelihood) is denoted by A(θ) (resp.A(θ)) in the following proofs. Weak dependence, localfiniteness and the convexity of log partition functionsare instrumental in establishing the conditions ofLemma 1.

2Nonidentifiable parameterization could be useful inpractice due to interpretability. In such circumstances thevalidity of learning needs to be established based on equiv-alence class arguments.

783


Proposition 2. Let θn be the maximum likelihood es-timator based on graph G with n cliques, generatedfrom Pθ0

(as defined by Equation 10). Then θn isasymptotically consistent: θn

P→θ0.

Proof SketchLet An(θ) = 1

n logRXG,YG exp

Dθ,Pc∈C(G) φC

Ed(x,y).

Given local finiteness, {An(θ)} is bounded and its limitexists, which we denote by A(θ) = limn→∞An(θ). Fur-thermore, An(θ) is convex and we denote its conjugatefunction by A∗n, i.e., A∗n(φ) = supθ 〈θ,φ〉 − An(θ). InLemma 1, let Mn(θ) =

Dθ, 1

n

PC∈C(G) φC

E−An(θ), and

M(θ) = 〈θ, Eθ[φ]〉−A(θ). We prove the consistency bychecking each of the assumptions. Equation (7) followsfrom Proposition 1 and the compactness of parameterspace. Moreover, for ∀δ > 0, ∀θ s.t. ||θ − θ0|| ≥ δ:

M(θ0)−M(θ)= limn→∞

1

nEθ0

hlogPθ0(yG|xG)−logPθ(yG|xG)

i(11)

which is the limit of the KL divergence (rescaled by1n ) between the distributions parameterized by θ0 andθ. To show this limit is strictly positive, pick a largestsubset of cliques G from G, such that ∀Ci, Cj ∈ G, Ci ∩Cj = ∅. Due to local finiteness, |G| ≥ dn/d2e. DefineG = G \ G. Furthermore, let Pθ(R) denote Pθ(xR,yR)

and Pθ(R|T ) denote Pθ(xR,yR|xT ,yT ) for any nodesets R and T . We rewrite Equation (11) as

1

nKL

`Pθ0

(G)||Pθ(G)´

=1

nKL

`Pθ0

(G|G)||Pθ(G|G´

+ KL(Pθ0(G||Pθ(G))

≥ 1

nKL

`Pθ0

(G|G)||Pθ(G|G)´

≥ 1

d2KL

`Pθ0

(C|∂C)||Pθ(C|∂C)´

Note that due to the identifiability of parame-ters, the KL divergence between the local prob-abilities is positive, i.e., ∃η > 0 such thatM(θ0) − M(θ) ≥ 1

d2KL

`Pθ0

(C|∂C)||Pθ(C|∂C)´≥

ηd2

> 0 i.e., Equation (8) holds. Finally, wecheck Equation (9). By Equation (5) E

θn[φ|x] =

1n

PC∈C(G) φC

P→Eθ0[φ], while 1

n

PC∈C(G) φC

P→Eθ0

by Proposition 1, so Eθn

[φ|x]P→Eθ0

[φ]. On the other

hand, Eθn

[φ|x]P→E

θn[φ]. Therefore, E

θn[φ]

P→Eθ0[φ].

Mn(θ0) − Mn(θn) = A∗n(Eθ0[φ]) − A∗n(E

θn[φ]) +D

θn, Eθn[φ]E−˙θ0, Eθ0

[φ]¸. Since E

θn[φ]

P→Eθ0[φ],

A∗n(Eθn

[φ])P→A∗n(Eθ0

[φ]) by continuous mapping the-orem. θ0 and θn are bounded. Therefore, the RHSconverge to 0. We thus get Equation (9).

Proposition 3. Let θm be the maximum pseudolike-lihood estimator based on graph G with m pseudolike-lihood components, generated from Pθ0

. Then θm isasymptotically consistent: θm

P→θ0.

Proof SketchSimilar to the previous proof, let Am(θ) =1m

PS∈S(G) log

RXS ,YS exp

Dθ,PC:C∩S 6=∅ φC

Ed(x,y),

and A(θ) = limm→∞ Am(θ). In Lemma 1, letMm(θ) = 1

m

PS∈S(G)

PC:C∩S 6=∅ φC − Am(θ), and

M(θ) =˙θ, Eθ0

[φ]¸− A(θ). Then Equation (7) and

(9) are verified in the same way as the MLE case.We only check Equation (8) here. For ∀δ > 0, ∀θ s.t.||θ − θ0|| ≥ δ:

M(θ0)−M(θ)

=˙θ, Eθ0

[φ]¸− A(θ0)−

“˙θ, Eθ0

[φ]¸− A(θ)

”= limm→∞

1

m

XS∈S(G)

Eθ0

`logPθ0

(S|∂S)− logPθ(S|∂S)´

= limm→∞

1

m

XS∈S(G)

KL`Pθ0

(S|∂S)||Pθ(S|∂S)´

where again the local KL divergence is positive due toidentifiability of the parameters. Due to local finite-ness of the graph and boundedness of pseudolikelihoodcomponents, as m increases, all possible pseudolikeli-hood components can be divided into a finite num-ber of isometric classes. Since the number of differentpseudolikelihood classes is finite, there exists η suchthat for any S, KL

`Pθ0

(S|∂S)||Pθ(S|∂S)´≥ η > 0.

Therefore, M(θ0)−M(θ) ≥ η > 0.

Our consistency results justify both joint and disjointlearning over observed subgraphs from the underlyingwhole graph.3.2 Asymptotic NormalityTo establish asymptotic normality for relational esti-mators, we need appropriate forms of central limit the-orems for dependent data. Lemma 2 and 3 serve forthis purpose (see appendix for proofs).

Lemma 2. Let f be the gradient of log-likelihood func-tion, as defined in Equation 5, and V(θ0) as defined inProposition 1. Then

√nfn(θ0)

D→N (0,V(θ0)).

Proposition 4. Assuming weak dependence, the max-imum likelihood estimator is asymptotically normal:

√n(θn − θ0)

D→N“0,V(θ0)−1

”(12)

Proof SketchWe apply the method describe in van der Vaart(2000, Section 5.3). Apply Taylor expansion to fn,and note that the MLE θn is the zero point offn(θ): 0 = fn(θn) = fn(θ0) + (θn − θ0)∇fn(θ0) +

o(||θn − θ0||2) Reagrranging the terms, and multi-plying both sides by

√n, we get

√n(θn − θ0) =

−∇fn(θ0)−1√nfn(θ0)−o(||θn−θ0||2). Since ∇fn(θ0)→−V(θ0), and fn(θ0)

D→N (0,V(θ0)) by Lemma 2, Equa-tion (12) follows.

784


Next, we consider the maximum pseudolikelihood es-timator. The asymptotic normality of the MPLE isclosely related to two covariance matrices. Intuitively,they can be viewed as the “within component” auto-covariance and the ”across network” covariance. Theautocovariance of feature values within each compo-nent is: V(θ) = Eθ

hV [φ|S,x]

i, where:

V [φ|S,x] =1

m

XS∈S

XC1:C1∩S 6=∅

XC2:C2∩S 6=∅

CovhuC1 ,uC2

i

and uCi = φCi −Eθ

hφ|y∂S ,x

i. We further denote the

covariance of component conditional values of featuresacross the network by C(θ) = Eθ [C[φ|x]], where:

C[φ|x]=1

m

XS1,S2∈S(G)

XC1:C1∩S1 6=∅

XC2:C2∩S2 6=∅

CovhuC1,uC2

i=

1

m

XC1,C2:C1./C2

CovhuC1 ,uC2

iHere C1 ./ C2 refers to the set of C where ∃S ∈S(G) s.t. C1 ∩ (S ∪ ∂S) 6= ∅ and C2 ∩ (S ∪ ∂S) 6= ∅. Thesecond equality follows from the fact that cliques out-side of the Markov blanket of each other are condition-ally independent.

We are now ready to state the following central limittheorem for fm, and consequently the asymptotic nor-mality of MPLE.

Lemma 3.√mfm(θ0)

D→N (0, C(θ0)).

Proposition 5. The maximum pseudolikelihood es-timator is asymptotically normal:√m(θm − θ0)

D→N“0, V(θ0)

−1C(θ0)V(θ0)−1”

(13)

Proof SketchSimilar to the proof for MLE, we apply Taylor expan-sion to fn:

0 = fm(θm) = fm(θ0)+(θm−θ0)∇fm(θ0)+o(||θm−θ0||2)

Observe that ∇fm(θ0) → −Eθ0

hV [φ|S,x]

i= −V(θ0).

Combined with Lemma 3, this gives Equation 13.

Although MLE can be regarded as MPLE with justone component, for which V and C are well defined,and the resulting asymptotic variance coincides withV−1, Proposition 4 and Proposition 5 are based ontwo extremely different conditions: the former a singlecomponent of infinite size, while the latter infinitelymany components with fixed size. In practice, due tothe intractability of MLE on large networks, we tend touse a finite number of small subgraphs for estimation.In this case, the asymptotic variance of MPLE suggeststhat partitioning the graph so as to maximize within-component variance V would make the learner perform

better as it reduces the asymptotic variance. This willbe further discussed in Section 4.

In addition, when the pseudolikelihood componentsare independent of each other, then V = 1

mC = 1nV and

the asymptotic variance of the MLE and MPLE be-come identical. Thus, the asymptotic normality resultsof both MLE and MPLE naturally generalize those ofthe i.i.d. case. In general, however, the MLE is asymp-totically more efficient than the MPLE in the senseof matrix determinant comparison. See appendix forproof.

Proposition 6. The MLE is asymptotically more ef-ficient than the MPLE, i.e.,

det

„1

nV−1

«≤ det

„1

mV−1CV−1

«(14)

Finally, we compare maximum pseudolikelihood esti-mators with different component partitions. Considertwo pseudolikelihood estimators θ1 and θ2, which con-sist of components S1(G) and S2(G) respectively. Wesay that S2(G) is a finer partition compared to S1(G),if for every S2 ∈ S2, there is some S1 ∈ S2 s.t. S2 ⊆ S1.See appendix for proof.

Proposition 7. If S2(G) is finer than S1(G), then θ1

is asymptotically more efficient than θ2, i.e.,

det

„1

m1V−1

1 C1V−11

«≤ det

„1

m2V−1

2 C2V−12

«(15)

4 Experiments

In this section, we investigate the empirical perfor-mance of relational learners on finite data. Since MLEis not tractable on networks of moderate to large sizes,we only demonstrate its performance on small syn-thetic data. Our main focus is on comparing differentgeneralized MPLEs. For high order MPLE compo-nents, different constructions will result in differentcovariance structures, and thus their form is expectedto impact the performance (i.e., convergence) of themodel. According to Proposition 5, in order to maxi-mize performance, component construction should aimto maximize the within-component covariance V andto minimize the between-component covariance C.3

However, it is computationally intractable to directlyoptimize either of these covariance matrices in prac-tice because the evaluation would involve running in-ference across the whole network while varying eachclique. Therefore, we need to resort to some heuristicchoice of components to approximate the analyticalobjective. Since network links directly affect compo-nent covariance, it may appear that graph clusteringcould be used as a heuristic to select components that

3Minimizing C is generally a secondary concern, butusually it can be optimized simultaneously with V.

785


satisfy this objective. However, we note that conven-tional graph clustering algorithms are not immediatelyapplicable since we are mainly interested in small, andpossibly overlapping, subgraphs for the MPLE compo-nents. As such, we investigate the following methodsof component construction:

• Singleton: Each node in the network forms a com-ponent.

• Edgewise: Each pair of nodes connected by anedge in the network forms a component.

• Random: Starting from edgewise components, werandomly expand each component to a particularsize. The component is expanded with a randomchoice of BFS or DFS at each step.

• Heuristic: Starting from edgewise components,we greedily expand each component by choosing,at each step, to add the node with maximum num-ber of links to the nodes already in the compo-nent, with a discount to reflect the number oftimes that the node (and edge) have already ap-peared in other components. This approach is aheuristic designed to maximize V.

4.1 Experiments on Synthetic DataWe use synthetic data to directly evaluate the qual-ity of parameter estimation under two levels of net-work connectivity and two levels of model dependence.These experiments are intended to provide further in-sights into model performance in practical situationswhen only limited data is available and to explorewhether behavior in the finite data region matches thatpredicted by the asymptotic analysis.

We use the Erdos-Renyi random graph model to gen-erate synthetic network structures with high and lowlinkage (average degree of 10 and 5 respectively). Thenwe used a manually-specified Markov network modelto generate a binary attribute and class label on thenodes, exploring both high and low interaction param-eters. Details are in the appendix.

For all learners we use the BFGS optimizer. For MLEwe use Gibbs Sampling to compute the gradient andvalue of likelihood function in each optimization step,while we perform exact computation by brute forceenumeration for all MPLE components. We reportmean squared error (MSE= E||θ−θ0||2) for evaluation.Figure 1 plots MSE as training data size increases, av-eraged over [200, 200, 80, 20] trials for network sizes[100, 400, 1000, 5000] respectively. In general, thelearning algorithms converge more quickly when boththe level of linkage and local interactions are low, whilethey converge most slowly when both the linkage andinteraction levels are high. This is due to the fact thatboth denser graph structure and larger interaction pa-rameters result in higher component variance. In all

but the high linkage, high interaction case, the behav-ior of the learners is consistent with the asymptoticanalysis. While the MLE attains lowest estimation er-ror, it is only feasible for small networks. The heuris-tic component construction achieves the lowest erroramong all pseudoliklihood methods. In the exceptionalhigh linkage, high interaction case, the performance ofthe learners in the small data regime seems to con-tradict the asymptotic prediction. While the heuristicapproach still achieves best performance at large net-work sizes, the random construction attains the bestperformance for small to moderate network sizes. Thisis because when both linkage and interaction levels arehigh, the covariance across the network is too large rel-ative to the network size and our assumption of weakdependence is violated, hence the asymptotic analysisis no longer applicable. While the empirical resultsseems to suggest that MPLE components based onrandom construction are more stable in such scenar-ios, this warrants further investigation before makingthis conclusion.

4.2 Experiments on Facebook DataWe next apply the relational learners to data from alarge “University” Facebook network. The data in-clude friendship links, transactions among user (e.g.,wall posts), as well as user profile attributes. We dividethe network into 8 articulated subnetworks of com-parable sizes for training and testing (e.g., “Class of2008”) and evaluate performance on four different clas-sification tasks (predicting profile attributes based onother profile attributes and connections in the friend-ship network). See appendix for details.

The classification results are shown in Figure 2. Weuse one test subnetwork for evaluation, while varyingthe number of subnetworks used for training, and re-port average results over the 8 subnetworks. As abaseline for comparison, we also include an indepen-dent learning approach which treats each user as ani.i.d. instance. As the size of the training graphgrows, the independent learning curve remains rela-tively flat, while the relational learners improve theirperformance. This is due to the fact that relationalmodels explore the covariance structure of the data,and take more training instances to converge to theiroptimum. When the autocorrelation in the data ishigh (as is the case for the tasks “gender” and “polit-ical view”), the relational learners perform constantlybetter than i.i.d. learning. When the autocorrelationis low (as is the case for the tasks “relationship status”and “religious view”), the relational estimates tend tobe biased for small dataset sizes, but still outperformindependent learning as more data becomes available.Similar to the synthetic data results, the heuristic com-ponent construction achieves superior performance inmost cases among all the relational learners. While the

786


(a) Low linkage, low interaction (b) Low linkage, high interaction (c) High linkage, low interaction (d) High linkage, high interaction

Figure 1: Average estimation MSE over randomly-generated synthetic-data training networks (log-log scale).

computational cost is the same, the random construc-tion performs much worse. This demonstrate the ben-efit of exploiting the structure of the graph in the com-ponent construction. The singleton and pairwise ap-proaches are inferior to high-order constructions whenonly moderate amount of training data is available.However, we observe that they eventually convergeto almost the same optimum as the high-order ap-proaches at the maximum size of training network.

(a) Gender: male? (b) Relationship status: single?

(c) Political view: conservative? (d) Religious view: Christian?

Figure 2: Classification results on Facebook, as thenumber of training subnetworks K is varied.

5 Related WorkMarkov networks that tie parameters through tem-plating is a popular approach to model relational andnetwork data. Our analysis builds upon the mod-els and learning algorithms proposed by Taskar et al.(2002), Richardson and Domingos (2006), and Nevilleand Jensen (2007). These algorithms have been suc-cessfully applied in single-network settings. Howeverto date, the theoretical issues of model convergence,consistency and efficiency in single-network domainswith data interdependence have not been explored.

Asymptotic analysis for i.i.d. estimators and its im-

plication in finite data learning have received consid-erable attention in the machine learning community.In (Liang and Jordan, 2008), the asymptotic analysisis motivated by comparing the performance of gener-ative, discriminative and pseudolikelihood estimators.Dillion and Lebanon (2009) exploited an understand-ing of the asymptotic variance to make better choiceof the pseudolikelihood component weights. In rela-tional domains, analyzing the asymptotic covarianceprovides us with additional insights as to how the co-variance structure of the network affects learning, andhow to exploit that understanding to construct betterpseudolikelihood components.

6 ConclusionIn this paper, we have performed an asymptotic anal-ysis of relational learning methods applied to a singlenetwork. We illustrate the findings with an empir-ical exploration of the performance of MPLEs withdifferent component construction schemes in variousconditions. This provides a starting point for moresophisticated approaches to constructing MPLE com-ponents and controlling the level of joint learning innetwork data. We believe that further exploration inthis direction is important for the relational learningcommunity, since relational models grow with the sizeof the training data, making full MLE intractable toapply in practice.

As part of future work, we plan to explore other impor-tant theoretical issues which have not been addressedin this paper. One of particular interest is the ac-count for model mis-specification, i.e., how the learn-ers perform when the true underlying distribution fallsoutside of the model family. While our experimentalresults on Facebook data seem to suggest MPLE basedon heuristic component construction is quite robust, itis worth pursuing a formal analysis of this setting.AcknowledgementsWe thank anonymous reviewers for helpful comments. Thisresearch is supported by NSF under contract numbers SES-0823313, IIS-1017898, CCF-0939370. The U.S. Govern-ment is authorized to reproduce and distribute reprints forgovernmental purposes notwithstanding any copyright no-tation hereon.

787


References

E. Bolthausen. On the central limit theorem for stationarymixing random fields. Ann. Probab., 10(4):1047–1050,1982.

Francis Comets. On consistency of a class of estimators forexponential families of markov random fields on the lat-tice. The Annals of Statistics, 20(1):pp. 455–468, 1992.

J. Dedecker, P. Doukhan, G. Lang, J.R. Leon, S. Louhichi,and C. Prieur. Weak Dependence: With Examples andApplications. Springer, 2007.

Joshua Dillion and Guy Lebanon. Statistical and compu-tational tradeoffs in stochastic composite likelihood. InAISTATS, 2009.

P. L. Dobruschin. The description of a random field bymeans of conditional probabilities and conditions of itsregularity. Theory Probab. Appl., 13:197–224, 1968.

Percy Liang and Michael I. Jordan. An asymptotic anal-ysis of generative, discriminative, and pseudolikelihoodestimators. In ICML ’08: Proceedings of the 25th inter-national conference on Machine learning. ACM, 2008.

Jennifer Neville and David Jensen. Relational dependencynetworks. J. Mach. Learn. Res., 8:653–692, 2007. ISSN1532-4435.

Matthew Richardson and Pedro Domingos. Markov logicnetworks. Mach. Learn., 62(1-2):107–136, 2006. ISSN0885-6125.

Garry Robins, Pip Pattison, Yuval Kalish, and DeanLusher. An introduction to exponential random graph(p*) models for social networks. Social Networks, 29(2):173–191, May 2007.

P. Singla and P. Domingos. Markov logic in infinite do-mains. In Proceeding of the 23rd conference on Un-certainty in Artificial Intelligence (UAI). AUAI Press,2007.

Charles Stein. A bound for the error in the normal ap-proximation to the distribution of a sum of dependentrandom variables. Proc. Sixth Berkeley Symp. on Math.Statist. and Prob., 2:583–602, 1972.

Charles Sutton and Andrew Mccallum. Introduction toConditional Random Fields for Relational Learning.MIT Press, 2006.

Benjamin Taskar, Pieter Abbeel, and Daphne Koller. Dis-criminative probabilistic models for relational data. InProc. of the 18th Conference in Uncertainty in ArtificialIntelligence, 2002.

A. W. van der Vaart. Asymptotic Statistics. Cambridge,2000.

Martin J Wainwright and Michael I Jordan. GraphicalModels, Exponential Families, and Variational Infer-ence. Now Publishers Inc., Hanover, MA, USA, 2008.

A APPENDIX—SUPPLEMENTARYMATERIAL

A.1 Proofs of Central Limit Theorems

To prove the essential central limit theorems, we usea standard technique for dependent data, developedby Bolthausen (1982), which is based on the well-knownStein’s method(Stein (1972)).

Proof Sketch of Lemma 2

Let a be any nonzero vector of the same di-mension as fn, uk =

˙a,φ(xCk ,yCk )

¸− Eθ0 [φ|x],

zn ≡ 1σn

Pnk=1 uk = 1

σn〈a, nfn〉, where σ2

n ≡

Eθ0

h`Pnk=1 uk

´2i → naTV(θ0; x)a. The proof then re-

duces to showing that zn is asymptotically standard nor-mal. By Lemma 2 in Bolthausen (1982), it suffices to show,for any λ ∈ R,

limn→∞

Eθ0 [(iλ− zn) exp(iλzn)] = 0 (16)

Apply the decomposition

(iλ− zn) exp(iλzn) = Q1 +Q2

where

Q1 = iλ exp(iλzn)

1− 1

σn

nXk=1

ukzn

!

Q2 =1

σnexp(iλzn)

nXk=1

uk(iλzn − 1)

We have

Eθ0 |Q1| ≤ cEθ0

˛˛1− 1

σn

nXk=1

ukzn

˛˛ = 0

and Eθ0 |Q2| ≤ cEθ0

˛1σn

Pnk=1 ukzn

˛+c′Eθ0

˛1σn

Pnk=1 uk

˛.

Note that σn = O(n) and Eθ0

˛Pnk=1 ukzn

˛→

aTV(θ0; x)a. Therefore we get Eθ0 |Q2| → 0.

Proof Sketch of Lemma 3

Similarly, let uk =Da,PC:C∩Sk 6=∅

φ(xC ,yC)E−

mSkEθ0 [φ|y∂Sk ,x], zm ≡ 1σm

Pmk=1 uk = 1

σm

Da,mfm

E,

where σ2m ≡ Eθ0

h`Pmk=1 uk

´2i→ maT C(θ0; x)a. Further-

more, define zm,k = 1σm

Pj:Sj∩Sk 6=∅

uk. zm,k is bounded

by local finiteness. We show Equation (16) based on thefollow decomposition.

(iλ− zm) exp(iλzm) = Q1 +Q2 +Q3

where

Q1 = iλ exp(iλzm)

1− 1

σm

mXk=1

ukzm

!

Q2 =1

σmexp(iλzm)

mXk=1

uk(iλzm,k + exp(−iλzm,k)− 1)

Q3 = − 1

σm

mXk=1

uk exp (iλ(zm − zm,k))

Again, Eθ0 |Q1| = 0 holds. Furthermore,

Eθ0 |Q2| ≤ cEθ0

˛˛ 1

σm

mXk=1

uk(λzm,k)2

2

˛˛

788


where we have used an identity |iv + exp(−iv) − 1| ≤ v2

2.

Since zm,k = O( 1m

),

Eθ0 |Q2| ≤ c′Eθ0

˛˛ 1

σm

mXk=1

uk

˛˛→ 0

Finally, since uk is independent of zm − zm,k, Eθ0 |Q3| =

Eθ0

˛1σm

uk

˛Eθ0 |exp (iλ(zm − zm,k))| = 0.

A.2 Asymptotic Efficiency Proofs

Proof Sketch of Proposition 6

We consider the autocovariance of the joint vector of like-

lihood and pseudolikelihood gradients h =

„f

f

«:

E [Var [h|S,x]] (17)

=

„E[Var[f |x]] E[Cov[f , f |S,x]]

E[Cov[f , f |S,x]] E[Var[f |S,x]]

«Since the above covariance matrix is positive semidefinite,we have:

det (E[Var[f |S,x]]) ≥det“E[Cov[f , f |S,x]]E[Cov[f , f |S,x]

”det“E[Var[f |S,x]]

”(18)

Also recognize that:

Var[f |x] =1

n2

XC1,C2

Cov(φC1 ,φC2 |x) =1

nV [φ|x] (19)

Var[f |S,x] =1

m2Var

24XS∈S

XC:C∩S 6=∅

uC

35=

1

m2

XC1,C2:C1./C2

Cov“uC1 ,uC2

”=

1

mC[φ|x]

(20)Cov[f , f |S,x] = Cov[f , f |S,x]

=1

nm

XC1∈C

XS∈S

XC2:C2∩S 6=∅

CovhφC1 ,uC2

i=

1

nm

XS∈S

XC1:C1∩S 6=∅

XC2:C2∩S 6=∅

CovhuC1 ,uC2

i=

1

nV [φ|S,x] (21)

Plugging Equations (19)-(21) into (18) and rearranging fac-tors, we obtain (14).

Proof Sketch of Proposition 7

Again consider the covariance matrix of joint feature vectoras in Equation (17), by the law of total covariance, wehave Var[h1|S1,x] ≤ Var[h2|S2,x]. Using a block matrixrepresentation, we have:

det“EhCov[f , f |S1,x]

iEhCov[f , f |S1,x]

i”det“EhVar[f |S1,x]

i”≤det“EhCov[f , f |S2,x]

iEhCov[f , f |S2,x]

i”det“EhVar[f |S2,x]

i”By applying Equation (20) and (21) and rearranging fac-tors, we get (15).

A.3 Dataset Information

Synthetic Data Generation

For the synthetic data experiments we used an Erdos-Renyirandom graph model to generate synthetic network struc-tures and then generated a binary attribute and class la-bel on the nodes use Markov network model specified be-low. In the Erdos-Renyi model, we generate a network ofN nodes by including an edge between each N(N − 1)/2pair of nodes independently at random with probability

p =davg

N(N−1)/2. In this work, we considered davg = 5 for

the low linkage setting and davg = 10 for the high link-age setting. In the Markov network, we specified twoclique types: a singleton clique defined on each node:Φy=k = exp{θkI(x = k)} for k = 1, 2, and an interac-tion clique defined on each edge (i, j): Φyi=yj = exp θ3.The attributes (x) and labels (y) are then generated byGibbs sampling of 1000000 iterations. For the low interac-tion setting we used θ3 = 0.1 and for the high interactionsetting we use θ3 = 0.3. The singleton clique parameterswere constant in all experiments at θ1 = 1.2, θ2 = 0.8.

Facebook Data and Model

The Facebook dataset consists of 7,315 nodes and 46,817edges, which comprises all the students and alumni, withpublic profiles, from a large “University” network. Thedata includes user profile attributes (gender, relationshipstatus, political view and religious view), as well as friend-ship links and transactions (wall posting and picture tag-ging) between users. We divide the network into 8 artic-ulated subnetworks of comparable sizes for training andtesting (e.g., “Class of 2008”).

For classification, we build a pairwise Markov networkbased on friendship links. Two types of cliques arespecified: the singleton clique is a linear weighting of

node attributes: Φy=k = expnPm

j=1 θnodejk I(x = k)

o,

and the edgewise clique is a linear weighting oftransactions between users i and j: Φyi=yj=k =

expnPt

l=1 θedgelk sk(i, j) + θedge0k

owhere sl(i, j) denotes the

strength of the transaction l. We perform classificationtasks based on the four aforementioned profile attributes.When classifying each attribute y, the model is conditionedon the remaining 3 attributes x1, x2, x3. The two afore-mentioned transactions are applied in the model, and thestrength sl(i, j) is computed as the logarithm of the totalcount of transaction type l occurred between i and j.

relational learning with one network: an asymptotic...

Documents