learning the network structure of heterogeneous data via pairwise exponential … · 2019-04-11 ·...

Learning the Network Structure of Heterogeneous Data via PairwiseExponential Markov Random Fields

Youngsuk Park David Hallac Stephen Boyd Jure LeskovecStanford University

[email protected] [email protected]

Stanford [email protected]

Stanford [email protected]

Abstract

Markov random fields (MRFs) are a use-ful tool for modeling relationships presentin large and high-dimensional data. Of-ten, this data comes from various sourcesand can have diverse distributions, for ex-ample a combination of numerical, binary,and categorical variables. Here, we definethe pairwise exponential Markov random field

(PE-MRF), an approach capable of model-ing exponential family distributions in het-erogeneous domains. We develop a scalablemethod of learning the graphical structureacross the variables by solving a regularizedapproximated maximum likelihood problem.Specifically, we first derive a tractable up-per bound on the log-partition function. Wethen use this upper bound to derive the groupgraphical lasso, a generalization of the classicgraphical lasso problem to heterogeneous do-mains. To solve this problem, we develop afast algorithm based on the alternating direc-tion method of multipliers (ADMM). We alsoprove that our estimator is sparsistent, withguaranteed recovery of the true underlyinggraphical structure, and that it has a polyno-mially faster runtime than the current state-of-the-art method for learning such distribu-tions. Experiments on synthetic and real-world examples demonstrate that our ap-proach is both e�cient and accurate at un-covering the structure of heterogeneous data.

Proceedings of the 20th International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2017, Fort Laud-erdale, Florida, USA. JMLR: W&CP volume 54. Copy-right 2017 by the author(s).

1 Introduction

Markov random fields (MRFs) are a fundamental toolfor many applications in machine learning [5, 20]. Of-ten, it is necessary to model MRFs between heteroge-neous entities, where the nodes in the graphical modelcan refer to di↵erent types of objects. For example,modeling medical patients may require joint reasoningabout the relationship between categorical and contin-uous variables (i.e., age, gender and medical historyevents) [1]. Or, when analyzing protein interactions,data about di↵erent types of proteins might be cap-tured using di↵erent experimental methods [17]. Inorder to faithfully model such heterogeneous domainsusing graphical models, nodes of the model must followdi↵erent types of distributions (Gaussian, Poisson, Bi-nary, etc.). While the structure of such heterogeneousgraphical models is typically learned through obser-vational data, estimating them is challenging for bothcomputational and mathematical reasons, and scalableinference methods have not yet been developed.

In this paper, we propose a class of multivariate ex-ponential family distributions, which we call the pair-

wise exponential Markov random field (PE-MRF). PE-MRFs explicitly reveal the Markov structure acrossdi↵erent variables and can cover many common dis-tributions, such as Ising models, Gaussian MRFs, andmixed (heterogeneous) models. Our approach extendsprevious methods of graphical inference [15, 21, 25]by using a di↵erent representation of the joint dis-tribution. This allows for a compact representationof heterogeneous variables, which eventually leads toa much faster structure/parameter learning method(O(p3), compared to O(p4), to learn a p node distri-bution with O(p2) unknown parameters).

After formally defining the PE-MRF model, we pro-pose a method of estimating its parameters. Becausesolving the exact maximum likelihood problem is com-putationally intractable in general high-dimensionalsettings, we extend an approach known as the ap-proximated maximum likelihood, which previously has

Learning the Network Structure of Heterogeneous Data via Pairwise Exponential Markov RandomFields

only been used for solving Ising models [2, 22]. Thisapproach relies on deriving a tractable upper boundon the (intractable) log-partition function of the PE-MRF. We then show that the estimator can be sim-plified into solving a convex problem, minimizing thelog-determinant divergence [19] plus a group sparsitypenalty [10]. We call this the group graphical lasso forPE-MRFs, since it turns out to be a generalization ofthe well-known graphical lasso [9], a popular methodof learning Gaussian MRFs. The graphical lasso is justa special case of our method, which is more generallyable to reveal the Markov structure in heterogeneoussettings. Furthermore, we prove that our estimator issparsistent [19, 25], meaning that, under some math-ematical assumptions, we are asymptotically guaran-teed to recover the true underlying Markov structureof the distribution.

By converting the problem into the group graphical

lasso, we are able to infer the structure in a scal-able way. In contrast to the pseudo-likelihood, acommonly-used alternative approach which in the gen-eral case requires Newton-type methods [4, 14, 21, 25],we develop an algorithm based on the alternating di-rection method of multipliers (ADMM) [6], and wesolve for closed-form updates for each of the ADMMsubproblems. These fast updates speed up the solverand make ADMM more scalable than other methods.Finally, we test our approach’s accuracy and scalabil-ity on both synthetic and real datasets.

Summary of Contributions. The main contribu-tions of this paper are as follows:

• We propose a pairwise exponential family dis-tribution (PE-MRF), explicitly revealing theMarkov structure across heterogeneous variables.

• We formulate an approximated maximum like-lihood problem by deriving a tractable upperbound on the log-partition function.

• We develop a scalable ADMM algorithm withclosed-form updates.

• We prove that our estimator is sparsistent.

1.1 Related Work on Pairwise Models

The PE-MRF is related to several recently suggestedmodels for inferring Markov random fields. Our pri-mary contribution is that the PE-MRF model presentsthe most scalable method to date for learning theMarkov structure of heterogeneous distributions. Fur-thermore, there have been proposed approaches whichsatisfy up to two of our three desired conditions (gen-erality, scalability, and sparsistency), but PE-MRFsare the first to attain all three. We examine severalalternative methods in more detail below.

Limited Heterogeneous Distributions. When thedistributions at every node are uniparameter, Yang etal. [25] proposed learning the parameters via a pseudo-likelihood approach, solved by Newton’s method, andprovided sparsistency guarantees. However, theirmodel is unable to generalize to multiparameter set-tings. Consider the case of a Gaussian MRF. Here,for example, this approach cannot model the problemunless either the mean or the variance is known at ev-ery node beforehand. Similarly, Lee et al. [14] used apseudo-likelihood approach to learn the Markov struc-ture of discrete-Gaussian mixed models. However,their model did not provide any sparsistency guaran-tee, nor did it generalize to other exponential fam-ily distributions. Our PE-MRFs, on the other hand,provide sparsistency and can also solve for distribu-tions with multiparameter and multivariate variablesat each node, a much broader class of problems.

Vector-Space MRFs (VS-MRFs). A separate ap-proach, VS-MRF [21], is capable of learning generalheterogeneous distributions. In fact, their approachand ours can cover the same classes of pairwise ex-ponential families. However, there is a significant con-trast between VS-MRF and PE-MRF in terms of scala-bility. By modeling the problem di↵erently, we can de-rive an approximated maximum likelihood, which al-lows for a very scalable algorithm (but had previouslyonly been used in homogeneous settings [2, 22]). In-stead, VS-MRFs rely on the pseudo-likelihood. Froman algorithmic perspective, to learn a p-node graphi-cal model in k ADMM iterations [6], our algorithm hasa runtime of O(kp3) whereas VS-MRF takes O(kp4).Note that there are on the order of p2 unknown param-eters. We leverage this speedup in Section 6, runningour algorithm on large models, where we empiricallydiscover that VS-MRF is several orders of magnitudeslower.

Node-wise Regression. In node-wise regression,each node estimates the associated edge structure ofits local neighborhood. For Gaussian MRFs and dis-crete models, this method has been used to learn theMarkov structure in a scalable way [15, 16, 18]. In ad-dition, Banerjee et al. [2] developed a block coordinatedescent method for solving such homogeneous models,which can be viewed as iterative node-wise regressionswith a penalty. This type of method, however, hasonly been suggested for a limited class of homogeneousdistributions, and not for broader problems in generalheterogeneous domains. As such, there is no knownmethod of using node-wise regression to learn hetero-geneous distributions, or to guarantee sparsistency inthese cases.

Youngsuk Park, David Hallac, Stephen Boyd, Jure Leskovec

2 Problem Setup

Consider a set of n independent multivariate obser-vations {x1, x2, . . . , xn}. We assume that these aresampled i.i.d. from an exponential family distributionp(x;✓) represented by a p-node graphical model withnatural parameter ✓. Here, the p nodes may have het-erogeneous domains. For example, some elements maybe defined over the set of real numbers, while othersmay have a finite discrete domain (i.e., {1, 2, . . . ,m})or a countably infinite one. We use these samples toestimate the underlying distribution. More specifi-cally we infer a Markov Random Field described byG = (V,E) with |V | = p. This structure can be en-coded in the exponential family parameter ✓.

2.1 Pairwise Exponential Markov RandomFields

We define the pairwise exponential Markov random field

(PE-MRF), a subclass of the multivariate exponentialfamily distribution that can explicitly reveal theMarkov structure across heterogeneous variables.

Here, we denote hA,BiF

= Tr(AB) as theFrobenius inner product between two matrices andvec[a

1

, . . . , ak

] = [aT1

, . . . , aTk

]T as the concatenationof a set of vectors.

Definition For a random vector X = {X1

, . . . , Xp

}defined over (heterogeneous) domains1 X =⌦{X

r

}pr=1

, suppose the conditional distributionof each variable X

r

given the remaining p�1 variablesX\r, follows a (known) exponential family distributionon the domain X

r

. This distribution is specified byan m

r

-dimensional node-potential Br

(Xr

) and scalarbase measure C

r

(Xr

). Then, a random vector X isdefined as a PE-MRF if, for x = {x

1

, . . . , xp

} 2 X , itfollows the joint distribution

p(x;✓) = exp{pX

r=1

✓Tr

Br

(xr

) +pX

s,t=1

⌦⇥

st

, Bt

(xt

)Bs

(xs

)T↵F

+pX

r=1

Cr

(xr

)�A(✓)}. (1)

Here ✓ = {✓1

, . . . , ✓p

,⇥11

,⇥12

, . . . ,⇥pp

} isthe natural parameter, consisting of node-parameter ✓

r

2 Rm

r and edge-parameter⇥

st

2 Rm

s

⇥m

t . A(✓) is the (finite-valued) log-partition function2 log

RX exp

�Pp

r=1

✓Tr

Br

(xr

) +Pp

s,t=1

⌦⇥

st

, Bt

(xt

)Bs

(xs

)T↵F

+P

p

r=1

Cr

(xr

) ⌫(dx).

Note that the edge-parameters {⇥st

}ps,t=1

explicitly re-

veal the Markov structure because ⇥st

2 Rm

s

⇥m

t is a

1Here, ⌦ refers to the Kronecker product.2⌫ is a proper measure on X . Refer to [23].

zero matrix if and only if Xs

and Xr

are conditionallyindependent given all other variables [12].

Exponential Family Expression. Distribution (1)can be written as

p(x;✓) = exp {h✓,B(x)i+ C(x)�A(✓)} , (2)

with base measure C(x) =P

p

r=1

Cr

(xr

), su�cientstatistic B(x), and inner product

⌦✓,B(x)

↵given by

B(x) =�{B

r

(xr

)}pr=1, {Bs

(xs

)Bt

(xt

)T }ps,t=1

,

⌦✓,B(x)

↵=

pX

r=1

✓

T

r

B

r

(xr

) +pX

s,t=1

D⇥

st

, B

t

(xt

)Bs

(xs

)TE

F

.

Alternative Quadratic Expression. The model in(1) can also be written in quadratic form, as

p(x;✓) = exp{b(x)T⇥b(x) + C(x)�A(✓)},

where we introduce the (extended) node-potential vec-tor b(x) = vec[1, B

1

(x1

), . . . , Bp

(xp

)] with base mea-sure C(x) =

Pp

r=1

Cr

(xr

)� 1, where

⇥ =

2

6664

1 ✓

T

1 /2 · · · ✓

T

p

/2✓1/2 ⇥11 · · · ⇥1p

......

. . .✓

p

/2 ⇥p1 ⇥

pp

3

7775.

Node-Conditional Distribution. The conditionaldistribution of X

r

given X\r follows

p(xr

|x\r;✓) / exp

⇢(✓

r

+X

t 6=r

⇥rt

B

t

(xt

))TBr

(xr

)

+D⇥

rr

, B

r

(xr

)Br

(xr

)TE

F

+ C

r

(xr

)

�.

(3)

This is just an exponential family with su�cient statis-tic {B

r

(xr

), Br

(xr

)B(xr

)T } and base measure C(xr

).

2.2 Examples of PE-MRFs

PE-MRFs can model many popular distributions,ranging from homogeneous pairwise models (exponen-tial, Poisson, gamma, etc. . . ) to general mixed ones.From the node-conditional distribution in Equation(3), we can design PE-MRFs on a node-by-node basis,by choosing the desired potential B

r

and base measureC

r

. Note that, in order to get a valid joint distribu-tion, we must consider domain constraints D on theparameter ✓ to guarantee a finite log-partition func-tion A(✓).

Gaussian MRF (GMRF). PE-MRFs can modelGMRFs by setting the node-potential B

r

(xr

) = xr

,


so that the corresponding su�cient statistic at eachnode becomes {x

r

, x2

r

}. This is a valid joint distribu-tion under the (negative definite) domain constraintsD = {(✓

node

,⇥edge

) | ⇥edge

� 0}. For a zero meanGMRF, we put additional the restrictions ✓

node

= 0.If a variable X

r

has known variance �2

r

, then we canadditionally assign ⇥

rr

= � 1

2�

2

r

.

Ising and Discrete Models. PE-MRFs canalso model a discrete distribution with domainX

r

= {0, 1, . . . ,mr

}, by choosing the node-potential B

r

(xr

) = [I(xr

= 1), . . . , I(xr

,= mr

)]T .Moreover, restricting ✓

r

= 0 and ⇥rr

=diag([⇥

rr:11

, . . . ,⇥rr:m

r

m

r

]) gives the minimal repre-sentation of a discrete model.

Mixed Models. Likewise, by choosing suitable nodepotentials, we can design any associated mixed modelusing a PE-MRF. The exponential family distributionsthat PE-MRFs can cover include, but are not limitedto, Poisson, quadratic Poisson [26] (which can captureboth positive and negative correlations), lognormal,gamma, Dirichlet, and any combination thereof, underproper domain constraints.

3 Learning the Structure:Approximate Maximum LikelihoodApproach

In order to learn the Markov structure of a PE-MRFdistribution, we formulate the following (negative)maximum likelihood problem with regularization,

minimize✓

� l(✓) +R

�

(✓). (4)

Here, l(✓) = h✓, µi � A(✓) is the log likelihood of ✓(up to scale and constant), where µ = 1

n

Pn

k=1

B(xk)is the empirical mean parameter, or averaged su�cientstatistic B(xk) over the samples {xk}n

k=1

,

µ =�{ 1n

nX

k=1

B

r

(xk

r

)}pr=1, {

1n

nX

k=1

B(xk

s

)B(xk

t

)T }ps,t=1

.

Regularization. R�

(✓) is a regularization function

with tuning parameter � that encourages structuralsparsity. We use the `

1

/`2

group lasso penalty [11],

R

�

(✓) = �

X

s 6=t

w

st

k⇥st

kF

, (5)

where k·kF

is the Frobenius norm, i.e., kAkF

=qPm

i

,m

j

i,j=1

aij

. This encourages the st-th block, for

every s 6= t, to be a zero matrix. Note that if the ⇥st

’sare all scalar parameters, then this becomes a stan-dard lasso penalty. Here, {w

st

}ps,t=1

is a set of scalar

values typically depending on the size and variance ofan associated B

s

(Xs

)Bt

(Xt

)T , in order to balance theweights on {k⇥

st

kF

}ps,t=1

[14].

3.1 Approximated Maximum Likelihood

When the exponential family has a tractable A(✓),for example with a Gaussian MRF, we can attemptto exactly solve the maximum likelihood in Problem(4). However, in the general case, A(✓) involves ahigh-dimensional integral and is typically intractableto compute. In order to overcome this, we use an ap-proximated maximum likelihood approach [2], wherewe replace A(✓) in Problem (4) with a tractable (con-vex) upper bound U(✓),

minimize✓

�⌦✓, µ

↵+ U(✓) +R

�

(✓). (6)

3.1.1 Upper Bound on Log-PartitionFunction

In this section, we use the notion of mean parameter

and variational analysis [23]. These allow us to havean alternative expression for A(✓), which we use toeventually derive the upper bound U(✓).

Notation. Recall that the natural parameter ✓ andsu�cient statistic B(x) of PE-MRFs are defined overthe domain Y = Rm

1 ⇥ · · ·Rm

p ⇥Rm

1

m

1 ⇥Rm

1

m

2 ⇥· · ·Rm

p

m

p . Here, we define the inner product over thedomain as ha, bi =

Pp

r=1

aTr

br

+P

p

s,t=1

hAst

, Bst

iF

for elements a, b 2 Y, which is consistent with thedefinition in (2).

We define a set of realizable expected su�cientstatistics B(·) over all valid distributions, themean parameters [23], as

M(B) =

⇢µ := {{µ

r

}pr=1

, {µst

}ps,t=1

} 2 Y | 9p(·) 2 �d

s.t. E[Br

(Xr

)] = µr

,E[Bs

(xs

)Bt

(xt

)T ] = µst

�,

where d =P

p

r=1

mr

is the sum of dimensions of thenode potentials and �

d

is a probability simplex in Rd.

Let ⌫ 2 R be a fixed scalar. We define a map M⌫

:Y ! R(d+1)⇥(d+1), for µ =2 Y, as

M

⌫

[µ] =

2

66664

⌫ µ

T

1 µ

T

2 . . . µ

T

p

µ1 µ11 µ12 . . . µ1p

µ2 µ21 µ22 µ2p

......

.... . .

...µ

p

µ

p1 µ2p . . . µ

pp

3

77775.

With these notations, we now derive an upper bound.Assume that a PE-MRF is regular [23] and the condi-tional distribution on each node comes from a known


exponential family. Let each node r have the distancecr

= infa 6=b2X

r

kBr

(a)�Br

(b)k1 in the domain of its

su�cient statistic Br

(Xr

). We assume cr

= 0 for con-tinuous nodes and that discrete nodes I

D

are separablewith respect to su�cient statistic, meaning c

r

> 0.

Theorem 3.1 For a PE-MRF, the log partition func-

tion A(✓) has the following upper bound,

A(✓) maxµ2M(B)

⇢⌦µ,✓

↵+

12log det

✓M1[µ] +D

◆�+ f1,

where D = diag([0, l1

, . . . , lp

]), lr

= c

3

r

12

[

m

rz }| {1, . . . , 1], and

f1

= d

2

log(2⇡e)�P

r2ID

(mr

log cr

).

The proof follows from Wainwright et al. [22], wherethe problem is alternatively represented as the Shan-

non entropy H(X) over the mean parameter. In par-ticular, we attain the upper bound from the relation-ship between the entropy H(X) and the entropy ofnode-potentials H({B

r

(Xr

)}pr=1

), in addition to a dif-ferent choice of {l

r

}pr=1

for heterogeneous domains.

By taking the relaxation of the dual, we can convertthe high-dimensional problem from Theorem 3.1 intothe following tractable form.

Corollary 3.2 The log partition function A(✓) has

the following upper bound

A(✓) 12min⌫2R

⇢� 1

2log det

��M

1+⌫

[✓0]�� ⌫

�

� 1

2

⌦M

1

[✓0], D↵F

+ f2

,

where ✓0 = {✓1

/2, . . . , ✓p

/2,⇥11

,⇥12

, . . . ,⇥pp

} 2 Y,

and f2

= 3

2

+ d+ d

2

log(2⇡e)�P

r2ID

(mr

log cr

).

3.1.2 Approximated Maximum Likelihood:Graphical Group Lasso

By plugging the upper bound from Corollary 3.2 intothe approximated maximum likelihood from Equation(6), we attain the following optimization problem.

Theorem 3.3 For a PE-MRF, the approximated neg-

ative maximum log-likelihood problem is equivalent to

min⇥2S

d+1

++

�h⇥,M1[µ] +Di

F

� log det⇥+R

�

(⇥) , (7)

with parameter matrix ⇥ = �M⌫✓ [✓

0], where ⌫✓ 2 Ris a dummy parameter, and R

�

(⇥) = R�

(✓).

Note that there may be additional parameter con-straints for some PE-MRF distributions, which caneasily be incorporated into (7). We can view this

problem as a `1

/`2

regularized log-determinant diver-gence with respect to the empirical average of su�cientstatistics, defined by the Bregman divergence corre-sponding to the log-determinant function3 [19]. Wecall problem (7) the group graphical lasso for a PE-MRF, since it is an extension of the classic graphicallasso problem [9, 15, 19] to heterogeneous settings.

3.2 Gaussianization of the Group GraphicalLasso

For a zero-mean Gaussian MRF, we set additional con-straints [✓

1

, . . . , ✓p

]T = 0. In fact, in this case ourproblem becomes equivalent to the graphical lasso. Inthe general case, however, the naive graphical lassooptimizes with respect to the empirical covariance ma-trix, whereas our approach optimizes by using the sam-ple average of su�cient statistics of a PE-MRF. Still,the following Corollary shows how the group graphi-cal lasso in (7) can be interpreted as a generalizationof the classic graphical lasso model to heterogeneousdomains.

Corollary 3.4 The group graphical lasso (7) is equiv-alent to the (exact) maximum likelihood problem with

a group lasso penalty for a GMRF Z ⇠ N (µ⇤,⌃⇤),

⌃⇤ = Cov[bnode

(X)] + diag([l1

, . . . , lp

]),

µ⇤ = [⌃⇤]�1 E[bnode

(X)],

where bnode

(X) = vec[B1

(X1

), . . . , Bp

(Xp

)].

The proof is immediate from the fact that the (log)likelihood function of N (µ⇤,⌃⇤) can be expressed asg(⇥) = h⇥,M

1

[E[B(X)]] + Di � log det⇥ by usingthe Schur complement.

Therefore, Corollary 3.4 implies that the estimator ofthe group graphical lasso for a PE-MRF is equiva-lent to solving a distinct graphical lasso problem forbnode

(X) (instead of X), where the lasso regulariza-tion term is replaced by a group lasso penalty. Here,we treat b

node

(X) as following a Gaussian distributionwith specified mean and covariance.

4 Optimization Algorithm

Here, we propose an algorithm to solve the groupgraphical lasso (7) for PE-MRFs. Our approach isbased on the alternating direction method of multipli-ers (ADMM) [6]. To solve, we introduce a consensusvariable Z = ⇥ and rewrite (7) as its equivalent prob-lem, for A =

�M

1

[µ] +D�

minZ=⇥,⇥2S

d+1

++

h⇥, AiF

� log det⇥+ �n

X

i 6=j

wij

kZij

kF

.

3D

log det(·)(⇥|| ¯B) = � log det⇥+log det

¯B+

D¯B�1

, (⇥ � ¯

B)

E

F


ADMM solves the corresponding augmented La-grangian in an iterative manner with respect to theprimal variable ⇥, the consensus variable Z, and the(scaled) dual variable U . For iteration k + 1, closed-form updates for each of three subproblems are pro-vided below. ADMM is guaranteed to converge to theglobal optimal for convex problems, with a standard(primal and dual residual) stopping criterion [6].

⇥ Update. The ⇥ update is

⇥k+1 :=1

2⌘Q�⇤+

p⇤2 + 4⌘I

�QT ,

where ⌘ = ⇢

n

and Q⇤QT is the eigendecomposition of

⌘(Zk �Uk)��M

1

[µ] +D�[24].

Z-Update. The Zk+1 update isM

⌫

k+1

z[{zk+1

1

, . . . , zk+1

p

, Zk+1

11

, Zk+1

12

, . . . , Zk+1

pp

}]where for i, j 2 {1, . . . , p}

⌫

k+1z = ⌫

k+1✓

+ ⌫

k

u,

z

i

= ✓

k+1i

+ u

k

i

, Z

ii

= ⇥k+1ii

+ U

k

ii

,

Z

ij,i 6=j

=

(⇣1� ⌘

ij

�

ij

⌘ �⇥k+1

ij

+ U

k

ij

��

ij

� ⌘

ij

0 otherwise,

with ⌘ij

= �

n

w

ij

⇢

and �ij

=��⇥k+1

ij

+ Uk

ij

��F

[6].

U-Update. The U update is

Uk+1 := Uk +⇥k+1 �Zk+1.

Algorithmic Complexity. Note that the eigende-

composition of the ⇥ update is the main computa-tional task in our algorithm, with a runtime of O(p3).For k ADMM iterations, the computational cost isO(kp3), which is very e�cient considering the fact thatthe total number of parameters to estimate is O(p2).On the other hand, the pseudo-likelihood approach re-quiring Newton-type methods [14, 21] needs to com-pute O(p3) operations every ADMM iteration at eachof the p nodes, requiring O(kp4) in total. We will seein Section 6 how this leads to a significant di↵erencein scalability on large problems.

5 Sparsistency

In this section, we present the conditions under whichwe are guaranteed to recover the underlying graph-ical structure embedded in the true parameter ma-trix ⇥true. From Corollary 3.4, recall that the so-lution of the group graphical lasso (7) is equivalentto the estimator of a (distinct) Gaussian MRF regu-larized by a group lasso penalty. This implies thatour estimate ⇥ of (7) may have a di↵erent value

from ⇥true, unless the node potential bnode

(X) fol-lows a proper normal distribution, which happens forexample in GMRFs and lognormal MRFs. Nonethe-less, under some mathematical assumptions for gen-eral PE-MRFs, we can demonstrate that the estima-tor of group graphical lasso (7) is su�cient to re-cover the underlying Markov structure represented byE(⇥true) = {(s, t) |

��⇥true

ij

��2

> 0 for 1 s 6= t p}.

Notation. We introduce several norms. We letkMk1 = max

i,j

|Mij

|. For a group of vectors {gi

}ki=1

and its concatenation g = vec[g1

, . . . , gk

], we define a(group) norm kgk1,2

= k[kg1

k2

, . . . , kgk

k2

]k1. Then,this norm induces an operator norm |||A|||1,2

.

Additionally, we introduce some parameters relatedto the group graphical lasso. For the log likeli-hood function g(⇥) = h⇥,M

1

[µ] + Di � log det⇥in (7), we define its (asymptotic) estimator as ⇥⇤ =[E[M

1

[B(X)]] + D]�1. The corresponding gradientand Hessian are denoted as ⌃⇤ := rg(⇥⇤) = (⇥⇤)�1,�⇤ = r2g(⇥⇤) := (⇥⇤)�1 ⌦ (⇥⇤)�1 respectively [7].Here, ⌃⇤ and �⇤ are not necessarily the covariance andFischer information.

We denote S as the edge set of ⇥true (includingall self-edges), as Sc as its complement. Then,

�⇤SS

c

2 R|S|⇥(p

2�|S|) is defined as its the subma-trix indexed by S and Sc. We define {} ={

⌃

⇤ ,�

⇤ ,Cov[B]

,Cov

min

} with ⌃

⇤ := |||⌃⇤|||1,2

,�

⇤ = |||�⇤|||1,2

, Cov[B]

= kCov[B(X)]k1, andCov

min

as the minimum value among non-zero el-ements of (Cov[b

node

(X)])�1. We denote mmax

=max

r

mr

, wmax

= maxs,t

wst

, and wmin

= mins,t

wst

.

Assumptions. We make three assumptions

1. The PE-MRF has an underlying graphical struc-ture with singleton separator sets with the maxi-mum degree d.

2. Incoherence condition:��

S

c

S

(�SS

)�1

��1,2

w

min

w

max

(1� ↵) for some ↵ 2 (0, 1].

3. Boundedness condition: E[B(X)] andCov[B(X)] are bounded.

Intuitively, the incoherence condition limits the corre-lation between edges variables and and non-edge vari-ables; see Loh et al. [15] and Ravikumar et al. [19].

Lemma 5.1 Suppose the boundedness condition

holds. Then, for �n

� 2mmax

pCov[B]

qlog(m

max

p)

n

,

there exists an universal constant c1

> 0 satisfying

PrhkM1[µ]� (M1[E[B(X)]])k1,2 < �

n

i� 1� e

�c

1

n

.


Theorem 5.2 Suppose a PE-MRF satisfies all three

assumptions. For a regularization parameter �n

>8(w

max

+w

min

)

↵w

max

w

min

pCov[B]

qlog(m

max

p)

n

, let ⇥ be the

unique solution of the group graphical lasso (7).If the number of samples is given by n >c2

({}, {wst

},mmax

, d,↵) log(mmax

p), then the fol-

lowing two statements about ⇥ hold with probability

at least 1� e�c

1

n

:

1.

��⇥�⇥⇤��1,2

< 2⌃

⇤

⇣w

max

w

max

↵

4(w

max

+w

min

)

+ wmax

⌘�n

,

where ⇥⇤ = (M1

[E[B(X)]] +D)�1

.

2. The recovered edge E(⇥) = {(s, t) |��⇥

ij

��2

>

2⌃

⇤

⇣w

max

w

max

↵

4(w

max

+w

min

)

+ wmax

⌘�n

} becomes the

same as the real edge set E(⇥true).

Here c2({}, {wst

},mmax

, d,↵) = 162�⇤(1+ 4w

max

+w

min

w

min

↵

)2

Cov[B] max{92⌃⇤d

2m

max

, 96⌃⇤

2�⇤d

2m

2max

,

2Cov

min

/4}.

The first statement about an error bound can be de-rived from Lemma 5.1 and the primal-dual witness ap-proach [19], albeit with di↵erent inequalities due to thegroup lasso penalty. The second is based on the rela-tionship between the graphical structure with single-ton separator sets and the generalized inverse covari-ance matrix [15], with an additional constraint thatthe error is less than the threshold

Cov

min

/2 for alarge n.

From Theorem 5.2, for any PE-MRF satisfying ourthree assumptions, we are asymptotically guaranteedto recover the true underlying Markov structure. Notethat the first two assumptions are likely to hold forsparse graphs with well-balanced weights {w

st

}, andthe last one holds for PE-MRFs consisting of knownexponential family distributions at each node. More-over, even if all parts of a graph are not separated bysingleton sets, we can still recover the graphical struc-ture of the sub-parts that are separated [15].

6 Experiments

Synthetic Data. We analyze performance4 on aheterogeneous synthetic network containing 32 nodes:eight Bernoulli, eight gamma, eight Gaussian, andeight Dirichlet (k=3). We consider two cases: sparse(10% of all potential edges exist) and dense (50% ex-ist). For both cases, we run 30 independent trials, withrandom edge weights, each time taking 1000 samplesfrom the distribution via Gibbs sampling. We comparewith VS-MRF [21], which is the only other approach

4All code is available at https://github.com/

youngsuk0723/PE-MRF-Code.

that can explicitly model such a diverse distribution.In our algorithm, we choose w

st

=pm

s

⇥mt

and nor-malize the raw data so that the rows are numericallywell-balanced. We plot the ROC curves for edge re-covery percentage (since we care more about capturingthe structure than the precise weights of each edge) inFigure 1a. As shown, both are better able to recoversparse graphs than dense ones. Overall, the two meth-ods attain similar accuracies, and as shown in Fig-ure 2, our PE-MRF model scales much better to largedatasets. In this experiment, we keep the same pro-portion of Bernoulli, gamma, Gaussian, and Dirichletnodes, but vary the total problem size. We comparethe runtimes of PE-MRF and VS-MRF on the samemachine. As shown, PE-MRF is several orders of mag-nitude faster, opening up new applications that previ-ously could not be modeled in a heterogeneous way.Our PE-MRF solver can learn a 100-node distributionin just 5 minutes. In contrast, VS-MRF [21] takesover 31 hours to converge.

Heterogeneous Genomic Networks. Inferring thestructure of genomic regulatory networks from experi-mental data is a fundamental task in computational bi-ology. Gaussian graphical models are particularly pop-ular and well-suited for this task, as gene expressionmeasurements generated by microarray technology ap-proximately follows a Gaussian distribution. How-ever, new technologies for DNA sequencing can pro-duce data with heterogeneous distributions that vio-late the Gaussianity assumption. Here, we use PE-MRF to infer a genomic regulatory network. We useLevel III public data from The Cancer Genome At-las (TCGA) [17] for 290 breast cancer patients. Thedata consists of miRNA sequencing counts mappedback to a reference genome, which follow a Poissondistribution, and microarray gene expression profiles,which are Gaussian. We employ three common stepsto process the data: adjust for sequencing depth, re-move genes whose mutations are known to have lowfunctional impact on cancer progression, and filter outmiRNAs with low variance across samples. In total,the dataset contains expression profiles for 500 genesand 314 miRNAs. Our goal is to infer a heterogeneousMRF with a total of 814 nodes, 500 of which are Gaus-sian (genes) and 314 of which are Poisson (miRNAs).

We infer a PE-MRF, choosing the regularization pa-rameter � that minimizes the Akaike information cri-terion (AIC), and plot the resulting network in Figure1b. The network contains three types of edges: be-tween two genes, between two miRNAs, and betweenone of each. All three edge types are interesting andpotentially worth exploring further, but we demon-strate here the utility of the gene-gene subnetworkdue to the availability of comprehensive gene data.

https://github.com/youngsuk0723/PE-MRF-Code

https://github.com/youngsuk0723/PE-MRF-Code


0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0TP

RHigh Sparsity Structure

PE-MRFVS-MRF

0.0 0.2 0.4 0.6 0.8 1.0

FPR

0.0

0.2

0.4

0.6

0.8

1.0

TPR

Low Sparsity Structure

PE-MRFVS-MRF

(a)

Gene

miRNA

(b)

Co-expressionCo-localizationShared protein domainsPhysical interactionsGenetic interactions

(c)

Figure 1: (a) ROC curves comparing our PE-MRF method with VS-MRF, (b) Inferred multi-modal network ofmiRNAs and genes, (c) Known interactions from biomedical database of the k-core of the inferred network.

In particular, we consider a k-core of the gene-genesubnetwork, the subnetwork of genes whose inferreddegree within the selected subnetwork is at least k (wechoose k = 65). To validate our model, we use externaldata and observe how many gold standard edges ex-ist between genes from the tightly connected 65-core.Using GeneMANIA [27], we find the core of the in-ferred gene subnetwork is well supported by many es-tablished interactions as shown in Figure 1c. We inferthis network from only the 290 patient samples, yetwe notice the abundance of interaction edges in Figure1c, which match the connectivity of the inferred genesubnetwork. The many genetic interactions (in green)between genes from the core are particularly encour-aging for our model, since genetic interactions corre-spond closely with partial correlations [3, 13], which isprecisely what our PE-MRF attempts to model.

7 Conclusion

In this paper, we have proposed a method of learn-ing Markov networks from observational data. Ourapproach models an underlying distribution as a pair-wise exponential Markov random field (PE-MRF), aclass of multivariate exponential families that is well-suited for heterogeneous distributions. To estimate theparameters in a scalable way, we derive the approxi-mated maximum likelihood problem and develop anADMM algorithm with closed-form updates. We thenprove sparsistency, or guaranteed recovery of the trueMarkov structure, of our method. Our promising re-sults, as well as the widespread applications with het-

Figure 2: Scalability comparison between our PE-MRF model and VS-MRF [21] on heterogeneous data(Bernoulli, gamma, Gaussian, and Dirichlet nodes).

erogeneous data sources, lead to many potential ex-tensions of this work. For example, if the the Markovstructure changes over time, we could use the times-tamped observations to estimate a time-varying net-work, instead of inferring a single network. Simi-larly, one could learn multiple heterogeneous distri-butions (separate yet coupled) at once, similar to thejoint graphical lasso [8]. Moreover, PE-MRFs can beextended to include higher-order interactions, whichopen up new applications and can help increase thepotential impact of our work.

Acknowledgements. This work was supportedby NSF IIS-1149837, NIH BD2K, DARPA XDATA,DARPA SIMPLEX, Stanford Data Science Initiative,Boeing, Bosch, Lightspeed, SAP, and Volkswagen.


References

[1] E. Aramaki, Y. Miura, M. Tonoike, T. Ohkuma,H. Masuichi, K. Waki, and K. Ohe. Extraction ofadverse drug e↵ects from clinical records. Proceedingsof the 13th World Congress on Medical Informatics,160(Pt 1):739–43, 2010.

[2] O. Banerjee, L. El Ghaoui, and A. d’Aspremont.Model selection through sparse maximum likelihoodestimation for multivariate Gaussian or binary data.JMLR, 9:485–516, 2008.

[3] B. Barzel and A.-L. Barabasi. Network link predic-tion by global silencing of indirect correlations. NatureBiotechnology, 31(8):720–725, 2013.

[4] J. Besag. Statistical analysis of non-lattice data. TheStatistician, pages 179–195, 1975.

[5] C. Bishop. Pattern recognition. Machine Learning,128, 2006.

[6] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eck-stein. Distributed optimization and statistical learn-ing via the alternating direction method of multipliers.Foundations and Trends in Machine Learning, 3(1):1–122, 2011.

[7] S. Boyd and L. Vandenberghe. Convex optimization.Cambridge university press, 2004.

[8] P. Danaher, P. Wang, and D. M. Witten. The jointgraphical lasso for inverse covariance estimation acrossmultiple classes. Journal of the Royal Statistical So-ciety: Series B (Statistical Methodology), 76(2):373–397, 2014.

[9] J. Friedman, T. Hastie, and R. Tibshirani. Sparseinverse covariance estimation with the graphical lasso.Biostatistics, 9(3):432–441, 2008.

[10] L. Jacob, G. Obozinski, and J.-P. Vert. Group lassowith overlap and graph lasso. In ICML, pages 433–440. ACM, 2009.

[11] A. Jalali, P. Ravikumar, V. Vasuki, and S. Sanghavi.On learning discrete graphical models using group-sparse regularization. In AISTATS, pages 378–387,2011.

[12] D. Koller and N. Friedman. Probabilistic GraphicalModels: Principles and Techniques. MIT press, 2009.

[13] T. Le, L. Liu, A. Tsykin, G. J. Goodall, B. Liu, B.-Y.Sun, and J. Li. Inferring microRNA–mRNA causalregulatory relationships from expression data. Bioin-formatics, 2013.

[14] J. Lee and T. Hastie. Learning the structure ofmixed graphical models. Journal of Computationaland Graphical Statistics, 24(1):230–253, 2015.

[15] P.-L. Loh, M. Wainwright, et al. Structure estimationfor discrete graphical models: Generalized covariancematrices and their inverses. The Annals of Statistics,41(6):3022–3049, 2013.

[16] N. Meinshausen and P. Buhlmann. High-dimensionalgraphs and variable selection with the lasso. The An-nals of Statistics, pages 1436–1462, 2006.

[17] C. G. A. Network et al. Comprehensive molec-ular portraits of human breast tumours. Nature,490(7418):61–70, 2012.

[18] P. Ravikumar, M. Wainwright, J. La↵erty, et al.High-dimensional Ising model selection using `1-regularized logistic regression. The Annals of Statis-tics, 38(3):1287–1319, 2010.

[19] P. Ravikumar, M. Wainwright, G. Raskutti, B. Yu,et al. High-dimensional covariance estimation by mini-mizing `1-penalized log-determinant divergence. Elec-tronic Journal of Statistics, 5:935–980, 2011.

[20] H. Rue and L. Held. Gaussian Markov Random Fields:Theory and Applications. CRC Press, 2005.

[21] W. Tansey, O. H. M. Padilla, A. S. Suggala, andP. Ravikumar. Vector-space Markov random fields viaexponential families. ICML, 2015.

[22] M. Wainwright and M. Jordan. Log-determinant re-laxation for approximate inference in discrete Markovrandom fields. Signal Processing, IEEE Transactionson, 54(6):2099–2109, 2006.

[23] M. Wainwright and M. Jordan. Graphical models, ex-ponential families, and variational inference. Founda-tions and Trends in Machine Learning, 1(1-2):1–305,2008.

[24] D. M. Witten and R. Tibshirani. Covariance-regularized regression and classification for high di-mensional problems. Journal of the Royal Sta-tistical Society: Series B (Statistical Methodology),71(3):615–636, 2009.

[25] E. Yang, Y. Baker, P. Ravikumar, G. Allen, andZ. Liu. Mixed graphical models via exponential fami-lies. In AISTATS, pages 1042–1050, 2014.

[26] E. Yang, P. Ravikumar, G. Allen, and Z. Liu. OnPoisson graphical models. In NIPS, pages 1718–1726,2013.

[27] K. Zuberi, M. Franz, H. Rodriguez, J. Montojo, C. T.Lopes, G. D. Bader, and Q. Morris. GeneMANIA pre-diction server 2013 update. Nucleic Acids Research,41(W1):W115–W122, 2013.

learning the network structure of heterogeneous data via pairwise exponential … · 2019-04-11 ·...

Documents