lifted tree-reweighted variational...

14
Lifted Tree-Reweighted Variational Inference * Hung Hai Bui Natural Language Understanding Lab Nuance Communications Sunnyvale, CA, USA [email protected] Tuyen N. Huynh John von Neumann Institute Vietnam National University Ho Chi Minh City [email protected] David Sontag Courant Institute of Mathematical Sciences New York University [email protected] Abstract We analyze variational inference for highly sym- metric graphical models such as those arising from first-order probabilistic models. We first show that for these graphical models, the tree- reweighted variational objective lends itself to a compact lifted formulation which can be solved much more efficiently than the standard TRW formulation for the ground graphical model. Compared to earlier work on lifted belief prop- agation, our formulation leads to a convex op- timization problem for lifted marginal inference and provides an upper bound on the partition function. We provide two approaches for im- proving the lifted TRW upper bound. The first is a method for efficiently computing maxi- mum spanning trees in highly symmetric graphs, which can be used to optimize the TRW edge ap- pearance probabilities. The second is a method for tightening the relaxation of the marginal poly- tope using lifted cycle inequalities and novel ex- changeable cluster consistency constraints. 1 Introduction Lifted probabilistic inference focuses on exploiting sym- metries in probabilistic models for efficient inference [5, 2, 3, 10, 17, 18, 21]. Work in this area has demonstrated the possibility to perform very efficient inference in highly- connected, large tree-width, but symmetric models, such as those arising in the context of relational (first-order) proba- bilistic models and exponential family random graphs [19]. These models also arise frequently in probabilistic pro- gramming languages, an area of increasing importance as demonstrated by DARPA’s PPAML program (Probabilistic Programming for Advancing Machine Learning). Even though lifted inference can sometimes offer order-of- magnitude improvement in performance, approximation is * Main article appeared in UAI 2014. This version also in- cludes the supplementary material. still necessary. A topic of particular interest is the interplay between lifted inference and variational approximate infer- ence. Lifted loopy belief propagation (LBP) [13, 21] was one of the first attempts at exploiting symmetry to speed up loopy belief propagation; subsequently, counting be- lief propagation (CBP) [16] provided additional insights into the nature of symmetry in BP. Nevertheless, these work were largely procedural and specific to the choice of message-passing algorithm (in this case, loopy BP). More recently, Bui et al., [3] proposed a general framework for lifting a broad class of convex variational techniques by formalizing the notion of symmetry (defined via automor- phism groups) of graphical models and the corresponding variational optimization problems themselves, independent of any specific methods or solvers. Our goal in this paper is to extend the lifted variational framework in [3] to address the important case of approxi- mate marginal inference. In particular, we show how to lift the tree-reweighted (TRW) convex formulation of marginal inference [28]. As far as we know, our work presents the first lifted convex variational marginal inference, with the following benefits over previous work: (1) a lifted con- vex upper bound of the log-partition function, (2) a new tightening of the relaxation of the lifted marginal poly- tope exploiting exchangeability, and (3) a convergent infer- ence algorithm. We note that convex upper bounds of the log-partition function immediately lead to concave lower bounds of the log-likelihood which can serve as useful sur- rogate loss functions in learning and parameter estimation [29, 13]. To achieve the above goal, we first analyze the symmetry of the TRW log-partition and entropy bounds. Since TRW bounds depend on the choice of the edge appearance prob- abilities ρ, we prove that the quality of the TRW bound is not affected if one only works with suitably symmet- ric ρ. Working with symmetric ρ gives rise to an explicit lifted formulation of the TRW optimization problem that is equivalent but much more compact. This convex objective function can be convergently optimized via a Frank-Wolfe (conditional gradient) method where each Frank-Wolfe it- eration solves a lifted MAP inference problem. We then

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

Lifted Tree-Reweighted Variational Inference∗

Hung Hai BuiNatural Language Understanding Lab

Nuance CommunicationsSunnyvale, CA, USA

[email protected]

Tuyen N. HuynhJohn von Neumann InstituteVietnam National University

Ho Chi Minh [email protected]

David SontagCourant Institute of Mathematical Sciences

New York [email protected]

Abstract

We analyze variational inference for highly sym-metric graphical models such as those arisingfrom first-order probabilistic models. We firstshow that for these graphical models, the tree-reweighted variational objective lends itself to acompact lifted formulation which can be solvedmuch more efficiently than the standard TRWformulation for the ground graphical model.Compared to earlier work on lifted belief prop-agation, our formulation leads to a convex op-timization problem for lifted marginal inferenceand provides an upper bound on the partitionfunction. We provide two approaches for im-proving the lifted TRW upper bound. Thefirst is a method for efficiently computing maxi-mum spanning trees in highly symmetric graphs,which can be used to optimize the TRW edge ap-pearance probabilities. The second is a methodfor tightening the relaxation of the marginal poly-tope using lifted cycle inequalities and novel ex-changeable cluster consistency constraints.

1 Introduction

Lifted probabilistic inference focuses on exploiting sym-metries in probabilistic models for efficient inference [5,2, 3, 10, 17, 18, 21]. Work in this area has demonstratedthe possibility to perform very efficient inference in highly-connected, large tree-width, but symmetric models, such asthose arising in the context of relational (first-order) proba-bilistic models and exponential family random graphs [19].These models also arise frequently in probabilistic pro-gramming languages, an area of increasing importance asdemonstrated by DARPA’s PPAML program (ProbabilisticProgramming for Advancing Machine Learning).

Even though lifted inference can sometimes offer order-of-magnitude improvement in performance, approximation is

∗Main article appeared in UAI 2014. This version also in-cludes the supplementary material.

still necessary. A topic of particular interest is the interplaybetween lifted inference and variational approximate infer-ence. Lifted loopy belief propagation (LBP) [13, 21] wasone of the first attempts at exploiting symmetry to speedup loopy belief propagation; subsequently, counting be-lief propagation (CBP) [16] provided additional insightsinto the nature of symmetry in BP. Nevertheless, thesework were largely procedural and specific to the choice ofmessage-passing algorithm (in this case, loopy BP). Morerecently, Bui et al., [3] proposed a general framework forlifting a broad class of convex variational techniques byformalizing the notion of symmetry (defined via automor-phism groups) of graphical models and the correspondingvariational optimization problems themselves, independentof any specific methods or solvers.

Our goal in this paper is to extend the lifted variationalframework in [3] to address the important case of approxi-mate marginal inference. In particular, we show how to liftthe tree-reweighted (TRW) convex formulation of marginalinference [28]. As far as we know, our work presents thefirst lifted convex variational marginal inference, with thefollowing benefits over previous work: (1) a lifted con-vex upper bound of the log-partition function, (2) a newtightening of the relaxation of the lifted marginal poly-tope exploiting exchangeability, and (3) a convergent infer-ence algorithm. We note that convex upper bounds of thelog-partition function immediately lead to concave lowerbounds of the log-likelihood which can serve as useful sur-rogate loss functions in learning and parameter estimation[29, 13].

To achieve the above goal, we first analyze the symmetryof the TRW log-partition and entropy bounds. Since TRWbounds depend on the choice of the edge appearance prob-abilities ρ, we prove that the quality of the TRW boundis not affected if one only works with suitably symmet-ric ρ. Working with symmetric ρ gives rise to an explicitlifted formulation of the TRW optimization problem that isequivalent but much more compact. This convex objectivefunction can be convergently optimized via a Frank-Wolfe(conditional gradient) method where each Frank-Wolfe it-eration solves a lifted MAP inference problem. We then

Page 2: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

discuss the optimization of the edge-appearance vector ρ,effectively yielding a lifted algorithm for computing maxi-mum spanning trees in symmetric graphs.

As in Bui et al.’s framework, our work can benefit fromany tightening of the local polytope such as the use of cy-cle inequalities [1, 23]. In fact, each method for relaxingthe marginal polytope immediately yields a variant of ouralgorithm. Notably, in the case of exchangeable randomvariables, radically sharper tightening (sometimes even ex-act characterization of the lifted marginal polytope) can beobtained via a set of simple and elegant linear constraintswhich we call exchangeable polytope constraints. We pro-vide extensive simulation studies comparing the behaviorsof different variants of our algorithm with exact inference(when available) and lifted LBP demonstrating the advan-tages of our approach. The supplementary material [4] pro-vides additional proof and algorithm details.

2 Background

We begin by reviewing variational inference and the tree-reweighted (TRW) approximation. We focus on in-ference in Markov random fields, which are distribu-tions in the exponential family given by Pr(x; θ) =exp {〈Φ(x), θ〉 −A(θ)}, where A(θ) is called the log-partition function and serves to normalize the distribution.We assume that the random variables x ∈ Xn are discrete-valued, and that the features (Φi), i ∈ I factor accordingto the graphical model structure G; Φ can be non-pairwiseand is assumed to be overcomplete. This paper focuseson the inference tasks of estimating the marginal proba-bilities p(xi) and approximating the log-partition function.Throughout the paper, the domain X is the binary domain{0, 1}, however, except for the construction of exchange-able polytope constraints in Section 6, this restriction is notessential.

Variational inference approaches view the log-partitionfunction as a convex optimization problem over themarginal polytope A(θ) = supµ∈M(G)〈µ, θ〉 − A∗(µ) andseek tractable approximations of the negative entropy A∗

and the marginal polytope M [27]. Formally, −A∗(µ) isthe entropy of the maximum entropy distribution with mo-ments µ. Observe that−A∗(µ) is upper bounded by the en-tropy of the maximum entropy distribution consistent withany subset of the expected sufficient statistics µ. To arriveat the TRW approximation [26], one uses a subset givenby the pairwise moments of a spanning tree1. Hence forany distribution ρ over spanning trees, an upper bound on−A∗ is obtained by taking a convex combination of tree en-tropies−B∗(τ, ρ) =

∑s∈V (G)H(τs)−

∑e∈E(G) I(τe)ρe.

Since ρ is a distribution over spanning trees, it must belongto the spanning tree polytope T(G) with ρe denoting theedge appearance probability of e. Combined with a relax-

1If the original model contains non-pairwise potentials, theycan be represented as cliques in the graphical model, and thebound based on spanning trees still holds.

ation of the marginal polytope OUTER ⊃ M, an upperbound B of the log-partition function is obtained:

A(θ) ≤ B(θ, ρ) = supτ∈OUTER(G)

〈τ, θ〉 −B∗(τ, ρ) (1)

We note that B∗ is linear w.r.t. ρ, and for ρ ∈ T(G), B∗ isconvex w.r.t. τ . On the other hand, B is convex w.r.t. ρ andθ.

The optimal solution τ∗(ρ, θ) of the optimization problem(1) can be used as an approximation to the mean param-eter µ(θ). Typically, the local polytope LOCAL given bypairwise consistency constraints is used as the relaxationOUTER; in this paper, we also consider tightening of thelocal polytope.

Since (1) holds with any edge appearance ρ in the spanningtree polytope T, the TRW bound can be further improvedby optimizing ρ

infρ∈T(G)

B(θ, ρ) (2)

The resulting ρ∗ is then plugged into (1) to find themarginal approximation. In practice, one might choose towork with some fixed choice of ρ, for example the uniformdistribution over all spanning trees. [14] proposed usingthe most uniform edge-weight arg infρ∈T(G)

∑e∈E(ρe −

|V |−1|E| )2 which can be found via conditional gradient where

each direction-finding step solves a maximum spanningtree problem.

Several algorithms have been proposed for optimizing theTRW objective (1) given fixed edge appearance probabil-ities. [27] derived the tree-reweighted belief propagationalgorithm from the fixed point conditions. [8] show howto solve the dual of the TRW objective, which is a geomet-ric program. Although this algorithm has the advantage ofguaranteed convergence, it is non-trivial to generalize thisapproach to use tighter relaxations of the marginal poly-tope, which we show is essential for lifted inference. [14]use an explicit set of spanning trees and then use dual de-composition to solve the optimization problem. However,as we show in the next section, to maintain symmetry it isessential that one not work directly with spanning trees butrather use symmetric edge appearance probabilities. [23]optimize TRW over the local and cycle polytopes using aFrank-Wolfe (conditional gradient) method, where each it-eration requires solving a linear program. We follow thislatter approach in our paper.

To optimize the edge appearance in (2), [26] proposed us-ing conditional gradient. They observed that ∂B(θ,ρ)

∂ρe=

−∂B∗(τ∗,ρ)∂ρe

= −I(τ∗e ) where τ∗ is the solution of (1). Thedirection-finding step in conditional gradient reduces tosolving supρ∈T〈ρ, I〉, again equivalent to finding the maxi-mum spanning tree with edge mutual information I(τ∗e ) asweights. We discuss the corresponding lifted problem insection 5.

Page 3: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

3 Lifted Variational Framework

We build on the key element of the lifted variational frame-work introduced in [3]. The automorphism group of agraphical model, or more generally, an exponential familyis defined as the group A of permutation pairs (π, γ) whereπ permutes the set of variables and γ permutes the set offeatures in such a way that they preserve the feature func-tion: Φγ

−1

(xπ) = Φ(x). Note that this construction of Ais entirely based on the structure of the model and does notdepend on the particular choice of the model parameters;nevertheless the group stabilizes2 (preserves) the key char-acteristics of the exponential family such as the marginalpolytopeM, the log-partition A and entropy −A∗.

As shown in [3] the automorphism group is particularlyuseful for exploiting symmetries when parameters are tied.For a given parameter-tying partition ∆ such that θi = θjfor i, j in the same cell3 of ∆, the group A gives rise toa subgroup called the lifting group A∆ that stabilizes thetied-parameter vector θ as well as the exponential family.The orbit partition of the the lifting group can be used toformulate equivalent but more compact variational prob-lems. More specifically, let ϕ = ϕ(∆) be the orbit parti-tion induced by the lifting group on the feature index setI = {1 . . .m}, let Rm[ϕ] denote the symmetrized subspace{r ∈ Rm s.t. ri = rj ∀i, j in the same cell of ϕ} and de-fine the lifted marginal polytopeM[ϕ] asM∩ Rm[ϕ], then(see Theorem 4 of [3])

supµ∈M

〈θ, µ〉 −A∗(µ) = supµ∈M[ϕ]

〈θ, µ〉 −A∗(µ) (3)

In practice, we need to work with convex variational ap-proximations of the LHS of (3) whereM is relaxed to anouter bound OUTER(G) and A∗ is approximated by a con-vex function B∗(µ). We now state a similar result for lift-ing general convex approximations.

Theorem 1. If B∗(µ) is convex and stabilized by the lift-ing group A∆, i.e., for all (π, γ) ∈ A∆, B∗(µγ) = B∗(µ),then ϕ is the lifting partition for the approximate varia-tional problem

supµ∈OUTER(G)

〈θ, µ〉 −B∗(µ) = supµ∈OUTER[ϕ]

〈θ, µ〉 −B∗(µ)

(4)

The importance of Theorem 1 is that it shows that it isequivalent to optimize over a subset of OUTER(G) wherepseudo-marginals in the same orbit are restricted to takethe same value. As we will show in Section 4.2, this willallow us to combine many of the terms in the objective,which is where the computational gains will derive from. A

2Formally, G stabilizes x if xg = x for all g ∈ G.3If ∆ = {∆1 . . .∆K} is a partition of S, then each subset

∆k ⊂ S is called a cell.

sketch of its proof is as follows. Consider a single pseudo-marginal vector µ. Since the objective value is the same forevery µγ for (π, γ) ∈ A∆ and since the objective is con-cave, the average of these, 1

|A∆|∑

(π,γ)∈A∆µγ , must have

at least as good of an objective value. Furthermore, notethat this averaged vector lives in the symmetrized subspace.Thus, it suffices to optimize over OUTER[ϕ].

4 Lifted Tree-Reweighted Problem

4.1 Symmetry of TRW Bounds

We now show that Theorem 1 can be used to lift the TRWoptimization problem (1). Note that the applicability ofTheorem 1 is not immediately obvious since B∗ dependson the distribution over trees implicit in ρ. In establishingthat the condition in Theorem 1 holds, we need to be care-ful so that the choice of the distribution over trees ρ doesnot destroy the symmetry of the problem.

The result below ensures that with no loss in optimality,ρ can be assumed to be suitably symmetric. More specifi-cally, let ϕE = ϕE(∆) be the set of G’s edge orbits inducedby the action of the lifting group A∆; the edge-weights ρefor every e in the same edge orbits can be constrained to bethe same, i.e. ρ can be restricted to T[ϕE ].

Theorem 2. For any ρ ∈ T, there exists a symmetrizedρ ∈ T[ϕE ] that yields at least as good an upper bound, i.e.

B(θ, ρ) ≤ B(θ, ρ) ∀θ ∈ Θ[∆]

As a consequence, in optimizing the edge appearance, ρcan be restricted to the symmetrized spanning tree polytopeT[ϕE ]

∀θ ∈ Θ[∆], infρ∈T

B(θ, ρ) = infρ∈T[ϕE ]

B(θ, ρ)

Proof. Let ρ be the argmin of the LHS, and define ρ =1|A∆|

∑π∈A∆

ρπ so that ρ ∈ T[ϕE ]. For all (π, γ) ∈ A∆

and for all tied-parameter θ ∈ Θ[∆], θπ = θ, soB(θ, ρπ) =B(θπ, ρπ). By theorem 1 of [3], π must be an automor-phism of the graph G. By lemma 7 (see supplementary ma-terial), B(θπ, ρπ) = B(θ, ρ). Thus B(θ, ρπ) = B(θ, ρ).Since B is convex w.r.t. ρ, by Jensen’s inequality we havethat B(θ, ρ) ≤ 1

|A∆|∑π∈A∆

B(θ, ρπ) = B(θ, ρ).

Using a symmetric choice of ρ, the TRW bound B∗ thensatisfies the condition of theorem 1, enabling the applica-bility of the general lifted variational inference framework.

Theorem 3. For a fixed ρ ∈ T[ϕE ], ϕ is the lifting partitionfor the TRW variational problem

supτ∈OUTER(G)

〈τ, θ〉−B∗(τ, ρ) = supτ∈OUTER[ϕ]

〈τ, θ〉−B∗(τ, ρ)

(5)

Page 4: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

4.2 Lifted TRW Problems

We give the explicit lifted formulation of the TRW opti-mization problem (5). As in [3], we restrict τ to OUTER[ϕ]

by introducing the lifted variables τj for each cell ϕj , andfor all i ∈ ϕj , enforcing that τi = τj . Effectively, we sub-stitute every occurrence of τi, i ∈ ϕj by τj ; in vector form,τ is substituted by Dτ where D is the characteristic matrixof the partition ϕ: Dij = 1 if i ∈ ϕj and 0 otherwise. Thisresults in the lifted form of the TRW problem

supDτ∈OUTER

⟨τ , θ⟩−B∗(τ , ρ) (6)

where θ = D>θ; B∗ is obtained from B∗ via theabove substitution; and ρ is the edge appearance per edge-orbit: for every edge orbit e, and for every edge e ∈ e,ρe = ρe. Using an alternative but equivalent form B∗ =−∑v∈V (1 −

∑e∈Nb(v) ρe)H(τv) −

∑e∈E ρeH(τe), we

obtain the following explicit form for

B∗(τ , ρ) = −∑v∈V

|v| − ∑e∈N(v)

|e|d(v, e)ρe

H(τv)

−∑e∈E

|e|ρeH(τe) (7)

Intuitively, the above can be viewed as a combination ofnode and edge entropies defined on nodes and edges of thelifted graph G. Nodes of G are the node orbits of G whileedges are the edge-orbits of G. G is not a simple graph: itcan have self-loops or multi-edges between the same nodepair (see Fig. 1). We encode the incidence on this graph asfollows: d(v, e) = 0 if v is not incident to e, d(v, e) = 1if v is incident to e and e is not a loop, d(v, e) = 2 if eis a loop incident to v. The entropy at the node orbit v isdefined as

H(τv) = −∑t∈X

τv:t ln(τv:t)

and the entropy at the edge orbit e is

H(τe) = −∑t,h∈X

τ{e1:t,e2:h} ln(τ{e1:t,e2:h})

where {e1, e2} for e1, e2 ∈ V is a representative (any el-ement) of e, {e1:t, e2:h} is an assignment of the groundedge {e1, e2}, and {e1:t, e2:h} is the assignment orbit.As in [3], we write {e1:t, e2:t} as e:t, and for t < h,{e1:t, e2:h} as a:(t, h) where a is the arc-orbit (e1, e2).

When OUTER is the local or cycle polytope, the con-straintsDτ ∈ OUTER yield the lifted local (or cycle) poly-tope respectively. For these constraints, we use the sameform given in [3]. In section 6, we describe a set of addi-tional constraints for further tightening when some clusterof nodes are exchangeable.

Example. Consider the MRF shown in Fig. 1 (left) with 10binary variables that we denote Bi (for the blue nodes) and

2

2 1 1 b r

Figure 1: Left: ground graphical model. Same colored nodes andedges have the same parameters. Right: lifted graph showing 2node orbits (b and r), and 3 edge orbits. Numbers on the liftedgraph representing the incidence degree d(v, e) between an edgeand a node orbit.

Ri (for the red nodes). The node and edge coloring denotesshared parameters. Let θb and θr be the single-node poten-tials used for the blue and red nodes, respectively. Let θrebe the edge potential used for the red edges connecting theblue and red nodes, θbe for the edge potential used for theblue edges (Bi, Bi+1), and θke for the edge potential usedfor the black edges (Bi, Bi+2).

There are two node orbits: b = {B1, . . . , B5} and r ={R1, . . . , R5}. There are three edge orbits: re for thered edges, be for the blue edges , and ke for the blackedges. The size of the node and edge orbits are all 5(e.g., |b| = |be| = 5), and d(b, re) = d(r, re) = 1,d(b,be) = d(b,ke) = 2. Suppose that ρ corresponds toa uniform distribution over spanning trees, which satisfiesthe symmetry needed by Theorem 2. We then have ρre = 1

and ρbe= ρke

= 25 . Putting all of this together, the lifted

TRW entropy is given byB∗(τ , ρ) = 8H(τb)−5H(τ re)−2H(τbe) − 2H(τke). We illustrate the expansion of theentropy of the red edge orbit H(τre). This edge orbithas 2 corresponding arc-orbits: rba = {(Ri, Bi)} andbra = {(Bi, Ri)}. Thus, H(τre) = −τre:00 ln τre:00 −τre:11 ln τre:11 − τrba:01 ln τrba:01 − τbra:01 ln τbra:01.

Finally, the linear term in the objective is given by⟨τ , θ⟩

=5 〈τb, θb〉+5 〈τr, θr〉+5 〈τre , θre〉+5 〈τbe , θbe〉+5 〈τke , θke〉 where, as an example, 〈τre , θre〉 =τre:00θre,00 + τre:11θre,11 + τbra:01θre,01 + τrba:01θre,10

4.3 Optimization using Frank-Wolfe

What remains is to describe how to optimize Eq. 6. Ourlifted tree-reweighted algorithm is based on Frank-Wolfe,also known as the conditional gradient method [7, 11].First, we initialize with a pseudo-marginal vector corre-sponding to the uniform distribution, which is guaranteedto be in the lifted marginal polytope. Next, we solve the lin-ear program whose objective is given by the gradient of theobjective Eq. 6 evaluated at the current point, and whoseconstraint set is OUTER. When using the lifted cycle re-laxation, we solve this linear program using a cutting-planealgorithm [3, 23]. We then perform a line search to find theoptimal step size using a golden section search (a type of bi-nary search that finds the maxima of a unimodal function),and finally repeat using the new pseudo-marginal vector.We warm start each linear program using the optimal basisfound in the previous run, which makes the LP solves ex-

Page 5: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

tremely fast after the first couple of iterations. Although weuse a generic LP solver in our experiments, it is also possi-ble to use dual decomposition to derive efficient algorithmsspecialized to graphical models [24].

5 Lifted Maximum Spanning Tree

Optimizing the TRW edge appearance probability ρ re-quires finding the maximum spanning tree (MST) in theground graphical model. For lifted TRW, we need to per-form MST while using only information from the node andedge orbits, without referring to the ground graph. In thissection, we present a lifted MST algorithm for symmetricgraphs which works at the orbit level.

Suppose that we are given a weighted graph (G, w), its au-tomorphism group A = Aut(G) and its node and edgeorbits. We aim to derive an algorithm analogous to theKruskal’s algorithm, but with complexity depends only onthe number of node/edge orbits of G. However, if the algo-rithm has to return an actual spanning tree of G then clearlyits complexity cannot be less thanO(|V |). Instead, we con-sider an equivalent problem: solving a linear program onthe spanning-tree polytope

supρ∈T(G)

〈ρ, w〉 (8)

The same mechanism for lifting convex optimization prob-lem (Lemma 1 in [3]) applies to this problem. Let ϕE bethe edge orbit partition, then an equivalent lifted problemproblem is

supρ∈T[ϕE ]

〈ρ, w〉 (9)

Since ρe is constrained to be the same for edges in the sameorbit, it is now possible to solve (9) with complexity de-pending only on the number of orbits. Any solution ρ ofthe LP (8) can be turned into a solution ρ of (9) by lettingρ(e) = 1

|e|∑e′∈e ρ(e′) .

5.1 Lifted Kruskal’s Algorithm

The Kruskal’s algorithm first sorts the edges accordingto their decreasing weight. Then starting from an emptygraph, at each step it greedily attempts to add the next edgewhile maintaining the property that the used edges form aforest (containing no cycle). The forest obtained at the endof this algorithm is a maximum-weight spanning tree.

Imagine how Kruskal’s algorithm would operate on aweighted graph G with non-trivial automorphisms. Lete1, . . . , ek be the list of edge-orbits sorted in the order ofdecreasing weight (the weights w on all edges in the sameorbit by definition must be the same). The main questiontherefore is how many edges in each edge-orbit ei will beadded to the spanning tree by the Kruskal’s algorithm. LetGi be the subgraph of G formed by the set of all the edgesand nodes in e1, . . . ei. Let V (G) and C(G) denote the setof nodes and set of connected components of a graph, re-spectively. Then (see the supplementary material for proof)

Lemma 4. The number of edges in ei appearing in theMST found by the Kruskal’s algorithm is δ(i)

V − δ(i)C where

δ(i)V = |V (Gi)|−|V (Gi−1)| and δ(i)

C = |C(Gi)|−|C(Gi−i)|.Thus a solution for the linear program (9) is ρ(ei) =δ(i)V −δ

(i)C

|ei| .

5.2 Lifted Counting of the Number of ConnectedComponents

We note that counting the number of nodes can be donesimply by adding the size of each node orbit. The remain-ing difficulty is how to count the number of connected com-ponents of a given graph4 G using only information at theorbit level. Let G be the lifted graph of G. Then (see sup-plementary material for proof)

Lemma 5. If G is connected then all connected compo-nents of G are isomorphic. Thus if furthermore G′ is a con-nected component of G then |C(G)| = |V (G)|/|V (G′)|.

To find just one connected component, we can choose anarbitrary node u and compute G[u], the lifted graph fixingu (see section 8.1 in [3]), then search for the connectedcomponent in G[u] that contains {u}. Finally, if G is notconnected, we simply apply lemma 5 for each connectedcomponent of G.

The final lifted Kruskal’s algorithm combines lemma 4 and5 while keeping track of the set of connected componentsof Gi incrementally. The full algorithm is given in the sup-plementary material.

6 Tightening via Exchangeable PolytopeConstraints

One type of symmetry often found in first-order probabilis-tic models are large sets of exchangeable random variables.In certain cases, exact inference with exchangeable vari-ables is possible via lifted counting elimination and its gen-eralization [17, 2]. The drawback of these exact methodsis that they do not apply to many models (e.g., those withtransitive clauses). Lifted variational inference methods donot have this drawback, however local and cycle relaxationcan be shown to be loose in the exchangeable setting, a po-tentially serious limitation compared to earlier work.

To remedy this situation, we now show how to take advan-tage of highly symmetric subset of variables to tighten therelaxation of the lifted marginal polytope.

We call a set of random variables χ an exchangeable clusteriff χ can be arbitrary permuted while preserving the prob-ability model. Mathematically, the lifting group A∆ actson χ and the image of the action is isomorphic to S(χ),

4Since we are only interested in connectivity in this subsec-tion, the weights of G play no role. Thus, orbits in this subsec-tion can also be generated by the automorphism group of the un-weighted version of G.

Page 6: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

the symmetric group on χ. The distribution of the randomvariables in χ is also exchangeable in the usual sense.

Our method for tightening the relaxation of the marginalpolytope is based on lift-and-project, wherein we introduceauxiliary variables specifying the joint distribution of alarge cluster of variables, and then enforce consistency be-tween the cluster distribution and the pseudo-marginal vec-tor [20, 24, 27]. In the ground model, one typically workswith small clusters (e.g., triplets) because the number ofvariables grows exponentially with cluster size. The key(and nice) difference in the lifted case is that we can makeuse of very large clusters of highly symmetric variables:while the grounded relaxation would clearly blow up, thecorresponding lifted relaxation can still remain compact.

Specifically, for an exchangeable cluster χ of arbitrary size,one can add cluster consistency constraints for the entirecluster and still maintain tractability. To keep the exposi-tion simple, we assume that the variables are binary. Let Cdenote a χ-configuration, i.e., a function C : χ → {0, 1}.The set {τχC | ∀ configuration C} is the collection of χ-cluster auxiliary variables. Since χ is exchangeable, allnodes in χ belong to the same node orbit; we call this nodeorbit v(χ). Similarly, e(χ) and a(χ) denote the single edgeand arc orbit that contains all edges and arcs in χ respec-tively. Let u1, u2 be two distinct nodes in χ. To enforceconsistency between the cluster χ and the edge {u1, u2} inthe ground model, we introduce the constraints

∃τχ :∑

C s.t.C(ui)=si

τχC = τu1:s1,u2:s2 ∀si ∈ {0, 1} (10)

These constraints correspond to using intersection sets ofsize two, which can be shown to be the exact characteri-zation of the marginal polytope involving variables in χ ifthe graphical model only has pairwise potentials. If higher-order potentials are present, a tighter relaxation could beobtained by using larger intersection sets together with thetechniques described below.

The constraints in (10) can be methodically lifted by re-placing occurrences of ground variables with lifted vari-ables at the orbit level. First observe that in place of thegrounded variables τu1:s1,u2:s2 , the lifted local relaxationhas three corresponding lifted variables, τe(χ):00, τe(χ):11

and τa(χ):01. Second, we consider the orbits of the set ofconfigurations C. Since χ is exchangeable, there can beonly |χ| + 1 χ-configuration orbits; each orbit contains allconfigurations with precisely k 1’s where k = 0 . . . |χ|.Thus, instead of the 2|χ| ground auxiliary variables, weonly need |χ| + 1 lifted cluster variables. Further manip-ulation leads to the following set of constraints, which wecall lifted exchangeable polytope constraints.

Theorem 6. Let χ be an exchangeable cluster of size n;e(χ) and a(χ) be the single edge and arc orbit of thegraphical model that contains all edges and arcs in χ re-spectively; τ be the lifted marginals. Then there exist

cχk ≥ 0, k = 0 . . . n such that

n−2∑k=0

(n− k)(n− k − 1)

n(n− 1)cχk = τe(χ):00

n−2∑k=0

(k + 1)(k + 2)

n(n− 1)cχk+2 = τe(χ):11

n−2∑k=0

(n− k − 1)(k + 1)

n(n− 1)cχk+1 = τa(χ):01

Proof. See the supplementary material.

In contrast to the lifted local and cycle relaxations, the num-ber of variables and constraints in the lifted exchangeablerelaxation depends linearly on the domain size of the first-order model. From the lifted local constraints given by [3],τe(χ):00 + τe(χ):11 + 2τa(χ):01 = 1. Substituting in theexpression involved cχk , we get

∑nk=0 c

χk = 1. Intuitively,

cχk represents the approximation of the marginal probabilityPr(∑i∈χ xi = k) of having precisely k ones in χ.

As proved by [2], groundings of unary predicates inMarkov Logic Networks (MLNs) gives rise to exchange-able clusters. Thus, for MLNs, the above theorem imme-diately suggests a tightening of the relaxation: for everyunary predicate of a MLN, add a new set of constraintsas above to the existing lifted local (or cycle) optimiza-tion problem. Although it is not the focus of our paper,we note that this should also improve the lifted MAP infer-ence results of [3]. For example, in the case of a symmetriccomplete graphical model, lifted MAP inference using thelinear program given by these new constraints would findthe exact k that maximizes Pr(xχ), hence recover the samesolution as counting elimination. Marginal inference maystill be inexact due to the tree-reweighted entropy approxi-mation. We re-emphasize that the complexity of variationalinference with lifted exchangeable constraints is guaran-teed to be polynomial in the domain size, unlike exactmethods based on lifted counting elimination and variableelimination.

7 Experiments

In this section, we provide an empirical evaluation of ourlifted tree reweighted (LTRW) algorithm. As a baselinewe use a dampened version of the lifted belief propagation(LBP-Dampening) algorithm from [21]. Our lifted algo-rithm has all of the same advantages of the tree-reweightedapproach over belief propagation, which we will illustratein the results: (1) a convex objective that can be conver-gently solved to optimality, (2) upper bounds on the parti-tion function, and (3) the ability to easily improve the ap-proximation by tightening the relaxation. Our evaluationincludes four variants of the LTRW algorithm correspond-ing to using different outer bounds: lifted local polytope(LTRW-L), lifted cycle polytope (LTRW-C), lifted local

Page 7: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

Q1(2)  

Q1(1)  

Q1(3)  

Q2(2)  

Q2(1)  

Q2(3)   Q3(2)  

Q3(1)  

Q3(3)  

Figure 2: An example of the ground graphical model for theClique-Cycle model (domain size = 3).

polytope with exchangeable polytope constraints (LTRW-LE), and lifted cycle polytope with exchangeable con-straints (LTRW-CE). The conditional gradient optimizationof the lifted TRW objective terminates when the duality gapis less than 10−4 or when a maximum number of 1000 it-erations is reached. To solve the LP problem during condi-tional gradient, we use Gurobi5.

We evaluate all the algorithms using several first-orderprobabilistic models. We assume that no evidence has beenobserved, which results in a large amount of symmetry.Even without evidence, performing marginal inference infirst-order probabilistic models can be very useful for max-imum likelihood learning [13]. Furthermore, the fact thatour lifted tree-reweighted variational approximation pro-vides an upper bound on the partition function enables usto maximize a lower bound on the likelihood [29], whichwe demonstrate in Sec. 7.5. To find the lifted orbit parti-tion, we use the renaming group as in [3] which exploitsthe symmetry of the unobserved contants in the model.

Rather than optimize over the spanning tree polytope,which is computationally intensive, most TRW implemen-tations use a single fixed choice of edge appearance prob-abilities, e.g. an (un)weighted distribution obtained usingthe matrix-tree theorem. In these experiments, we initial-ize the lifted edge appearance probabilities ρ to be the mostuniform per-orbit edge-appearance probabilties by solv-ing the optimization problem inf ρ∈T[ϕE ]

(ρ − |V |−1|E| )2 us-

ing conditional gradient. Each direction-finding step ofthis conditional gradient solves a lifted MST problem ofthe form supρ′∈T[ϕE ]

⟨−2(ρ− |V |−1

|E| ), ρ′⟩

using our liftedKruskal’s algorithm, where ρ is the current solution. Afterthis initialization, we fix the lifted edge appearance proba-bilities and do not attempt to optimize them further.

7.1 Test models

Fig. 3 describes the four test models in MLN syntax. Wefocus on the repulsive case, since for attractive models, allTRW variants and lifted LBP give similar results. The pa-rameter W denotes the weight that will be varying duringthe experiments. In all models except Clique-Cycle, Wacts like the “local field” potential in an Ising model; anegative (or positive) value of W means the correspond-ing variable tends to be in the 0 (or 1) state. Complete-

5http://www.gurobi.com/

Graph is equivalent to an Ising model on the completegraph of size n (the domain size) with homogenous param-eters. Exact marginals and the log-partition function canbe computed in closed form using lifted counting elimina-tion. The weight of the interaction clause is set to −0.1(repulsive). Friends-Smokers (negated) is a variant of theFriends-Smokers model [21] where the weight of the fi-nal clause is set to -1.1 (repulsive). We use the method in[2] to compute the exact marginal for the Cancer predicateand the exact value of the log-partition function. Lovers-Smokers is the same MLN used in [3] with a full transi-tive clause and where we vary the prior of the Loves pred-icate. Clique-Cycle is a model with 3 cliques and 3 bipar-tite graphs in between. Its corresponding ground graphicalmodel is shown in Fig. 2.

7.2 Accuracy of Marginals

Fig. 4 shows the marginals computed by all the algo-rithms as well as exact marginals on the Complete-Graphand Friends-Smokers models. We do not know how to effi-ciently perform exact inference in the remaining two mod-els, and thus do not measure accuracy for them. The resulton complete graphs illustrates the clear benefit of tight-ening the relaxation: LTRW-Local and LBP are inaccu-rate for moderate W , whereas cycle constraints and, es-pecially, exchangeable constraints drastically improve ac-curacy. As discussed earlier, for the case of symmetriccomplete graphical models, the exchangeable constraintssuffice to exactly characterize the marginal polytope. As aresult, the approximate marginals computed by LTRW-LEand LTRW-CE are almost the same as the exact marginals;the very small difference is due to the entropy approxima-tion. On the Friends-Smokers (negated) model, all LTRWvariants give accurate marginals while lifted LBP even withvery strong dampening (0.9 weight given to previous itera-tions’ messages) fails to converge forW < 2. We observedthat LTRW-LE gives the best trade-off between accuracyand running time for this model. Note that we do not com-pare to ground versions of the lifted TRW algorithms be-cause, by Theorem 3, the marginals and log-partition func-tion are the same for both.

7.3 Quality of Log-Partition Upper bounds

Fig. 5 plots the values of the upper bounds obtained bythe LTRW algorithms on the four test models. The re-sults clearly show the benefits of adding each type of con-straint to the LTRW, with the best upper bound obtainedby tightening the lifted local polytope with both lifted cy-cle and exchangeable constraints. For the Complete-Graphand Friends-Smokers model, the log-partition approxima-tion using exchangeable polytope constraints is very closeto exact. In addition, we illustrate lifted LBP’s approxima-tion of the log-partition function on the Complete-Graph(note it is non-convex and not an upper bound).

Page 8: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

Complete Graph

W V (x)

−0.1 [x 6= y ∧ (V (x)⇔ V (y))]

Friends-Smokers (Negated)

W [x 6= y ∧ ¬Friends(x, y)]

1.4 ¬Smokes(x)

2.3 ¬Cancer(x)

1.5 Smokes(x)⇒ Cancer(x)

−1.1 [x 6= y ∧ Smokes(x) ∧ Friends(x, y)⇒ Smokes(y)]Lovers-Smokers

W [x 6= y ∧ Loves(x, y)]

100 Male(x)⇔!Female(x)

2 Male(x) ∧ Smokes(x)

1 Female(x) ∧ Smokes(x)

0.5 [x 6= y ∧Male(x) ∧ Female(y) ∧ Loves(x, y)]

1 [x 6= y ∧ Loves(x, y) ∧ (Smokes(x)⇔ Smokes(y))]

−100 [x 6= y ∧ y 6= z ∧ z 6= x ∧ Loves(x, y) ∧ Loves(y, z) ∧ Loves(x, z)]

Clique-Cycle

W x 6= y ∧ (Q1(x)⇔ ¬Q2(y))

W x 6= y ∧ (Q2(x)⇔ ¬Q3(y))

W x 6= y ∧ (Q3(x)⇔ ¬Q1(y))

−W x 6= y ∧ (Q1(x)⇔ Q1(y))

−W x 6= y ∧ (Q2(x)⇔ Q2(y))

−W x 6= y ∧ (Q3(x)⇔ Q3(y))

Figure 3: Test models

−10

000

500

1500

Parameter W

Log−

part

ition

−30

−25

−20

−15

−10 −

5 0 5 10 15 20 25 30

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

Log−partition UpperboundComplete Graph (Domain Size = 100)

ExactLBP−DampeningLTRW−LLTRW−LELTRW−CLTRW−CE

−20

000

020

000

6000

0

Parameter W

Log−

part

ition

−10 −

9

−8

−7

−6

−5

−4

−3

−2

−1 0 1 2 3

Log−partition UpperboundFriends−Smokers (Domain Size = 200)

ExactLTRW−LLTRW−LELTRW−CLTRW−CE

010

000

3000

050

000

Parameter W

Log−

part

ition

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

Log−partition UpperboundClique−Cycle Graph (Domain Size = 100)

LTRW−LLTRW−LELTRW−CLTRW−CE

Parameter W

Log−

part

ition

−10 −

9

−8

−7

−6

−5

−4

−3

−2

−1 0 1 2 3

040

000

8000

0

Log−partition UpperboundLovers−Smokers (Domain Size = 200)

LTRW−LLTRW−LELTRW−CLTRW−CE

Figure 5: Approximations of the log-partition function on the four test models from Fig. 3 (best viewed in color).

0 10 20 30 40 50

0.5

0.6

0.7

0.8

0.9

1.0

Parameter W

Mar

gina

l pro

babi

lity

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Accuracy of Approximate MarginalsComplete Graph (Domain Size = 100)

ExactLBP−DampeningLTRW−LLTRW−LELTRW−CLTRW−CE

0.0

0.1

0.2

0.3

0.4

Parameter W

Mar

gina

l pro

babi

lity

−10 −

9−

8−

7−

6−

5−

4−

3−

2−

1 0 1 2 3 4 5 6 7 8 9 10

●●

●● ● ● ●

Accuracy of Approximate MarginalsFriends−Smokers (Domain Size = 200)

ExactLBP−DampeningLTRW−LLTRW−LELTRW−CLTRW−CE

Figure 4: Left: marginal accuracy for complete graphmodel. Right: marginal accuracy for Pr(Cancer(x)) inFriends-Smokers (neg). Lifted TRW variants using differ-ent outer bounds: L=local, C=cycle, LE=local+exchangeable,CE=cycle+exchangeable (best viewed in color).

7.4 Running time

As shown in Table 1, lifted variants of TRW are order-of-magnitudes faster than the ground version. Interestingly,lifted TRW with local constraints is observed to be fasteras the domain size increase; this is probably due to the factthat as the domain size increases, the distribution becomesmore peak, so marginal inference becomes more similar toMAP inference. Lifted TRW with local and exchangeableconstraints requires a smaller number of conditional gradi-ent iterations, thus is faster; however note that its runningtime slowly increases since the exchangeable constraint setgrows linearly with domain size.

LBP’s lack of convergence makes it difficult to have a

Domain size 10 20 30 100 200TRW-L 138370 609502 1525140 - -

LTRW-L 3255 3581 3438 1626 1416LTRW-LE 681 703 721 1033 1307

Table 1: Ground vs lifted TRW runtime on Complete-Graph (mil-liseconds)

meaningful timing comparison with LBP. For example,LBP did not converge for about half of the values of Win the Lovers-Smokers model, even after using very strongdampening. We did observe that when LBP converges,it is much faster than LTRW. We hypothesize that this isdue to the message passing nature of LBP, which is basedon a fixed point update whereas our algorithm is based onFrank-Wolfe.

7.5 Application to Learning

We now describe an application of our algorithm to thetask of learning relational Markov networks for inferringprotein-protein interactions from noisy, high-throughput,experimental assays [12]. This is equivalent to learning theparameters of an exponential family random graph model[19] where edges in the random graph represent the protein-protein interactions. Despite fully observed data, maxi-mum likelihood learning is challenging because of the in-tractability of computing the log-partition function and itsgradient. In particular, this relational Markov network hasover 330K random variables (one for each possible inter-action of 813 variables) and tertiary potentials. However,Jaimovich et al. [13] observed that the partition function in

Page 9: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

Figure 6: Log-likelihood lower-bound obtained using lifted TRWwith the cycle and exchangeable constraints (CE) for the sameprotein-protein interaction data used in [13] (left) (c.f. Fig. 7in [13]). Improvement in lower-bound after tightening the localconstraints (L) with CE (right).

relational Markov networks is highly symmetric, and uselifted LBP to efficiently perform approximate learning inrunning time that is independent of the domain size. Theyuse their lifted inference algorithm to visualize the (approx-imate) likelihood landscape for different values of the pa-rameters, which among other uses characterizes the robust-ness of the model to parameter changes.

We use precisely the same procedure as [13], substitutinglifted BP with our new lifted TRW algorithms. The modelhas three parameters: θ1, used in the single-node potentialto specify the prior probability of a protein-protein interac-tion; θ111, part of the tertiary potentials which encouragescliques of three interacting proteins; and θ011, also part ofthe tertiary potentials which encourages chain-like struc-tures where proteinsA,B interact,B,C interact, butA andC do not (see supplementary material for the full modelspecification as an MLN). We follow their two-step esti-mation procedure, first estimating θ1 in the absence of theother parameters (the maximum likelihood, BP, and TRWestimates of this parameter coincide, and estimation can beperformed in closed-form: θ∗1 = −5.293). Next, for eachsetting of θ111 and θ011 we estimate the log-partition func-tion using lifted TRW with the cycle+exchangeable vs. lo-cal constraints only. Since TRW is an upper bound on thelog-partition function, these provide lower bounds on thelikelihood.

Our results are shown in Fig. 6, and should be comparedto Fig. 7 of [13]. The overall shape of the likelihood land-scapes are similar. However, the lifted LBP estimates of thelikelihood have several local optima, which cause gradient-based learning with lifted LBP to reach different solutionsdepending on the initial setting of the parameters. In con-trast, since TRW is convex, any gradient-based procedurewould reach the global optima, and thus learning is mucheasier. Interestingly, we see that our estimates of the likeli-hood have a significantly smaller range over these parame-ter settings than that estimated by lifted LBP. For the liftedTRW algorithms there is a sudden decrease in the likeli-hood at θ011 > 0 and θ111 > 0.02 (so as to not affect thescale, we show this as white).

8 Discussion and Conclusion

Lifting partitions used by lifted and counting BP [21, 16]can be coarser than orbit partitions. In graph-theoretic

terms, these partitions are called equitable partitions. Ifeach equitable partition cell is thought of as a distinct nodecolor, then among nodes with the same color, their neigh-bors must have the same color histogram. It is known thatorbit partitions are always equitable, however the converseis not always true [9].

Since equitable partition can be computed more efficientlyand potentially leads to more compact lifted problems, thefollowing question naturally arises: can we use equitablepartition in lifting the TRW problem? Unfortunately, acomplete answer is non-trivial. We point out here a the-oretical barrier due to the interplay between the spanningtree polytope and the equitable partition of a graph.

Let ε be the coarsest equitable partition of edges of G.We give an example graph in the supplementary mate-rial (see example 9) where the symmetrized spanning treepolytope corresponding to the equitable partition ε, T[ε] =

T(G)∩R|E|[ε] is an empty set. When T[ε] is empty, the conse-quence is that if we want ρ to be within T so that B(., ρ) isguaranteed to be a convex upper bound of the log-partitionfunction, we cannot restrict ρ to be consistent with the eq-uitable partition. In lifted and counting BP, ρ ≡ 1 so itis clearly consistent with the equitable partition; however,one loses convexity and upper bound guarantee as a result.This suggests that there might be a trade-off between thecompactness of the lifting partition and the quality of theentropy approximation, a topic deserving the attention offuture work.

In summary, we presented a formalization of liftedmarginal inference as a convex optimization problem andshowed that it can be efficiently solved using a Frank-Wolfe algorithm. Compared to previous lifted variationalinference algorithms, in particular lifted belief propagation,our approach comes with convergence guarantees, upperbounds on the partition function, and the ability to im-prove the approximation (e.g. by introducing additionalconstraints) at the cost of small additional running time.

A limitation of our lifting method is that as the amount ofsoft evidence (the number of distinct individual objects)approaches the domain size, the behavior of lifted infer-ence approaches ground inference. The wide difference inrunning time between ground and lifted inference suggeststhat significant efficiency can be gained by solving an ap-proximation of the orignal problem that is more symmetric[25, 15, 22, 6]. One of the most interesting open questionsraised by our work is how to use the variational formula-tion to perform approxiate lifting. Since our lifted TRWalgorithm provides an upper bound on the partition func-tion, it is possible that one could use the upper bound toguide the choice of approximation when deciding how tore-introduce symmetry into an inference task.

Acknowledgements: Work by DS supported by DARPAPPAML program under AFRL contract no. FA8750-14-C-0005.

Page 10: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

References

[1] F. Barahona and A. R. Mahjoub. On the cut polytope. Math-ematical Programming, 36:157–173, 1986.

[2] Hung Hai Bui, Tuyen N. Huynh, and Rodrigo de Salvo Braz.Lifted inference with distinct soft evidence on every object.In AAAI-2012, 2012.

[3] Hung Hai Bui, Tuyen N. Huynh, and Sebastian Riedel. Au-tomorphism groups of graphical models and lifted varia-tional inference. In Proceedings of the Twenty-Ninth Con-ference on Uncertainty in Artificial Intelligence, UAI-2013.AUAI Press, 2013.

[4] Hung Hai Bui, Tuyen N. Huynh, and David Sontag. Liftedtree-reweighted variational inference. arXiv pre-print, 2014.http://arxiv.org/abs/1406.4200.

[5] R. de Salvo Braz, E. Amir, and D. Roth. Lifted first-orderprobabilistic inference. In Proceedings of the 19th Inter-national Joint Conference on Artificial Intelligence (IJCAI’05), pages 1319–1125, 2005.

[6] Rodrigo de Salvo Braz, Sriraam Natarajan, Hung Bui, JudeShavlik, and Stuart Russell. Anytime lifted belief propaga-tion. In 6th International Workshop on Statistical RelationalLearning (SRL 2009), 2009.

[7] Marguerite Frank and Philip Wolfe. An algorithm forquadratic programming. Naval Research Logistics Quar-terly, 3(1-2):95–110, 1956. ISSN 1931-9193.

[8] A. Globerson and T. Jaakkola. Convergent Propagation Al-gorithms via Oriented Trees. In Uncertainty in ArtificialIntelligence, 2007.

[9] Chris Godsil and Gordon Royle. Algebraic Graph Theory.Springer, 2001.

[10] Vibhav Gogate and Pedro Domingos. Probabilistic theoremproving. In Proceedings of the Twenty-Seventh Annual Con-ference on Uncertainty in Artificial Intelligence (UAI-11),pages 256–265, 2011.

[11] Martin Jaggi. Revisiting Frank-Wolfe: Projection-freesparse convex optimization. In Proceedings of the 30thICML, volume 28, pages 427–435. JMLR Workshop andConference Proceedings, 2013.

[12] Ariel Jaimovich, Gal Elidan, Hanah Margalit, and Nir Fried-man. Towards an integrated protein-protein interaction net-work: a relational markov network approach. Journal ofComputational Biology, 13(2):145–164, 2006.

[13] Ariel Jaimovich, Ofer Meshi, and Nir Friedman. Tem-plate based inference in symmetric relational markov ran-dom fields. In Proceedings of the Twenty-Third Confer-ence on Uncertainty in Artificial Intelligence, Vancouver,BC, Canada, July 19-22, 2007, pages 191–199. AUAI Press,2007.

[14] Jeremy Jancsary and Gerald Matz. Convergent decompo-sition solvers for tree-reweighted free energies. Journal ofMachine Learning Research - Proceedings Track, 15:388–398, 2011.

[15] K. Kersting, Y. El Massaoudi, B. Ahmadi, and F. Hadiji. In-formed lifting for message–passing. In D. Poole M. Fox,editor, Twenty–Fourth AAAI Conference on Artificial Intel-ligence (AAAI–10), Atlanta, USA, July 11 – 15 2010. AAAIPress.

[16] Kristian Kersting, Babak Ahmadi, and Sriraam Natarajan.Counting belief propagation. In Proceedings of the 25th An-nual Conference on Uncertainty in AI (UAI ’09), 2009.

[17] B. Milch, L. S. Zettlemoyer, K. Kersting, M. Haimes, andL. P. Kaelbling. Lifted Probabilistic Inference with Count-ing Formulas. In Proceedings of the 23rd AAAI Confer-ence on Artificial Intelligence (AAAI ’08), pages 1062–1068, 2008.

[18] Mathias Niepert. Markov chains on orbits of permutationgroups. In UAI-2012, 2012.

[19] Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher.An introduction to exponential random graph (p∗) modelsfor social networks. Social networks, 29(2):173–191, 2007.

[20] H. D. Sherali and W. P. Adams. A hierarchy of re-laxations between the continuous and convex hullrepresentations for zero-one programming prob-lems. SIAM Journal on Discrete Mathematics, 3(3):411–430, 1990. doi: 10.1137/0403036. URLhttp://link.aip.org/link/?SJD/3/411/1.

[21] Parag Singla and Pedro Domingos. Lifted first-order be-lief propagation. In Proceedings of the 23rd AAAI Con-ference on Artificial Intelligence (AAAI ’08), pages 1094–1099, 2008.

[22] Parag Singla, Aniruddh Nath, and Pedro Domingos. Ap-proximate lifted belief propagation. In Workshop on Statis-tical Relational Artificial Intelligence (StaR-AI 2010), 2010.

[23] D. Sontag and T. Jaakkola. New outer bounds on themarginal polytope. In Advances in Neural Information Pro-cessing Systems 21. MIT Press, 2008.

[24] David Sontag, T. Meltzer, A. Globerson, T. Jaakkola, andY. Weiss. Tightening LP relaxations for MAP using messagepassing. In Proceedings of the 24th Annual Conference onUncertainty in AI (UAI ’08), 2008.

[25] Guy Van den Broeck and Adnan Darwiche. On the com-plexity and approximation of binary evidence in lifted infer-ence. In Advances in Neural Information Processing Sys-tems, pages 2868–2876, 2013.

[26] M. Wainwright, T. Jaakkola, and A. Willsky. A new class ofupper bounds on the log partition function. IEEE Transac-tions on Information Theory, 51:2313–2335, 2005.

[27] Martin Wainwright and Michael Jordan. Graphical Mod-els, Exponential Families, and Variational Inference. NowPublishers, 2008.

[28] Martin J. Wainwright, Tommi Jaakkola, and Alan S. Will-sky. Tree-based reparameterization framework for analysisof sum-product and related algorithms. IEEE Transactionson Information Theory, 49:1120–1146, 2003.

[29] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimiz-ing and learning energy functions for side-chain prediction.Journal of Computational Biology, 15(7):899–911, 2008.

Page 11: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

Supplementary Materials for “LiftedTree-Reweighted Variational Inference”

We present (1) proofs not given in the main paper, (2) fullpseudo-code for the lifted Kruskal’s algorithm for findingmaximum spanning tree in symmetric graphs, and (3) ad-ditional details of the protein-protein interaction model.

Proof of Theorem 1.

Proof. The lifting group A∆ stabilizes both the objectivefunction and the constraints of the convex optimizationproblem in the LHS of Eq. (4). The equality is then es-tablished using Lemma 1 in [3].

We state and prove a lemma about the symmetry of thebounds B and B∗ that will be used in subsequent proofs.

Lemma 7. let π be an automorphism of the graphicalmodel G, then B∗(τ, ρ) = B∗(τπ, ρπ) and B(θ, ρ) =B(θπ, ρπ).

Proof. The intuition is that since the entropy bound B∗ isdefined on the graph structure of the graphical model G, itinherits the symmetry of G. This can be verified by view-ing the graph automorphismπ as a bijection from nodes tonodes and edges to edges, and so B∗(τπ, ρπ) simply rear-ranges the summation inside B∗(τ, ρ).

B∗(τ, ρ) = −∑

v∈V (G)

H(τv) +∑

e∈E(G)

I(τe)ρe

= −∑

v∈V (G)

H(τπ(v)) +∑

e∈E(G)

I(τπ(e))ρπ(e)

= B∗(τπ, ρπ)

We now show same symmetry applies to the log-partitionupper bound B. Let OUTER(G) be an outer boundof the marginal polytope M(G) such that M(G) ⊂OUTER(G) ⊂ LOCAL(G). Note that π acts on and sta-bilizes OUTER, i.e., OUTERπ = OUTER. Thus

B(θ, ρ) = supτ∈OUTER

〈θ, τ〉 −B∗(τ, ρ)

= supτπ∈OUTER

〈θ, τ〉 −B∗(τ, ρ)

= supτ∈OUTER

⟨θ, τπ

−1⟩−B∗(τπ

−1

, ρ)

= supτ∈OUTER

〈θπ, τ〉 −B∗(τ, ρπ)

= B(θπ, ρπ)

Proof of Theorem 3.

Proof. We will show that the condition of theorem 1 holds.Let us fix a ρ ∈ T[ϕE ]. Then ρπ = ρ for all (π, γ) ∈ A∆.

Thus B∗(τπ, ρ) = B∗(τπ, ρπ). On the other hand, byLemma 7, B∗(τπ, ρπ) = B∗(τ, ρ). Thus B∗(τπ, ρ) =B∗(τ, ρ). Note that in case of the overcomplete representa-tion, the action of the group A∆ is the permuting action ofπ; thus, the TRW bound B∗(τ, ρ) (for fixed ρ ∈ T[ϕE ]) isstabilized by the lifting group A∆.

Conditional Gradient (Frank-Wolfe) Algorithm forLifted TRW

The pseudo-code is given in Algorithm 1. Step 2 essentiallysolves a lifted MAP problem which we used the same algo-rithms presented in [3] with Gurobi as the main linear pro-gramming engine. Step 3 solves a 1-D constrained convexproblem via the golden search algorithm to find the optimalstep size.

Proof of Lemma 4.

Proof. After considering all edges in e1∪. . .∪ei, Kruskal’salgorithm must form a spanning forest of Gi (the forest isspanning since if there remains an edge that can be usedwithout forming a cycle, Kruskal’s algorithm must haveused it already). Since the forest is spanning, the numberof edges used by Kruskal’s algorithm at this point is pre-cisely |V (Gi)| − |C(Gi)|. Similarly, just after consideringall edges in e1 ∪ . . . ∪ ei−1, the number of edges used is|V (Gi−1)|− |C(Gi−1)|. Therefore, the number of ei-edgesused must be

|V (Gi)| − |C(Gi)| − [|V (Gi−1)| − |C(Gi−1)|] = δ(i)V − δ

(i)C

which is the difference between the number of new nodes(which must be non-negative) and the number of new con-nected components (which could be negative) induced byconsidering edges in ei. Any MST solution ρ can be turnedinto a solution ρ of (9) by letting ρ(e) = 1

|e|∑e′∈e ρ(e′) .

Thus, we obtain a solution ρ(ei) =δ(i)V −δ

(i)C

|ei| .

Proof of Lemma 5.

We first need an intermediate result.

Lemma 8. If u and v are two distinct node orbits, andu and v are reachable in the lifted graph G, then for anyu ∈ u, there is some v ∈ v such that v is reachable from u.

Proof. Induction on the length of the u-v path. Base case:if u and v are adjacent, there exists an edge orbit e inci-dent to both u and v. Therefore, the exists a ground edge{u0, v0} in e such that u0 ∈ u and v0 ∈ v. The automor-phism mapping u0 7→ u will map v0 7→ v for some node v.Clearly {u, v} is an edge, and v ∈ v. Main case: assumethe statement is true for all pair of orbits with path length≤ n. Suppose u − v is a path of length n + 1. Take theorbit z right in front of v in the path, so that u− z is a pathof length n, and z and v are adjacent. By the inductive as-sumption, there exists z ∈ z such that u is connected to z.

Page 12: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

Algorithm 1 Conditional gradient for optimizing lifted TRW problem

1: k = 0; τ (0) ← uniform2: Direction finding via lifted MAP

sk = arg maxsk∈O

〈sk, θ −∇τB∗(τk, ρ)〉

3: Step size finding via golden section search

λk = arg maxλ∈[0,1]

λ〈sk − τk, θ〉 −B∗(τk(1− λ) + skλ)

4: Update τk+1 = τk(1− λk) + skλk5: k ← k + 16: if not converged go to 2

Applying the same argument of the base case, there existsv ∈ v such that {z, v} is an edge. Thus u is connected tov.

We now return to the main proof of lemma 5.

Proof. If the ground graph G has only one component thenthis is trivially true. Let G1 and G2 be two distinct con-nected components of G, let u1 be a node in G1 and u bethe orbit containing u1. Let v2 be any node in G2. Since thelifted graph G is connected, all orbits are reachable fromone another in G. By the above lemma, there must be somenode u2 ∈ u reachable from v2, hence u2 ∈ G2 (if v2 ∈ uthen we just take u2 = v2). This establishes that the nodeorbit u intersects with both G1 and G2. Note that u1 6= u2

since otherwise G1 = G2. Let π be the automorphism thattakes u1 to u2.

We now show that π(G1) = G2. Since π maps edges toedges and non-edges to non-edges, it is sufficient to showthat π(V (G1)) = π(V (G2)). Let z1 be a node of G1 andz2 = π(z1). Since G1 is connected, there exists a path fromu1 to z1. But π must map this path to a path from u2 to z2,hence z2 ∈ V (G2). Thus π(V (G1)) ⊂ V (G2). Now, let z2

be a node of G2 and let z1 = π−1(z2). By a similar argu-ment, π−1 must map the path from u2 to z2 to a path fromu1 to z1, hence z1 ∈ V (G1). Thus π(V (G2)) ⊂ V (G1),and we have indeed shown that π(V (G1)) = π(V (G2)).

Hence all connected components of G are isomorphic.Given one connected component G′ , the number of con-nected components of G is |C(G)| = |V (G)|/|V (G′)|.

Lifted Kruskal’s Algorithm

See algorithm 2. The algorithm keeps track of the list ofconnected components of the lifted graph Gi in a disjoint-set data structure similar to Kruskal’s algorithm. Line 17and line 20 follow from lemma 5 and lemma 4 respectively.The number of ground nodes of a lifted graph is computedas the sum of the size of each of its node orbit.

Proof of Theorem 6.

We give two different proofs of this theorem. The firstproof demonstrates how the constraints are derived me-thodically by lifting the ground constraints of the exactmarginal polytope. The second proof is more intuitive andillustrates what each variable in the system of constraintsrepresents conceptually.

Proof 1.

Proof. First consider the configuration C such thatC(u1, u2) = (0, 0). From Eq. (10), after substituting theground variables τχC by the lifted variables τχ|C| where |C|denotes the number of 1’s in the configuration, we obtainthe lifted constraint

∃τχ0 . . . τχn :∑

C s.t. C(u1,u2)=(0,0)

τχ|C| = τe(χ):00

Let us now simplify the summation. Since C(u1, u2) =(0, 0), |C| can range from 0 to n − 2. For each value of

|C| = k there are(n−2

k

)different configurations. As a

result, we can compactly write the above lifted constraintas

n−2∑k=0

(n−2

k

)τχk = τe(χ):00

Note that every edge {u1, u2} results in exactly the sameconstraint.

Similarly, when C(u1, u2) = (1, 1) we obtain the liftedconstraint

n−2∑k=0

(n−2

k

)τχk+2 = τe(χ):11

and when C(u1, u2) = (0, 1) we obtain

n−2∑k=0

(n−2

k

)τχk+1 = τa(χ):01

There’s no need to consider the (1, 0) case since we canalways re-order the pair (u1, u2).

Page 13: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

Algorithm 2 Lifted Kruskal’s algorithmFind: ρ, a solution of (9) at orbit levelInput: lifted graph G and its set of edge orbits

1: sort edge orbits in decreasing weight order (e1 . . . ek)2: C = ∅ . set of connected comp. of lifted graph Gi as disjoint set3: GC = empty map . hashmap from elements of C to their number of ground components4: numGC = 0 . number of ground connected comp of Gi5: numGV = 0 . number of ground nodes of Gi6: GVmax = #GroundNode(G) . number of ground nodes of G7: for i=1,. . . ,k do8: Cold = find elements in C containing the end nodes of ei9: GCold =

∑S∈Cold GC(S)

10: δV =∑

v:v∈Nb(ei), v 6∈Cold |v|11: numGV = numGV + δV12: H = union of members of Cold and ei13: C ← remove members of Cold and add H14: u = pick a node of H; u = representative(u)15: Hfixed = lifted graph of H after fixing the node u16: K = component contains {u} in Hfixed

17: GC(H)← #GroundNode(H)#GroundNode(K)

18: δC = GC(H)−GCold19: numGC ← numGC + δC20: ρ(ei) = 1

|ei| [δV − δC ]

21: if numGV = GVmax and numGC = 1 then . no new ground nodes, 1 ground connected component22: break . future δV and δC must be 023: end if24: end for25: ρ(ej) = 0 for all j = i+ 1 . . . k26: return ρ

Finally, let cχk =(nk)τχk , we arrive at the set of constraints

of theorem 6.

Proof 2.

Proof. Let C(0, 0) denote the number of (0, 0) edges in χ.An (0, 0) edge is an edge where the two end-nodes receivethe assignment 0. To show the first equality holds, we usetwo different ways to compute the expectation of C(0, 0).

First, E [C(0, 0)] is the sum of the probability that an edgeis (0, 0), summing over all the edges. Due to exchangeabil-ity, all probabilities are the same and is equal to τe(χ):00,so

E[C(0, 0)] =n(n− 1)

2τe(χ):00

Second, C(0, 0) conditioned on the event that there are pre-cisely k 1’s in χ is (n−k)(n−k−1)

2 for k ≤ n − 2 and 0 ifk > n−2. Now let cχk be the probability of this event, then

E[C(0, 0)] = E[E[C(0, 0)|∑i∈χ

xi = k]]

=

n−2∑k=0

(n− k)(n− k − 1)

2cχk

This shows the first equality holds. The second equal-ity is obtained similarly by considering the expectation ofC(1, 1), the number of (1, 1) edges in χ. The last equalityis obtained by considering the expectation of C(0, 1), thenumber of (0, 1) arcs.

Protein-Protein Interaction (PPI) Model

The PPI model we considered is exactly the same as an ex-ponential family random graph model with 3 graph statis-tics: edge count, 2-chain (triangle with a missing edge)count, and triangle count. The model specification in MLN(with an additional clause enforcing undirectedness of therandom graph) is

1

2θ1 r(x, y)

1

2θ110 r(x, y) ∧ r(x, z)∧!r(y, z)

1

6θ111 r(x, y) ∧ r(y, z) ∧ r(x, z)

−∞ !(r(x, y)↔ r(y, x))

For this model, the edge appearance ρ is initialized so thatthe edges correspond to the hard clause are always selected.

Page 14: Lifted Tree-Reweighted Variational Inferencepeople.csail.mit.edu/dsontag/papers/BuiHuySon_uai14.pdf · each direction-finding step solves a maximum spanning tree problem. Several

0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00Theta011

900

1000

1100

1200

1300

1400

1500

1600

1700lo

gz

protein-interaction; domain size 813

LTRW-L_logz

LTRW-C_logz

LTRW-LE_logz

LTRW-CE_logz

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16Theta011

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

logz

1e7 protein-interaction; domain size 813

LTRW-L_logz

LTRW-C_logz

LTRW-LE_logz

LTRW-CE_logz

Figure 7: Log-partition bounds for PPI model

Figure 8: G and its equitable partition. No element of the spanningtree polytope is uniform in the same cell of the equitable partition.

The set {r(x, y) | x fixed, y varies} is an exchangeablecluster, thus exchangeable constraints can be enforced onthe orbit corresponding to this cluster. The log-partitionupper bounds of lifted TRW with different outer boundsare shown in Fig. 7. The left part shows the region withnegative θ011 and the right part shows the region with posi-tive θ011 (θ111 is held fixed at−0.021). For negative θ011, acombination of cycle and exchangeable constraints (CE) iscrucial to improve the upper bound. For positive θ011, ex-change constraints (LE) are already sufficient to yield thebest bound.

Equitable Partition and the Spanning Tree PolytopeExample 9. We give an example where the symmetrizedspanning tree polytope corresponding to the (edge) equi-table partition ε, T[ε] = T(G) ∩ R|E|[ε] is an empty set. Theexample graph G is shown in Fig. 8. There are two edgetypes in the coarsest equitable partition: solid and dashed.The dashed edge in the middle is a bridge, it must appearin every spanning tree, so every ρ ∈ T must assign 1 tothis edge. If T[ε] 6= ∅ then there is some ρ ∈ T[ε] that as-signs the same weights to all dashed edges. Therefore itwill assign 1 to the two dashed edges on the left and righthand side. The remaining solid edges have equal weight,and since the total weight is |V | − 1 = 7, the solid weightis (7 − 3)/6 = 2/3. Now consider the triangle on the left.This triangle has the total weight 1 + 4/3 > 2 which vio-lates the constraint of the spanning tree polytope. Thus, forthis graph T[ε] must be empty.