annular augmentation samplingproceedings.mlr.press/v54/fagan17a/fagan17a.pdf · 2020. 8. 18. ·...

Annular Augmentation Sampling

Francois Fagan Jalaj Bhandari John P. CunninghamColumbia University Columbia University Columbia University

Abstract

The exponentially large sample space of gen-eral binary probabilistic models renders in-tractable standard operations such as exactmarginalization, inference, and normalization.Typically, researchers deal with these distri-butions via deterministic approximations, theclass of belief propagation methods being aprominent example. Comparatively, MarkovChain Monte Carlo methods have been signif-icantly less used in this domain. In this work,we introduce an auxiliary variable MCMCscheme that samples from an annular aug-mented space, translating to a great circlepath around the hypercube of the binary sam-ple space. This annular augmentation sam-pler explores the sample space more effec-tively than coordinate-wise samplers and hasno tunable parameters, leading to substantialperformance gains in estimating quantities ofinterest in large binary models. We extendthe method to incorporate into the samplerany existing mean-field approximation (suchas from belief propagation), leading to furtherperformance improvements. Empirically, weconsider a range of large Ising models and anapplication to risk factors for heart disease.

1 INTRODUCTION

Binary probabilistic models are a fundamental andwidely used framework for encoding dependencebetween discrete variables, with applications fromphysics (Wang and Landau, 2001) and computer vi-sion (Nowozin and Lampert, 2011) to natural languageprocessing (Johnson et al., 2007). The exponentialnature of these distributions — a d-dimensional bi-nary model has sample space of size 2

d — renders

Proceedings of the 20th International Conference on Artifi-

cial Intelligence and Statistics (AISTATS) 2017, Fort Laud-

erdale, Florida, USA. JMLR: W&CP volume 54. Copyright

2017 by the author(s).

intractable key operations like exact marginalizationand exact inference. Indeed these operations are for-mally hard: calculating the partition function is in gen-eral #P-complete (Chandrasekaran et al., 2008). Thisintractability has prompted extensive research into ap-proximation techniques, with deterministic approxima-tion methods such as belief propagation (Murphy et al.,1999; Wainwright and Jordan, 2008) and its variants(Murphy, 2012) enjoying substantial impact. Weightedmodel counting is another promising approach that hasreceived increasing attention (Chavira and Darwiche,2008; Chakraborty et al., 2014). However, MarkovChain Monte Carlo (MCMC) methods have been rela-tively unexplored, and in many settings basic Metropo-lis Hastings (MH) is still the default choice of samplerfor binary probabilistic models (Zhang et al., 2012).

This scarcity of MCMC methods contrasts with theliterature on continuous probabilistic models, whereMCMC methods like Hamiltonian Monte Carlo (HMC),(Neal, 2011) and slice sampling (Neal, 2003) have en-joyed substantial success. Whether explicitly or im-plicitly stated, the key idea for many of these state-of-the-art continuous samplers is to utilize a two-operatorconstruction: at each iteration of the algorithm, first asubset of the full sample space is chosen, and seconda new point is sampled from that subspace. ConsiderHMC, which samples a point from the Hamiltonianflow (the subset) induced by a particular sample of themomentum variable. This two-operator constructionis enabled by an auxiliary variable augmentation (e.g.the momentum variable in HMC) that separates eachMCMC operator. Recently this two-operator conceptfor HMC has been modified to sample binary distribu-tions (Zhang et al., 2012; Pakman and Paninski, 2013),and its improved performance over MH and Gibbs sam-pling has been demonstrated, giving promise to thenotion of auxiliary augmentations for binary samplers.

Another continuous example of this two-operator for-mulation, which inspires our present work, is EllipticalSlice Sampling (ESS), (Murray et al., 2010): auxiliaryvariables are introduced into a latent Gaussian modelsuch that an elliptical contour (the subset) is definedby the first sampling operator, and then the second


operator draws the next sample point from that ellipse.

Here we leverage this central idea of auxiliary aug-mentation to create a two-operator MCMC sampler forbinary distributions. At each iteration, we first choose asubset of 2d points (⌧ 2

d, the size of the sample space)defined over an annulus in the auxiliary variable space(or alternatively, a great circle around the hypercubecorresponding to the sample space), and in the secondoperator we sample over that annulus. Critically, thisaugmentation strategy leads to an overrelaxed sam-pler, allowing long range exploration of the samplespace (flipping multiple dimensions of the binary vec-tor), unlike more basic coordinate-by-coordinate MHstrategies, which require geometrically many iterationsto flip many dimensions of the binary vector. We alsoextend this auxiliary formulation to exploit availabledeterministic approximations to improve the quality ofthe sampler. We evaluate our sampler on a syntheticproblem (2D Ising models) and a real world problem ofmodeling heart disease risk factors; which significantlyoutperforms MH and HMC techniques for binary dis-tributions. Overall, our contributions include:

• a novel annular augmentation that maps a contin-uous path in the auxiliary space to the discretesample space of binary distributions (Section 2.1);

• an extension of this augmentation to incorporatedeterministic posterior approximations such asmean-field belief propagation (Section 2.1);

• a Rao-Blackwellized Gibbs sampler that exploitsthis annular augmentation to achieve the desiredMCMC sampler, a sampler which is both free oftuning parameters and straightforward to imple-ment with generic code (Section 2.3);

• and experimental results demonstrating the per-formance improvements achieved by annular aug-mentation sampling (Section 3).

We conclude by discussing some extensions to annularaugmentation, and future directions for the class ofaugmented binary MCMC techniques (Section 4).

2 ANNULAR AUGMENTATIONSAMPLING

We are interested in sampling from a probability distri-bution p(s) defined over d-dimensional binary vectorss 2 {�1,+1}d. The density p(s) is given in terms of afunction f(s) as

p(s) =1

Z

f(s),

where Z is the normalizing constant. As in other worksusing augmentation strategies (Pakman and Paninski,2013; Murray et al., 2010), we rewrite s as a determin-istic function s = h(✓, t) of auxiliary random variables(✓, t) (to be defined), where both the function and thedistribution of the auxiliary variables are chosen topreserve the measure p(s) up to a normalizing con-stant. It is perhaps most straightforward to take thisstep by partitioning p into the product of a tractabledistribution p̂, which will implicitly characterize thedeterministic function, and the remainder L:

p(s) / L(s)p̂ (s) , s = h(✓, t),

where L(s) = f(s)/p̂(s) can be thought of as a likeli-hood with respect to the prior p̂. We must choose p̂

such that it is tractable to find a p̂-preserving functionh from the auxiliary space (✓, t) to s. Furthermore,we would like p̂ to capture the structure of p (elsesamples will be wasted, similar to a bad importancesampler). We now demonstrate how to do so with anyfully factorized p̂ and annular auxiliary variables.

2.1 Annular Augmentations

Let us first consider the choice of a uniform prior:p̂(s) =

Qdi=1 p̂(si) with p̂(si = +1) = p̂(si = �1) = 1/2.

We can now replace s with auxiliary variables (✓, t) as

✓ ⇠ unif(0, 2⇡)

ti ⇠ unif(0, 2⇡)

si = h(✓, ti) = sgn[cos(ti � ✓)], (1)

where t = (t1, ..., td) 2 [0, 2⇡]

d. This choice of distribu-tion on (✓, t) and the map s = h(✓, t) maintains thedistribution p̂(s), but induces conditional dependencegiven the auxiliary variables. Critical to this work isthe observation that, conditioned on a value of t, thesample s can take on 2d possible values (not 2

d); whichvalue s takes is determined by ✓. Thus t defines asubset of the sample space over which we can sample susing ✓.

Geometrically, this fact can be seen in two ways. First,Figure 1a shows each ti as a red, green, or blue point,where the interval ti± ⇡

2 is shown as a similarly coloredbar; this bar defines the threshold for flipping the ithcoordinate of s (i.e., the threshold of the sgn functionin Equation 1). As ✓ traverses 0 ! 2⇡, 2d bit flipsalter s0 to �s0 and back. Alternatively, a second viewof this geometric structure is to consider a great circlepath around the d-dimensional hypercube, as shown inFigure 1b. Owing to this fundamental ring structure,we call this auxiliary variable scheme annular augmen-tation. We use this augmentation to create an MCMCsampler in Section 2.2.

Francois Fagan, Jalaj Bhandari, John P. Cunningham

8><

>:

+1

+1

�1

9>=

>;= s0

8><

>:

+1

+1

+1

9>=

>;

8><

>:

�1

+1

+1

9>=

>;

�s0 =

8><

>:

�1

�1

+1

9>=

>;

8><

>:

�1

�1

�1

9>=

>;

8><

>:

+1

�1

�1

9>=

>;

✓

0

⇡2

⇡

3⇡2

t1t2

t3

(a) Annular augmentation

8><

>:

+1

+1

+1

9>=

>;

8><

>:

+1

+1

�1

9>=

>;= s0

�s0 =

8><

>:

�1

�1

+1

9>=

>;

8><

>:

�1

�1

�1

9>=

>;

(b) Path of annulus on hypercube

Figure 1: Diagram of the annuluar augmentation. (a) The auxiliary annulus. Each threshold ti defines a semicirclewhere si = +1, which is indicated by a darker shade. The value of s changes as ✓ moves around the annulus. (b)The implied path of the annulus on the binary hypercube. Hypercube faces are colored to match the annulus.

Before introducing the sampler, we introduce an impor-tant extension. In some settings we may have accessto a mean-field posterior approximation p̂ (e.g., fromloopy belief propagation), which should in general serveus better than the uniform p̂ just introduced: a pos-terior approximation should “flatten” the likelihoodL(s) = f(s)/p̂(s) in the spirit of (Wang and Landau,2001) and (Fagan et al., 2016), thereby improving theefficiency of our sampler. Our annuluar augmentationgeneralizes nicely to this setting, by altering Eqn. 1 to:

si = sgn[cos(ti � ✓)� cos(⇡p̂(si = 1))]. (2)

Figure 2a demonstrates that the proportion of theannulus where si = 1 becomes:

(ti + ⇡p̂(si = 1))� (ti � ⇡p̂(si = 1))

2⇡

= p̂(si = 1),

resulting in the implied prior measure on s being pre-cisely the mean-field approximation p̂. Indeed, whenp̂(si = 1) = 1/2, Equation 2 reverts back to Equation 1,so the augmentations are consistent. Geometricallythis change to Equation 1 induces stretched thresholdsaround the annulus, as shown in Figure 2b. We usethe stretched annulus to good effect in Section 3.

2.2 Generic Sampler

As outlined in Section 1, to sample from p(s) we definetwo MCMC operators to sample our auxiliary variables(✓, t), and then we read out the corresponding values ofs = h(✓, t) as given by Equation 1 or, more generally,Equation 2. The posterior over the auxiliary variables(✓, t) is:

p(✓, t) / L(h(✓, t)), (3)

where the uniform 12⇡ prior for ✓ and t have been

absorbed into the constant of proportionality. In Oper-ator 1, we use t to define a subspace, and in Operator2 we sample over ✓:

• Operator 1: Sample t subject to the constraints = h(✓, t) is fixed. The simplest such operatoruniformly samples ti from the domain where si isfixed:

ti ⇠ unif(✓ + (1� si)⇡/2� ⇡p̂(si),

✓ + (1� si)⇡/2 + ⇡p̂(si)).

• Operator 2: Sample ✓ using an MCMC transitionoperator (e.g. Gibbs sampling).

First note that, in Operator 1, the density of the pro-posed point is equal to that of the current point sinceL(h(✓, t)) is unchanged. Thus the sample from Op-erator 1 is accepted with probability 1 and is a validMCMC step (which follows from evaluating the stan-dard MH acceptance probability). Second, Operator 2is also a valid MCMC transition by definition. Together,this two-operator sampling approach leaves the targetdistribution in Equation 3 invariant and is ergodic, andthus the Markov chain will converge to the stationarydistribution p.

2.3 Operator 2: Rao-Blackwellized Gibbs

In the previous section we proposed a generic samplerfor the annular augmentation; it now remains to specifyOperator 2 to sample ✓ conditioned on t. Though anumber of choices are available, a particularly effective


t1 � ⇡ t1 + ⇡

t1 � ⇡p̂1 t1 + ⇡p̂1

1

�1

cos(t1 � ✓)

cos(⇡p̂1)

✓

(a) Interval where si = 1

✓

0

⇡2

⇡

3⇡2

t1t2

t3

(b) Stretched thresholds

Figure 2: Augmentation with mean-field prior p̂. (a) Equation 2 in graphical form; in blue, the implied intervalwhere s1 = 1 along the annulus. We denote p̂1 = p̂(s1 = 1). (b) Stretched annuluar augmentation; cf. Figure 1a.

choice is Rao-Blackwellized Gibbs sampling, which wewill now explain.

Gibbs sampling in the annular augmentation is madepossible by the fact that s = h(t, ✓) is piecewise con-stant over the annulus (see braces in Figure 1a). Ineach of the intervals between threshold edges, constants = h(t, ✓) implies L(h(t, ✓)) is also constant. In Gibbssampling the probability of sampling a point within aninterval is proportional to the integral of the densityover that interval. Here it is (proportional to) the prod-uct of the interval length and the value of L on theinterval. Therefore we can Gibbs sample ✓ by first sam-pling an interval and then uniformly sampling a pointwithin it. We further improve estimates from Gibbssampling via Rao-Blackwellization (RB) (Douc et al.,2011). To calculate an expectation of some functiong(s), the RB estimator is:

ˆE[g] = 1

N

NX

n=1

2dX

k=1

r

nk · g(snk ),

where r

nk is the normalized density of the k

th inter-val between threshold edges in the n

th iteration andskn is its corresponding value. Annular augmentationpermits an efficient implementation of RB Gibbs sam-pling, which we show in pseudocode in Algorithm 1.As an implementation note, the rotational invarianceof t and ✓ allows us to set ✓ = 0 at the beginning ofeach iteration without loss of generality, thereby elimi-nating ✓ from Algorithm 1. Hereafter we will refer toGibbs sampling on the annular augmentation as Annu-lar Augmentation Gibbs sampling (AAG) and it’s RBversion as Rao-Blackwellized Annular AugmentationGibbs sampling (AAG + RB).

Even though the AAG sampler may seem to involvemore work than a traditional method such as MH, bothhave similar computational complexity per sample. Tosee this, note that each MH iteration requires evaluatingf with a change in sign in one coordinate and a uniform

random variable for acceptance-rejection. In AAGwe require d uniform random variables to sample t,must sort the threshold edges and perform 2d functionevaluations of f with a change in sign in one coordinate.Although this is more expensive, using RB we get 2d

samples, hence the cost per sample is similar. Foruniform priors it is possible to avoid the cost of sortingthe threshold edges if we constrain t to lie on a lattice,i.e. ti = ki⇡/d where ki 2 {1, ..., 2d}.

We noted previously that the annular augmentation canbe sampled with methods other than Gibbs. Indeed, inour experiments we will compare AAG to a similarlydefined Annular Augmentation Slice sampler (AAS)(Neal, 2003). Suwa and Todo (2010) recently proposeda non-reversible MCMC sampler for discrete spaceswith state-of-the-art mixing times; this Suwa-Todo(ST) algorithm can also be used in Operator 2. InSection 3.1, we extensively explore the benefits of usingdifferent Operator 2 approaches on Ising models.

2.4 Choice And Effect Of The Prior p̂

Annular augmentation requires a distribution p̂ which,for optimal performance, should approximately matchthe marginals of p. A natural choice is Loopy Be-lief Propagation (LBP), a deterministic approximationwhich incurs minimal computational overhead. The useof deterministic priors to speed up samplers has beenused before in MCMC algorithms (De Freitas et al.,2001; Fagan et al., 2016) as well as in importance sam-pling (Liu et al., 2015). Often it makes the samplersmore efficient without significant extra computationalcost, as we will see in Section 3.

Annular augmentation explores the subset defined bythe annular thresholds t, and the behavior of thesampler will depend significantly on the choice ofprior p̂. Presuming p̂ approximately matches the truemarginals of p, it is instructive to consider two ex-treme cases that demonstrate the quality and flexibility


Algorithm 1: (AAG) Rao-Blackwellized AnnularAugmentation Gibbs sampling (AAG + RB)Input :Unnormalized density L, prior p̂, number of

iterations N , function g

Output :Monte Carlo approximation Ê[g]Initialize: s 2 {�1,+1}d; Ê[g] 0

for n = 1, .., N do/* Operator 1 */for i = 1, .., d do

ti ⇡2 (si � 1) + ⇡p̂(si) · unif(�1,+1)

// Threshold valueei (ti � ⇡p̂(si = +1), ti + ⇡p̂(si = +1))

// Threshold edgesend/* Operator 2 */q sort_edge_indices(e1, e2, ..., ed)

// Order of coordinate flips on annulus` length(q; e1, e2, ..., ed)

// Length of ordered intervalsInitialize rn 2 R2d // Interval densitiesfor k=1,...,2d do // Move around annulus

sq(k) �sq(k)snk sr

nk `(k) · L(snk )

endrn rn/krnk1 // Normalize probabilitiesi ⇠Multinomial(rn), s sni // Gibbs sampleÊ[g] Ê[g] + 1

N

P2dk=1 r

nk · g(snk )

end

of the annular augmentation: strongly non-uniformp̂, where p̂(si = 1) ⇡ 1; and uniform priors, wherep̂(si = 1) ⇡ 1/2 for all i = 1, ..., d.

In the non-uniform case, each threshold is stretched tocover nearly all of the annulus, inducing a constant saround the annulus except for d tiny regions where onlyone coordinate is flipped (with high probability). Oursampler reduces to an MH type algorithm where onlyone coordinate is flipped at a time. This is appropriateand desired: nearly all of the density in p will be ats = (+1, ...,+1) with most of the remaining densityresiding in adjacent s0 that have only one negativesigned coordinate. AAG in such a case would quicklymove to s = (+1, ...,+1) guided by the LBP prior, andthen only consider flipping one coordinate at a time.

On the other hand, for uniform priors the annular aug-mentation will do a form of overrelaxed sampling. Inoverrelaxed sampling, points proposed lie on the op-posite side of a mode or mean of a distribution (Neal,1998) which can suppress random walks. In the annularaugmentation with uniform prior, we can observe fromFigure 1a that points on opposite sides of the annulus

have opposite signs, i.e. s = h(✓, t) = �h(✓ + ⇡, t).Annular augmentation considers all such points andhas a similar flavor as an overrelaxed sampler. Thisis particularly useful for pairwise binary distributionswithout bias where f(s) =

Q

i<j exp(�ijsisj). In thiscase Ep(s) = 0 since the density is symmetric with re-spect to signs. Here annular augmentation samplers si-multaneously explore points of opposite signs, resultingin an extremely accurate estimate of node marginals.

3 EXPERIMENTAL RESULTS

We evaluate our annular augmentation framework onIsing models and a real world Boltzmann Machine (BM)application. Ising models have historically been an im-portant benchmark for assessing the performance ofbinary samplers (Newman et al., 1999; Zhang et al.,2012; Pakman and Paninski, 2013). We use a set-upsimilar to (Zhang et al., 2012) to interpolate betweenunimodal and more challenging multi-modal target dis-tributions which allows us to compare the performanceof each sampler in different regimes and understand thebenefits of RB and the LBP prior. BMs, also known asMarkov random fields, are a fundamental paradigm inmachine learning (Ackley et al., 1985; Barber, 2012),with it’s deep counterpart having found numerous ap-plications (Salakhutdinov and Hinton, 2009). Inferencein fully connect BMs is extremely challenging. In ourexperiments we consider the heart disease dataset (Ed-wards and Toma, 1985) as its small size enables exactcomputations which can be used to measure the errorof each sampler.

We compare against other general purpose state-of-the-art binary samplers: Exact-HMC (Pakman andPaninski, 2013) and Coordinate Metropolis Hastings(CMH), a MH sampler which proposes to flip only onecoordinate, si, per iteration. We note that for certainclasses of binary distributions, specialized samplershave been developed, e.g. the Wolff algorithm (New-man et al., 1999); as annular augmentation is a generalbinary sampling scheme, we do not compare to suchspecialized methods. To make the comparisons par-ticularly challenging, we developed a novel method toincorporate the LBP prior in the CMH sampler and fur-ther improved its performance by using discrete eventsimulation. We refer to this as CMH + LBP (detailsin Appendix).

To distinguish the contributions of annular augmenta-tion, RB, and the LBP pseudoprior, we show resultsseparately for the AAG sampler, its RB version (AAG+ RB), and with the LBP prior (AAG + RB + LBP).As mentioned in Section 2.3, we also provide results forthe Annular Augmentation Slice (AAS) and Suwa-Todo(AAST) sampling counterparts. We run all samplers


for a fixed numbers of density function (L) evaluations.This choice is sensible since it dominates the run timecost of each sampler, ensuring a fair comparison be-tween different methods. All samplers were startedfrom the same initial point, and for the LBP approx-imation we used Matlab code (Schmidt, 2007). Wereport results for estimation of i) node marginals, ii)pairwise marginals, and iii) the normalizing factor (par-tition function) where the exact values were obtainedby running the junction tree algorithm (Schmidt, 2007).

3.1 2D Ising Model

We consider a 2D Ising model on a square lattice of size9⇥ 9 with periodic boundary conditions, where p(s) /exp

�

��E[s]�

. Here E[s] = �P

hi,ji Wijsisj �P

i bisi

is the energy of the system and the sumP

hi,ji is overall adjacent nodes on the square lattice. The strengthof interaction between node pairs (i, j) is denoted byWij , bi is the bias applied to each node, and � is theinverse temperature of the system. As is often done,we fix � = 1, Wij = W for all (i, j) pairs, and applya scaled bias bi ⇠ c · Unif [�1, 1] to each node. EachMCMC method is run 20 times for 1000 equivalentdensity function (L) evaluations.

In Figures 3, 4, and 5, we report RMSE of nodemarginal and pairwise marginal estimates, each av-eraged over 20 runs for different settings of W (referredto as “Strength”) and c (referred to as “Bias scale”).A higher value in heatmaps (more yellow, less red)indicates a larger error. The values of strength andbias scale determine the difficulty of the problem, anddemonstrate the performance of annular augmentationacross a range of settings.

In Figure 3, we fix c = 0.2 and progressively increase W .These correspond to hard cases or “frustrated systems”with the target distribution being multimodal: increas-ing W increases the energy barrier between modesmaking the target more difficult to explore. All sam-plers do similarly well for easy, low-strength cases (leftcolumns of Figure 3), but AAS, AAS + RB, AAG,AAG + RB and AAST, AAST + RB outperform ondifficult problems (high W values, redder colors). Alsointeresting to note is that, in the high strength regime,the addition of LBP hurts performance (LBP itself, notannular augmentation, has broken down; see Appendix,Table 1 for numerical values).

In Figure 4, no bias is applied to any node and weprogressively increase W . This corresponds to a targetdistribution with two modes: the ones and negativeones vectors. Samplers based on annular augmentationsignificantly outperform (lower error, more red) due tothe overrelaxation property. For the zero bias cases, theLBP approximation is equal to the true node marginals:

p̂(si = 1) = 0.5 which is equivalent to the uniformprior case, hence we expect annular augmentation withand without LBP to perform similarly. A significantperformance gain in estimating node marginals comesfrom RB (Appendix, Table 9 has numerical values).For settings in Figures 3 and 4, HMC, CMH and CMH+ LBP samplers tend to get stuck in one mode, leadingto poor performance.

Finally in Figure 5, we fix W = 0.2 and progressivelyincrease c making the target distribution unimodal. Inthis simpler setting, HMC and CMH perform well andare on par with AAS + RB + LBP, AAG + RB + LBPand AAST + RB + LBP. We see this performance isdue in part to the accuracy of LBP in this setting (Ap-pendix, Table 2), shown also by the underperformanceof annular augmentation without LBP in this case.

The annular augmented Gibbs, slice and ST methodshave similar performance in all experiments. Gibbsalways performs marginally better than slice and typ-ically outperforms ST when RB. This suggests thatGibbs makes better use of RB, even though its mixingtime may be slower (Suwa and Todo, 2010).

Regarding the size of these problems, note that annularaugmentation has no issue scaling to much larger sizes;we stopped at 9⇥9 grids to be able to create the baselinewith junction tree within a day or so of computation.

3.2 Heart Disease Risk Factors

The heart disease dataset (Edwards and Toma, 1985)lists six binary features (risk factors) relevant for coro-nary heart disease in 1841 men. These factors aremodeled as a fully connected BM, which defines a prob-ability distribution over a vector of binary variabless = [s1, . . . , sn], si 2 {0, 1}:

p(s|W,b) =1

Z(W,b)exp

⇢

X

i<j

Wi,jsisj +

X

i

bisi

�

A symmetric weight matrix W, with zeros only alongthe diagonal, and a bias vector b parameterize thedistribution with Wi,j denoting the interaction strengthbetween units i, j. Given a data set of binary vectorsD = {sj , j = 1, . . . , N}, our goal is to learn theposterior over parameters (W,b). Given some priorp(W,b) we define the joint model as

p(W,b,D) = p(W,b) · p(D|W,b)

p(D|W,b) =

exp

⇢

P

n,i<j Wi,js(n)i s

(n)j +

P

n,i bis(n)i

�

Z(W,b)N

Bayesian learning here is hard: consider a simple MHsampler where starting from (W,b), a symmetric pro-posal distribution is used to propose (W0

,b0) which is


Figure 3: Bias scale c = 0.2 and the bias for each node is drawn as bi ⇠ c ·Unif [�1, 1]. The target distribution hereis multi-modal. As we increase strength W , AAS, AAS + RB and AAG, AAG + RB increase their outperformanceover all other samplers, including those using LBP (the LBP approximate degrades for W > 0.6).

Figure 4: No bias is applied to any node making the target distribution bimodal, with the modes becoming morepeaked as strength W is increased. All annular augmentation samplers very significantly outperform.

Figure 5: W is fixed to 0.2 and bias scale c is increased making the target distribution uni-modal. HMC andCMH find the mode quickly, as do annular augmentation samplers that leverage LBP. That group outperformsannular augmentation samplers with no access to LBP.


Ap

pro

xim

atio

n e

rro

r

0

0.1

0.2

0.3

0.4

0.5

Approx. MH sampler

HMC CMH AAS + RB AAG + RB

Ap

pro

xim

atio

n e

rro

r

1.6

1.8

2

2.2

Approx. MH sampler

HMC CMH AAS + RB AAG + RB

Figure 6: Absolute error in estimating the log partition function ratio for a) real data (left) and b) fake datasimulation (right). For zero bias, LBP prior yields no additional benefit; we omit results for LBP driven samplers.

accepted with probability:

a = min

✓

1,

p(W0,b0

)

p(W,b)

p(D|W0,b0

)

p(D|W,b)

◆

, (4)

which depends on the intractable partition function ra-tio

�

Z(W,b)/Z(W0,b0

)

�N . As in Murray and Ghahra-mani (2004), we use MCMC to approximate:

Z(W,b)

Z(W0,b0

)

⇡˜

Z(W,b)˜

Z(W0,b0

)

(5)

=

⌧

exp

⇢

X

i<j

(W

0i,j �Wi,j)sisj +

X

i

(b

0i � bi)si

��

where h·i denotes Ep(s|W,b)(·). We will use differentMCMC methods to draw samples from p(s|W,b) tocompute the expectation in Equation (5), which isthen used to compute the approximate MH acceptanceprobability as in Equation (4). Following Murray andGhahramani (2004), bias terms are taken to be zero.

We run 1000 such Markov chains, each with 100 samples(drawn using the approx. MH scheme). To approximatethe partition function ratio in Equation (5), we use1000 samples from 4 different samplers: i) HMC ii)CMH iii) AAS + RB and iv) AAG + RB. For eachMarkov chain, we compute the average absolute errorin approximating the log partition function ratio, i.e.*

�

�

�

�

�

log

✓

Z(W,b)

Z(W0,b0

)

◆

� log

✓

˜

Z(W,b)˜

Z(W0,b0

)

◆

�

�

�

�

�

+

p(W,b|D)

which is then averaged over 1000 such Markov chains.Following Murray and Ghahramani (2004), to see theeffect of increasing dimensionality we also simulatedata for 10 dimensional risk factors, using a randomW matrix and repeat the experiment (details are givenin the Appendix). Figure 6 plots the overall meanand standard deviation of the approximation error forboth real and fake data. The exact partition func-tion ratio, required for computing the error metricabove, was evaluated by enumeration. For the real

data example, both AAS + RB and AAG + RB out-perform CMH and perform on par with HMC; whilefor the higher-dimensional fake data simulation, annu-lar augmentation outperforms both CMH and HMC.Accordingly, across both real and simulated data, an-nular augmentation provides optimal performance. Wenote that small differences in Figure 6 can affect theconvergence of the approximate MH sampler to the cor-rect posterior distribution; as the approximation errorfor MH acceptance probabilities is proportional to theapproximation error for partition function ratio (weplotted log of this quantity) taken to the Nth power.

4 CONCLUSION

We have presented a new framework for sampling frombinary distributions. The annular augmentation sam-pler with subsequent Rao-Blackwellization has a num-ber of desirable properties, including overrelaxation, notuning parameters, and the ability to easily incorpo-rate approximate posterior marginal information suchas that from LBP. Taken together, these advantageslead to significant performance gains over CMH andExact-HMC samplers, across a range of settings.

In this work we only considered uniform sampling ofthe thresholds: ti ⇠ unif(0, 2⇡); a future direction isto group thresholds together to flip clusters of corre-lated coordinates in the spirit of the Wolff algorithm.Furthermore, our inclusion of LBP in both annularaugmentation and CMH suggests a similar extensionfor Exact-HMC (Pakman and Paninski, 2013).

Acknowledgements

We would like to thank the anonymous reviewers, espe-cially for suggesting the Suwa-Todo method, as well asAri Pakman and Liam Paninski for helpful discussions.

FF is supported by a grant from Bloomberg. JPC issupported by Simons Foundation (SCGB#325171 andSCGB#325233), the Sloan Foundation, the McKnightFoundation, and the Grossman Center.


ReferencesAckley, D. H., Hinton, G. E., and Sejnowski, T. J.

(1985). A learning algorithm for boltzmann machines.Cognitive science, 9(1):147–169.

Barber, D. (2012). Bayesian reasoning and machinelearning. Cambridge University Press.

Chakraborty, S., Fremont, D. J., Meel, K. S., Seshia,S. A., and Vardi, M. Y. (2014). Distribution-awaresampling and weighted model counting for SAT.arXiv preprint arXiv:1404.2984.

Chandrasekaran, V., Srebro, N., and Harsha, P. (2008).Complexity of inference in graphical models. In Pro-ceedings of the Twenty-Fourth Conference on Uncer-tainty in Artificial Intelligence, pages 70–78. AUAIPress.

Chavira, M. and Darwiche, A. (2008). On probabilisticinference by weighted model counting. ArtificialIntelligence, 172(6):772–799.

De Freitas, N., Højen-Sørensen, P., Jordan, M. I., andRussell, S. (2001). Variational MCMC. In Proceed-ings of the Seventeenth conference on Uncertainty inartificial intelligence, pages 120–127. Morgan Kauf-mann Publishers Inc.

Douc, R., Robert, C. P., et al. (2011). A vanilla Rao–Blackwellization of Metropolis–Hastings algorithms.The Annals of Statistics, 39(1):261–277.

Edwards, D. and Toma, H. (1985). A fast procedurefor model search in multidimensional contingencytables. Biometrika, 72(2):339–351.

Fagan, F., Bhandari, J., and Cunningham, J. P. (2016).Elliptical slice sampling with expectation propaga-tion. In Proceedings of the 32nd Conference on Un-certainty in Artificial Intelligence.

Johnson, M., Griffiths, T. L., and Goldwater, S. (2007).Bayesian inference for PCFGs via Markov chainMonte Carlo.

Liu, Q., Fisher III, J. W., and Ihler, A. T. (2015). Prob-abilistic variational bounds for graphical models. InAdvances in Neural Information Processing Systems,pages 1432–1440.

Murphy, K. P. (2012). Machine learning: a probabilisticperspective. MIT press.

Murphy, K. P., Weiss, Y., and Jordan, M. I. (1999).Loopy belief propagation for approximate inference:An empirical study. In Proceedings of the Fifteenthconference on Uncertainty in artificial intelligence,pages 467–475. Morgan Kaufmann Publishers Inc.

Murray, I. and Ghahramani, Z. (2004). Bayesian learn-ing in undirected graphical models: approximatemcmc algorithms. In Proceedings of the 20th confer-ence on Uncertainty in artificial intelligence, pages392–399. AUAI Press.

Murray, I., Prescott, R., David, A., and Mackay, J. C.(2010). Elliptical slice sampling. In Proceedingsof the 13th International Conference on ArtificialIntelligence and Statistics (AISTATS).

Neal, R. (2003). Slice sampling. Annals of Statistics,31:705–741.

Neal, R. M. (1998). Suppressing random walks inMarkov chain Monte Carlo using ordered overrelax-ation. In Learning in graphical models, pages 205–228.Springer.

Neal, R. M. (2011). MCMC using Hamiltonian dy-namics. Handbook of Markov Chain Monte Carlo,2.

Newman, M. E., Barkema, G. T., and Newman, M.(1999). Monte Carlo methods in statistical physics,volume 13. Clarendon Press Oxford.

Nowozin, S. and Lampert, C. H. (2011). Structuredprediction and learning in computer vision. Founda-tions and Trends in Computer Graphics and Vision,6(3-4):3–4.

Pakman, A. and Paninski, L. (2013). Auxiliary-variableexact Hamiltonian Monte Carlo samplers for binarydistributions. In Advances in Neural InformationProcessing Systems, pages 2490–2498.

Salakhutdinov, R. and Hinton, G. E. (2009). Deepboltzmann machines. In AISTATS, volume 1, page 3.

Schmidt, M. (2007). UGM: A Matlab toolbox forprobabilistic undirected graphical models.

Suwa, H. and Todo, S. (2010). Markov chain MonteCarlo method without detailed balance. Physicalreview letters, 105(12):120603.

Wainwright, M. J. and Jordan, M. I. (2008). Graph-ical models, exponential families, and variationalinference. Foundations and Trends R� in MachineLearning, 1(1-2):1–305.

Wang, F. and Landau, D. (2001). Efficient, multiple-range random walk algorithm to calculate the densityof states. Physical review letters, 86(10):2050.

Zhang, Y., Ghahramani, Z., Storkey, A. J., and Sutton,C. A. (2012). Continuous relaxations for discreteHamiltonian Monte Carlo. In Advances in NeuralInformation Processing Systems, pages 3194–3202.

annular augmentation samplingproceedings.mlr.press/v54/fagan17a/fagan17a.pdf · 2020. 8. 18. ·...

Documents