stochastic quasi-newton langevin monte carloproceedings.mlr.press/v48/simsekli16.pdf · appended...

Stochastic Quasi-Newton Langevin Monte Carlo

Umut Simsekli1 [email protected] Badeau1 [email protected]. Taylan Cemgil2 [email protected]ël Richard1 [email protected]

1: LTCI, CNRS, Télécom ParisTech, Université Paris-Saclay, 75013, Paris, France2: Department of Computer Engineering, Bogaziçi University, 34342, Bebek, Istanbul, Turkey

AbstractRecently, Stochastic Gradient Markov ChainMonte Carlo (SG-MCMC) methods have beenproposed for scaling up Monte Carlo compu-tations to large data problems. Whilst theseapproaches have proven useful in many appli-cations, vanilla SG-MCMC might suffer frompoor mixing rates when random variables exhibitstrong couplings under the target densities or bigscale differences. In this study, we propose anovel SG-MCMC method that takes the local ge-ometry into account by using ideas from Quasi-Newton optimization methods. These secondorder methods directly approximate the inverseHessian by using a limited history of samples andtheir gradients. Our method uses dense approx-imations of the inverse Hessian while keepingthe time and memory complexities linear withthe dimension of the problem. We provide aformal theoretical analysis where we show thatthe proposed method is asymptotically unbiasedand consistent with the posterior expectations.We illustrate the effectiveness of the approach onboth synthetic and real datasets. Our experimentson two challenging applications show that ourmethod achieves fast convergence rates similar toRiemannian approaches while at the same timehaving low computational requirements similarto diagonal preconditioning approaches.

1. IntroductionMarkov Chain Monte Carlo (MCMC) methods are one ofthe most important family of algorithms in Bayesian ma-chine learning. These methods lie in the core of various

Proceedings of the 33 rd International Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

applications such as the estimation of Bayesian predictivedensities and Bayesian model selection. They also provideimportant advantages over optimization-based point esti-mation methods, such as having better predictive accuracyand being more robust to over-fitting (Chen et al., 2014;Ahn et al., 2015). Despite their well-known benefits, dur-ing the last decade, conventional MCMC methods have losttheir charm since they are often criticized as being compu-tationally very demanding. Indeed, classical approachesbased on batch Metropolis Hastings would require passingover the whole data set at each iteration and the acceptance-rejection test makes the methods even more impractical formodern data sets.

Recently, alternative approaches have been proposed toscale-up MCMC inference to large-scale regime. An im-portant attempt was made by Welling & Teh (2011), wherethe authors combined the ideas from Langevin Monte Carlo(LMC) (Rossky et al., 1978; Roberts & Stramer, 2002;Neal, 2010) and stochastic gradient descent (SGD) (Rob-bins & Monro, 1951; Kushner & Yin, 2003), and devel-oped a scalable MCMC framework referred to as stochasticgradient Langevin dynamics (SGLD). Unlike conventionalbatch MCMC methods, SGLD uses subsamples of the dataper iteration similar to SGD. With this manner, SGLD isable to scale up to large datasets while at the same timebeing a valid MCMC method that forms a Markov chainasymptotically sampling from the target density. Severalextensions of SGLD have been proposed (Ahn et al., 2012;Patterson & Teh, 2013; Ahn et al., 2014; Chen et al., 2014;Ding et al., 2014; Ma et al., 2015; Chen et al., 2015; Shanget al., 2015; Li et al., 2016), which are coined under theterm Stochastic Gradient MCMC (SG-MCMC).

One criticism that is often directed at SGLD is that it suffersfrom poor mixing rates when the target densities exhibitstrong couplings and scale differences across dimensions.This problem is caused by the fact that SGLD is based onan isotropic Langevin diffusion, which implicitly assumesthat different components of the latent variable are uncor-


related and have the same scale, a situation which is hardlyencountered in practical applications.

In order to be able to generate samples from non-isotropictarget densities in an efficient manner, SGLD has been ex-tended in several ways. The common theme in these meth-ods is to incorporate the geometry of the target densityto the sampling algorithm. Towards this direction, a firstattempt was made by (Ahn et al., 2012), where the au-thors proposed Stochastic Gradient Fisher Scoring (SGFS),a preconditioning schema that incorporates the curvaturevia Fisher scoring. Later, Patterson & Teh (2013) pre-sented Stochastic Gradient Riemannian Langevin Dynam-ics (SGRLD), where they defined the sampler on the Rie-mannian manifold by borrowing ideas from (Girolami &Calderhead, 2011). SGRLD incorporates the local curva-ture information through the expected Fisher informationmatrix (FIM), which defines a natural Riemannian metrictensor for probability distributions. SGRLD has shownsignificant performance improvements over SGLD; how-ever, its computational complexity is very intensive sinceit requires storing and inverting huge matrices, computingCholesky factors, and expensive matrix products for gener-ating each sample. It further requires computing the thirdorder derivatives and the expected FIM to be analyticallytractable, which limit the applicability of the method to anarrow variety of statistical models (Calderhead & Sustik,2012).

In a very recent study, Li et al. (2016) proposed Precondi-tioned SGLD (PSGLD), that aims to handle the scale dif-ferences in the target density by making use of an adap-tive preconditioning matrix that is chosen to be diagonalfor computational purposes. Even though this method iscomputationally less intensive than SGRLD, it still cannothandle highly correlated target densities since the precon-ditioning matrix is forced to be diagonal. Besides, in orderto reduce the computational burden, the authors discard acorrection term in practical applications, which introducespermanent bias that does not vanish asymptotically.

In this study, we propose Hessian Approximated MCMC(HAMCMC), a novel SG-MCMC method that computesthe local curvature of the target density in an accurate yetcomputationally efficient manner. Our method is built upon similar ideas from the Riemannian approaches; how-ever, instead of using the expected FIM, we consider thelocal Hessian of the negative log posterior, whose expec-tation coincides with the expected FIM. Instead of com-puting the local Hessian and inverting it for each sample,we borrow ideas from Quasi-Newton optimization meth-ods and directly approximate the inverse Hessian by usinga limited history of samples and their stochastic gradients.By this means, HAMCMC is (i) computationally more effi-cient than SGRLD since its time and memory complexities

are linear with the dimension of the latent variable, (ii) itcan be applied to a wide variety of statistical models with-out requiring the expected FIM to be analytically tractable,and (iii) it is more powerful than diagonal preconditioningapproaches such as PSGLD, since it uses dense approxi-mations of the inverse Hessian, hence is able to deal withcorrelations along with scale differences.

We provide rigorous theoretical analysis for HAMCMC,where we show that HAMCMC is asymptotically unbi-ased and consistent with posterior expectations. We eval-uate HAMCMC on a synthetic and two challenging realdatasets. In particular, we apply HAMCMC on two dif-ferent matrix factorization problems and we evaluate itsperformance on a speech enhancement and a distributedlink prediction application. Our experiments demonstratethat HAMCMC achieves fast convergence rates similar toSGRLD while at the same time having low computationalrequirements similar to SGLD and PSGLD.

We note that SGLD has also been extended by making useof higher order dynamics (Chen et al., 2014; Ding et al.,2014; Ma et al., 2015; Shang et al., 2015). These extensionsare rather orthogonal to our contributions and HAMCMCcan be easily extended in the same directions.

2. PreliminariesLet θ be a random variable in RD. We aim to samplefrom the posterior distribution of θ, given as p(θ|x) ∝exp(−U(θ)), where U(θ) is often called the potential en-ergy and defined as U(θ) = −[log p(x|θ)+log p(θ)]. Here,x , xnNn=1 is a set of observed data points, and eachxn ∈ RP . We assume that the data instances are indepen-dent and identically distributed, so that we have:

U(θ) = −[log p(θ) +

N∑n=1

log p(xn|θ)]. (1)

We define an unbiased estimator of U(θ) as follows:

U(θ) = −[log p(θ) +N

NΩ

∑n∈Ω

log p(xn|θ)] (2)

where Ω ⊂ 1, . . . , N is a random data subsample that isdrawn with replacement, NΩ = |Ω| is the number of ele-ments in Ω. We will occasionally use UΩ(θ) for referringto U(θ) when computed on a specific subsample Ω.

2.1. Stochastic Gradient Langevin Dynamics

By combining ideas from LMC and SGD, Welling & Teh(2011) presented SGLD that asymptotically generates asample θt from the posterior distribution by iteratively ap-plying the following update equation:

θt = θt−1 − εt∇U(θt−1) + ηt (3)


where εt is the step-size, and ηt is Gaussian noise: ηt ∼N (0, 2εtI), and I stands for the identity matrix.

This method can be seen as the Euler discretization ofthe Langevin dynamics that is described by the followingstochastic differential equation (SDE):

dθt = −∇U(θt)dt+√

2dWt (4)

where Wt is the standard Brownian motion. In the simula-tions∇U(θ) is replaced by the stochastic gradient∇U(θ),which is an unbiased estimator of ∇U(θ). Convergenceanalysis of SGLD has been studied in (Sato & Nakagawa,2014; Teh et al., 2016). Since Wt is isotropic, SGLD oftensuffers from poor mixing rates when the target density ishighly correlated or contains big scale differences.

2.2. The L-BFGS algorithm

In this study, we consider a limited-memory Quasi-Newton(QN) method, namely the L-BFGS algorithm (Nocedal &Wright, 2006) that iteratively applies the following equa-tion in order to find the maximum a-posteriori estimate:

θt = θt−1 − εtHt∇U(θt−1) (5)

where Ht is an approximation to the inverse Hessian atθt−1. The L-BFGS algorithm directly approximates the in-verse of the Hessian by using the M most recent values ofthe past iterates. At each iteration, the inverse Hessian isapproximated by applying the following recursion: (usingHt = HM−1

t )

Hmt = (I − sτy

>τ

y>τ sτ)Hm−1

t (I − yτs>τ

y>τ sτ) +

sτs>τ

y>τ sτ(6)

where st , θt − θt−1, yt , ∇U(θt) − ∇U(θt−1), andτ = t −M + m and the initial approximation is chosenas H1

t = γI for some γ > 0. The matrix-vector productHt∇U(θt−1) is often implemented either by using the two-loop recursion (Nocedal & Wright, 2006) or by using thecompact form (Byrd et al., 1994), which both have lineartime complexity O(MD). Besides, since L-BFGS needsto store only the latest M − 1 values of st and yt, the spacecomplexity is also O(MD), as opposed to classical QNmethods that have quadratic memory complexity.

Surprisingly, QN methods have not attracted much at-tention from the MCMC literature. Zhang & Sutton(2011) combined classical Hamiltonian Monte Carlo withL-BFGS for achieving better mixing rates for small- andmedium-sized problems. This method considers a batchscenario with full gradient computations which are thenappended with costly Metropolis acceptance steps. Re-cently, Dahlin et al. (2015) developed a particle Metropolis-Hastings schema based on (Zhang & Sutton, 2011).

In this study, we consider a stochastic QN (SQN) frame-work for MCMC. The main difference in these approachesis that they use stochastic gradients ∇U(θ) in the L-BFGScomputations. These methods have been shown to pro-vide better scalability properties than QN methods, andmore favorable convergence properties than SGD-basedapproaches. However, it has been shown that straightfor-ward extensions of L-BFGS to the stochastic case wouldfail in practice and therefore special attention is required(Schraudolph et al., 2007; Byrd et al., 2014).

3. Stochastic Quasi-Newton LMCRecent studies have shown that Langevin and HamiltonianMonte Carlo methods have strong connections with opti-mization techniques. Therefore, one might hope for de-veloping stronger MCMC algorithms by borrowing ideasfrom the optimization literature (Qi & Minka, 2002; Zhang& Sutton, 2011; Bui-Thanh & Ghattas, 2012; Pereyra,2013; Chaari et al., 2015; Bubeck et al., 2015; Li et al.,2016). However, incorporating ideas from optimizationmethods to an MCMC framework requires careful designand straightforward extensions often do not yield properalgorithms (Chen et al., 2014; Ma et al., 2015). In this sec-tion, we will first show that a naïve way of developing anSG-MCMC algorithm based on L-BFGS would not targetthe correct distribution. Afterwards, we will present ourproposed algorithm HAMCMC and show that it targets thecorrect distribution and it is asymptotically consistent withposterior expectations.

3.1. A Naïve Algorithm

A direct way of using SQN ideas in an LMC frameworkwould be achieved by considering Ht as a precondition-ing matrix in an SGLD context (Welling & Teh, 2011; Ahnet al., 2012) by combining Eq. 3 and Eq. 5, which wouldyield the following update equation:

θt = θt−1 − εtHt(θt−M :t−1)∇U(θt−1) + ηt (7)

where Ht(·) is the approximate inverse Hessian computedvia L-BFGS, M denotes the memory size in L-BFGS andηt ∼ N

(0, 2εtHt(θt−M :t−1)

). Here, we use a slightly dif-

ferent notation and denote the L-BFGS approximation viaHt(θt−M :t−1) instead of Ht in order to explicitly illustratethe samples that are used in the L-BFGS calculations.

Despite the fact that such an approach would introduce sev-eral challenges, which will be described in the next section,for now let us assume computing Eq. 7 is feasible. Even so,this approach does not result in a proper MCMC schema.

Proposition 1. The Markov chain described in Eq. 7does not target the posterior distribution p(θ|x) unless∑Dj=1

∂∂θj

Hij(θ) = 0 for all i ∈ 1, . . . , D.


Proof. Eq. 7 can be formulated as a discretization of acontinuous-time diffusion that is given as follows:

dθt = −H(θt)∇U(θt)dt+√

2H(θt)dWt (8)

where ∇U(θt) is approximated by ∇U(θt) in simulations.Theorems 1 and 2 of (Ma et al., 2015) together show thatSDEs with state dependent volatilities leave the posteriordistribution invariant only if the drift term contains an ad-ditional ‘correction’ term, Γ(θt), defined as follows:1

Γi(θ) =

D∑j=1

∂

∂θjHij(θ). (9)

Due to the hypothesis, Eq. 8 clearly does not target the pos-terior distribution, therefore Eq. 7 does not target the pos-terior distribution either.

In fact, this result is neither surprising nor new. It has beenshown that in order to ensure targeting the correct distribu-tion, we should instead consider the following SDE (Giro-lami & Calderhead, 2011; Xifara et al., 2014; Patterson &Teh, 2013; Ma et al., 2015):

dθt = −[H(θt)∇U(θt) + Γ(θt)]dt+√

2H(θt)dWt

where Γ(θt) must be taken into account when generatingthe samples. If we choose H(θt) = E[∇2U(θt)]

−1 (i.e.,the inverse of the expected FIM), discretize this SDE withthe Euler scheme, and approximate ∇U(θt) with ∇U(θt),we obtain SGRLD. Similarly, we obtain PSGLD if H(θt)has the following form:

H(θt) = diag(1/(λ+√v(θt)))

v(θt) = αv(θt−1) + (1− α)gΩt−1(θt−1) gΩt−1

(θt−1)

where gΩ(θ) , (1/NΩ)∑n∈Ω∇ log p(xn|θ), the symbols

·/· and · · respectively denote element-wise division andmultiplication, and α ∈ [0, 1] and λ ∈ R+ are the parame-ters to be tuned.

The correction term Γ(θt) does not vanish except for a verylimited variety of models and computing this term requiresthe computation of the second derivatives of U(θt) in ourcase (whereas it requires computing the third derivatives inthe Riemannian methods). This conflicts with our motiva-tion of using QN methods for avoiding the Hessian compu-tations in the first place.

In classical Metropolis-Hastings settings, where Langevindynamics is only used in the proposal distribution, the cor-rection term Γ(θt) can be discarded without violating the

1The correction term presented in (Girolami & Calderhead,2011) and (Patterson & Teh, 2013) has a different form than Eq. 9.Xifara et al. (2014) later showed that this term corresponded to anon-Lebesgue measure and Eq. 9 should be used instead for theLebesgue measure.

convergence properties thanks to the acceptance-rejectionstep (Qi & Minka, 2002; Girolami & Calderhead, 2011;Zhang & Sutton, 2011; Calderhead & Sustik, 2012). How-ever, such an approach would result in slower convergence(Girolami & Calderhead, 2011). On the other hand, inSG-MCMC settings which do not include an acceptance-rejection step, this term can be problematic and should behandled with special attention.

In (Ahn et al., 2012), the authors discarded the correc-tion term in a heuristic manner. Recently, Li et al. (2016)showed that discarding the correction term induces perma-nent bias which deprives the methods of their asymptoticconsistency and unbiasedness. In the sequel, we will showthat the computation of Γ(θt) can be avoided without in-ducing permanent bias by using a special construction thatexploits the limited memory structure of L-BFGS.

3.2. Hessian Approximated MCMC

In this section, we present our proposed method HAM-CMC. HAMCMC applies the following update rules forgenerating samples from the posterior distribution:

θt = θt−M − εtHt

(θ¬(t−M)t−2M+1:t−1

)∇U(θt−M ) + ηt (10)

where θ¬ca:b ≡ (θa:b \ θc), ηt ∼ N(0, 2εtHt(·)

), and we

assume that the initial 2M + 1 samples, i.e. θ0, . . . , θ2M

are already provided.

As opposed to the naïve approach, HAMCMC generatesthe sample θt based on θt−M and it uses a history samplesθt−2M+1, . . . , θt−M−1, θt−M+1, . . . , θt−1 in the L-BFGScomputations. Here, we define the displacements and thegradient differences in a slightly different manner fromusual L-BFGS, given as follows: st = θt − θt−M andyt = ∇U(θt)−∇U(θt−M ). Therefore, M must be chosenat least 2 or higher. HAMCMC requires to store only thelatestM samples and the latestM−1 memory components,i.e. θt−M :t−1, st−M+1:t−1, and yt−M+1:t−1, resulting inlinear memory complexity.

An important property of HAMCMC is that the correctionterm Γ(θ) vanishes due to the construction of our algo-rithm, which will be formally proven in the next section.Informally, since Ht is independent of θt−M conditionedon θ¬(t−M)

t−2M+1:t−1, the change in θt−M does not affect Ht

and therefore all the partial derivatives in Eq. 9 becomezero. This property saves us from expensive higher-orderderivative computations or inducing permanent bias due tothe negligence of the correction term. On the other hand,geometrically, our approach corresponds to choosing a flatRiemannian manifold specified by the metric tensor Ht,whose curvature is constant (i.e. Γ(θ) = 0).

We encounter several challenges due to the usage ofstochastic gradients that are computed on random sub-


Algorithm 1: Hessian Approximated MCMC

1 input: M , γ, λ, NΩ, θ0, . . . , θ2M

2 for t = 2M + 1, · · · , T do3 Draw Ωt ⊂ 1, . . . , N4 Generate zt ∼ N (0, I)

// L-BFGS computations

5 ξt = Ht

(θ¬(t−M)t−2M+1:t−1

)∇UΩt(θt−M ) (Eq. 6)

6 ηt = St(θ¬(t−M)t−2M+1:t−1

)zt (Eq. 11)

// Generate the new sample7 θt = θt−M − εtξt +

√2εtηt (Eq. 10)

// Update L-BFGS memory

8 st=θt−θt−M , yt=∇UΩt(θt)−∇UΩt(θt−M )+λst

sets of the data. The first challenge is the data consis-tency (Schraudolph et al., 2007; Byrd et al., 2014). Sincewe draw different data subsamples Ωt at different epochs,the gradient differences yt would be simply ‘inconsistent’.Handling this problem in incremental QN settings (i.e., thedata subsamples are selected via a deterministic mecha-nism, rather than being chosen at random) is not as chal-lenging, however, the randomness in the stochastic settingprevents simple remedies such as computing the differ-ences in a cyclic fashion. Therefore, in order to ensure con-sistent gradient differences, we slightly increase the com-putational needs of HAMCMC and perform an additionalstochastic gradient computation at the end of each iteration,i.e. yt = ∇UΩt

(θt) − ∇UΩt(θt−M ) where both gradients

are evaluated on the same subsample Ωt.

In order to have a valid sampler, we are required to havepositive definiteHt(·). However, even ifU(θ) was stronglyconvex, due to data subsampling L-BFGS is no longerguaranteed to produce positive definite approximations.One way to address this problem is to use variance reduc-tion techniques as presented in (Byrd et al., 2014); however,such an approach would introduce further computationalburden. In this study, we follow a simpler approach andaddress this problem by approximating the inverse Hessianin a trust region that is parametrized by λ, i.e. we use L-BFGS to approximate (∇2U(θ)+λI)−1. For large enoughλ this approach will ensure positive definiteness. The L-BFGS updates for this problem can be obtained by sim-ply modifying the gradient differences as: yt ← yt + λst(Schraudolph et al., 2007).

The final challenge is to generate ηt whose covariance isa scaled version of the L-BFGS output. This requires tocompute the square root of Ht, i.e. StS>t = Ht, so thatwe can generate ηt as

√2εtStzt, where zt ∼ N (0, I). For-

tunately, we can directly compute the product Stzt withinthe L-BFGS framework by using the following recursion

(Brodlie et al., 1973; Zhang & Sutton, 2011):

Hmt = Smt (Smt )>, Smt = (I − pτq>τ )Sm−1

t

Bmt = Cmt (Cmt )>, Cmt = (I − uτv>τ )Cm−1t

vτ =sτ

s>τ Bm−1t sτ

, uτ =

√s>τ B

m−1t sτs>τ yτ

yτ +Bm−1t sτ ,

pτ =sτs>τ yτ

, qτ =

√s>τ yτ

s>τ Bm−1t yτ

Bm−1t sτ − yτ (11)

where St ≡ SM−1t and Bmt = (Hm

t )−1. By this approach,ηt can be computed in O(M2D) time, which is still linearin D. We illustrate HAMCMC in Algorithm 1.

Note that large M would result in a large gap betweentwo consecutive samples and therefore might require largerλ for ensuring positive definite L-BFGS approximations.Such a circumstance would be undesired since Ht wouldget closer to the identity matrix and therefore result inHAMCMC being similar to SGLD. In our computationalstudies, we have observed that choosing M between 2 and5 is often adequate in several applications.

3.3. Convergence Analysis

We are interested in approximating the posterior expecta-tions by using sample averages:

h ,∫RD

h(θ)π(dθ) ≈ h ,1

WT

T∑t=1

εth(θt), (12)

where π(θ) , p(θ|x), WT ,∑Tt=1 εt, and θt denotes the

samples generated by HAMCMC. In this section we willanalyze the asymptotic properties of this estimator. Weconsider a theoretical framework similar to the one pre-sented in (Chen et al., 2015). In particular, we are interestedin analyzing the bias

∣∣E[h]− h∣∣ and the mean squared-error

E[(h− h)2

]of our estimator.

A useful way of analyzing dynamics-based MCMC ap-proaches is to first analyze the underlying continuous dy-namics, then consider the MCMC schema as a discrete-time approximation (Roberts & Stramer, 2002; Sato &Nakagawa, 2014; Chen et al., 2015). However, this ap-proach is not directly applicable in our case. Inspired by(Roberts & Gilks, 1994) and (Zhang & Sutton, 2011), weconvert Eq. 10 to a first order Markov chain that is de-fined on the product space RD(2M−1) such that Θt ≡θt−2M+2, . . . , θt. In this way, we can show that HAM-CMC uses a transition kernel which modifies only one el-ement of Θt per iteration and rearranges the samples thatwill be used in the L-BFGS computations. Therefore, theoverall HAMCMC kernel can be interpreted as a compo-sition of M different preconditioned SGLD kernels whose


Figure 1. The illustration of the samples generated by differentSG-MCMC methods. The contours of the true posterior distribu-tion is shown with dashed lines. We set M = 2 for HAMCMC.

volatilities are independent of their current states. By con-sidering this property and assuming certain conditions thatare provided in the Supplement hold, we can bound the biasand MSE as shown in Theorem 1.

Theorem 1. Assume the number of iterations is chosenas T = KM where K ∈ N+ and θtTt=1 are ob-tained by HAMCMC (Eq. 10). Define LK ,

∑Kk=1 εkM ,

YK ,∑Kk=1 ε

2kM , and the operator ∆Vt , (∇U(θt) −

∇U(θt))>Ht∇. Then, we can bound the bias and MSE of

HAMCMC as follows:

(a)∣∣E[h]− h

∣∣ = O(

1LK

+ YK

LK

)(b) E

[(h− h)2

]= O

( K∑k=1

ε2kM

L2KE‖∆Vk?‖2 + 1

LK+

Y 2K

L2K

)where h is smooth and ∆Vk? = ∆Vm∗

k+kM , where m∗k =

arg max1≤m≤M E‖∆Vm+kM‖2, and ‖∆Vt‖ denotes theoperator norm.

The proof is given in the Supplement. The theorem im-plies that as T goes to infinity, the bias and the MSE of theestimator vanish. Therefore, HAMCMC is asymptoticallyunbiased and consistent.

4. Experiments4.1. Linear Gaussian Model

We conduct our first set of experiments on synthetic datawhere we consider a rather simple Gaussian model whoseposterior distribution is analytically available. The modelis given as follows:

θ ∼ N (0, I), xn|θ ∼ N (a>n θ, σ2x), (13)

for all n. Here, we assume anNn=1 and σ2x are known

and we aim to draw samples from the posterior distributionp(θ|x). We will compare the performance of HAMCMCagainst SGLD, PSGLD, and SGRLD. In these experiments,we determine the step size as εt = (aε/t)

0.51. In all our ex-periments, for each method, we have tried several valuesfor the hyper-parameters with the same rigor and we reportthe best results. We also report all of the hyper-parametersused in the experiments (including Sections 4.2-4.3) in the

Figure 2. The error between the true posterior mean θ and theMonte Carlo estimates θ for D = 10 and D = 100. On theleft column, we show the error versus iterations and on the rightcolumn we show the error versus wall-clock time.

Supplement. These experiments are conducted on a stan-dard laptop computer with 2.5GHz Quad-core Intel Core i7CPU and all the methods are implemented in Matlab.

In the first experiment, we first generate the true θ andthe observations x by using the generative model givenin Eq. 13. We set D = 2 for visualization purposes andσ2x = 10, then we randomly generate the vectors ann

in such a way that the posterior distribution will be corre-lated. Then we generate T = 20000 samples by using theSG-MCMC methods, where the size of the data subsamplesis selected as NΩ = N/100. We discard the first half of thesamples as burn-in. Figure 1 shows the typical results ofthis experiment. We can clearly observe that SGLD cannotexplore the posterior efficiently and gets stuck around themode of the distribution. PSGLD captures the shape of thedistribution in a more accurate way than SGLD since it isable to take the scale differences into account; however, itstill cannot explore lower probability areas efficiently. Be-sides, we can see that SGRLD performs much better thanSGLD and PSGLD. This is due to the fact that it uses the in-verse expected FIM of the full problem for generating eachsample, where this matrix is simply (

∑n ana

>n /σ

2x + I)−1

for this model. Therefore, SGRLD requires much heaviercomputations as it needs to store a D × D matrix, com-putes the Cholesky factors of this matrix, and performsmatrix-vector products at each iteration. Furthermore, inmore realistic probabilistic models, the expected FIM al-most always depends on θ; hence, the computation burdenof SGRLD is further increased by the computation of theinverse of the D×D matrix at each iteration. On the otherhand, Fig. 1 shows that HAMCMC takes the best of bothworlds: it is able to achieve the quality of SGRLD sinceit makes use of dense approximations to the inverse Hes-sian and it has low computational requirements similar toSGLD and PSGLD. This observation becomes more clearin our next experiment.


In our second experiment, we again generate the true θ andthe observations by using Eq. 13. Then, we approximatethe posterior mean θ =

∫θp(θ|x)dθ by using the afore-

mentioned MCMC methods. For comparison, we moni-tor ‖θ − θ‖2 where θ is the estimate that is obtained fromMCMC. Figure 2 shows the results of this experiment forD = 10 and D = 100. In the left column, we show theerror as a function of iterations. These results show that theconvergence rate of SGRLD (with respect to iterations) ismuch faster than SGLD and PSGLD. As expected, HAM-CMC yields better results than SGLD and PSGLD even inthe simplest configuration (M = 2), and it performs verysimilarly to SGRLD as we set M = 3. The performancedifference becomes more prominent as we increase D.

An important advantage of HAMCMC is revealed whenwe take into account the total running time of these meth-ods. These results are depicted in the right column ofFig. 2, where we show the error as a function of wall-clocktime 2. We can observe that, since SGLD, PSGLD, andHAMCMC have similar computational complexities, theshapes of their corresponding plots do not differ signifi-cantly. However, SGRLD appears to be much slower thanthe other methods as we increase D.

4.2. Alpha-Stable Matrix Factorization

In our second set of experiments, we consider a recentlyproposed probabilistic matrix factorization model, namelyalpha-stable matrix factorization (αMF), that is given asfollows (Simsekli et al., 2015):

Wik ∼ GG(aw, bw,−2/α), Hkj ∼ GG(ah, bh,−2/α)

Xij |W,H ∼ SαSc([∑

kWikHkj ]

1/α)

(14)

where X ∈ CI×J is the observed matrix with complexentries, and W ∈ RI×K+ and H ∈ RK×J+ are the latentfactors. Here GG denotes the generalized gamma distribu-tion (Stacy, 1962) and SαSc denotes the complex symmet-ric α-stable distribution (Samoradnitsky & Taqqu, 1994).Stable distributions are heavy-tailed distributions and theyare the limiting distributions in the generalized central limittheorem. They are often characterized by four parameters:S(α, β, σ, µ), where α ∈ (0, 2] is the characteristic expo-nent, β ∈ [−1, 1] is the skewness parameter, σ ∈ (0,∞) isthe dispersion parameter, and µ ∈ (−∞,∞) is the locationparameter. The distribution is called symmetric (SαS) ifβ = 0. The location parameter is also chosen as 0 in αMF.

The probability density function of the stable distributionscannot be written in closed-form except for certain spe-

2The inverse of the FIM for this model can be precomputed.However, this will not be the case in more general scenarios.Therefore, for fair comparison we perform the matrix inversionand Cholesky decomposition operations at each iteration.

0 50 100 150 200Iteration (t)

4.4

4.6

4.8

5

5.2

Test S

DR

(dB

)

SGLDPSGLDHAMCMC

Figure 3. The performance of HAMCMC on a speech denoisingapplication in terms of SDR (higher is better).

cial cases; however it is easy to draw samples from them.Therefore, MCMC has become the de facto inference toolfor α-stable models like αMF (Godsill & Kuruoglu, 1999).By using data augmentation and the product property ofstable distributions, Simsekli et al. (2015) proposed a par-tially collapsed Gibbs sampler for making inference in thismodel. The performance of this approach was then illus-trated on a speech enhancement problem.

In this section, we derive a new inference algorithm forαMF that is based on SG-MCMC. By using the productproperty of stable models (Kuruoglu, 1999), we express themodel as conditionally Gaussian, given as follows:

Wik ∼ GG(aw, bw,−2/α), Hkj ∼ GG(ah, bh,−2/α)

Φij ∼ S(α/2, 1, 2

(cos(πα)/4

)2/α, 0)

Xij |W,H,Φ ∼ Nc(0,Φij [

∑kWikHkj ]

2/α), (15)

where Nc denotes the complex Gaussian distribution.Here, by assuming α is already provided, we propose aMetropolis and SG-MCMC within Gibbs approach, wherewe generate samples from the full conditional distributionsof the latent variables in such a way that we use a Metropo-lis step for sampling Φ where the proposal is the prior andwe use SG-MCMC for sampling W and H from their fullconditionals. Since all the elements of W and H must benon-negative, we apply mirroring at the end of each itera-tion where we replace negative elements with their absolutevalues (Patterson & Teh, 2013).

We evaluate HAMCMC on the same speech enhancementapplication described in (Simsekli et al., 2015) by using thesame settings. The aim here is to recover clean speech sig-nals that are corrupted by different types of real-life noisesignals (Hu & Loizou, 2007). We use the Matlab code thatcan be downloaded from the authors’ websites in order togenerate the same experimental setting. We first trainW ona clean speech corpus, then we fixW at denoising and sam-ple the rest of the variables. For each test signal (out of 80)we have I = 256 and J = 140. The rank is set K = 105where the first 100 columns of W is reserved for cleanspeech and for speech α = 1.2 and for noise α = 1.89.Note that the model used for denoising is slightly differentfrom Eq. 14 and we refer the readers to the original paperfor further details of the experimental setup.


1

32

4

1

32

4

12

34

32

4

1

(1) (2) (3) (4)

Figure 4. The illustration of the data partitioning that is used in thedistributed matrix factorization experiments. The numbers in theblocks denote the computer that stores the corresponding block.

In this experiment, we compare HAMCMC against SGLDand PSGLD. SGRLD cannot be applied to this model sincethe expected FIM is not tractable. For each method, weset NΩ = IJ/10 and we generate 2500 samples where wecompute the Monte Carlo averages by using the last 200samples. For HAMCMC, we set M = 5. Fig. 3 showsthe denoising results on the test set in terms of signal-to-distortion ratio (SDR) when the mixture signal-to-noise ra-tio is 5dB. The results show that SGLD and PSGLD per-form similarly, where the quality of the outputs of PSGLDis slightly higher than the ones obtained from SGLD. Wecan observe that HAMCMC provides a significant perfor-mance improvement over SGLD and PSGLD, thanks to theusage of local curvature information.

4.3. Distributed Matrix Factorization

In our final set of experiments, we evaluate HAMCMC ona distributed matrix factorization problem, where we con-sider the following probabilistic model: Wik ∼ N (0, σ2

w),Hkj ∼ N (0, σ2

h), Xij |· ∼ N(∑

kWikHkj , σ2x

). Here,

W ∈ RI×K , H ∈ RK×J , and X ∈ RI×J . This modelis similar to (Salakhutdinov & Mnih, 2008) and is oftenused in distributed matrix factorization problems (Gemullaet al., 2011; Simsekli et al., 2015b).

Inspired by Distributed SGLD (DSGLD) for matrix fac-torization (Ahn et al., 2015) and Parallel SGLD (Simsekliet al., 2015a), we partition X into disjoint subsets whereeach subset is further partitioned into disjoint blocks. Thisschema is illustrated in Fig. 4, where X is partitioned into4 disjoint subsets Ω(1), . . . ,Ω(4) and each subset contains4 different blocks that will be distributed among differentcomputation nodes. By using this approach, we extend PS-GLD and HAMCMC to the distributed setting similarly toDSGLD, where at each iteration the data subsample Ωtis selected as Ω(i) with probability |Ω(i)|/

∑j |Ω(j)|. At

the end of each iteration the algorithms transfer a smallblock of H to a neighboring node depending on Ωt+1. PS-GLD and HAMCMC further need to communicate othervariables that are required for computing Ht, where thecommunication complexity is still linear with D. Finally,HAMCMC requires to compute six vector dot products ateach iteration that involves all the computation nodes, suchas s>t yt. These computations can be easily implemented byusing a reduce operation which has a communication com-plexity of O(M2). By using this approach, the methods

0 5 10 15 20Wall-clock time (sec)

0.9

1

1.1

1.2

Test R

MS

E

DSGLDPSGLDHAMCMC

Figure 5. The performance of HAMCMC on a distributed matrixfactorization problem.

have the same theoretical properties except that the stochas-tic gradients would be expected to have larger variances.

For this experiment, we have implemented all the algo-rithms in C by using a low-level message passing proto-col, namely the OpenMPI library. In order to keep the im-plementation simple, for HAMCMC we set M = 2. Weconduct our experiments on a small-sized cluster with 4interconnected computers each of them with 4 Intel Xeon2.93GHz CPUs and 192 GB of memory.

We compare distributed HAMCMC against DSGLD anddistributed PSGLD on a large movie ratings dataset,namely the MovieLens 1M (grouplens.org). Thisdataset contains about 1 million ratings applied to I =3883 movies by J = 6040 users, resulting in a sparse ob-served matrixX with 4.3% non-zero entries. We randomlyselect 10% of the data as the test set and partition the rest ofthe data as illustrated in Fig. 4. The rank of the factorizationis chosen as K = 10. We set σ2

w = σ2h = σ2

x = 1. In theseexperiments, we use a constant step size for each methodand discard the first 50 samples as burn-in. Note that forconstant step size, we no longer have asymptotic unbiased-ness; however, the bias and MSE are still bounded.

Fig. 5 shows the comparison of the algorithms in terms ofthe root mean squared-error (RMSE) that are obtained onthe test set. We can observe that DSGLD and PSGLD havesimilar converges rates. On the other hand, HAMCMCconverges faster than these methods while having almostthe same computational requirements.

5. ConclusionWe presented HAMCMC, a novel SG-MCMC algorithmthat achieves high convergence rates by taking the local ge-ometry into account via the local Hessian that is efficientlycomputed by the L-BFGS algorithm. HAMCMC is bothefficient and powerful since it uses dense approximationsof the inverse Hessian while having linear time and mem-ory complexity. We provided formal theoretical analysisand showed that HAMCMC is asymptotically consistent.Our experiments showed that HAMCMC achieves betterconvergence rates than the state-of-the-art while keepingthe computational cost almost the same. As a next step,we will explore the use of HAMCMC in model selectionapplications (Simsekli et al., 2016).

grouplens.org


AcknowledgmentsThe authors would like to thank to S. Ilker Birbil and FigenÖztoprak for fruitful discussions on Quasi-Newton meth-ods. The authors would also like to thank to Charles Sut-ton for his explanations on the Ensemble Chain Adaptationframework. This work is partly supported by the FrenchNational Research Agency (ANR) as a part of the EDISON3D project (ANR-13-CORD-0008-02). A. Taylan Cemgilis supported by TÜBITAK 113M492 (Pavera) and Bogaz-içi University BAP 10360-15A01P1.

ReferencesAhn, S., Korattikara, A., and Welling, M. Bayesian poste-

rior sampling via stochastic gradient Fisher scoring. InInternational Conference on Machine Learning, 2012.

Ahn, S., Shahbaba, B., and Welling, M. Distributedstochastic gradient MCMC. In International Conferenceon Machine Learning, 2014.

Ahn, S., Korattikara, A., Liu, N., Rajan, S., and Welling,M. Large-scale distributed Bayesian matrix factoriza-tion using stochastic gradient MCMC. In InternationalConference on Knowledge Discovery and Data Mining,August 2015.

Brodlie, K. W., Gourlay, A. R., and Greenstadt, J. Rank-one and rank-two corrections to positive definite matri-ces expressed in product form. IMA J APPL MATH, 11(1):73–82, 1973.

Bubeck, S., Eldan, R., and Lehec, J. Finite-time analysis ofprojected Langevin Monte Carlo. In Advances in NeuralInformation Processing Systems, pp. 1243–1251, 2015.

Bui-Thanh, T. and Ghattas, O. A scaled stochastic Newtonalgorithm for Markov Chain Monte Carlo simulations.SIAM Journal on Uncertainty Quantification, 2012.

Byrd, R. H., Nocedal, J., and Schnabel, R. B. Representa-tions of quasi-Newton matrices and their use in limitedmemory methods. Mathematical Programming, 63(1-3):129–156, 1994.

Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y.A stochastic quasi-Newton method for large-scale opti-mization. arXiv preprint arXiv:1401.7020, 2014.

Calderhead, B. and Sustik, M. Sparse approximate mani-folds for differential geometric MCMC. In Advances inNeural Information Processing Systems, pp. 2879–2887,2012.

Chaari, L., Tourneret, J.Y., Chaux, C., and Batatia, H. AHamiltonian Monte Carlo method for non-smooth en-ergy sampling. arXiv preprint arXiv:1401.3988, 2015.

Chen, C., Ding, N., and Carin, L. On the convergence ofstochastic gradient MCMC algorithms with high-orderintegrators. In Advances in Neural Information Process-ing Systems, pp. 2269–2277, 2015.

Chen, T., Fox, E. B., and Guestrin, C. Stochastic gradientHamiltonian Monte Carlo. In International Conferenceon Machine Learning, 2014.

Simsekli, U., Liutkus, A., and Cemgil, A. T. Alpha-stablematrix factorization. IEEE Signal Processing Letters, 22(12):2289–2293, 2015.

Simsekli, U., Badeau, R., Richard, G., and Cemgil,A. T. Stochastic thermodynamic integration: effi-cient Bayesian model selection via stochastic gradientMCMC. In 41st International Conference on Acoustics,Speech and Signal Processing, 2016.

Dahlin, J., Lindsten, F., and Schön, T. B. Quasi-Newtonparticle Metropolis-Hastings. In IFAC Symposium onSystem Identification, 2015.

Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D.,and Neven, H. Bayesian sampling using stochastic gra-dient thermostats. In Advances in Neural InformationProcessing Systems, pp. 3203–3211, 2014.

Gemulla, R., Nijkamp, E., J., Haas. P., and Sismanis,Y. Large-scale matrix factorization with distributedstochastic gradient descent. In ACM SIGKDD, 2011.

Girolami, M. and Calderhead, B. Riemann manifoldLangevin and Hamiltonian Monte Carlo methods. J RStat Soc Series B Stat Methodol, 73(2):123–214, 2011.

Godsill, S. and Kuruoglu, E. E. Bayesian inference fortime series with heavy-tailed symmetric α-stable noiseprocesses. Heavy Tails’ 99, Applications of Heavy TailedDistributions in Economics, Engineering and Statistics,pp. 3–5, 1999.

Hu, Y. and Loizou, P. C. Subjective comparison and eval-uation of speech enhancement algorithms. Speech com-munication, 49(7):588–601, 2007.

Kuruoglu, E. E. Signal processing in α-stable noise envi-ronments: a least lp-norm approach. PhD thesis, Uni-versity of Cambridge, 1999.

Kushner, H. and Yin, G. Stochastic Approximation andRecursive Algorithms and Applications. Springer, NewYork, 2003.

Li, C., Chen, C., Carlson, D., and Carin, L. Preconditionedstochastic gradient Langevin dynamics for deep neuralnetworks. In AAAI Conference on Artificial Intelligence,2016.


Ma, Y. A., Chen, T., and Fox, E. A complete recipe forstochastic gradient MCMC. In Advances in Neural In-formation Processing Systems, pp. 2899–2907, 2015.

Neal, R. M. MCMC using Hamiltonian dynamics. Hand-book of Markov Chain Monte Carlo, 54, 2010.

Nocedal, J. and Wright, S. J. Numerical optimization.Springer, Berlin, 2006.

Patterson, S. and Teh, Y. W. Stochastic gradient Rie-mannian Langevin dynamics on the probability simplex.In Advances in Neural Information Processing Systems,2013.

Pereyra, M. Proximal Markov Chain Monte Carlo algo-rithms. arXiv preprint arXiv:1306.0187, 2013.

Qi, Y. and Minka, T. P. Hessian-based Markov ChainMonte Carlo algorithms. In First Cape Cod Workshopon Monte Carlo Methods, 2002.

Robbins, H. and Monro, S. A stochastic approximationmethod. Ann. Math. Statist., 22(3):400–407, 1951.

Roberts, G. O. and Stramer, O. Langevin Diffusionsand Metropolis-Hastings Algorithms. Methodology andComputing in Applied Probability, 4(4):337–357, De-cember 2002. ISSN 13875841.

Roberts, G.O. and Gilks, W.R. Convergence of adaptivedirection sampling. Journal of Multivariate Analysis, 49(2):287 – 298, 1994.

Rossky, P. J., Doll, J. D., and Friedman, H. L. Browniandynamics as smart Monte Carlo simulation. The Journalof Chemical Physics, 69(10):4628–4633, 1978.

Salakhutdinov, R. and Mnih, A. Bayesian probabilistic ma-trix factorization using Markov Chain Monte Carlo. InInternational Conference on Machine learning, pp. 880–887, 2008.

Samoradnitsky, G. and Taqqu, M. Stable non-Gaussianrandom processes: stochastic models with infinite vari-ance, volume 1. CRC Press, 1994.

Sato, I. and Nakagawa, H. Approximation analysis ofstochastic gradient Langevin dynamics by using Fokker-Planck equation and Ito process. In International Con-ference on Machine Learning, pp. 982–990, 2014.

Schraudolph, N. N., Yu, J., and Günter, S. A stochasticquasi-Newton method for online convex optimization. InInternational Conference on Artificial Intelligence andStatistics, pp. 436–443, 2007.

Shang, X., Zhu, Z., Leimkuhler, B., and Storkey, A. J.Covariance-controlled adaptive Langevin thermostat forlarge-scale Bayesian sampling. In Advances in NeuralInformation Processing Systems, pp. 37–45, 2015.

Simsekli, U., Koptagel, H., Güldas, H., Cemgil, A. T., Öz-toprak, F., and Birbil, S. I. Parallel stochastic gradi-ent Markov Chain Monte Carlo for matrix factorisationmodels. arXiv preprint arXiv:1506.01418, 2015a.

Simsekli, U., Koptagel, H., Öztoprak, F., Birbil, S. I.,and Cemgil, A. T. HAMSI: Distributed incremen-tal optimization algorithm using quadratic approxima-tions for partially separable problems. arXiv preprintarXiv:1509.01698, 2015b.

Stacy, E. W. A generalization of the gamma distribution.Ann. Math. Statist., 33(3):1187–1192, 09 1962.

Teh, Y. W., Thiéry, A., and Vollmer, S. Consistencyand fluctuations for stochastic gradient Langevin dynam-ics. Journal of Machine Learning Research, 17(7):1–33,2016.

Welling, M. and Teh, Y. W. Bayesian learning via stochas-tic gradient Langevin dynamics. In International Con-ference on Machine Learning, pp. 681–688, 2011.

Xifara, T., Sherlock, C., Livingstone, S., Byrne, S., andGirolami, M. Langevin diffusions and the Metropolis-adjusted Langevin algorithm. Statistics & ProbabilityLetters, 91:14–19, 2014.

Zhang, Y. and Sutton, C. A. Quasi-Newton methods forMarkov Chain Monte Carlo. In Advances in Neural In-formation Processing Systems, pp. 2393–2401, 2011.

stochastic quasi-newton langevin monte carloproceedings.mlr.press/v48/simsekli16.pdf · appended...

Documents