stability of adaptive random-walk metropolis algorithms

IntroductionSome adaptive MCMC algorithms

Validity

Stability of adaptive random-walk Metropolisalgorithms

Matti [email protected]

University of Jyvaskyla, Department of Mathematics and Statistics

joint work with Eero Saksman (University of Helsinki)

Paris, May 5th 2011

Matti Vihola Stability of adaptive random-walk Metropolis algorithms


Validity

Adaptive MCMCStochastic approximation frameworkStability?

Adaptive MCMC

◮ Adaptive MCMC algorithms have been succesfully applied to‘automatise’ the tuning of the proposal distributions.

◮ In general, they are based on a collection of Markov kernels{Pθ}θ∈Θ, each having π as the unique invariant distribution.

◮ The aim is to find a good kernel (a good value of θ)automatically.

◮ Stochastic approximation framework is the most common wayto construct the algorithms [Andrieu and Robert, 2001].



Validity


Stochastic approximation framework

◮ One starts with certain (X1,Θ1) ≡ (x1, θ1) ∈ Rd ×R

e and forn ≥ 2

Xn ∼ PΘn−1(Xn−1, · )

Θn = Θn−1 + γnH(Θn−1, Xn).

where (γn)n≥2 is a non-negative step size sequence thatvanishes and the function H : Re × R

d → Re determines the

adaptation.

◮ The adaptation aims to Θn → θ∗ so that h(θ∗) = 0, where

h(θ) :=

∫

H(θ, x)π(dx).



Validity


Stability? I

◮ Usually, one can identify a Lyapunov functionw : Re → [0,∞) such that

⟨

∇w(θ), h(θ)⟩

< 0 whenever h(θ) 6= 0.

◮ This suggests that, “on average”, there should be a “drift” ofΘn towards a θ∗ such that h(θ∗) = 0.

◮ To have sufficient “averaging”, the sequence of Markovkernels PΘn

should have nice properties.

◮ To have nice properties on PΘn, the sequence (Θn)n≥1 should

behave nicely. . .



Validity


Stability? II

◮ Usual way to overcome the stability issue is to enforcestability, or to use adaptive reprojections approach like Andrieuand Moulines [2006], Andrieu, Moulines, and Priouret [2005].

◮ Strong belief (based on overwhelming empirical evidence) thatmany algorithms are intrinsically stable (and do not requirestabilisation).



Validity

Adaptive Metropolis algorithmAdaptive scaling Metropolis algorithmRobust adaptive Metropolis algorithm

Adaptive Metropolis algorithm I

The AM algorithm [Haario, Saksman, and Tamminen, 2001]:

◮ Define X1 ≡ x1 ∈ Rd such that π(x1) > 0.

◮ Choose parameters t > 0 and ǫ ≥ 0.

For n = 2, 3, . . .

Yn = Xn−1 +Σ1/2n−1Wn, where Wn ∼ N(0, I), and

Xn =

{

Yn, with probability α(Xn−1, Yn) and

Xn−1, otherwise.

where Σn−1 := t2Cov(X1, . . . , Xn−1) + ǫI, and I ∈ Rd×d stands

for the identity matrix.



Validity


Adaptive Metropolis algorithm II

Cov(· · · ) is some consistent covariance estimator (not necessarilythe standard sample covariance!).

◮ In what follows, consider the definitionCov(X1, . . . , Xn) := Sn, where (M1, S1) ≡ (x1, s1) withs1 ∈ R

d×d is symmetric and positive definite, and

Mn = (1− γn)Mn−1 + γnXn

Sn = (1− γn)Sn−1 + γn(Xn −Mn−1)(Xn −Mn−1)T .



Validity


Example run of AM

−2 0 2−10

−8

−6

−4

−2

0

−2 0 2−10

−8

−6

−4

−2

0

−2 0 2−10

−8

−6

−4

−2

0

Figure: The first 10, 100 and 1000 samples of the AM algorithm startedwith s1 = (0.01)2I, t = 2.38/

√2, ǫ = 0 and ηn := n−1.



Validity


Adaptive scaling Metropolis algorithm I

This algorithm, essentially proposed by [Gilks, Roberts, and Sahu,1998, Andrieu and Robert, 2001], adjusts the size of the proposaljumps, and tries to attain a given mean acceptance rate.


◮ Let Θ1 ≡ θ1 > 0.

◮ Define the desired mean acceptance rate α∗ ∈ (0, 1). (Usuallyα∗ = 0.44 or α∗ = 0.234. . . )



Validity


Adaptive scaling Metropolis algorithm II

For n = 2, 3, . . ., iterate

Yn = Xn−1 +Θn−1Wn, where Wn ∼ N(0, I), and

Xn =

{


Xn−1, otherwise.

logΘn = logΘn−1 + γn[

α(Xn−1, Yn)− α∗].



Validity


Example run of ASM

−2 0 2−10

−8

−6

−4

−2

0

−2 0 2−10

−8

−6

−4

−2

0

−2 0 2−10

−8

−6

−4

−2

0

Figure: The first 10, 100 and 1000 samples of the ASM algorithm startedwith θ1 = 0.01 and using γn = n−3/4.



Validity


Adaptive scaling within adaptive Metropolis

The AM and ASM algorithms can be used simultaneously [Atchadeand Fort, 2010, Andrieu and Thoms, 2008].

Yn = Xn−1 +Θn−1Σn−1Wn, where Wn ∼ N(0, I), and

Xn =

{


Xn−1, otherwise.

logΘn = logΘn−1 + γn[

α(Xn−1, Yn)− α∗]

Mn = (1− γn)Mn−1 + γnXn

Sn = (1− γn)Sn−1 + γn(Xn −Mn−1)(Xn −Mn−1)T

Σn = t2Sn + ǫI.



Validity


Robust adaptive Metropolis algorithm I

The RAM algorithm is a multidimensional ‘extension’ of the ASMadaptation mechanism [Vihola, 2010b].


◮ Let Σ1 ≡ s1 ∈ Rd×d be symmetric and positive definite.

◮ Define the desired mean acceptance rate α∗ ∈ (0, 1).



Validity


Robust adaptive Metropolis algorithm II

For n = 2, 3, . . ., iterate

Yn = Xn−1 +Σ1/2n−1Wn, where Wn ∼ N(0, I), and

Xn =

{


Xn−1, otherwise.

Σn = Σn−1 + γn[

α(Xn−1, Yn)− α∗]Σ1/2n−1

WnWTn

‖Wn‖2Σ1/2n−1

where Σn will be symmetric and positive definite wheneverγn

[

α(Xn−1, Yn)− α∗] > −1.



Validity


Robust adaptive Metropolis algorithm III

Figure: The RAM update with γn[

α(Xn−1, Yn)− α∗

]

= ±0.8, resp.The ellipsoids defined by Σn−1 (Σn) are drawn in solid (dashed), and the

vector Σ1/2n−1Wn/‖Wn‖ as a dot.



Validity


Example run of RAM

−2 0 2−10

−8

−6

−4

−2

0

−2 0 2−10

−8

−6

−4

−2

0

−2 0 2−10

−8

−6

−4

−2

0

Figure: The first 10, 100 and 1000 samples of the RAM algorithm startedwith Σ1 = 0.01× I and using γn = min{1, 2 · n−3/4}.



Validity


RAM vs. ASwAM

◮ Computationally, they are of similar complexity (Choleskyupdate of order O(d2)).

◮ The RAM adaptation does not require that π has a finitesecond moment (or any moment at all).

◮ In RAM, the adaptation is ‘direct’ ‘while estimating thecovariance is ‘indirect’ (as it requires also a mean estimate).

◮ There are also situations where ASwAM seems to be moreefficient than RAM. . .



Validity

Previous conditionsNew resultsResults for AM and ASM

Previous conditions on validity I

◮ What conditions are sufficient to guarantee that a law of largenumbers (LLN) holds: 1

n

∑nk=1

f(Xk)n→∞−−−→

∫

f(x)π(x)dx?

◮ Key conditions due to Roberts and Rosenthal [2007]:

Diminishing adaptation (DA) The effect of the adaptationbecomes smaller and smaller,

“‖PΘn− PΘn−1

‖ n→∞−−−→ 0.”

Containment (C) The kernels PΘnhave all the time

sufficiently good mixing properties:

(

supn≥1

supk≥m

‖P kΘn

(Xn, · )− π‖)

m→∞−−−−→ 0.



Validity


Previous conditions on validity II

◮ DA is usually easier to verify.

◮ Containment is often tightly related to the stability of theprocess (Θn)n≥1.

◮ Establishing containment (or stability) is often difficult beforeshowing that a LLN holds. . .

◮ Typical solution has been to enforce containment.◮ In the case of AM, ASM and RAM algorithms, this means that

the eigenvalues of Σn, Θn are constrained to be within [a, b]for some constants 0 < a ≤ b < ∞.



Validity


Previous conditions on validity III

There are also other results in the literature on the ergodicity ofadaptive MCMC given conditions similar to DA and C.

◮ The original work on AM [Haario, Saksman, and Tamminen,2001].

◮ Atchade and Rosenthal [2005] analyse the ASM algorithmfollowing the original mixingale approach.

◮ The results on adaptive reprojections approach Andrieu andMoulines [2006], Andrieu, Moulines, and Priouret [2005].

◮ The recent paper by Atchade and Fort [2010] is based on aresolvent kernel approach.



Validity


New results I

Key ideas in the new results:

◮ Containment is, in fact, not a necessary condition for LLN.

◮ That is, the ergodic properties of PΘncan become more and

more unfavourable, and still a (strong) LLN can hold.

◮ The adaptation mechanism may include a strong explicit‘drift’ away from ‘bad values’ of θ.



Validity


General ergodicity result I

◮ Assume K1 ⊂ K2 ⊂ · · · ⊂ Re are increasing subsets of the

adaptation space. Assume that the adaptation follows thestochastic approximation dynamic

Xn ∼ PΘn−1(Xn−1, · )

Θn = Θn−1 + γnH(Θn−1, Xn).

and the adaptation parameter Θn ∈ Kn for all n ≥ 1.◮ One may also enforce Θn ∈ Kn. . .

◮ Consider the following conditions for constants c < ∞ and0 ≤ ǫ ≪ 1:



Validity


General ergodicity result II

(A1) Drift and minorisation:

There is a drift function V : Rd → [1,∞) such that for alln ≥ 1 and all θ ∈ Kn

PθV (x) ≤ λnV (x) + 1Cn(x)bn and (1)

Pθ(x,A) ≥ 1Cn(x)δnνn(A) (2)

where Cn ⊂ Rd are Borel sets, δn, λn ∈ (0, 1) and bn < ∞

are constants and νn is concentrated on Cn. Furthermore,the constants are polynomially bounded so that

(1− λn)−1 ∨ δ−1

n ∨ bn ≤ cnǫ.



Validity


General ergodicity result III

(A2) Continuity:

For all n ≥ 1 and any r ∈ (0, 1], there is c′ = c′(r) ≥ 1 suchthat for all θ, θ′ ∈ Kn,

‖Pθf − Pθ′f‖V r ≤ c′nǫ ‖f‖V r |θ − θ′|.

(A3) Bound for adaptation function:

There is a β ∈ [0, 1/2] such that for all n ≥ 1, θ ∈ Kn andx ∈ R

d

|H(θ, x)| ≤ cnǫV β(x).



Validity


General ergodicity result IV

Theorem (Saksman and Vihola [2010])

Assume (A1)–(A3) hold and let f be a function with ‖f‖V α < ∞for some α ∈ (0, 1− β). Assume ǫ < κ−1

∗ [(1/2) ∧ (1− α− β)],where κ∗ ≫ 1 is an independent constant, and that∑∞

k=1kκ∗ǫ−1γk < ∞. Then,

limn→∞

1

n

n∑

k=1

f(Xk) =

∫

f(x)π(x)dx almost surely.



Validity


Verifiable assumptions I

Definition (Strongly super-exponential target)

The target density π is continuously differentiable and has regulartails that decay super-exponentially,

lim sup|x|→∞

x

|x| ·∇π(x)

|∇π(x)| < 0 and

lim|x|→∞

x

|x|ρ · ∇ log π(x) = −∞,

with some ρ > 1.

(The “super-exponential case”, ρ = 1, is due to Jarner and Hansen[2000] who ensure the geometric ergodicity of a nonadaptiverandom-walk Metropolis process.)



Validity


Verifiable assumptions II

x

0

α∇π(x)

Figure: The condition lim sup|x|→∞x|x| ·

∇π(x)|∇π(x)| < 0 implies that there is

an ε > 0 such that for any sufficiently large |x|, the angle α < π/2− ε.



Validity


AM when π has unbounded support

◮ Saksman and Vihola [2010]: First ergodicity results for theoriginal AM algorithm, without upper bounding theeigenvalues λ(Σn).

◮ Use the proposal covariance Σn = t2Sn + ǫI, with ǫ > 0.

◮ SLLN and CLT for strongly super-exponential π and functionssatisfying supx |f(x)|π−p(x) < ∞ for some p ∈ (0, 1).



Validity


AM without the covariance lower bound I

Vihola [2011b] contains partial results on the case Σn = t2Sn, thatis, without lower bounding the eigenvalues λ(Σn).

◮ Analysed first an ‘adaptive random walk’: AM run with ‘flattarget π ≡ 1’,

Xn+1 = Xn + tS1/2n Wn+1

Mn+1 = (1− γn+1)Mn + γn+1Xn+1

Sn+1 = (1− γn+1)Sn + γn+1(Xn+1 −Mn)2.

◮ Ten sample paths of this process (univariate) started atx0 = 0, s0 = 1 and with t = 0.01 and ηn = n−1:



Validity


AM without the covariance lower bound II

0 2000 4000 6000 8000 10000−0.1

−0.05

0

0.05

0.1

Xn

0 2000 4000 6000 8000 1000010

−4

10−2

100

Sn



Validity


AM without the covariance lower bound III

0 0.5 1 1.5 2

x 106

−200

0

200

400

600

Xn

0 0.5 1 1.5 2

x 106

10−5

100

105

Sn



Validity


AM without the covariance lower bound IV

◮ It is shown that Sn → ∞ almost surely.

◮ The speed of growth is E[Sn] ∼ exp(

t∑n

i=2

√ηi)

.

(Particularly, E[Sn] ∼ e2t√n when ηn = n−1).

◮ Using the same techniques, one can show the stability (andergodicity) of AM run with a univariate Laplace target π.

◮ These results have little direct practical value, but theyindicate that the AM covariance parameter Sn does not tendto collapse.



Validity


AM with a fixed proposal component I

◮ Instead of lower bounding the eigenvalues λ(Σn), employ afixed proposal component with a probability β ∈ (0, 1)[Roberts and Rosenthal, 2009].

◮ This corresponds to an algorithm where the proposals Yn aregenerated by

Yn = Xn−1 +

{

Σ1/20 Wn, with probability β,

Σ1/2n−1Wn, otherwise.

with some fixed symm.pos.def. Σ0.

◮ In other words, one employs a mixture of ‘adaptive’ and‘nonadaptive’ Markov kernels:

P (Xn ∈ A | X1, . . . , Xn−1) = (1− β)PΣn−1(A) + βPΣ0

(A)



Validity


AM with a fixed proposal component II

Vihola [2011b] shows that, for example with a bounded andcompactly supported π or with a super-exponential π, theeigenvalues λ(Σn) are bounded away from zero.

◮ Having a strongly super-exponential target, SLLN (resp. CLT)hold for functions satisfying supx |f(x)|π−p(x) < ∞ for somep ∈ (0, 1) (p ∈ (0, 1/2)).

◮ Explicit upper and lower bounds for Σn unnecessary.

◮ Mixture proposal may be better in practice than the lowerbound ǫI.

Also Bai, Roberts, and Rosenthal [2008] analyse this algorithm.

◮ A completely different approach, with different assumptions.

◮ Also exponentially decaying π are considered; in this case thefixed covariance Σ0 must be large enough.



Validity


Ergodicity of unconstrained ASM algorithm

Vihola [2011a] shows two results for the unmodified adaptation,without any (upper or lower) bounds for Θn. Assume:

◮ the desired acceptance rate α∗ ∈ (0, 1/2).

◮ the adaptation weights satisfy∑

γ2n < ∞ (e.g. γn = n−γ withγ ∈ (1/2, 1]).

Two cases:

1. π is bounded, bounded away from zero on the support, andthe support X = {x : π(x) > 0} is compact and has a smoothboundary.Then, SLLN holds for bounded functions.

2. Suppose a strongly super-exponential target having tails withuniformly smooth contours.Then, SLLN holds for functions satisfyingsupx |f(x)|π−p(x) < ∞ for some p ∈ (0, 1).



Validity


Ergodicity of ASM within AM

◮ The results in [Vihola, 2011a] cover also the adaptive scalingwithin adaptive Metropolis algorithm.

◮ However, one has to assume that the covariance factor Σn isbounded within 0 < a ≤ b < ∞



Validity


Final remarks I

◮ Current results show that some adaptive MCMC algorithmsare intrinsically stable, requiring no additional stabilisationstructures.

◮ Easier for practitioners; less parameters to ‘tune.’◮ Showing that the methods are fairly ‘safe’ to apply.

◮ The results are related to the more general question of thestability of the Robbins-Monro stochastic approximation withMarkovian noise.

◮ Manuscript in preparation with C. Andrieu. . .

◮ Fort, Moulines, and Priouret [2010] consider also interactingMCMC (e.g. the equi-energy sampler) and extend theapproach in [Saksman and Vihola, 2010].



Validity


Final remarks II

◮ It is not necessary to use Gaussian proposal distributions. Allthe above results apply to elliptical proposals satisfying aparticular tail decay condition. For example, the results applywith multivariate Student distributions having the form

qc(z) ∝ (1 + ‖c−1/2z‖2)−d+p

2

where p > 0 [Vihola, 2011a].

◮ Current results apply only for targets with rapidly decayingtails. It is important to establish similar results withheavy-tailed targets.

◮ Overall, there is a need for more general yet practicallyverifiable conditions to check the validity of the methods.



Validity


Final remarks III

◮ There is also a free software available for testing severaladaptive random-walk MCMC algorithms, including the newRAM approach [Vihola, 2010a]:http://iki.fi/mvihola/grapham/


http://iki.fi/mvihola/grapham/


Validity


Final remarks III

◮ There is also a free software available for testing severaladaptive random-walk MCMC algorithms, including the newRAM approach [Vihola, 2010a]:http://iki.fi/mvihola/grapham/

Thank you!


http://iki.fi/mvihola/grapham/

References

References I

C. Andrieu and E. Moulines. On the ergodicity properties of someadaptive MCMC algorithms. Ann. Appl. Probab., 16(3):1462–1505, 2006.

C. Andrieu and C. P. Robert. Controlled MCMC for optimalsampling. Technical Report Ceremade 0125, Universite ParisDauphine, 2001.

C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statist.Comput., 18(4):343–373, Dec. 2008.

C. Andrieu, E. Moulines, and P. Priouret. Stability of stochasticapproximation under verifiable conditions. SIAM J. ControlOptim., 44(1):283–312, 2005.


References

References II

Y. Atchade and G. Fort. Limit theorems for some adaptive MCMCalgorithms with subgeometric kernels. Bernoulli, 16(1):116–154,Feb. 2010.

Y. F. Atchade and J. S. Rosenthal. On adaptive Markov chainMonte Carlo algorithms. Bernoulli, 11(5):815–828, 2005.

Y. Bai, G. O. Roberts, and J. S. Rosenthal. On the containmentcondition for adaptive Markov chain Monte Carlo algorithms.Preprint, July 2008. URLhttp://probability.ca/jeff/research.html.

G. Fort, E. Moulines, and P. Priouret. Convergence of adaptiveMCMC algorithms: ergodicity and law of large numbers. 2010.technical report.


http://probability.ca/jeff/research.html

References

References III

W. R. Gilks, G. O. Roberts, and S. K. Sahu. Adaptive Markovchain Monte Carlo through regeneration. J. Amer. Statist.Assoc., 93(443):1045–1054, 1998.

H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolisalgorithm. Bernoulli, 7(2):223–242, 2001.

S. F. Jarner and E. Hansen. Geometric ergodicity of Metropolisalgorithms. Stochastic Process. Appl., 85:341–361, 2000.

G. O. Roberts and J. S. Rosenthal. Coupling and ergodicity ofadaptive Markov chain Monte Carlo algorithms. J. Appl.Probab., 44(2):458–475, 2007.

G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC.J. Comput. Graph. Statist., 18(2):349–367, 2009.


References

References IV

E. Saksman and M. Vihola. On the ergodicity of the adaptiveMetropolis algorithm on unbounded domains. Ann. Appl.Probab., 20(6):2178–2203, Dec. 2010.

M. Vihola. Grapham: Graphical models with adaptive random walkMetropolis algorithms. Comput. Statist. Data Anal., 54(1):49–54, Jan. 2010a.

M. Vihola. Robust adaptive Metropolis algorithm with coercedacceptance rate. Preprint, arXiv:1011.4381v1, Nov. 2010b.

M. Vihola. On the stability and ergodicity of adaptive scalingMetropolis algorithms. Preprint, arXiv:0903.4061v3, Apr. 2011a.

M. Vihola. Can the adaptive Metropolis algorithm collapse withoutthe covariance lower bound? Electron. J. Probab., 16:45–75,2011b.


stability of adaptive random-walk metropolis algorithms

Documents