Download - Introduction to advanced Monte Carlo methods
An introduction to advanced (?) MCMC methods
An introduction to advanced (?) MCMC methods
Christian P. Robert
Universite Paris-Dauphine and CREST-INSEEhttp://www.ceremade.dauphine.fr/~xian
Royal Statistical Society, October 13, 2010
An introduction to advanced (?) MCMC methods
Motivating example
Motivating example
1 Motivating example
2 The Metropolis-Hastings Algorithm
An introduction to advanced (?) MCMC methods
Motivating example
Latent structures make life harder!
Even simple models may lead to computational complications,as in latent variable models
f(x|θ) =
∫
f⋆(x, x⋆|θ) dx⋆
An introduction to advanced (?) MCMC methods
Motivating example
Latent structures make life harder!
Even simple models may lead to computational complications,as in latent variable models
f(x|θ) =
∫
f⋆(x, x⋆|θ) dx⋆
If (x, x⋆) observed, fine!
An introduction to advanced (?) MCMC methods
Motivating example
Latent structures make life harder!
Even simple models may lead to computational complications,as in latent variable models
f(x|θ) =
∫
f⋆(x, x⋆|θ) dx⋆
If (x, x⋆) observed, fine!
If only x observed, trouble!
An introduction to advanced (?) MCMC methods
Motivating example
Example (Mixture models)
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1f1(x) + · · · + pkfk(x) .
An introduction to advanced (?) MCMC methods
Motivating example
Example (Mixture models)
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1f1(x) + · · · + pkfk(x) .
For a sample of independent random variables (X1, · · · , Xn),sample density
n∏
i=1
{p1f1(xi) + · · · + pkfk(xi)} .
An introduction to advanced (?) MCMC methods
Motivating example
Example (Mixture models)
Models of mixtures of distributions:
X ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X ∼ p1f1(x) + · · · + pkfk(x) .
For a sample of independent random variables (X1, · · · , Xn),sample density
n∏
i=1
{p1f1(xi) + · · · + pkfk(xi)} .
Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.
An introduction to advanced (?) MCMC methods
Motivating example
0.3N (µ1, 1) + 0.7N (µ2, 1) loglikelihood
−1 0 1 2 3
−1
01
23
µ1
µ 2
An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance inconstrained parameter sets like those resulting from imposingstationarity constraints in dynamic models;
An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance inconstrained parameter sets like those resulting from imposingstationarity constraints in dynamic models;
(ii) use of a complex sampling model with an intractablelikelihood, as for instance in missing data and graphicalmodels;
An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance inconstrained parameter sets like those resulting from imposingstationarity constraints in dynamic models;
(ii) use of a complex sampling model with an intractablelikelihood, as for instance in missing data and graphicalmodels;
(iii) use of a huge dataset;
An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance inconstrained parameter sets like those resulting from imposingstationarity constraints in dynamic models;
(ii) use of a complex sampling model with an intractablelikelihood, as for instance in missing data and graphicalmodels;
(iii) use of a huge dataset;(iv) use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
An introduction to advanced (?) MCMC methods
Motivating example
A typology of Bayes computational problems
(i) use of a complex parameter space, as for instance inconstrained parameter sets like those resulting from imposingstationarity constraints in dynamic models;
(ii) use of a complex sampling model with an intractablelikelihood, as for instance in missing data and graphicalmodels;
(iii) use of a huge dataset;(iv) use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);(v) use of a complex inferential procedure as for instance, Bayes
factors
Bπ01(x) = P (θ ∈ Θ0 |x)/P (θ ∈ Θ1 |x)
/
π(θ ∈ Θ0)
π(θ ∈ Θ1).
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis-Hastings Algorithm
1 Motivating example
2 The Metropolis-Hastings AlgorithmMonte Carlo Methods based on Markov ChainsThe Metropolis–Hastings algorithmA collection of Metropolis-Hastings algorithmsExtensionsConvergence assessment
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains
Fact: It is not necessary to use a sample from the distribution f toapproximate the integral
I =
∫
h(x)f(x)dx ,
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains
Fact: It is not necessary to use a sample from the distribution f toapproximate the integral
I =
∫
h(x)f(x)dx ,
We can obtain X1, . . . , Xn ∼ f (approx) without directlysimulating from f , using an ergodic Markov chain withstationary distribution f
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
Running Monte Carlo via Markov Chains (2)
Idea
For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f
Ensures the convergence in distribution of (X(t)) to a randomvariable from f .
For a “large enough” T0, X(T0) can be considered as
distributed from f
Produces a dependent sample X(T0), X(T0+1), . . ., which isgenerated from f , sufficient for most approximation purposes.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The Metropolis–Hastings algorithm
Problem:How can one build a Markov chain with a given stationarydistribution?
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The Metropolis–Hastings algorithm
Problem:How can one build a Markov chain with a given stationarydistribution?
MH basicsAlgorithm that converges to the objective (target) density
f
using an arbitrary transition kernel density
q(x, y)
called instrumental (or proposal) distribution
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
The MH algorithm
Algorithm (Metropolis–Hastings)
Given x(t),
1 Generate Yt ∼ q(x(t), y).
2 Take
X(t+1) =
{
Yt with prob. ρ(x(t), Yt),
x(t) with prob. 1 − ρ(x(t), Yt),
where
ρ(x, y) = min
{
f(y)
f(x)
q(y, x)
q(x, y), 1
}
.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Features
Independent of normalizing constants for both f and q(x, ·)(ie, those constants independent of x)
Never move to values with f(y) = 0
The chain (x(t))t may take the same value several times in arow, even though f is a density wrt Lebesgue measure
The sequence (yt)t is usually not a Markov chain
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Features
Independent of normalizing constants for both f and q(x, ·)(ie, those constants independent of x)
Never move to values with f(y) = 0
The chain (x(t))t may take the same value several times in arow, even though f is a density wrt Lebesgue measure
The sequence (yt)t is usually not a Markov chain
Satisfies the detailed balance condition
f(x)K(x, y) = f(y)K(y, x)
’θθ->P( )
P( )θ ’ θ->
θ θ’
[Green, 1995]
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
1 The M-H Markov chain is reversible, with invariant/stationarydensity f .
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
1 The M-H Markov chain is reversible, with invariant/stationarydensity f .
2 As f is a probability measure, the chain is positive recurrent
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties
1 The M-H Markov chain is reversible, with invariant/stationarydensity f .
2 As f is a probability measure, the chain is positive recurrent
3 If
Pr
[
f(Yt) q(Yt, X(t))
f(X(t)) q(X(t), Yt)≥ 1
]
< 1. (1)
i.e., if the event {X(t+1) = X(t)} occurs with positiveprobability, then the chain is aperiodic
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties (2)
4 Ifq(x, y) > 0 for every (x, y), (2)
the chain is irreducible
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties (2)
4 Ifq(x, y) > 0 for every (x, y), (2)
the chain is irreducible5 For M-H, f -irreducibility implies Harris recurrence
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
The Metropolis–Hastings algorithm
Convergence properties (2)
4 Ifq(x, y) > 0 for every (x, y), (2)
the chain is irreducible5 For M-H, f -irreducibility implies Harris recurrence6 Thus, under conditions (1) and (2)
(i) For h, with Ef |h(X)| <∞,
limT→∞
1
T
T∑
t=1
h(X(t)) =
∫
h(x)df(x) a.e. f.
(ii) and
limn→∞
∥
∥
∥
∥
∫
Kn(x, ·)µ(dx) − f
∥
∥
∥
∥
TV
= 0
for every initial distribution µ, where Kn(x, ·) denotes thekernel for n transitions.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
The Independent Case
The instrumental distribution q(x, ·) is independent of x and isdenoted g
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
The Independent Case
The instrumental distribution q(x, ·) is independent of x and isdenoted g
Algorithm (Independent Metropolis-Hastings)
Given x(t),
1 Generate Yt ∼ g(y)
2 Take
X(t+1) =
Yt with prob. min
{
f(Yt) g(x(t))
f(x(t)) g(Yt), 1
}
,
x(t) otherwise.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Properties
The resulting sample is not iid
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Properties
The resulting sample is not iid but there exist strong convergenceproperties:
Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists aconstant M such that
f(x) ≤Mg(x) , x ∈ supp f.
In this case,
‖Kn(x, ·) − f‖TV ≤(
1 − 1
M
)n
.
[Mengersen & Tweedie, 1996]
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ2)
and observablesyt|xt ∼ N (x2
t , σ2)
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1))
Hidden Markov chain from a regular AR(1) model,
xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ2)
and observablesyt|xt ∼ N (x2
t , σ2)
The distribution of xt given xt−1, xt+1 and yt is
exp−1
2τ2
{
(xt − ϕxt−1)2 + (xt+1 − ϕxt)
2 +τ2
σ2(yt − x2
t )2
}
.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1) too)
Use for proposal the N (µt, ω2t ) distribution, with
µt = ϕxt−1 + xt+1
1 + ϕ2and ω2
t =τ2
1 + ϕ2.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Noisy AR(1) too)
Use for proposal the N (µt, ω2t ) distribution, with
µt = ϕxt−1 + xt+1
1 + ϕ2and ω2
t =τ2
1 + ϕ2.
Ratioπ(x)/qind(x) = exp−(yt − x2
t )2/2σ2
is bounded
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
(top) Last 500 realisations of the chain {Xk}k out of 10, 000iterations; (bottom) histogram of the chain, compared withthe target distribution.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk Metropolis–Hastings
Instead, use a local perturbation as proposal
Yt = X(t) + εt,
where εt ∼ g, independent of X(t).The instrumental density is now of the form g(y − x) and theMarkov chain is a random walk if g is symmetric
g(x) = g(−x)
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Algorithm (Random walk Metropolis)
Given x(t)
1 Generate Yt ∼ g(y − x(t))
2 Take
X(t+1) =
Yt with prob. min
{
1,f(Yt)
f(x(t))
}
,
x(t) otherwise.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Probit illustration
Likelihood and posterior given by
π(β|y, X) ∝ ℓ(β|y, X) ∝n∏
i=1
Φ(xiTβ)yi(1 − Φ(xiTβ))ni−yi .
under the flat prior
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Probit illustration
Likelihood and posterior given by
π(β|y, X) ∝ ℓ(β|y, X) ∝n∏
i=1
Φ(xiTβ)yi(1 − Φ(xiTβ))ni−yi .
under the flat priorA random walk proposal works well for a small number ofpredictors. Use the maximum likelihood estimate β as startingvalue and asymptotic (Fisher) covariance matrix of the MLE, Σ, asscale
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
MCMC algorithm
Probit random-walk Metropolis-Hastings
Initialization: Set β(0) = β and compute Σ
Iteration t:1 Generate β ∼ Nk+1(β
(t−1), τ Σ)2 Compute
ρ(β(t−1), β) = min
(
1,π(β|y)
π(β(t−1)|y)
)
3 With probability ρ(β(t−1), β) set β(t) = β;otherwise set β(t) = β(t−1).
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
R bank benchmark
Probit modelling withno intercept over thefour measurements.Three different scalesτ = 1, 0.1, 10: bestmixing behavior isassociated with τ = 1.Average of theparameters overMCMC 9, 000iterations gives plug-inestimate
0 4000 8000
−2.
0−
1.0
−2.0 −1.5 −1.0 −0.5
0.0
1.0
0 200 600 1000
0.0
0.4
0.8
0 4000 8000
−1
12
3
−1 0 1 2 3
0.0
0.4
0 200 600 1000
0.0
0.4
0.8
0 4000 8000−
0.5
1.0
2.5
−0.5 0.5 1.5 2.5
0.0
0.4
0.8
0 200 600 1000
0.0
0.4
0.8
0 4000 8000
0.6
1.2
1.8
0.6 1.0 1.4 1.8
0.0
1.0
2.0
0 200 600 1000
0.0
0.4
0.8
pi = Φ(−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4) .
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Mixture models)
π(θ|x) ∝n∏
j=1
(
k∑
ℓ=1
pℓf(xj |µℓ, σℓ)
)
π(θ)
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Mixture models)
π(θ|x) ∝n∏
j=1
(
k∑
ℓ=1
pℓf(xj |µℓ, σℓ)
)
π(θ)
Metropolis-Hastings proposal:
θ(t+1) =
{
θ(t) + ωε(t) if u(t) < ρ(t)
θ(t) otherwise
where
ρ(t) =π(θ(t) + ωε(t)|x)
π(θ(t)|x) ∧ 1
and ω scaled for good acceptance rate
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale 1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 1
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale 1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 10
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale 1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 100
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale 1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 500
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale 1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 1000
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale√.1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 10
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale√.1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 100
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale√.1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 500
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale√.1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 1000
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale√.1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 10,000
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)
and scale√.1
−1 0 1 2 3 4
−1
01
23
4
µ1
µ 2
Iteration 5000
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Convergence properties
Uniform ergodicity prohibited by random walk structure
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Convergence properties
Uniform ergodicity prohibited by random walk structureAt best, geometric ergodicity:
Theorem (Sufficient ergodicity)
For a symmetric density f , log-concave in the tails, and a positiveand symmetric density g, the chain (X(t)) is geometrically ergodic.
[Mengersen & Tweedie, 1996]
no tail effect
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms
Example (Comparison of taileffects)
Random-walkMetropolis–Hastings algorithmsbased on a N (0, 1) instrumentalfor the generation of (a) aN (0, 1) distribution and (b) adistribution with densityψ(x) ∝ (1 + |x|)−3
(a)
0 50 100 150 200
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
(a)
0 50 100 150 200
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
0 50 100 150 200
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
0 50 100 150 200
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
(b)
0 50 100 150 200
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
0 50 100 150 200
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
0 50 100 150 200
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
90% confidence envelopes ofthe means, derived from 500parallel independent chains
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Extensions
There are many other families of HM algorithms
Adaptive Rejection Metropolis Sampling
Reversible Jump
Langevin algorithms
to name just a few...
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Langevin Algorithms
Proposal based on the Langevin diffusion Lt is defined by thestochastic differential equation
dLt = dBt +1
2∇ log f(Lt)dt,
where Bt is the standard Brownian motion
Theorem
The Langevin diffusion is the only non-explosive diffusion which isreversible with respect to f .
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider thediscretised sequence
x(t+1) = x(t) +σ2
2∇ log f(x(t)) + σεt, εt ∼ Np(0, Ip)
where σ2 corresponds to the discretisation step
Example off(x) = exp(−x4)
Den
sity
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
σ2 = .1
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider thediscretised sequence
x(t+1) = x(t) +σ2
2∇ log f(x(t)) + σεt, εt ∼ Np(0, Ip)
where σ2 corresponds to the discretisation step
Example off(x) = exp(−x4)
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
σ2 = .01
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider thediscretised sequence
x(t+1) = x(t) +σ2
2∇ log f(x(t)) + σεt, εt ∼ Np(0, Ip)
where σ2 corresponds to the discretisation step
Example off(x) = exp(−x4)
Den
sity
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
σ2 = .001
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider thediscretised sequence
x(t+1) = x(t) +σ2
2∇ log f(x(t)) + σεt, εt ∼ Np(0, Ip)
where σ2 corresponds to the discretisation step
Example off(x) = exp(−x4)
Den
sity
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
σ2 = .0001
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Because continuous time cannot be simulated, consider thediscretised sequence
x(t+1) = x(t) +σ2
2∇ log f(x(t)) + σεt, εt ∼ Np(0, Ip)
where σ2 corresponds to the discretisation step
Example off(x) = exp(−x4)
Den
sity
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0.0
0.1
0.2
0.3
0.4
0.5
0.6
σ2 = .0001∗
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Discretization
Unfortunately, the discretized chain may be transient, for instancewhen
limx→±∞
∣
∣σ2∇ log f(x)|x|−1∣
∣ > 1
Example of f(x) = exp(−x4) when σ2 = .2
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
MH correction
Accept the new value Yt with probability
f(Yt)
f(x(t))·exp
{
−∥
∥
∥Yt − x(t) − σ2
2 ∇ log f(x(t))∥
∥
∥
2/
2σ2
}
exp
{
−∥
∥
∥x(t) − Yt − σ2
2 ∇ log f(Yt)∥
∥
∥
2/
2σ2
} ∧ 1 .
Choice of the scaling factor σ
Should lead to an acceptance rate of 0.574 to achieve optimalconvergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998]
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Optimizing the Acceptance Rate
Problem of choice of the transition kernel from a practical point ofviewMost common alternatives:
1 a fully automated algorithm like ARMS;
2 an instrumental density g which approximates f , such thatf/g is bounded for uniform ergodicity to apply;
3 a random walk
In both cases (b) and (c), the choice of g is critical,
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk
Different approach to acceptance ratesA high acceptance rate does not indicate that the algorithm ismoving correctly since it indicates that the random walk is movingtoo slowly on the surface of f .
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk
Different approach to acceptance ratesA high acceptance rate does not indicate that the algorithm ismoving correctly since it indicates that the random walk is movingtoo slowly on the surface of f .If x(t) and yt are close, i.e. f(x(t)) ≃ f(yt) y is accepted withprobability
min
(
f(yt)
f(x(t)), 1
)
≃ 1 .
For multimodal densities with well separated modes, the negativeeffect of limited moves on the surface of f clearly shows.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Case of the random walk (2)
If the average acceptance rate is low, the successive values of f(yt)tend to be small compared with f(x(t)), which means that therandom walk moves quickly on the surface of f since it oftenreaches the “borders” of the support of f
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. Inlarge dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. Inlarge dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
This rule is to be taken with a pinch of salt!
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Example (Noisy AR(1) continued)
For a Gaussian random walk with scale ω small enough, therandom walk never jumps to the other mode. But if the scale ω issufficiently large, the Markov chain explores both modes and give asatisfactory approximation of the target distribution.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Markov chain based on a random walk with scale ω = .1
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Markov chain based on a random walk with scale ω = .5
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Where do we stand?
MCMC in a nutshell:
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Where do we stand?
MCMC in a nutshell:
Running a sequence Xt+1 = Ψ(Xt, Yy) provides approximationto target density f when detailed balance condition holds
f(x)K(x, y) = f(y)K(y, x)
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Where do we stand?
MCMC in a nutshell:
Running a sequence Xt+1 = Ψ(Xt, Yy) provides approximationto target density f when detailed balance condition holds
f(x)K(x, y) = f(y)K(y, x)
Easiest implementation of the principle is random walkMetropolis-Hastings
Yt = X(t) + εt
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Extensions
Where do we stand?
MCMC in a nutshell:
Running a sequence Xt+1 = Ψ(Xt, Yy) provides approximationto target density f when detailed balance condition holds
f(x)K(x, y) = f(y)K(y, x)
Easiest implementation of the principle is random walkMetropolis-Hastings
Yt = X(t) + εt
Practical convergence requires sufficient energy from theproposal that is calibrated by trial and error.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Convergence diagnostics
How many iterations?
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Convergence diagnostics
How many iterations?
Rule # 1 There is no absolute number of simulations, i.e.1, 000 is neither large, nor small.
Rule # 2 It takes [much] longer to check for convergencethan for the chain itself to converge.
Rule # 3 MCMC is a “what-you-get-is-what-you-see”algorithm: it fails to tell about unexplored parts of the space.
Rule # 4 When in doubt, run MCMC chains in parallel andcheck for consistency.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Convergence diagnostics
How many iterations?
Rule # 1 There is no absolute number of simulations, i.e.1, 000 is neither large, nor small.
Rule # 2 It takes [much] longer to check for convergencethan for the chain itself to converge.
Rule # 3 MCMC is a “what-you-get-is-what-you-see”algorithm: it fails to tell about unexplored parts of the space.
Rule # 4 When in doubt, run MCMC chains in parallel andcheck for consistency.
Many “quick-&-dirty” solutions in the literature, but notnecessarily 100% trustworthy.
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Example (Bimodal target)
Density
f(x) =exp−x2/2√
2π
4(x− .3)2 + .01
4(1 + (.3)2) + .01.
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
and use of random walk Metropolis–Hastings algorithm withvariance .04Evaluation of the missing mass by
T−1∑
t=1
[θ(t+1) − θ(t)] f(θ(t))
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
0 500 1000 1500 2000
0.00.2
0.40.6
0.81.0
Index
mass
Index
Sequence [in blue] and mass evaluation [in brown]
[Philippe & Robert, 2001]
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Effective sample size
How many iid simulations from π are equivalent to N simulationsfrom the MCMC algorithm?
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Effective sample size
How many iid simulations from π are equivalent to N simulationsfrom the MCMC algorithm?
Based on estimated k-th order auto-correlation,
ρk = cov(
x(t), x(t+k))
,
effective sample size
N ess = n
(
1 + 2
T0∑
k=1
ρk
)−1/2
,
Only partial indicator that fails to signal chains stuck in onemode of the target
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Tempering
Facilitate exploration of π by flattening the target: simulate fromπα(x) ∝ π(x)α for α > 0 small enough
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Tempering
Facilitate exploration of π by flattening the target: simulate fromπα(x) ∝ π(x)α for α > 0 small enough
Determine where the modal regions of π are (possibly withparallel versions using different α’s)
Recycle simulations from π(x)α into simulations from π byimportance sampling
Simple modification of the Metropolis–Hastings algorithm,with new acceptance
{(
π(θ′|x)π(θ|x)
)α q(θ|θ′)q(θ′|θ)
}
∧ 1
An introduction to advanced (?) MCMC methods
The Metropolis-Hastings Algorithm
Convergence assessment
Tempering with the mean mixture
−1 0 1 2 3 4
−10
12
34 1
−1 0 1 2 3 4
−10
12
34 0.5
−1 0 1 2 3 4
−10
12
34 0.2