Download - Introduction to advanced Monte Carlo methods

An introduction to advanced (?) MCMC methods


Christian P. Robert

Universite Paris-Dauphine and CREST-INSEEhttp://www.ceremade.dauphine.fr/~xian

Royal Statistical Society, October 13, 2010

http://www.ceremade.dauphine.fr/~xian


Motivating example

Motivating example

1 Motivating example

2 The Metropolis-Hastings Algorithm


Motivating example

Latent structures make life harder!

Even simple models may lead to computational complications,as in latent variable models

f(x|θ) =

∫

f⋆(x, x⋆|θ) dx⋆


Motivating example



f(x|θ) =

∫


If (x, x⋆) observed, fine!


Motivating example



f(x|θ) =

∫


If (x, x⋆) observed, fine!

If only x observed, trouble!


Motivating example

Example (Mixture models)

Models of mixtures of distributions:

X ∼ fj with probability pj ,

for j = 1, 2, . . . , k, with overall density

X ∼ p1f1(x) + · · · + pkfk(x) .


Motivating example





X ∼ p1f1(x) + · · · + pkfk(x) .

For a sample of independent random variables (X1, · · · , Xn),sample density

n∏

i=1

{p1f1(xi) + · · · + pkfk(xi)} .


Motivating example





X ∼ p1f1(x) + · · · + pkfk(x) .

For a sample of independent random variables (X1, · · · , Xn),sample density

n∏

i=1

{p1f1(xi) + · · · + pkfk(xi)} .

Expanding this product involves kn elementary terms: prohibitiveto compute in large samples.


Motivating example

0.3N (µ1, 1) + 0.7N (µ2, 1) loglikelihood

−1 0 1 2 3

−1

01

23

µ1

µ 2


Motivating example

A typology of Bayes computational problems

(i) use of a complex parameter space, as for instance inconstrained parameter sets like those resulting from imposingstationarity constraints in dynamic models;


Motivating example



(ii) use of a complex sampling model with an intractablelikelihood, as for instance in missing data and graphicalmodels;


Motivating example




(iii) use of a huge dataset;


Motivating example




(iii) use of a huge dataset;(iv) use of a complex prior distribution (which may be the

posterior distribution associated with an earlier sample);


Motivating example




(iii) use of a huge dataset;(iv) use of a complex prior distribution (which may be the

posterior distribution associated with an earlier sample);(v) use of a complex inferential procedure as for instance, Bayes

factors

Bπ01(x) = P (θ ∈ Θ0 |x)/P (θ ∈ Θ1 |x)

/

π(θ ∈ Θ0)

π(θ ∈ Θ1).


The Metropolis-Hastings Algorithm


1 Motivating example

2 The Metropolis-Hastings AlgorithmMonte Carlo Methods based on Markov ChainsThe Metropolis–Hastings algorithmA collection of Metropolis-Hastings algorithmsExtensionsConvergence assessment



Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains

Fact: It is not necessary to use a sample from the distribution f toapproximate the integral

I =

∫

h(x)f(x)dx ,




Running Monte Carlo via Markov Chains

Fact: It is not necessary to use a sample from the distribution f toapproximate the integral

I =

∫

h(x)f(x)dx ,

We can obtain X1, . . . , Xn ∼ f (approx) without directlysimulating from f , using an ergodic Markov chain withstationary distribution f




Running Monte Carlo via Markov Chains (2)

Idea

For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f




Running Monte Carlo via Markov Chains (2)

Idea

For an arbitrary starting value x(0), an ergodic chain (X(t)) isgenerated using a transition kernel with stationary distribution f

Ensures the convergence in distribution of (X(t)) to a randomvariable from f .

For a “large enough” T0, X(T0) can be considered as

distributed from f

Produces a dependent sample X(T0), X(T0+1), . . ., which isgenerated from f , sufficient for most approximation purposes.



The Metropolis–Hastings algorithm


Problem:How can one build a Markov chain with a given stationarydistribution?





Problem:How can one build a Markov chain with a given stationarydistribution?

MH basicsAlgorithm that converges to the objective (target) density

f

using an arbitrary transition kernel density

q(x, y)

called instrumental (or proposal) distribution




The MH algorithm

Algorithm (Metropolis–Hastings)

Given x(t),

1 Generate Yt ∼ q(x(t), y).

2 Take

X(t+1) =

{

Yt with prob. ρ(x(t), Yt),

x(t) with prob. 1 − ρ(x(t), Yt),

where

ρ(x, y) = min

{

f(y)

f(x)

q(y, x)

q(x, y), 1

}

.




Features

Independent of normalizing constants for both f and q(x, ·)(ie, those constants independent of x)

Never move to values with f(y) = 0

The chain (x(t))t may take the same value several times in arow, even though f is a density wrt Lebesgue measure

The sequence (yt)t is usually not a Markov chain




Features

Independent of normalizing constants for both f and q(x, ·)(ie, those constants independent of x)

Never move to values with f(y) = 0

The chain (x(t))t may take the same value several times in arow, even though f is a density wrt Lebesgue measure

The sequence (yt)t is usually not a Markov chain

Satisfies the detailed balance condition

f(x)K(x, y) = f(y)K(y, x)

’θθ->P( )

P( )θ ’ θ->

θ θ’

[Green, 1995]




Convergence properties

1 The M-H Markov chain is reversible, with invariant/stationarydensity f .






2 As f is a probability measure, the chain is positive recurrent






2 As f is a probability measure, the chain is positive recurrent

3 If

Pr

[

f(Yt) q(Yt, X(t))

f(X(t)) q(X(t), Yt)≥ 1

]

< 1. (1)

i.e., if the event {X(t+1) = X(t)} occurs with positiveprobability, then the chain is aperiodic




Convergence properties (2)

4 Ifq(x, y) > 0 for every (x, y), (2)

the chain is irreducible






the chain is irreducible5 For M-H, f -irreducibility implies Harris recurrence






the chain is irreducible5 For M-H, f -irreducibility implies Harris recurrence6 Thus, under conditions (1) and (2)

(i) For h, with Ef |h(X)| <∞,

limT→∞

1

T

T∑

t=1

h(X(t)) =

∫

h(x)df(x) a.e. f.

(ii) and

limn→∞

∥

∥

∥

∥

∫

Kn(x, ·)µ(dx) − f

∥

∥

∥

∥

TV

= 0

for every initial distribution µ, where Kn(x, ·) denotes thekernel for n transitions.



A collection of Metropolis-Hastings algorithms

The Independent Case

The instrumental distribution q(x, ·) is independent of x and isdenoted g




The Independent Case

The instrumental distribution q(x, ·) is independent of x and isdenoted g

Algorithm (Independent Metropolis-Hastings)

Given x(t),

1 Generate Yt ∼ g(y)

2 Take

X(t+1) =

Yt with prob. min

{

f(Yt) g(x(t))

f(x(t)) g(Yt), 1

}

,

x(t) otherwise.




Properties

The resulting sample is not iid




Properties

The resulting sample is not iid but there exist strong convergenceproperties:

Theorem (Ergodicity)

The algorithm produces a uniformly ergodic chain if there exists aconstant M such that

f(x) ≤Mg(x) , x ∈ supp f.

In this case,

‖Kn(x, ·) − f‖TV ≤(

1 − 1

M

)n

.

[Mengersen & Tweedie, 1996]




Example (Noisy AR(1))

Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ2)

and observablesyt|xt ∼ N (x2

t , σ2)




Example (Noisy AR(1))

Hidden Markov chain from a regular AR(1) model,

xt+1 = ϕxt + ǫt+1 ǫt ∼ N (0, τ2)

and observablesyt|xt ∼ N (x2

t , σ2)

The distribution of xt given xt−1, xt+1 and yt is

exp−1

2τ2

{

(xt − ϕxt−1)2 + (xt+1 − ϕxt)

2 +τ2

σ2(yt − x2

t )2

}

.




Example (Noisy AR(1) too)

Use for proposal the N (µt, ω2t ) distribution, with

µt = ϕxt−1 + xt+1

1 + ϕ2and ω2

t =τ2

1 + ϕ2.




Example (Noisy AR(1) too)

Use for proposal the N (µt, ω2t ) distribution, with

µt = ϕxt−1 + xt+1

1 + ϕ2and ω2

t =τ2

1 + ϕ2.

Ratioπ(x)/qind(x) = exp−(yt − x2

t )2/2σ2

is bounded




(top) Last 500 realisations of the chain {Xk}k out of 10, 000iterations; (bottom) histogram of the chain, compared withthe target distribution.




Random walk Metropolis–Hastings

Instead, use a local perturbation as proposal

Yt = X(t) + εt,

where εt ∼ g, independent of X(t).The instrumental density is now of the form g(y − x) and theMarkov chain is a random walk if g is symmetric

g(x) = g(−x)




Algorithm (Random walk Metropolis)

Given x(t)

1 Generate Yt ∼ g(y − x(t))

2 Take

X(t+1) =

Yt with prob. min

{

1,f(Yt)

f(x(t))

}

,

x(t) otherwise.




Probit illustration

Likelihood and posterior given by

π(β|y, X) ∝ ℓ(β|y, X) ∝n∏

i=1

Φ(xiTβ)yi(1 − Φ(xiTβ))ni−yi .

under the flat prior




Probit illustration

Likelihood and posterior given by

π(β|y, X) ∝ ℓ(β|y, X) ∝n∏

i=1

Φ(xiTβ)yi(1 − Φ(xiTβ))ni−yi .

under the flat priorA random walk proposal works well for a small number ofpredictors. Use the maximum likelihood estimate β as startingvalue and asymptotic (Fisher) covariance matrix of the MLE, Σ, asscale




MCMC algorithm

Probit random-walk Metropolis-Hastings

Initialization: Set β(0) = β and compute Σ

Iteration t:1 Generate β ∼ Nk+1(β

(t−1), τ Σ)2 Compute

ρ(β(t−1), β) = min

(

1,π(β|y)

π(β(t−1)|y)

)

3 With probability ρ(β(t−1), β) set β(t) = β;otherwise set β(t) = β(t−1).




R bank benchmark

Probit modelling withno intercept over thefour measurements.Three different scalesτ = 1, 0.1, 10: bestmixing behavior isassociated with τ = 1.Average of theparameters overMCMC 9, 000iterations gives plug-inestimate

0 4000 8000

−2.

0−

1.0

−2.0 −1.5 −1.0 −0.5

0.0

1.0

0 200 600 1000

0.0

0.4

0.8

0 4000 8000

−1

12

3

−1 0 1 2 3

0.0

0.4

0 200 600 1000

0.0

0.4

0.8

0 4000 8000−

0.5

1.0

2.5

−0.5 0.5 1.5 2.5

0.0

0.4

0.8

0 200 600 1000

0.0

0.4

0.8

0 4000 8000

0.6

1.2

1.8

0.6 1.0 1.4 1.8

0.0

1.0

2.0

0 200 600 1000

0.0

0.4

0.8

pi = Φ(−1.2193xi1 + 0.9540xi2 + 0.9795xi3 + 1.1481xi4) .





π(θ|x) ∝n∏

j=1

(

k∑

ℓ=1

pℓf(xj |µℓ, σℓ)

)

π(θ)





π(θ|x) ∝n∏

j=1

(

k∑

ℓ=1

pℓf(xj |µℓ, σℓ)

)

π(θ)

Metropolis-Hastings proposal:

θ(t+1) =

{

θ(t) + ωε(t) if u(t) < ρ(t)

θ(t) otherwise

where

ρ(t) =π(θ(t) + ωε(t)|x)

π(θ(t)|x) ∧ 1

and ω scaled for good acceptance rate




Random walk MCMC output for.7N (µ1, 1) + .3N (µ2, 1)

and scale 1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 1





and scale 1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 10





and scale 1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 100





and scale 1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 500





and scale 1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 1000





and scale√.1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 10





and scale√.1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 100





and scale√.1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 500





and scale√.1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 1000





and scale√.1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 10,000





and scale√.1

−1 0 1 2 3 4

−1

01

23

4

µ1

µ 2

Iteration 5000





Uniform ergodicity prohibited by random walk structure





Uniform ergodicity prohibited by random walk structureAt best, geometric ergodicity:

Theorem (Sufficient ergodicity)

For a symmetric density f , log-concave in the tails, and a positiveand symmetric density g, the chain (X(t)) is geometrically ergodic.

[Mengersen & Tweedie, 1996]

no tail effect




Example (Comparison of taileffects)

Random-walkMetropolis–Hastings algorithmsbased on a N (0, 1) instrumentalfor the generation of (a) aN (0, 1) distribution and (b) adistribution with densityψ(x) ∝ (1 + |x|)−3

(a)

0 50 100 150 200

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

(a)

0 50 100 150 200

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

0 50 100 150 200

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

0 50 100 150 200

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

(b)

0 50 100 150 200

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

0 50 100 150 200

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

0 50 100 150 200

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

90% confidence envelopes ofthe means, derived from 500parallel independent chains



Extensions

Extensions

There are many other families of HM algorithms

Adaptive Rejection Metropolis Sampling

Reversible Jump

Langevin algorithms

to name just a few...



Extensions

Langevin Algorithms

Proposal based on the Langevin diffusion Lt is defined by thestochastic differential equation

dLt = dBt +1

2∇ log f(Lt)dt,

where Bt is the standard Brownian motion

Theorem

The Langevin diffusion is the only non-explosive diffusion which isreversible with respect to f .



Extensions

Discretization

Because continuous time cannot be simulated, consider thediscretised sequence

x(t+1) = x(t) +σ2

2∇ log f(x(t)) + σεt, εt ∼ Np(0, Ip)

where σ2 corresponds to the discretisation step

Example off(x) = exp(−x4)

Den

sity

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

σ2 = .1



Extensions

Discretization


x(t+1) = x(t) +σ2




−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

σ2 = .01



Extensions

Discretization


x(t+1) = x(t) +σ2




Den

sity

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

σ2 = .001



Extensions

Discretization


x(t+1) = x(t) +σ2




Den

sity

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

σ2 = .0001



Extensions

Discretization


x(t+1) = x(t) +σ2




Den

sity

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

σ2 = .0001∗



Extensions

Discretization

Unfortunately, the discretized chain may be transient, for instancewhen

limx→±∞

∣

∣σ2∇ log f(x)|x|−1∣

∣ > 1

Example of f(x) = exp(−x4) when σ2 = .2



Extensions

MH correction

Accept the new value Yt with probability

f(Yt)

f(x(t))·exp

{

−∥

∥

∥Yt − x(t) − σ2

2 ∇ log f(x(t))∥

∥

∥

2/

2σ2

}

exp

{

−∥

∥

∥x(t) − Yt − σ2

2 ∇ log f(Yt)∥

∥

∥

2/

2σ2

} ∧ 1 .

Choice of the scaling factor σ

Should lead to an acceptance rate of 0.574 to achieve optimalconvergence rates (when the components of x are uncorrelated)

[Roberts & Rosenthal, 1998]



Extensions

Optimizing the Acceptance Rate

Problem of choice of the transition kernel from a practical point ofviewMost common alternatives:

1 a fully automated algorithm like ARMS;

2 an instrumental density g which approximates f , such thatf/g is bounded for uniform ergodicity to apply;

3 a random walk

In both cases (b) and (c), the choice of g is critical,



Extensions

Case of the random walk

Different approach to acceptance ratesA high acceptance rate does not indicate that the algorithm ismoving correctly since it indicates that the random walk is movingtoo slowly on the surface of f .



Extensions

Case of the random walk

Different approach to acceptance ratesA high acceptance rate does not indicate that the algorithm ismoving correctly since it indicates that the random walk is movingtoo slowly on the surface of f .If x(t) and yt are close, i.e. f(x(t)) ≃ f(yt) y is accepted withprobability

min

(

f(yt)

f(x(t)), 1

)

≃ 1 .

For multimodal densities with well separated modes, the negativeeffect of limited moves on the surface of f clearly shows.



Extensions

Case of the random walk (2)

If the average acceptance rate is low, the successive values of f(yt)tend to be small compared with f(x(t)), which means that therandom walk moves quickly on the surface of f since it oftenreaches the “borders” of the support of f



Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. Inlarge dimensions, at an average acceptance rate of 25%.

[Gelman,Gilks and Roberts, 1995]



Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. Inlarge dimensions, at an average acceptance rate of 25%.

[Gelman,Gilks and Roberts, 1995]

This rule is to be taken with a pinch of salt!



Extensions

Example (Noisy AR(1) continued)

For a Gaussian random walk with scale ω small enough, therandom walk never jumps to the other mode. But if the scale ω issufficiently large, the Markov chain explores both modes and give asatisfactory approximation of the target distribution.



Extensions

Markov chain based on a random walk with scale ω = .1



Extensions

Markov chain based on a random walk with scale ω = .5



Extensions

Where do we stand?

MCMC in a nutshell:



Extensions

Where do we stand?

MCMC in a nutshell:

Running a sequence Xt+1 = Ψ(Xt, Yy) provides approximationto target density f when detailed balance condition holds




Extensions

Where do we stand?

MCMC in a nutshell:



Easiest implementation of the principle is random walkMetropolis-Hastings

Yt = X(t) + εt



Extensions

Where do we stand?

MCMC in a nutshell:



Easiest implementation of the principle is random walkMetropolis-Hastings

Yt = X(t) + εt

Practical convergence requires sufficient energy from theproposal that is calibrated by trial and error.



Convergence assessment

Convergence diagnostics

How many iterations?






Rule # 1 There is no absolute number of simulations, i.e.1, 000 is neither large, nor small.

Rule # 2 It takes [much] longer to check for convergencethan for the chain itself to converge.

Rule # 3 MCMC is a “what-you-get-is-what-you-see”algorithm: it fails to tell about unexplored parts of the space.

Rule # 4 When in doubt, run MCMC chains in parallel andcheck for consistency.






Rule # 1 There is no absolute number of simulations, i.e.1, 000 is neither large, nor small.

Rule # 2 It takes [much] longer to check for convergencethan for the chain itself to converge.

Rule # 3 MCMC is a “what-you-get-is-what-you-see”algorithm: it fails to tell about unexplored parts of the space.

Rule # 4 When in doubt, run MCMC chains in parallel andcheck for consistency.

Many “quick-&-dirty” solutions in the literature, but notnecessarily 100% trustworthy.




Example (Bimodal target)

Density

f(x) =exp−x2/2√

2π

4(x− .3)2 + .01

4(1 + (.3)2) + .01.

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

and use of random walk Metropolis–Hastings algorithm withvariance .04Evaluation of the missing mass by

T−1∑

t=1

[θ(t+1) − θ(t)] f(θ(t))




0 500 1000 1500 2000

0.00.2

0.40.6

0.81.0

Index

mass

Index

Sequence [in blue] and mass evaluation [in brown]

[Philippe & Robert, 2001]




Effective sample size

How many iid simulations from π are equivalent to N simulationsfrom the MCMC algorithm?




Effective sample size

How many iid simulations from π are equivalent to N simulationsfrom the MCMC algorithm?

Based on estimated k-th order auto-correlation,

ρk = cov(

x(t), x(t+k))

,

effective sample size

N ess = n

(

1 + 2

T0∑

k=1

ρk

)−1/2

,

Only partial indicator that fails to signal chains stuck in onemode of the target




Tempering

Facilitate exploration of π by flattening the target: simulate fromπα(x) ∝ π(x)α for α > 0 small enough




Tempering

Facilitate exploration of π by flattening the target: simulate fromπα(x) ∝ π(x)α for α > 0 small enough

Determine where the modal regions of π are (possibly withparallel versions using different α’s)

Recycle simulations from π(x)α into simulations from π byimportance sampling

Simple modification of the Metropolis–Hastings algorithm,with new acceptance

{(

π(θ′|x)π(θ|x)

)α q(θ|θ′)q(θ′|θ)

}

∧ 1




Tempering with the mean mixture

−1 0 1 2 3 4

−10

12

34 1

−1 0 1 2 3 4

−10

12

34 0.5

−1 0 1 2 3 4

−10

12

34 0.2

Download - Introduction to advanced Monte Carlo methods

Top Related