mcmc methods: gibbs sampling and the metropolis-hastings algorithm

8/21/2019 MCMC Methods: Gibbs Sampling and the Metropolis-Hastings Algorithm

1/233

MCMC Methods: Gibbs Sampling and theMetropolis-Hastings Algorithm

Patrick Lam

http://find/http://goback/


2/233

Outline

Introduction to Markov Chain Monte Carlo

Gibbs Sampling

The Metropolis-Hastings Algorithm



3/233

Outline


Gibbs Sampling




4/233

What is Markov Chain Monte Carlo (MCMC)?



5/233


Markov Chain: a stochastic process in which future states areindependent of past states given the present state

http://find/


6/233



Monte Carlo: simulation

http://find/


7/233




Up until now, we’ve done a lot of Monte Carlo simulation to findintegrals rather than doing it analytically, a process called Monte Carlo Integration.



8/233




Up until now, we’ve done a lot of Monte Carlo simulation to findintegrals rather than doing it analytically, a process called Monte Carlo Integration.

Basically a fancy way of saying we can take quantities of interest of a distribution from simulated draws from the distribution.

http://find/


9/233

Monte Carlo Integration

http://find/


10/233


Suppose we have a distribution p (θ) (perhaps a posterior) that wewant to take quantities of interest from.



11/233



To derive it analytically, we need to take integrals:

I =

Θg (θ)p (θ)d θ

where g (θ) is some function of θ (g (θ) = θ for the mean andg (θ) = (θ − E (θ))2 for the variance).



12/233



To derive it analytically, we need to take integrals:

I =

Θg (θ)p (θ)d θ

where g (θ) is some function of θ (g (θ) = θ for the mean andg (θ) = (θ − E (θ))2 for the variance).

We can approximate the integrals via Monte Carlo Integration by

simulating M values from p (θ) and calculating

Î M = 1

M

M i =1

g (θ(i ))

http://find/


13/233

For example, we can compute the expected value of the Beta(3,3)distribution analytically:

http://find/


14/233


E (θ) =

Θθp (θ)d θ =

Θθ

Γ(6)

Γ(3)Γ(3)θ2(1 − θ)2d θ =

1

2



15/233


E (θ) =

Θθp (θ)d θ =

Θθ

Γ(6)

Γ(3)Γ(3)θ2(1 − θ)2d θ =

1

2

or via Monte Carlo methods:

http://find/


16/233


E (θ) =

Θθp (θ)d θ =

Θθ

Γ(6)

Γ(3)Γ(3)θ2(1 − θ)2d θ =

1

2


> M beta.sims sum(beta.sims)/M

[1] 0.5013

http://find/


17/233


E (θ) =

Θθp (θ)d θ =

Θθ

Γ(6)

Γ(3)Γ(3)θ2(1 − θ)2d θ =

1

2



[1] 0.5013

Our Monte Carlo approximation Î M is a simulation consistent

estimator of the true value I :

http://find/


18/233


E (θ) =

Θθp (θ)d θ =

Θθ

Γ(6)

Γ(3)Γ(3)θ2(1 − θ)2d θ =

1

2



[1] 0.5013


estimator of the true value I : ˆI M → I as M → ∞.

http://find/


19/233


E (θ) =

Θθp (θ)d θ =

Θθ

Γ(6)

Γ(3)Γ(3)θ2(1 − θ)2d θ =

1

2



[1] 0.5013


estimator of the true value I : ˆI M → I as M → ∞.

We know this to be true from the Strong Law of Large Numbers.

S L f L N b (SLLN)



20/233

Strong Law of Large Numbers (SLLN)

St L f L N b (SLLN)



21/233


Let X 1,X 2, . . . be a sequence of independent and identicallydistributed random variables, each having a finite mean µ = E (X i ).

Then with probability 1,

X 1 + X 2 + · · · + X M M

→ µ as M → ∞

St L f L N b (SLLN)

http://find/


22/233




X 1 + X 2 + · · · + X M M

→ µ as M → ∞

In our previous example, each simulation draw was independentand distributed from the same Beta(3,3) distribution.

St g L f L g N b s (SLLN)

http://find/


23/233




X 1 + X 2 + · · · + X M M

→ µ as M → ∞


This also works with variances and other quantities of interest,since a function of i.i.d. random variables are also i.i.d. randomvariables.


http://find/


24/233




X 1 + X 2 + · · · + X M M

→ µ as M → ∞


This also works with variances and other quantities of interest,since a function of i.i.d. random variables are also i.i.d. randomvariables.

But what if we can’t generate draws that are independent?

http://find/


25/233

Suppose we want to draw from our posterior distribution p (θ|y ),but we cannot sample independent draws from it.

http://find/


26/233


For example, we often do not know the normalizing constant.

http://find/


27/233



However, we may be able to sample draws from p (θ|y ) that areslightly dependent.

http://find/


28/233



However, we may be able to sample draws from p (θ|y ) that areslightly dependent.

If we can sample slightly dependent draws using a Markov chain,then we can still find quantities of interests from those draws.

What is a Markov Chain?

http://find/


29/233





30/233


Definition: a stochastic process in which future states areindependent of past states given the present state


http://find/


31/233



Stochastic process: a consecutive set of random (notdeterministic) quantities defined on some known state space Θ.




32/233




think of Θ as our parameter space.


http://find/


33/233

C




consecutive implies a time component, indexed by t .


http://find/


34/233





Consider a draw of θ(t ) to be a state at iteration t . The next drawθ

(t +1) is dependent only on the current draw θ(t ), and not on anypast draws.


http://find/


35/233







This satisfies the Markov property:




36/233







This satisfies the Markov property:

p (θ(t +1)|θ(1), θ(2), . . . , θ(t )) = p (θ(t +1)|θ(t ))



37/233

So our Markov chain is a bunch of draws of θ that are each slightly

dependent on the previous one.

http://find/


38/233


dependent on the previous one. The chain wanders around theparameter space, remembering only where it has been in the lastperiod.

http://find/


39/233



What are the rules governing how the chain jumps from one stateto another at each period?

http://find/


40/233



What are the rules governing how the chain jumps from one stateto another at each period?

The jumping rules are governed by a transition kernel, which is amechanism that describes the probability of moving to some other

state based on the current state.

Transition Kernel

http://find/


41/233

Transition Kernel

http://find/


42/233

For discrete state space (k possible states):

Transition Kernel

http://find/


43/233

For discrete state space (k possible states): a k × k matrix of transition probabilities.

Transition Kernel

http://find/


44/233


Example: Suppose k = 3.

Transition Kernel

http://find/


48/233



p (θ(t +1)A |θ

(t )A ) p (θ

(t +1)B |θ

(t )A ) p (θ

(t +1)C |θ

(t )A )

p (θ(t +1)A |θ

(t )B ) p (θ

(t +1)B |θ

(t )B ) p (θ

(t +1)C |θ

(t )B )

p (θ(t

+1)A |θ(t

)C ) p (θ(t

+1)B |θ(t

)C ) p (θ(t

+1)C |θ(t

)C )



the current state. The columns are the marginal probabilities of being in a certain state in the next period.

For continuous state space (infinite possible states), the transition

kernel is a bunch of conditional PDFs: f (θ(t +1)

j

|θ(t )

i

)

How Does a Markov Chain Work? (Discrete Example)



49/233


http://find/


50/233

1. Define a starting distribution

(0) (a 1 × k vector of

probabilities that sum to one).


http://goforward/http://find/http://goback/


51/233




2. At iteration 1, our distribution(1) (from which θ(1) is

drawn) is

(1) =

(0) × P

(1 × k ) (1 × k ) × (k × k )


http://find/


52/233





drawn) is

(1) =

(0) × P

(1 × k ) (1 × k ) × (k × k )


drawn) is

(2) = (1) × P(1 × k ) (1 × k ) × (k × k )


http://find/


53/233





drawn) is

(1) =

(0) × P

(1 × k ) (1 × k ) × (k × k )


drawn) is

(2) = (1) × P(1 × k ) (1 × k ) × (k × k )

4. At iteration t , our distribution(t ) (from which θ(t ) is

drawn) is (t ) =

(t −1) × P = (0) × Pt

Stationary (Limiting) Distribution

http://find/


54/233


http://find/


55/233

Define a stationary distribution π to be some distribution

suchthat π = πP.




56/233


suchthat π = πP.

For all the MCMC algorithms we use in Bayesian statistics, theMarkov chain will typically converge to π regardless of our

starting points.




57/233


suchthat π = πP.

For all the MCMC algorithms we use in Bayesian statistics, theMarkov chain will typically converge to π regardless of our

starting points.

So if we can devise a Markov chain whose stationary distribution πis our desired posterior distribution p (θ|y ), then we can run thischain to get draws that are approximately from p (θ|y ) once the

chain has converged.

Burn-in



58/233

Burn-in

http://find/


59/233

Since convergence usually occurs regardless of our starting point,we can usually pick any feasible (for example, picking startingdraws that are in the parameter space) starting point.

Burn-in

http://find/


60/233


However, the time it takes for the chain to converge variesdepending on the starting point.

Burn-in

http://find/


61/233



As a matter of practice, most people throw out a certain numberof the first draws, known as the burn-in. This is to make ourdraws closer to the stationary distribution and less dependent onthe starting point.

Burn-in

http://find/


62/233

Since convergence usually occurs regardless of our starting point,we can usually pick any feasible (for example, picking starting

draws that are in the parameter space) starting point.


As a matter of practice, most people throw out a certain numberof the first draws, known as the burn-in. This is to make ourdraws closer to the stationary distribution and less dependent onthe starting point.

However, it is unclear how much we should burn-in since ourdraws are all slightly dependent and we don’t know exactly whenconvergence occurs.

Monte Carlo Integration on the Markov Chain

http://find/


63/233


http://find/


64/233

Once we have a Markov chain that has converged to the stationarydistribution, then the draws in our chain appear to be like drawsfrom p (θ|y ),


http://find/


65/233

Once we have a Markov chain that has converged to the stationarydistribution, then the draws in our chain appear to be like drawsfrom p (θ|y ), so it seems like we should be able to use Monte CarloIntegration methods to find quantities of interest.


http://find/


66/233


One problem: our draws are not independent, which we requiredfor Monte Carlo Integration to work (remember SLLN).


http://find/


67/233


One problem: our draws are not independent, which we requiredfor Monte Carlo Integration to work (remember SLLN).

Luckily, we have the Ergodic Theorem.

Ergodic Theorem

http://find/


68/233

Ergodic Theorem

Let θ(1),θ(2), . . . , θ(M ) be M values from a Markov chain that is

http://find/


69/233

aperiodic , irreducible , and positive recurrent (then the chain isergodic), and E [g (θ)] < ∞.

Ergodic Theorem


http://find/


70/233



1

M

M

i =1

g (θi ) → Θ

g (θ)π(θ)d θ

as M → ∞, where π is the stationary distribution.

Ergodic Theorem




71/233



1

M

M

i =1

g (θi ) → Θ

g (θ)π(θ)d θ


This is the Markov chain analog to the SLLN,

Ergodic Theorem




72/233



1

M

M

i =1

g (θi ) → Θ

g (θ)π(θ)d θ


This is the Markov chain analog to the SLLN, and it allows us to

ignore the dependence between draws of the Markov chain whenwe calculate quantities of interest from the draws.

Ergodic Theorem

Let θ(1),θ(2), . . . , θ(M ) be M values from a Markov chain that is(

http://find/


73/233



1

M

M

i =1

g (θi ) → Θ

g (θ)π(θ)d θ


This is the Markov chain analog to the SLLN, and it allows us to

ignore the dependence between draws of the Markov chain whenwe calculate quantities of interest from the draws.

But what does it mean for a chain to be aperiodic , irreducible , andpositive recurrent , and therefore ergodic?

Aperiodicity

http://find/


74/233

Aperiodicity

A Markov chain is aperiodic if the only length of time for which

http://find/


75/233

the chain repeats some cycle of values is the trivial case with cycle

length equal to one.

Aperiodicity


http://find/


76/233



Let A, B, and C denote the states (analogous to the possiblevalues of θ) in a 3-state Markov chain.

Aperiodicity


http://find/


77/233



Let A, B, and C denote the states (analogous to the possiblevalues of θ) in a 3-state Markov chain. The following chain isperiodic with period 3, where the period is the number of steps

that it takes to return to a certain state.

Aperiodicity


http://find/


78/233





A

0

1 B

0

1 C

0

1

Aperiodicity

A Markov chain is aperiodic if the only length of time for whichh h i l f l i h i i l i h l

http://find/


79/233





A

0

1 B

0

1 C

0

1

As long as the chain is not repeating an identical cycle, then thechain is aperiodic.

Irreducibility

http://find/


80/233

Irreducibility

http://find/


81/233

A Markov chain is irreducible if it is possible go from any state toany other state (not necessarily in one step).

Irreducibility

http://find/


82/233


The following chain is reducible , or not irreducible.

Irreducibility

http://find/


83/233



A

0.5

0.5 B

0.7

0.3 C

0.4

0.6

Irreducibility

http://find/


84/233



A

0.5

0.5 B

0.7

0.3 C

0.4

0.6

The chain is not irreducible because we cannot get to A from B or

C regardless of the number of steps we take.

Positive Recurrence

http://find/


85/233

Positive Recurrence

http://find/


86/233

A Markov chain is recurrent if for any given state i , if the chain

starts at i , it will eventually return to i with probability 1.

Positive Recurrence

http://find/


87/233



A Markov chain is positive recurrent if the expected return timeto state i is finite;

Positive Recurrence

http://find/


88/233



A Markov chain is positive recurrent if the expected return timeto state i is finite; otherwise it is null recurrent .

Positive Recurrence

http://find/


89/233



A Markov chain is positive recurrent if the expected return timeto state i is finite; otherwise it is null recurrent .

So if our Markov chain is aperiodic, irreducible, and positiverecurrent (all the ones we use in Bayesian statistics usually are),then it is ergodic and the ergodic theorem allows us to do MonteCarlo Integration by calculating quantities of interest from ourdraws, ignoring the dependence between draws.

Thinning the Chain

http://find/


90/233

Thinning the Chain

In order to break the dependence between draws in the Markov

http://find/


91/233

chain, some have suggested only keeping every d th draw of the

chain.

Thinning the Chain


http://find/


92/233


chain.

This is known as thinning.

Thinning the Chain


http://find/


93/233


chain.


Pros:

Thinning the Chain




94/233


chain.


Pros:

Perhaps gets you a little closer to i.i.d. draws.

Thinning the Chain

In order to break the dependence between draws in the Markovh i h d l k i d h d f h

http://find/


95/233


chain.


Pros:


Saves memory since you only store a fraction of the draws.

Thinning the Chain

In order to break the dependence between draws in the Markovh i h d l k i d h d f h

http://find/


96/233


chain.


Pros:



Cons:

Thinning the Chain

In order to break the dependence between draws in the Markovh i h t d l k i dth d f th



97/233


chain.


Pros:



Cons:

Unnecessary with ergodic theorem.

Thinning the Chain

In order to break the dependence between draws in the Markovh i h t d l k i dth d f th

http://find/


98/233


chain.


Pros:



Cons:

Unnecessary with ergodic theorem.

Shown to increase the variance of your Monte Carlo estimates.

So Really, What is MCMC?

http://find/


99/233


http://find/


100/233

MCMC is a class of methods in which we can simulate draws thatare slightly dependent and are approximately from a (posterior)distribution.




101/233


We then take those draws and calculate quantities of interest for

the (posterior) distribution.


http://find/


102/233


We then take those draws and calculate quantities of interest for

the (posterior) distribution.

In Bayesian statistics, there are generally two MCMC algorithmsthat we use: the Gibbs Sampler and the Metropolis-Hastingsalgorithm.

Outline

http://find/


103/233


Gibbs Sampling


Gibbs Sampling

http://find/


104/233

Gibbs Sampling

Suppose we have a joint distribution p(θ1, . . . , θk ) that we want to

http://find/


105/233

Suppose we have a joint distribution p (θ1, . . . , θk ) that we want to

sample from (for example, a posterior distribution).

Gibbs Sampling




106/233

S pp j p( 1, , k )


We can use the Gibbs sampler to sample from the joint distributionif we knew the full conditional distributions for each parameter.

Gibbs Sampling


http://find/


107/233

pp j p( 1, , k )



For each parameter, the full conditional distribution is thedistribution of the parameter conditional on the known informationand all the other parameters:

Gibbs Sampling


http://find/


108/233

pp j p( , , k )



For each parameter, the full conditional distribution is thedistribution of the parameter conditional on the known informationand all the other parameters: p (θ j |θ− j , y )

Gibbs Sampling


http://find/


109/233

j ( )



For each parameter, the full conditional distribution is thedistribution of the parameter conditional on the known informationand all the other parameters: p (θ j |θ− j , y )

How can we know the joint distribution simply by knowing the full

conditional distributions?

The Hammersley-Clifford Theorem (for two blocks)

http://find/


110/233

The Hammersley-Clifford Theorem (for two blocks)Suppose we have a joint density f (x , y ).

http://find/


111/233

The Hammersley-Clifford Theorem (for two blocks)Suppose we have a joint density f (x , y ). The theorem proves thatwe can write out the joint density in terms of the conditionaldensities f (x |y ) and f (y |x ):

http://find/


112/233


http://find/


113/233

f (x , y ) = f (y |x ) f (y |x )

f (x |y ) dy


http://find/


114/233

f (x , y ) = f (y |x ) f (y |x )

f (x |y ) dy

We can write the denominator as

f (y |x )

f (x |y )dy =

f (x ,y )f (x )

f (x ,y )f (y )

dy


http://find/


115/233

f (x , y ) = f (y |x ) f (y |x )

f (x |y ) dy


f (y |x )

f (x |y )dy =

f (x ,y )f (x )

f (x ,y )f (y )

dy

= f (y )f (x )

dy


http://find/


116/233

f (x , y ) = f (y |x ) f (y |x )

f (x |y ) dy


f (y |x )

f (x |y )dy =

f (x ,y )f (x )

f (x ,y )f (y )

dy

= f (y )f (x )

dy

= 1

f (x )

Thus, our right-hand side is

f (y |x )



117/233

(y | )1

f (x )= f (y |x )f (x )


f (y |x )

http://find/


118/233

(y | )1

f (x )= f (y |x )f (x )

= f (x , y )


f (y |x )

http://find/


119/233

(y | )1

f (x )= f (y |x )f (x )

= f (x , y )

The theorem shows that knowledge of the conditional densities

allows us to get the joint density.


f (y |x )

http://find/


120/233

( | )1

f (x )= f (y |x )f (x )

= f (x , y )



This works for more than two blocks of parameters.


f (y |x )

http://find/


121/233

1f (x )

= f (y |x )f (x )

= f (x , y )



This works for more than two blocks of parameters.

But how do we figure out the full conditionals?

Steps to Calculating Full Conditional Distributions

http://find/


122/233


Suppose we have a posterior p (θ|y).

http://find/


123/233


Suppose we have a posterior p (θ|y). To calculate the full

d l f h θ d h f ll



124/233

conditionals for each θ, do the following:



di i l f h θ d h f ll i

http://find/


125/233


1. Write out the full posterior ignoring constants of proportionality.



di i l f h θ d h f ll i

http://find/


126/233



2. Pick a block of parameters (for example, θ1) and drop

everything that doesn’t depend on θ1.



diti l f h θ d th f ll i

http://find/


127/233





3. Use your knowledge of distributions to figure out what thenormalizing constant is (and thus what the full conditionaldistribution p (θ1|θ−1, y) is).



diti l f h θ d th f ll i

http://find/


128/233





3. Use your knowledge of distributions to figure out what thenormalizing constant is (and thus what the full conditionaldistribution p (θ1|θ−1, y) is).

4. Repeat steps 2 and 3 for all parameter blocks.

Gibbs Sampler Steps

http://find/


129/233

Gibbs Sampler Steps

Let’s suppose that we are interested in sampling from the posterior

p(θ|y) where θ is a vector of three parameters θ1 θ2 θ3

http://find/


130/233

p (θ|y), where θ is a vector of three parameters, θ1, θ2, θ3.

Gibbs Sampler Steps



http://find/


131/233


The steps to a Gibbs Sampler (and the analogous steps in the MCMC process) are

Gibbs Sampler Steps



http://find/


132/233



1. Pick a vector of starting values θ(0). (Defining a starting distributionQ(0) anddrawing θ(0) from it.)

Gibbs Sampler Steps





133/233




2. Start with any θ (order does not matter, but I’ll start with θ1

for convenience).

Gibbs Sampler Steps



http://find/


134/233




2. Start with any θ (order does not matter, but I’ll start with θ1

for convenience). Draw a value θ(1)1 from the full conditional

p (θ1|θ(0)

2

, θ(0)

3

, y).

3. Draw a value θ(1)2 (again order does not matter) from the full

conditional p (θ2|θ(1)1 , θ

(0)3 , y).

http://find/


135/233

3. Draw a value θ(1)2 (again order does not matter) from the full

conditional p (θ2|θ(1)1 , θ

(0)3 , y). Note that we must use the

updated value of θ(1)1 .



136/233

3. Draw a value θ(1)

2 (again order does not matter) from the fullconditional p (θ2|θ

(1)1 , θ



4. Draw a value θ(1)3 from the full conditional p (θ3|θ

(1)1 , θ

(1)2 , y)

using both updated values.



137/233

using both updated values.



(1)1 , θ




(1)1 , θ

(1)2 , y)

using both updated values. (Steps 2-4 are analogous to multiplyingQ(0) and P to get

http://find/


138/233

using both updated values. (Steps 2 4 are analogous to multiplyingQ and P to getQ(1) and then drawing θ(1) from

Q(1).)



(1)1 , θ




(1)1 , θ

(1)2 , y)




139/233

g p ( p g p y gQ gQ(1) and then drawing θ(1) from

Q(1).)

5. Draw θ(2) using θ(1) and continually using the most updatedvalues.



(1)1 , θ




(1)1 , θ

(1)2 , y)


http://find/


140/233

g p ( QQ(1) and then drawing θ(1) from

Q(1).)


6. Repeat until we get M draws, with each draw being a vectorθ

(t ).



(1)1 , θ




(1)1 , θ

(1)2 , y)




141/233

g pQ(1) and then drawing θ(1) from

Q(1).)



(t ).

7. Optional burn-in and/or thinning.



(1)1 , θ




(1)1 , θ

(1)2 , y)




142/233

g pQ(1) and then drawing θ(1) from

Q(1).)



(t ).


Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior.



(1)1 , θ




(1)1 , θ

(1)2 , y)


http://find/


143/233

Q(1) and then drawing θ(1) fromQ(1).)



(t ).


Our result is a Markov chain with a bunch of draws of θ that areapproximately from our posterior. We can do Monte CarloIntegration on those draws to get quantities of interest.

An Example (Robert and Casella, 10.17)1

http://find/


144/233

1Robert, Christian P. and George Casella. 2004. Monte Carlo Statistical Met ho ds , 2nd ed it io n. Springer.

An Example (Robert and Casella, 10.17)1

Suppose we have data of the number of failures (y i ) for each of 10pumps in a nuclear plant.

http://find/


145/233


An Example (Robert and Casella, 10.17)

1


We also have the times (t i ) at which each pump was observed.

http://find/


146/233



1


We also have the times (t i ) at which each pump was observed.(5 1 5 14 3 19 1 1 4 22)

http://find/


147/233

> y t rbind(y, t)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]y 5 1 5 14 3 19 1 1 4 22t 94 16 63 126 5 31 1 1 2 10



1


We also have the times (t i ) at which each pump was observed.> < (5 1 5 14 3 19 1 1 4 22)

http://find/


148/233

> y t rbind(y, t)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]y 5 1 5 14 3 19 1 1 4 22t 94 16 63 126 5 31 1 1 2 10

We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.



1


We also have the times (t i ) at which each pump was observed.> y < c(5 1 5 14 3 19 1 1 4 22)

http://find/


149/233

> y t rbind(y, t)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]y 5 1 5 14 3 19 1 1 4 22t 94 16 63 126 5 31 1 1 2 10

We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.Since the time which we observed each pump is different, we needto scale each λi by its observed time t i .



1


We also have the times (t i ) at which each pump was observed.> y


150/233

> y < c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22)> t rbind(y, t)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]y 5 1 5 14 3 19 1 1 4 22t 94 16 63 126 5 31 1 1 2 10

We want to model the number of failures with a Poisson likelihood,where the expected number of failure λi differs for each pump.Since the time which we observed each pump is different, we needto scale each λi by its observed time t i .

Our likelihood is 10

i =1 Poisson(λi t i ).


Let’s put a Gamma(α, β ) prior on λi with α = 1.8, so the λi s aredrawn from the same distribution.

http://find/


151/233


Also, let’s put a Gamma(γ, δ ) prior on β , with γ = 0.01 and δ = 1.

http://find/


152/233



So our model has 11 parameters that are unknown (10 λi s and β ).



153/233






154/233

Our posterior is

p (λ, β |y, t) ∝

10i =1

Poisson(λi t i ) ×

Gamma(α, β )

×Gamma

(γ, δ )






155/233

Our posterior is

p (λ, β |y, t) ∝

10i =1

Poisson(λi t i ) ×

Gamma(α, β )

×Gamma

(γ, δ )

=

10

i =1

e −λi t i (λi t i )y i

y i ! ×

β α

Γ(α)λα−1i e

−βλi

× δ γ

Γ(γ )β γ −1e −δβ

p (λ, β |y, t) ∝

10

i =1

e −λi t i (λi t i )y i × β αλα−1i e

−βλi

×β γ −1e −δβ

http://find/


156/233

p (λ, β |y, t) ∝

10

i =1


−βλi


= 10

i =1λy i +α−1i e −(t i +β )λi

β 10α+γ −1e −δβ

http://find/


157/233

p (λ, β |y, t) ∝

10

i =1


−βλi


= 10

i =1λ

y i +α−1i e −(t i +β )λi


http://find/


158/233

Finding the full conditionals:

p (λ, β |y, t) ∝

10

i =1


−βλi


= 10

i =1λ

y i +

α−1i e −(t i +β )λi


http://find/


159/233


p (λi |λ−i , β, y, t) ∝ λy i +α−1i e

−(t i +β )λi

p (λ, β |y, t) ∝

10

i =1


−βλi


= 10

i =1λ

y i +

α−1i e −(t i +β )λi




160/233


p (λi |λ−i , β, y, t) ∝ λy i +α−1i e

−(t i +β )λi

p (β |λ, y, t) ∝ e −β (δ+P10

i =1 λi )β 10α+γ −1

p (λ, β |y, t) ∝

10

i =1


−βλi


= 10

i =1λ

y i +α−1

i e −(t i +β )λi β 10α+γ −1e −δβ



161/233


p (λi |λ−i , β, y, t) ∝ λy i +α−1i e

−(t i +β )λi

p (β |λ, y, t) ∝ e −β (δ+P10

i =1 λi )β 10α+γ −1

p (λi |λ−i , β, y, t) is a Gamma(y i + α, t i + β ) distribution.

p (λ, β |y, t) ∝

10

i =1


−βλi


= 10

i =1λ

y i +α−1

i e −(t i +β )λi β 10α+γ −1e −δβ



162/233


p (λi |λ−i , β, y, t) ∝ λy i +α−1i e

−(t i +β )λi

p (β |λ, y, t) ∝ e −β (δ+P10

i =1 λi )β 10α+γ −1

p (λi |λ−i , β, y, t) is a Gamma(y i + α, t i + β ) distribution.

p (β |λ, y, t) is a Gamma(10α + γ, δ +10

i =1 λi ) distribution.

Coding the Gibbs Sampler

http://find/


163/233


1. Define starting values for β

http://find/


164/233


1. Define starting values for β (we only need to define β herebecause we will draw λ first and it only depends on β andother given values).

http://find/


165/233



> beta.cur


166/233



> beta.cur


167/233

2. Draw λ(1) from its full conditional



> beta.cur


168/233

2. Draw λ(1) from its full conditional (we’re drawing all the λi sas a block because they all only depend on β and not eachother).



> beta.cur


169/233


> lambda.update beta.cur


170/233


> lambda.update beta.cur


171/233


> lambda.update beta.update


172/233

4. Repeat using most updated values until we get M draws.

5 Optional burn in and thinning

http://find/


173/233

5. Optional burn-in and thinning.

4. Repeat using most updated values until we get M draws.

5 Optional burn in and thinning

http://find/


174/233

5. Optional burn-in and thinning.

6. Make it into a function.

> gibbs


175/233

g g y p g+ }+ for (i in 1:n.sims) { + lambda.cur


176/233

7. Do Monte Carlo Integration on the resulting Markov chain,which are samples approximately from the posterior.

> posterior colMeans(posterior$lambda.draws)

[1] 0.07113 0.15098 0.10447 0.12321 0.65680 0.62212 0.86522 0.85465[9] 1.35524 1.92694

http://find/


177/233

> mean(posterior$beta.draws)

[1] 2.389

> apply(posterior$lambda.draws, 2, sd)

[1] 0.02759 0.08974 0.04012 0.03071 0.30899 0.13676 0.55689 0.54814[9] 0.60854 0.40812

> sd(posterior$beta.draws)

[1] 0.6986

Outline




178/233

Gibbs Sampling


Suppose we have a posterior p (θ|y) that we want to sample from,but

http://find/


179/233


the posterior doesn’t look like any distribution we know (no

conjugacy)

http://find/


180/233



conjugacy) the posterior consists of more than 2 parameters (grid

approximations intractable)



181/233








182/233


some (or all) of the full conditionals do not look like any

distributions we know (no Gibbs sampling for those whose fullconditionals we don’t know)







183/233

pp )

some (or all) of the full conditionals do not look like any

distributions we know (no Gibbs sampling for those whose fullconditionals we don’t know)

If all else fails, we can use the Metropolis-Hastings algorithm,which will always work.

Metropolis-Hastings Algorithm

http://find/


184/233


The Metropolis-Hastings Algorithm follows the following steps:

http://find/


185/233



1. Choose a starting value θ(0).

http://find/


186/233




2. At iteration t , draw a candidate θ∗

from a jumpingdistribution J t (θ

∗|θ(t −1)).

http://find/


187/233






∗|θ(t −1)).

3. Compute an acceptance ratio (probability):

http://find/


190/233

r =

p (θ∗|y)/J t (θ∗|θ(t −1))

p (θ(t −1)|y)/J t (θ(t −1)|θ∗)

mcmc methods: gibbs sampling and the metropolis-hastings algorithm

Documents