required part 2
TRANSCRIPT
-
7/26/2019 Required Part 2
1/80
Monte Carlo Methods with R: Monte Carlo Optimization [80]
Monte Carlo Optimization
Introduction
Optimization problems can mostly be seen as one of two kinds:
Find the extremaof a functionh() over a domain
Find the solution(s)to an implicit equationg() = 0 over a domain .
The problems are exchangeable
The second one is a minimization problem for a function like h() =g2()
while the first one is equivalent to solvingh()/= 0
We only focus on the maximization problem
-
7/26/2019 Required Part 2
2/80
Monte Carlo Methods with R: Monte Carlo Optimization [81]
Monte Carlo Optimization
Deterministic or Stochastic
Similar to integration, optimization can be deterministic or stochastic
Deterministic: performance dependent on properties of the functionsuch as convexity, boundedness, and smoothness
Stochastic(simulation)
Properties ofhplay a lesser role in simulation-based approaches.
Therefore, ifhis complex or is irregular, chose the stochastic approach.
-
7/26/2019 Required Part 2
3/80
Monte Carlo Methods with R: Monte Carlo Optimization [82]
Monte Carlo Optimization
Numerical Optimization
Rhas several embedded functions to solve optimization problems
The simplest one is optimize(one dimensional)
Example: Maximizing a Cauchy likelihoodC(, 1)
When maximizing the likelihood of a Cauchy C(, 1) sample,
(|x1, . . . , xn) =
n
i=1
1
1 + (xi )2,
The sequence of maxima (MLEs) = 0 whenn .
But the journey is not a smooth one...
-
7/26/2019 Required Part 2
4/80
Monte Carlo Methods with R: Monte Carlo Optimization [83]
Monte Carlo Optimization
Cauchy Likelihood
MLEs(left)at each sample size,n= 1, 500 , and plot of final likelihood(right).
Why are the MLEs so wiggly?
The likelihood is not as well-behaved as it seems
-
7/26/2019 Required Part 2
5/80
Monte Carlo Methods with R: Monte Carlo Optimization [84]
Monte Carlo Optimization
Cauchy Likelihood-2
The likelihood (|x1, . . . , xn) =n
i=11
1+(xi)2
Is like a polynomial of degree 2n
The derivative has 2nzeros
Hard to see ifn= 500 Here isn= 5
R code
-
7/26/2019 Required Part 2
6/80
Monte Carlo Methods with R: Monte Carlo Optimization [85]
Monte Carlo Optimization
Newton-Raphson
Similarly,nlmis a genericR function uses the NewtonRaphson method
Based on the recurrence relationi+1=i
2h
T(i)1
h
(i)
Where the matrix of the second derivatives is called the Hessian
This method is perfect when his quadratic
But may also deteriorate whenhis highly nonlinear
It also obviously depends on the starting point0whenh has several minima.
-
7/26/2019 Required Part 2
7/80
Monte Carlo Methods with R: Monte Carlo Optimization [86]
Monte Carlo Optimization
Newton-Raphson; Mixture Model Likelihood
Bimodal Mixture Model Likelihood 14N(1, 1) +34N(2, 1)
Sequences go to the closest mode
Starting point(1, 1)has a steep gradient
Bypasses the main mode (0.68, 1.98)
Goes to other mode (lower likelihood)
-
7/26/2019 Required Part 2
8/80
Monte Carlo Methods with R: Monte Carlo Optimization [87]
Stochastic search
A Basic Solution
A natural if rudimentary way of using simulation to find maxh()
Simulate points over according to an arbitrary distribution fpositive on
Until a high value ofh() is observed
Recallh(x) = [cos(50x) + sin(20x)]2
Max=3.8325
Histogram of 1000 runs
-
7/26/2019 Required Part 2
9/80
Monte Carlo Methods with R: Monte Carlo Optimization [88]
Stochastic search
Stochastic Gradient Methods
Generating direct simulations from the target can be difficult.
Different stochastic approach to maximization
Explore the surface in a local manner.
Can usej+1=j+j
A Markov Chain
The random componentj can be arbitrary
Can also use features of the function: Newton-Raphson Variation
j+1=j+jh(j), j >0 ,
Whereh(j) is the gradient j the step size
-
7/26/2019 Required Part 2
10/80
Monte Carlo Methods with R: Monte Carlo Optimization [89]
Stochastic search
Stochastic Gradient Methods-2
In difficult problems
The gradient sequence will most likely get stuck in a local extremum ofh.
Stochastic Variation
h(j) h(j+jj) h(j+jj)
2jj =
h(j, jj)
2jj,
(j) is a second decreasing sequence j is uniform on the unit sphere ||||= 1.
We then use
j+1=j+
j
2j h(j, jj)j
-
7/26/2019 Required Part 2
11/80
Monte Carlo Methods with R: Monte Carlo Optimization [90]
Stochastic Search
A Difficult Minimization
Many Local Minima
Global Min at (0, 0)
Code in the text
-
7/26/2019 Required Part 2
12/80
Monte Carlo Methods with R: Monte Carlo Optimization [91]
Stochastic Search
A Difficult Minimization 2
Scenario 1 2 3 4
j 1/ log(j+ 1) 1/100 log(j+ 1) 1/(j+ 1) 1/(j+ 1)
j 1/ log(j+ 1).1 1/ log(j+ 1).1 1/(j+ 1).5 1/(j+ 1).1
0slowly, jj = 0more slowly,
j(j/j)
2
-
7/26/2019 Required Part 2
13/80
Monte Carlo Methods with R: Monte Carlo Optimization [92]
Simulated Annealing
Introduction
This name is borrowed from Metallurgy:
A metal manufactured by a slow decrease of temperature (annealing)
Is stronger than a metal manufactured by a fast decrease of temperature.
The fundamental idea of simulated annealing methods
A change of scale, ortemperature
Allows for faster moves on the surface of the function hto maximize.
Rescaling partially avoids the trapping attraction of local maxima.
As T decreases toward 0, the values simulated from this distribution become
concentrated in a narrower and narrower neighborhood of the local maxima ofh
-
7/26/2019 Required Part 2
14/80
Monte Carlo Methods with R: Monte Carlo Optimization [93]
Simulated Annealing
Metropolis Algorithm/Simulated Annealing
Simulation method proposed by Metropolis et al. (1953)
Starting from0,is generated fromUniform in a neighborhood of0.
The new value of is generated as
1= with probability= exp(h/T) 10 with probability 1 ,
h=h() h(0)
Ifh() h(0),is accepted
Ifh()< h(0),may still be accepted
This allows escape from local maxima
-
7/26/2019 Required Part 2
15/80
Monte Carlo Methods with R: Monte Carlo Optimization [94]
Simulated Annealing
Metropolis Algorithm - Comments
Simulated annealing typically modifies the temperatureTat each iteration
It has the form
1. Simulate from an instrumental distribution
with density g(| i|);
2. Accept i+1= with probability
i= exp{hi/Ti} 1;
take i+1=i otherwise.
3. Update Ti to Ti+1.
All positive moves accepted
AsT0
Harder to accept downward moves No big downward moves
Not a Markov Chain - difficult to analyze
[ ]
-
7/26/2019 Required Part 2
16/80
Monte Carlo Methods with R: Monte Carlo Optimization [95]
Simulated Annealing
Simple Example
Trajectory: Ti= 1(1+i)2
Log trajectory also works
Can Guarantee Finding Global
Max R code
M t C l M th d ith R M t C l O ti i ti [96]
-
7/26/2019 Required Part 2
17/80
Monte Carlo Methods with R: Monte Carlo Optimization [96]
Simulated Annealing
Normal Mixture
Previous normal mixture
Most sequences find max
They visit both modes
Monte Carlo Methods with R: Monte Carlo Optimization [97]
-
7/26/2019 Required Part 2
18/80
Monte Carlo Methods with R: Monte Carlo Optimization [97]
Stochastic Approximation
Introduction
We now consider methods that work with the objective functionh
Rather than being concerned with fast exploration of the domain .
Unfortunately, the use of those methods results in an additional level of error
Due to this approximation ofh.
But, the objective function in many statistical problems can be expressed as h(x) = E[H(x, Z)]
This is the setting of so-called missing-data models
Monte Carlo Methods with R: Monte Carlo Optimization [98]
-
7/26/2019 Required Part 2
19/80
Monte Carlo Methods with R: Monte Carlo Optimization [98]
Stochastic Approximation
Optimizing Monte Carlo Approximations
Ifh(x) = E[H(x, Z)], a Monte Carlo approximation is
h(x) = 1
m
m
i=1 H(x, zi), Zis are generated from the conditional distributionf(z|x).
This approximation yields a convergent estimator ofh(x) for every value ofx
This is apointwise convergent estimator
Its use in optimization setups is not recommended
Changing sample ofZisunstable sequence of evaluations
And a rather noisy approximation to arg max h(x)
Monte Carlo Methods with R: Monte Carlo Optimization [99]
-
7/26/2019 Required Part 2
20/80
Monte Carlo Methods with R: Monte Carlo Optimization [99]
Stochastic Approximation
Bayesian Probit
Example: Bayesian analysis of a simple probit model
Y {0, 1}has a distribution depending on a covariateX:
P(Y = 1|X=x) = 1 P(Y = 0|X=x) = (0+1x) ,
Illustrate with Pima.trdataset,Y= diabetes indicator,X=BMI
Typically infer from themarginal posteriorarg max
0
i=1
(0+1xn)yi(0 1xn)
1yi d1= arg max0
h(0)
For a flat prior on and a sample (x1, . . . , xn).
Monte Carlo Methods with R: Monte Carlo Optimization [100]
-
7/26/2019 Required Part 2
21/80
p [ ]
Stochastic Approximation
Bayesian Probit Importance Sampling
No analytic expression forh
The conditional distribution of1 given0is also nonstandard
Use importance sampling with atdistribution with 5 df
Take= 0.1 and= 0.03 (MLEs)
Importance Sampling Approximation
h0(0) = 1M
Mm=1
i=1
(0+m1xn)
yi(0 m1xn)
1yit5(m1; , )
1 ,
Monte Carlo Methods with R: Monte Carlo Optimization [101]
-
7/26/2019 Required Part 2
22/80
Stochastic Approximation
Importance Sampling Evaluation
Plotting this approximation ofhwithtsamples simulated for each value of0
The maximization of the representedh function is not to be trusted as anapproximation to the maximization ofh. But, if we use thesametsample for all values of0
We obtain a much smoother function
We use importance sampling based on a singlesample ofZis
Simulated from an importance functiong(z) forallvalues ofx
Estimatehwithhm(x) =
1
m
mi=1
f(zi|x)
g(zi) H(x, zi).
Monte Carlo Methods with R: Monte Carlo Optimization [102]
-
7/26/2019 Required Part 2
23/80
Stochastic Approximation
Importance Sampling Likelihood Representation
Top: 100 runs, different samples
Middle: 100 runs, same sample
Bottom: averages over 100 runs
The averages over 100 runs are the same - but we will not do 100 runs
R code: Run pimax(25)from mcsm
Monte Carlo Methods with R: Monte Carlo Optimization [103]
-
7/26/2019 Required Part 2
24/80
Stochastic Approximation
Comments
This approach is not absolutely fool-proof
The precision ofhm(x) has no reason to be independent ofx
The numbermof simulations has to reflect the most varying case.
As in every importance sampling experimentThe choice of the candidate g is influential
In obtaining a good (or a disastrous) approximation ofh(x).
Checking for the finite varianceof the ratiof(zi|x)H(x, zi)
g(zi)
Is a minimal requirement in the choice ofg
Monte Carlo Methods with R: Monte Carlo Optimization [104]
-
7/26/2019 Required Part 2
25/80
Missing-Data Models and Demarginalization
Introduction
Missing data models are special cases of the representationh(x) = E[H(x, Z)]
These are models where the density of the observations can be expressed as
g(x|) =
Zf(x, z|) dz .
This representation occurs in many statistical settingsCensoring models and mixtures
Latent variable models (tobit, probit, arch, stochastic volatility, etc.)
Genetics: Missing SNP calls
Monte Carlo Methods with R: Monte Carlo Optimization [105]
-
7/26/2019 Required Part 2
26/80
Missing-Data Models and Demarginalization
Mixture Model
Example: Normal mixture model as a missing-data model
Start with a sample (x1, . . . , xn)
Introduce a vector (z1, . . . , z n) {1, 2}n such thatP(Zi= 1) = 1 P(Zi= 2) = 1/4 , Xi|Zi=z N(z, 1) ,
The (observed) likelihood is then obtained as E[H(x, Z)] for
H(x, z)
i; zi=1
1
4exp
(xi 1)
2/2
i; zi=2
3
4exp
(xi 2)
2/2
,
We recover the mixture model1
4N(1, 1) +
3
4N(2, 1)
As the marginal distribution ofXi.
Monte Carlo Methods with R: Monte Carlo Optimization [106]
-
7/26/2019 Required Part 2
27/80
Missing-Data Models and Demarginalization
CensoredData Likelihood
Example: Censoreddata likelihood
Censored data may come from experiments
Where some potential observations are replaced with a lower boundBecause they take too long to observe.
Suppose that we observeY1,. . .,Ym, iid, fromf(y )
And the (n m) remaining (Ym+1, . . . , Y n) are censored at the thresholda.
The corresponding likelihood function is
L(|y) = [1 F(a )]nm
m
i=1
f(yi ),
Fis the cdf associated withf
Monte Carlo Methods with R: Monte Carlo Optimization [107]
-
7/26/2019 Required Part 2
28/80
Missing-Data Models and Demarginalization
Recovering the Observed Data Likelihood
If we had observed the last n mvalues
Sayz= (zm+1, . . . , z n), withzi a(i=m+ 1, . . . , n),
We could have constructed the (complete data) likelihood
Lc(|y, z) =m
i=1
f(yi )n
i=m+1
f(zi ) .
Note thatL(|y) = E[Lc(|y, Z)] =
ZLc(|y, z)f(z|y, ) dz,
Wheref(z|y, ) is the density of the missing data
Conditional on the observed dataThe product of thef(zi )/[1 F(a )]s
f(z ) restricted to (a, +).
Monte Carlo Methods with R: Monte Carlo Optimization [108]
-
7/26/2019 Required Part 2
29/80
Missing-Data Models and Demarginalization
Comments
When we have the relationship
g(x|) =
Zf(x, z|) dz .
Z merely serves to simplify calculations
it does not necessarily have a specific meaning
We have thecomplete-data likelihoodLc(|x, z)) =f(x, z|)
The likelihood we would obtain
Were we to observe (x, z),thecomplete data
REMEMBER: g(x|) = Z
f(x, z|) dz .
Monte Carlo Methods with R: Monte Carlo Optimization [109]
-
7/26/2019 Required Part 2
30/80
The EM Algorithm
Introduction
The EM algorithm is a deterministic optimization technique
Dempster, Laird and Rubin 1977
Takes advantage of the missing data representationBuilds a sequence of easier maximization problems
Whose limit is the answer to the original problem
We assume that we observeX1, . . . , X ng(x|) that satisfies
g(x|) =
Z
f(x, z|) dz,
And we want to compute= arg max L(|x) = arg max g(x|).
Monte Carlo Methods with R: Monte Carlo Optimization [110]
-
7/26/2019 Required Part 2
31/80
The EM Algorithm
First Details
With the relationshipg(x|) =
Zf(x, z|) dz,
(X, Z) f(x, z|)
The conditionaldistribution of the missing data Z
Given the observed data xis
k(z|, x) =f(x, z|)
g(x|) .
Taking the logarithm of this expression leads to the following relationship
log L(|x)
= E0[log L
c(|x, Z)]
E0[log k(Z|, x)]
,
Obs. Data Complete Data Missing Data
Where the expectation is with respect tok(z|0, x).
In maximizing log L(|x), we can ignore the last term
Monte Carlo Methods with R: Monte Carlo Optimization [111]
-
7/26/2019 Required Part 2
32/80
The EM Algorithm
Iterations
DenotingQ(|0, x) = E0[log L
c(|x, Z)],
EM algorithm indeed proceeds by maximizing Q(|0, x) at each iterationIf(1)= argmaxQ(|0, x), (0) (1)
Sequence of estimators{(j)}, where
(j)= argmaxQ(|(j1))
This iterative scheme
Contains both an expectation step
And a maximization step
Giving the algorithm its name.
Monte Carlo Methods with R: Monte Carlo Optimization [112]
-
7/26/2019 Required Part 2
33/80
The EM Algorithm
The Algorithm
Pick a starting value (0)
Repeat
1. Compute (the E-step)
Q(|(m), x) = E(m)[log Lc(|x, Z)] ,
where the expectation is with respect to k(z|(m), x) and set m= 0.
2. Maximize Q(|(m), x) in and take (the M-step)
(m+1)= arg max
Q(|(m), x)
and set m= m+ 1
until a fixed point is reached; i.e., (m+1)
=(m)
.fixed point
Monte Carlo Methods with R: Monte Carlo Optimization [113]
-
7/26/2019 Required Part 2
34/80
The EM Algorithm
Properties
Jensens inequality The likelihood increases at each step of the EM algorithm
L((j+1)|x) L((j)|x),
Equality holding if and only ifQ((j+1)|(j), x) =Q((j)|(j), x).
Every limit point of an EM sequence {(j)}is a stationary point ofL(|x)
Not necessarily the maximum likelihood estimator
In practice, we run EM several times with different starting points.
Implementing the EM algorithm thus means being able to
(a)Compute the functionQ(
|, x)(b)Maximize this function.
Monte Carlo Methods with R: Monte Carlo Optimization [114]
-
7/26/2019 Required Part 2
35/80
The EM Algorithm
Censored Data Example
The complete-data likelihood is
Lc(|y, z)m
i=1 exp{(yi )2/2}
n
i=m+1 exp{(zi )2/2} ,
With expected complete-data log-likelihood
Q(|0, y) = 1
2
m
i=1 (yi )2
1
2
n
i=m+1E0[(Zi )2] ,
the Zi are distributed from a normal N(, 1) distribution truncated at a.
M-step (differentiatingQ(|0, y) in and setting it equal to 0 gives
= my+ (n m)E[Z1]
n .
With E[Z1] =+ (a)1(a),
Monte Carlo Methods with R: Monte Carlo Optimization [115]
-
7/26/2019 Required Part 2
36/80
The EM Algorithm
Censored Data MLEs
EM sequence
(j+1) =m
ny+
n m
n
(j) +
(a (j))
1 (a (j))
Climbing the Likelihood
R code
Monte Carlo Methods with R: Monte Carlo Optimization [116]
-
7/26/2019 Required Part 2
37/80
The EM Algorithm
Normal Mixture
Normal Mixture Bimodal Likelihood
Q(|, x) =1
2
n
i=1E
Zi(xi 1)
2 + (1 Zi)(xi 2)2
x
.
Solving the M-step then provides the closed-form expressions
1= E
n
i=1Zixi|x
E
n
i=1Zi|x
and
2= E
ni=1
(1 Zi)xi|x
E
ni=1
(1 Zi)|x
.
Since
E[Zi|x] =
(xi 1)
(xi 1) + 3(xi 2),
Monte Carlo Methods with R: Monte Carlo Optimization [117]
-
7/26/2019 Required Part 2
38/80
The EM Algorithm
Normal Mixture MLEs
EM five times with various starting points
Two out of five sequences higher mode
Otherslower mode
Monte Carlo Methods with R: Monte Carlo Optimization [118]
-
7/26/2019 Required Part 2
39/80
Monte Carlo EM
Introduction
If computationQ(|0, x) is difficult, can use Monte Carlo
For Z1, . . . , ZTk(z|x,(m)), maximize
Q(|0, x) = 1
T
Ti=1
log Lc(|x, zi)
Better: Use importance sampling
Since
arg max
L(|x) = arg max
log g(x|)
g(x|(0))= arg max
log E(0)
f(x, z|)
f(x, z|(0))
x
,
Use the approximation to the log-likelihood
log L(|x) 1
T
Ti=1
Lc(|x, zi)
Lc((0)|x, zi),
Monte Carlo Methods with R: Monte Carlo Optimization [119]
-
7/26/2019 Required Part 2
40/80
Monte Carlo EM
Genetics Data
Example: Genetic linkage.
A classic example of the EM algorithm
Observations (x1, x2, x3, x4) are gathered from the multinomial distribution
M
n;
1
2+
4,1
4(1 ),
1
4(1 ),
4
.
Estimation is easier if the x1cell is split into two cells
We create the augmented model
(z1, z2, x2, x3, x4) M
n;
1
2,
4,1
4(1 ),
1
4(1 ),
4
withx1=z1+z2.
Complete-data likelihood: z2+x4(1 )x2+x3
Observed-data likelihood: (2 +)x1x4(1 )x2+x3
Monte Carlo Methods with R: Monte Carlo Optimization [120]
-
7/26/2019 Required Part 2
41/80
Monte Carlo EM
Genetics Linkage Calculations
The expected complete log-likelihood function is
E0[(Z2+x4)log + (x2+x3) log(1 )] =
0
2 +0x1+x4
log + (x2+x3) log(1 ),
which can easily be maximized in, leading to the EM step
1=
0 x12 +0
0 x12 +0
+x2+x3+x4
.
Monte Carlo EM:Replace the expectation with
zm= 1m
mi=1 zi,zi B(x1, 0/(2 +0))
The MCEM step would then be1= zmzm+x2+x3+x4
,
which converges to 1asmgrows to infinity.
Monte Carlo Methods with R: Monte Carlo Optimization [121]
-
7/26/2019 Required Part 2
42/80
Monte Carlo EM
Genetics Linkage MLEs
Note variation in MCEM sequence
Can control withsimulations
R code
Monte Carlo Methods with R: Monte Carlo Optimization [122]
-
7/26/2019 Required Part 2
43/80
Monte Carlo EM
Random effect logit model
Example: Random effect logit model
Random effect logit model,
yij is distributed conditionally on one covariatexij as a logit model
P(yij = 1|xij, ui, ) = exp {xij+ui}
1 + exp {xij+ui},
ui N(0, 2
) is an unobserved random effect.
(U1, . . . , U n) therefore corresponds to the missing data Z
Monte Carlo Methods with R: Monte Carlo Optimization [123]
-
7/26/2019 Required Part 2
44/80
Monte Carlo EM
Random effect logit model likelihood
For the complete data likelihood with= (, ),
Q(|, x, y) =
i,j
yijE[xij+Ui|,, x, y]
i,j
E[log 1 + exp{xij+Ui}|,, x, y]
i
E[U2i |,, x, y]/22 n log ,
it is impossible to compute the expectations in Ui.
Were those available, the M-step would be difficult but feasible
MCEM: Simulate theUis conditional on, , x, yfrom
(ui|,, x, y)exp
jyijui u
2i /2
2
j[1 + exp {xij+ui}]
Monte Carlo Methods with R: Monte Carlo Optimization [124]
-
7/26/2019 Required Part 2
45/80
Monte Carlo EM
Random effect logit MLEs
Top: Sequence of s from the MCEM
algorithm
Bottom: Sequence of completed likeli-
hoods
MCEM sequence
Increases the number of Monte Carlo steps
at each iteration
MCEM algorithm
Does not have EM monotonicity property
Monte Carlo Methods with R: MetropolisHastings Algorithms [125]
-
7/26/2019 Required Part 2
46/80
Chapter 6: MetropolisHastings Algorithms
How absurdly simple!, I cried.Quite so!, said he, a little nettled. Every problem becomes very child-ish when once it is explained to you.
Arthur Conan Doyle
The Adventure of the Dancing Men
This Chapter
The first of a of two on simulation methods based on Markov chains
The MetropolisHastings algorithm is one of the most general MCMC algorithms
And one of the simplest.
There is a quick refresher on Markov chains, just the basics.
We focus on the most common versions of the MetropolisHastings algorithm.
We also look at calibration of the algorithm via its acceptance rate
Monte Carlo Methods with R: MetropolisHastings Algorithms [126]
-
7/26/2019 Required Part 2
47/80
MetropolisHastings Algorithms
Introduction
We now make a fundamental shift in the choice of our simulation strategy.
Up to now we have typically generated iidvariables
The MetropolisHastings algorithm generatescorrelatedvariablesFrom a Markov chain
The use of Markov chains broadens our scope of applications
The requirements on the target fare quite minimal
Efficient decompositions of high-dimensional problems
Into a sequence of smaller problems.
This has been part of aParadigm Shiftin Statistics
Monte Carlo Methods with R: MetropolisHastings Algorithms [127]
-
7/26/2019 Required Part 2
48/80
MetropolisHastings Algorithms
A Peek at Markov Chain Theory
A minimalist refresher on Markov chains
Basically to define terms See Robert and Casella (2004, Chapter 6) for more of the story
AMarkov chain{X(t)}is a sequence of dependent random variables
X(0), X(1), X(2), . . . , X (t), . . .
where the probability distribution ofX(t) depends only onX(t1).
The conditional distribution ofX(t)|X(t1) is atransition kernelK,
X(t+1) |X(0), X(1), X(2), . . . , X (t) K(X(t), X(t+1)) .
Monte Carlo Methods with R: MetropolisHastings Algorithms [128]
M k Ch i
-
7/26/2019 Required Part 2
49/80
Markov Chains
Basics
For example, a simplerandom walkMarkov chain satisfies
X(t+1) =X(t) +t , t N(0, 1) ,
The Markov kernelK(X(t)
, X(t+1)
) corresponds to aN(X(t)
, 1) density.
Markov chain Monte Carlo (MCMC) Markov chains typically have a very strongstability property.
They have a a stationary probability distribution
A probability distributionfsuch that ifX(t) f, then X(t+1) f, so wehave the equation
X K(x, y)f(x)dx=f(y).
Monte Carlo Methods with R: MetropolisHastings Algorithms [129]
M k Ch i
-
7/26/2019 Required Part 2
50/80
Markov Chains
Properties
MCMC Markov chains are alsoirreducible, or else they are useless
The kernelKallows for free moves all over the stater-space
For any X(0), the sequence {X(t)} has a positive probability of eventuallyreaching any region of the state-space
MCMC Markov chains are alsorecurrent, or else they are useless
They will return to any arbitrary nonnegligible set an infinite number of times
Monte Carlo Methods with R: MetropolisHastings Algorithms [130]
M k Ch i
-
7/26/2019 Required Part 2
51/80
Markov Chains
AR(1) Process
AR(1) models provide a simple illustration of continuous Markov chains
Here
Xn=Xn1+n, ,withn N(0, 2)
If thens are independent
Xn is independent fromXn2, Xn3, . . .conditionally onXn1.
The stationary distribution(x|, 2) is
N0, 2
1 2 ,
which requires||
-
7/26/2019 Required Part 2
52/80
Markov Chains
Statistical Language
We associate the probabilistic language of Markov chains
With the statistical language of data analysis.
Statistics Markov Chainmarginal distribution invariant distributionproper marginals positive recurrent
If the marginals are not proper, or if they do not exist
Then the chain is not positive recurrent.
It is either null recurrent or transient, and both are bad.
Monte Carlo Methods with R: MetropolisHastings Algorithms [132]
Markov Chains
-
7/26/2019 Required Part 2
53/80
Markov Chains
Pictures of the AR(1) Process
AR(1) Recurrent and Transient -Note the Scale
3 1 0 1 2 3
3
1
1
2
3
= 0.4
x
y
plot
4 2 0 2 4
4
2
0
2
4
= 0.8
x
y
plot
20 10 0 10 20
2
0
0
10
20
=0.95
x
yplot
20 10 0 10 20
2
0
0
10
20 =1.001
x
yplot
R code
Monte Carlo Methods with R: MetropolisHastings Algorithms [133]
Markov Chains
-
7/26/2019 Required Part 2
54/80
Markov Chains
Ergodicity
In recurrent chains, the stationary distribution is also a limiting distribution
Iffis the limiting distribution
X(t) X f, for any initial value X(0)
This property is also calledergodicity
For integrable functionsh, the standard average
1
T
Tt=1
h(X(t)) Ef[h(X)] ,
The Law of Large Numbers
Sometimes called theErgodic Theorem
Monte Carlo Methods with R: MetropolisHastings Algorithms [134]
Markov Chains
-
7/26/2019 Required Part 2
55/80
Markov Chains
In Bayesian Analysis
There is one case where convergence never occurs
When, in a Bayesian analysis,the posterior distribution is not proper
The use ofimproper priors f(x) is quite common in complex models,
Sometimes the posterior is proper, and MCMC works (recurrent)
Sometimes the posterior is improper, and MCMC fails (transient)
These transient Markov chains may present all the outer signs of stability
More later
Monte Carlo Methods with R: MetropolisHastings Algorithms [135]
Basic MetropolisHastings algorithms
-
7/26/2019 Required Part 2
56/80
Basic Metropolis Hastings algorithms
Introduction
The working principle of Markov chain Monte Carlo methods is straightforward
Given a target densityf
We build a Markov kernelKwith stationary distributionf
Then generate a Markov chain (X(t)) X f
Integrals can be approximated by to the Ergodic Theorem
The MetropolisHastings algorithm is an example of those methods.
Given the target densityf, we simulate from a candidateq(y|x)
Only need that the ratiof(y)/q(y|x) isknownup to a constant
Monte Carlo Methods with R: MetropolisHastings Algorithms [136]
Basic MetropolisHastings algorithms
-
7/26/2019 Required Part 2
57/80
p g g
A First MetropolisHastings Algorithm
MetropolisHastings Given x(t),
1. Generate Ytq(y|x(t)).
2. Take
X(t+1) =Yt with probability (x(t), Yt),
x(t) with probability 1 (x(t), Yt),
where
(x, y) = minf(y)f(x) q(x|y)q(y|x) , 1 . qis called theinstrumental or proposal or candidate distribution
(x, y) is the MetropolisHastingsacceptance probability Looks likeSimulated Annealing- but constant temperature
MetropolisHastingsexploresrather than maximizes
Monte Carlo Methods with R: MetropolisHastings Algorithms [137]
Basic MetropolisHastings algorithms
-
7/26/2019 Required Part 2
58/80
p g g
Generating Beta Random Variables
Target densityfis the Be(2.7, 6.3) Candidateqis uniform
Notice therepeats
Repeats must be kept!
Monte Carlo Methods with R: MetropolisHastings Algorithms [138]
Basic MetropolisHastings algorithms
-
7/26/2019 Required Part 2
59/80
Comparing Beta densities
Comparison with independent
sampling Histograms indistinguishable
Moments match
K-S test accepts R code
Monte Carlo Methods with R: MetropolisHastings Algorithms [139]
Basic MetropolisHastings algorithms
-
7/26/2019 Required Part 2
60/80
A Caution
The MCMC and exact sampling outcomes look identical, but
Markov chain Monte Carlo sample has correlation, the iid sample does not
This means that the quality of the sample is necessarily degraded
We need more simulations to achieve the same precision
This is formalized by theeffective sample sizefor Markov chains -later
Monte Carlo Methods with R: MetropolisHastings Algorithms [140]
Basic MetropolisHastings algorithms
-
7/26/2019 Required Part 2
61/80
Some Comments
In the symmetric caseq(x|y) =q(y|x),
(xt, yt) = minf(yt)
f(xt), 1
.
The acceptance probability is independent ofq
MetropolisHastings always accept values ofyt such that
f(yt)/q(yt|x(t))> f(x(t))/q(x(t)|yt)
Values yt that decrease the ratiomayalso be accepted
MetropolisHastings only depends on the ratios
f(yt)/f(x(t)) and q(x(t)|yt)/q(yt|x
(t)) .
Independent of normalizing constants
Monte Carlo Methods with R: MetropolisHastings Algorithms [141]
Basic MetropolisHastings algorithms
-
7/26/2019 Required Part 2
62/80
The Independent MetropolisHastings algorithm
The MetropolisHastings algorithm allowsq(y|x)
We can useq(y|x) =g(y), a special case
Independent MetropolisHastings
Given x(t)
1. Generate Ytg(y).
2. Take
X(t+1) =
Yt with probability min
f(Yt)g(x
(t))
f(x(t))g(Yt), 1
x
(t)
otherwise.
Monte Carlo Methods with R: MetropolisHastings Algorithms [142]
Basic MetropolisHastings algorithms
P i f h I d d M li H i l i h
-
7/26/2019 Required Part 2
63/80
Properties of the Independent MetropolisHastings algorithm
Straightforward generalization of the AcceptReject method
Candidates are independent, but still a Markov chain
The AcceptReject sample is iid, but the MetropolisHastings sample is not
The AcceptReject acceptance step requires the calculatingM
MetropolisHastings is AcceptReject for the lazy person
Monte Carlo Methods with R: MetropolisHastings Algorithms [143]
Basic MetropolisHastings algorithms
A li ti f th I d d t M t li H ti l ith
-
7/26/2019 Required Part 2
64/80
Application of the Independent MetropolisHastings algorithm
We now look at a somewhat more realistic statistical example
Get preliminary parameter estimates from a model
Use an independent proposal with those parameter estimates.
For example, to simulate from a posterior distribution (|x) ()f(x|)
Take a normal or atdistribution centered at the MLE
Covariance matrix equal to the inverse of Fishers information matrix.
Monte Carlo Methods with R: MetropolisHastings Algorithms [144]
Independent MetropolisHastings algorithm
B ki D t
-
7/26/2019 Required Part 2
65/80
Braking Data
Thecarsdataset relates braking distance (y) to speed (x) in a sample of cars.
Model
yij =a+bxi+cx2i +ij
The likelihood function is1
2
N/2exp
122ij
(yij a bxi cx2i )
2
,whereN= i ni
Monte Carlo Methods with R: MetropolisHastings Algorithms [145]
Independent MetropolisHastings algorithm
Braking Data Least Squares Fit
-
7/26/2019 Required Part 2
66/80
Braking Data Least Squares Fit
Candidate from Least Squares
R command: x2=x^2; summary(lm(y~x+x2))Coefficients:
Estimate Std. Error t value Pr(>|t|)(Intercept) 2.63328 14.80693 0.178 0.860
x 0.88770 2.03282 0.437 0.664
x2 0.10068 0.06592 1.527 0.133
Residual standard error: 15.17 on 47 degrees of freedom
Monte Carlo Methods with R: MetropolisHastings Algorithms [146]
Independent MetropolisHastings algorithm
Braking Data Metropolis Algorithm
-
7/26/2019 Required Part 2
67/80
Braking Data Metropolis Algorithm
Candidate: normal centered at theMLEs,
a N(2.63, (14.8)2),
b N(.887, (2.03)2),
c N(.100, (0.065)2),
Inverted gamma
2 G(n/2, (n 3)(15.17)2)
See the variability of the curves associated with the simulation.
Monte Carlo Methods with R: MetropolisHastings Algorithms [147]
Independent MetropolisHastings algorithm
Braking Data Coefficients
-
7/26/2019 Required Part 2
68/80
Braking Data Coefficients
Distributions of estimates
Credible intervals
See the skewness
Note that these are marginal distributions
Monte Carlo Methods with R: MetropolisHastings Algorithms [148]
Independent MetropolisHastings algorithm
Braking Data Assessment
-
7/26/2019 Required Part 2
69/80
Braking Data Assessment
50, 000 iterations
See the repeats
Intercept may not have converged
R code
Monte Carlo Methods with R: MetropolisHastings Algorithms [149]
Random Walk MetropolisHastings
Introduction
-
7/26/2019 Required Part 2
70/80
Introduction
Implementation of independent MetropolisHastings can sometimes be difficult
Construction of the proposal may be complicated
They ignore local information
An alternative is to gather information stepwise
Exploring the neighborhood of the current value of the chain
Can take into account the value previously simulated to generate the next value
Gives a more local exploration of the neighborhood of the current value
Monte Carlo Methods with R: MetropolisHastings Algorithms [150]
Random Walk MetropolisHastings
Some Details
-
7/26/2019 Required Part 2
71/80
The implementation of this idea is to simulate Yt according to
Yt=X(t) +t,
t is arandom perturbationwith distributiong, independent ofX(t)
Uniform, normal, etc...
The proposal densityq(y|x) is now of the formg(y x)
Typically,g is symmetric around zero, satisfying g(t) =g(t)
The Markov chain associated with q is arandom walk
Monte Carlo Methods with R: MetropolisHastings Algorithms [151]
Random Walk MetropolisHastings
The Algorithm
-
7/26/2019 Required Part 2
72/80
g
Given x(t),
1. Generate Ytg(y x(t)).
2. Take
X(t+1) =
Yt with probability min
1,
f(Yt)
f(x(t))
,
x(t) otherwise.
Theg chain is a random walk
Due to the MetropolisHastings acceptance step, the {X(t)}chain is not
The acceptance probability does not depend on g
But differentgs result in different ranges and different acceptance rates
Calibrating the scale of the random walk is for good exploration
Monte Carlo Methods with R: MetropolisHastings Algorithms [152]
Random Walk MetropolisHastings
Normal Mixtures
-
7/26/2019 Required Part 2
73/80
Explore likelihood with random walk
Similar to Simulated AnnealingBut constant temperature (scale)
MultimodalScale is important
Too smallget stuckToo bigmiss modes
Monte Carlo Methods with R: MetropolisHastings Algorithms [153]
Random Walk MetropolisHastings
Normal Mixtures - Different Scales
-
7/26/2019 Required Part 2
74/80
LeftRight: Scale=1,Scale=2,Scale=3
Scale=1: Too small, gets stuck
Scale=2: Just right, finds both modes
Scale=3: Too big, misses mode
R code
Monte Carlo Methods with R: MetropolisHastings Algorithms [154]
Random Walk MetropolisHastings
Model Selection or Model Choice
-
7/26/2019 Required Part 2
75/80
Random walk MetropolisHastings algorithms also apply to discrete targets.
As an illustration, we consider a regression
The swissdataset in Ry= logarithm of the fertility in 47 districts of Switzerland 1888
The covariate matrixXinvolves five explanatory variables> names(swiss)
[1] "Fertility" "Agriculture" "Examination" "Education"[5] "Catholic" "Infant.Mortality"
Compare the 25 = 32 models corresponding to all possible subsets of covariates.
If we include squares and twoway interactions
220 = 1048576 models, same R code
Monte Carlo Methods with R: MetropolisHastings Algorithms [155]
Random Walk MetropolisHastings
Model Selection using Marginals
-
7/26/2019 Required Part 2
76/80
Given an ordinary linear regression withnobservations,
y|, 2, X Nn(X ,2In) , Xis an (n, p) matrix
The likelihood is
, 2|y, X = 22n/2 exp 122
(y X)T(y X) UsingZellnersg-prior, with the constantg=n
|2, X Nk+1(,n2(XTX)1) and (2|X) 2
The marginal distribution ofyis a multivariatetdistribution,
m(y|X) y
I n
n+ 1X(XX)1Xy
1
n+ 1XX
n/2
.
Find the model with maximum marginal probability
Monte Carlo Methods with R: MetropolisHastings Algorithms [156]
Random Walk MetropolisHastings
Random Walk on Model Space
-
7/26/2019 Required Part 2
77/80
To go from(t) (t+1)
Firstget a candidate
(t) =
1
01
1
0
=
1
00
1
0
Choosea component of
(t)
at random, and flip 1 0 or 0 1Acceptthe proposed model with probability
min
m(y|X, )
m(y|X, (t)), 1
The candidate is symmetric
Note: This is not the MetropolisHastings algorithm in the book - it is simpler
Monte Carlo Methods with R: MetropolisHastings Algorithms [157]
Random Walk MetropolisHastings
Results from the Random Walk on Model Space
-
7/26/2019 Required Part 2
78/80
Last iterations of the MH search
The chain goes down often
Top Ten Models
Marg.
7.95 1 0 1 1 1
7.19 0 0 1 1 1
6.27 1 1 1 1 1
5.44 1 0 1 1 0
5.45 1 0 1 1 0
Best model excludesthe variable Examination
= (1, 0, 1, 1, 1)
Inclusion rates:Agri Exam Educ Cath Inf.Mort0.661 0.194 1.000 0.904 0.949
Monte Carlo Methods with R: MetropolisHastings Algorithms [158]
MetropolisHastings Algorithms
Acceptance Rates
-
7/26/2019 Required Part 2
79/80
Infinite number of choices for the candidateqin a MetropolisHastings algorithm
Is there and optimal choice?
The choice ofq=f, the target distribution? Not practical. A criterion for comparison is the acceptance rate
It can be easily computed with the empirical frequency of acceptance
In contrast to the AcceptReject algorithmMaximizing the acceptance rate will is not necessarily best
Especially for random walks
Also look at autocovariance
Monte Carlo Methods with R: MetropolisHastings Algorithms [159]
Acceptance Rates
Normals from Double Exponentials
-
7/26/2019 Required Part 2
80/80
In the AcceptReject algorithm
To generate aN(0, 1) from a double-exponentialL()
The choice= 1 optimizes the acceptance rate
In an independent MetropolisHastings algorithm
We can use the double-exponential as an independent candidate q
Compare the behavior of MetropolisHastings algorithm
When using theL(1) candidate or theL(3) candidate