required part 2

Upload: successinmp

Post on 13-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/26/2019 Required Part 2

    1/80

    Monte Carlo Methods with R: Monte Carlo Optimization [80]

    Monte Carlo Optimization

    Introduction

    Optimization problems can mostly be seen as one of two kinds:

    Find the extremaof a functionh() over a domain

    Find the solution(s)to an implicit equationg() = 0 over a domain .

    The problems are exchangeable

    The second one is a minimization problem for a function like h() =g2()

    while the first one is equivalent to solvingh()/= 0

    We only focus on the maximization problem

  • 7/26/2019 Required Part 2

    2/80

    Monte Carlo Methods with R: Monte Carlo Optimization [81]

    Monte Carlo Optimization

    Deterministic or Stochastic

    Similar to integration, optimization can be deterministic or stochastic

    Deterministic: performance dependent on properties of the functionsuch as convexity, boundedness, and smoothness

    Stochastic(simulation)

    Properties ofhplay a lesser role in simulation-based approaches.

    Therefore, ifhis complex or is irregular, chose the stochastic approach.

  • 7/26/2019 Required Part 2

    3/80

    Monte Carlo Methods with R: Monte Carlo Optimization [82]

    Monte Carlo Optimization

    Numerical Optimization

    Rhas several embedded functions to solve optimization problems

    The simplest one is optimize(one dimensional)

    Example: Maximizing a Cauchy likelihoodC(, 1)

    When maximizing the likelihood of a Cauchy C(, 1) sample,

    (|x1, . . . , xn) =

    n

    i=1

    1

    1 + (xi )2,

    The sequence of maxima (MLEs) = 0 whenn .

    But the journey is not a smooth one...

  • 7/26/2019 Required Part 2

    4/80

    Monte Carlo Methods with R: Monte Carlo Optimization [83]

    Monte Carlo Optimization

    Cauchy Likelihood

    MLEs(left)at each sample size,n= 1, 500 , and plot of final likelihood(right).

    Why are the MLEs so wiggly?

    The likelihood is not as well-behaved as it seems

  • 7/26/2019 Required Part 2

    5/80

    Monte Carlo Methods with R: Monte Carlo Optimization [84]

    Monte Carlo Optimization

    Cauchy Likelihood-2

    The likelihood (|x1, . . . , xn) =n

    i=11

    1+(xi)2

    Is like a polynomial of degree 2n

    The derivative has 2nzeros

    Hard to see ifn= 500 Here isn= 5

    R code

  • 7/26/2019 Required Part 2

    6/80

    Monte Carlo Methods with R: Monte Carlo Optimization [85]

    Monte Carlo Optimization

    Newton-Raphson

    Similarly,nlmis a genericR function uses the NewtonRaphson method

    Based on the recurrence relationi+1=i

    2h

    T(i)1

    h

    (i)

    Where the matrix of the second derivatives is called the Hessian

    This method is perfect when his quadratic

    But may also deteriorate whenhis highly nonlinear

    It also obviously depends on the starting point0whenh has several minima.

  • 7/26/2019 Required Part 2

    7/80

    Monte Carlo Methods with R: Monte Carlo Optimization [86]

    Monte Carlo Optimization

    Newton-Raphson; Mixture Model Likelihood

    Bimodal Mixture Model Likelihood 14N(1, 1) +34N(2, 1)

    Sequences go to the closest mode

    Starting point(1, 1)has a steep gradient

    Bypasses the main mode (0.68, 1.98)

    Goes to other mode (lower likelihood)

  • 7/26/2019 Required Part 2

    8/80

    Monte Carlo Methods with R: Monte Carlo Optimization [87]

    Stochastic search

    A Basic Solution

    A natural if rudimentary way of using simulation to find maxh()

    Simulate points over according to an arbitrary distribution fpositive on

    Until a high value ofh() is observed

    Recallh(x) = [cos(50x) + sin(20x)]2

    Max=3.8325

    Histogram of 1000 runs

  • 7/26/2019 Required Part 2

    9/80

    Monte Carlo Methods with R: Monte Carlo Optimization [88]

    Stochastic search

    Stochastic Gradient Methods

    Generating direct simulations from the target can be difficult.

    Different stochastic approach to maximization

    Explore the surface in a local manner.

    Can usej+1=j+j

    A Markov Chain

    The random componentj can be arbitrary

    Can also use features of the function: Newton-Raphson Variation

    j+1=j+jh(j), j >0 ,

    Whereh(j) is the gradient j the step size

  • 7/26/2019 Required Part 2

    10/80

    Monte Carlo Methods with R: Monte Carlo Optimization [89]

    Stochastic search

    Stochastic Gradient Methods-2

    In difficult problems

    The gradient sequence will most likely get stuck in a local extremum ofh.

    Stochastic Variation

    h(j) h(j+jj) h(j+jj)

    2jj =

    h(j, jj)

    2jj,

    (j) is a second decreasing sequence j is uniform on the unit sphere ||||= 1.

    We then use

    j+1=j+

    j

    2j h(j, jj)j

  • 7/26/2019 Required Part 2

    11/80

    Monte Carlo Methods with R: Monte Carlo Optimization [90]

    Stochastic Search

    A Difficult Minimization

    Many Local Minima

    Global Min at (0, 0)

    Code in the text

  • 7/26/2019 Required Part 2

    12/80

    Monte Carlo Methods with R: Monte Carlo Optimization [91]

    Stochastic Search

    A Difficult Minimization 2

    Scenario 1 2 3 4

    j 1/ log(j+ 1) 1/100 log(j+ 1) 1/(j+ 1) 1/(j+ 1)

    j 1/ log(j+ 1).1 1/ log(j+ 1).1 1/(j+ 1).5 1/(j+ 1).1

    0slowly, jj = 0more slowly,

    j(j/j)

    2

  • 7/26/2019 Required Part 2

    13/80

    Monte Carlo Methods with R: Monte Carlo Optimization [92]

    Simulated Annealing

    Introduction

    This name is borrowed from Metallurgy:

    A metal manufactured by a slow decrease of temperature (annealing)

    Is stronger than a metal manufactured by a fast decrease of temperature.

    The fundamental idea of simulated annealing methods

    A change of scale, ortemperature

    Allows for faster moves on the surface of the function hto maximize.

    Rescaling partially avoids the trapping attraction of local maxima.

    As T decreases toward 0, the values simulated from this distribution become

    concentrated in a narrower and narrower neighborhood of the local maxima ofh

  • 7/26/2019 Required Part 2

    14/80

    Monte Carlo Methods with R: Monte Carlo Optimization [93]

    Simulated Annealing

    Metropolis Algorithm/Simulated Annealing

    Simulation method proposed by Metropolis et al. (1953)

    Starting from0,is generated fromUniform in a neighborhood of0.

    The new value of is generated as

    1= with probability= exp(h/T) 10 with probability 1 ,

    h=h() h(0)

    Ifh() h(0),is accepted

    Ifh()< h(0),may still be accepted

    This allows escape from local maxima

  • 7/26/2019 Required Part 2

    15/80

    Monte Carlo Methods with R: Monte Carlo Optimization [94]

    Simulated Annealing

    Metropolis Algorithm - Comments

    Simulated annealing typically modifies the temperatureTat each iteration

    It has the form

    1. Simulate from an instrumental distribution

    with density g(| i|);

    2. Accept i+1= with probability

    i= exp{hi/Ti} 1;

    take i+1=i otherwise.

    3. Update Ti to Ti+1.

    All positive moves accepted

    AsT0

    Harder to accept downward moves No big downward moves

    Not a Markov Chain - difficult to analyze

    [ ]

  • 7/26/2019 Required Part 2

    16/80

    Monte Carlo Methods with R: Monte Carlo Optimization [95]

    Simulated Annealing

    Simple Example

    Trajectory: Ti= 1(1+i)2

    Log trajectory also works

    Can Guarantee Finding Global

    Max R code

    M t C l M th d ith R M t C l O ti i ti [96]

  • 7/26/2019 Required Part 2

    17/80

    Monte Carlo Methods with R: Monte Carlo Optimization [96]

    Simulated Annealing

    Normal Mixture

    Previous normal mixture

    Most sequences find max

    They visit both modes

    Monte Carlo Methods with R: Monte Carlo Optimization [97]

  • 7/26/2019 Required Part 2

    18/80

    Monte Carlo Methods with R: Monte Carlo Optimization [97]

    Stochastic Approximation

    Introduction

    We now consider methods that work with the objective functionh

    Rather than being concerned with fast exploration of the domain .

    Unfortunately, the use of those methods results in an additional level of error

    Due to this approximation ofh.

    But, the objective function in many statistical problems can be expressed as h(x) = E[H(x, Z)]

    This is the setting of so-called missing-data models

    Monte Carlo Methods with R: Monte Carlo Optimization [98]

  • 7/26/2019 Required Part 2

    19/80

    Monte Carlo Methods with R: Monte Carlo Optimization [98]

    Stochastic Approximation

    Optimizing Monte Carlo Approximations

    Ifh(x) = E[H(x, Z)], a Monte Carlo approximation is

    h(x) = 1

    m

    m

    i=1 H(x, zi), Zis are generated from the conditional distributionf(z|x).

    This approximation yields a convergent estimator ofh(x) for every value ofx

    This is apointwise convergent estimator

    Its use in optimization setups is not recommended

    Changing sample ofZisunstable sequence of evaluations

    And a rather noisy approximation to arg max h(x)

    Monte Carlo Methods with R: Monte Carlo Optimization [99]

  • 7/26/2019 Required Part 2

    20/80

    Monte Carlo Methods with R: Monte Carlo Optimization [99]

    Stochastic Approximation

    Bayesian Probit

    Example: Bayesian analysis of a simple probit model

    Y {0, 1}has a distribution depending on a covariateX:

    P(Y = 1|X=x) = 1 P(Y = 0|X=x) = (0+1x) ,

    Illustrate with Pima.trdataset,Y= diabetes indicator,X=BMI

    Typically infer from themarginal posteriorarg max

    0

    i=1

    (0+1xn)yi(0 1xn)

    1yi d1= arg max0

    h(0)

    For a flat prior on and a sample (x1, . . . , xn).

    Monte Carlo Methods with R: Monte Carlo Optimization [100]

  • 7/26/2019 Required Part 2

    21/80

    p [ ]

    Stochastic Approximation

    Bayesian Probit Importance Sampling

    No analytic expression forh

    The conditional distribution of1 given0is also nonstandard

    Use importance sampling with atdistribution with 5 df

    Take= 0.1 and= 0.03 (MLEs)

    Importance Sampling Approximation

    h0(0) = 1M

    Mm=1

    i=1

    (0+m1xn)

    yi(0 m1xn)

    1yit5(m1; , )

    1 ,

    Monte Carlo Methods with R: Monte Carlo Optimization [101]

  • 7/26/2019 Required Part 2

    22/80

    Stochastic Approximation

    Importance Sampling Evaluation

    Plotting this approximation ofhwithtsamples simulated for each value of0

    The maximization of the representedh function is not to be trusted as anapproximation to the maximization ofh. But, if we use thesametsample for all values of0

    We obtain a much smoother function

    We use importance sampling based on a singlesample ofZis

    Simulated from an importance functiong(z) forallvalues ofx

    Estimatehwithhm(x) =

    1

    m

    mi=1

    f(zi|x)

    g(zi) H(x, zi).

    Monte Carlo Methods with R: Monte Carlo Optimization [102]

  • 7/26/2019 Required Part 2

    23/80

    Stochastic Approximation

    Importance Sampling Likelihood Representation

    Top: 100 runs, different samples

    Middle: 100 runs, same sample

    Bottom: averages over 100 runs

    The averages over 100 runs are the same - but we will not do 100 runs

    R code: Run pimax(25)from mcsm

    Monte Carlo Methods with R: Monte Carlo Optimization [103]

  • 7/26/2019 Required Part 2

    24/80

    Stochastic Approximation

    Comments

    This approach is not absolutely fool-proof

    The precision ofhm(x) has no reason to be independent ofx

    The numbermof simulations has to reflect the most varying case.

    As in every importance sampling experimentThe choice of the candidate g is influential

    In obtaining a good (or a disastrous) approximation ofh(x).

    Checking for the finite varianceof the ratiof(zi|x)H(x, zi)

    g(zi)

    Is a minimal requirement in the choice ofg

    Monte Carlo Methods with R: Monte Carlo Optimization [104]

  • 7/26/2019 Required Part 2

    25/80

    Missing-Data Models and Demarginalization

    Introduction

    Missing data models are special cases of the representationh(x) = E[H(x, Z)]

    These are models where the density of the observations can be expressed as

    g(x|) =

    Zf(x, z|) dz .

    This representation occurs in many statistical settingsCensoring models and mixtures

    Latent variable models (tobit, probit, arch, stochastic volatility, etc.)

    Genetics: Missing SNP calls

    Monte Carlo Methods with R: Monte Carlo Optimization [105]

  • 7/26/2019 Required Part 2

    26/80

    Missing-Data Models and Demarginalization

    Mixture Model

    Example: Normal mixture model as a missing-data model

    Start with a sample (x1, . . . , xn)

    Introduce a vector (z1, . . . , z n) {1, 2}n such thatP(Zi= 1) = 1 P(Zi= 2) = 1/4 , Xi|Zi=z N(z, 1) ,

    The (observed) likelihood is then obtained as E[H(x, Z)] for

    H(x, z)

    i; zi=1

    1

    4exp

    (xi 1)

    2/2

    i; zi=2

    3

    4exp

    (xi 2)

    2/2

    ,

    We recover the mixture model1

    4N(1, 1) +

    3

    4N(2, 1)

    As the marginal distribution ofXi.

    Monte Carlo Methods with R: Monte Carlo Optimization [106]

  • 7/26/2019 Required Part 2

    27/80

    Missing-Data Models and Demarginalization

    CensoredData Likelihood

    Example: Censoreddata likelihood

    Censored data may come from experiments

    Where some potential observations are replaced with a lower boundBecause they take too long to observe.

    Suppose that we observeY1,. . .,Ym, iid, fromf(y )

    And the (n m) remaining (Ym+1, . . . , Y n) are censored at the thresholda.

    The corresponding likelihood function is

    L(|y) = [1 F(a )]nm

    m

    i=1

    f(yi ),

    Fis the cdf associated withf

    Monte Carlo Methods with R: Monte Carlo Optimization [107]

  • 7/26/2019 Required Part 2

    28/80

    Missing-Data Models and Demarginalization

    Recovering the Observed Data Likelihood

    If we had observed the last n mvalues

    Sayz= (zm+1, . . . , z n), withzi a(i=m+ 1, . . . , n),

    We could have constructed the (complete data) likelihood

    Lc(|y, z) =m

    i=1

    f(yi )n

    i=m+1

    f(zi ) .

    Note thatL(|y) = E[Lc(|y, Z)] =

    ZLc(|y, z)f(z|y, ) dz,

    Wheref(z|y, ) is the density of the missing data

    Conditional on the observed dataThe product of thef(zi )/[1 F(a )]s

    f(z ) restricted to (a, +).

    Monte Carlo Methods with R: Monte Carlo Optimization [108]

  • 7/26/2019 Required Part 2

    29/80

    Missing-Data Models and Demarginalization

    Comments

    When we have the relationship

    g(x|) =

    Zf(x, z|) dz .

    Z merely serves to simplify calculations

    it does not necessarily have a specific meaning

    We have thecomplete-data likelihoodLc(|x, z)) =f(x, z|)

    The likelihood we would obtain

    Were we to observe (x, z),thecomplete data

    REMEMBER: g(x|) = Z

    f(x, z|) dz .

    Monte Carlo Methods with R: Monte Carlo Optimization [109]

  • 7/26/2019 Required Part 2

    30/80

    The EM Algorithm

    Introduction

    The EM algorithm is a deterministic optimization technique

    Dempster, Laird and Rubin 1977

    Takes advantage of the missing data representationBuilds a sequence of easier maximization problems

    Whose limit is the answer to the original problem

    We assume that we observeX1, . . . , X ng(x|) that satisfies

    g(x|) =

    Z

    f(x, z|) dz,

    And we want to compute= arg max L(|x) = arg max g(x|).

    Monte Carlo Methods with R: Monte Carlo Optimization [110]

  • 7/26/2019 Required Part 2

    31/80

    The EM Algorithm

    First Details

    With the relationshipg(x|) =

    Zf(x, z|) dz,

    (X, Z) f(x, z|)

    The conditionaldistribution of the missing data Z

    Given the observed data xis

    k(z|, x) =f(x, z|)

    g(x|) .

    Taking the logarithm of this expression leads to the following relationship

    log L(|x)

    = E0[log L

    c(|x, Z)]

    E0[log k(Z|, x)]

    ,

    Obs. Data Complete Data Missing Data

    Where the expectation is with respect tok(z|0, x).

    In maximizing log L(|x), we can ignore the last term

    Monte Carlo Methods with R: Monte Carlo Optimization [111]

  • 7/26/2019 Required Part 2

    32/80

    The EM Algorithm

    Iterations

    DenotingQ(|0, x) = E0[log L

    c(|x, Z)],

    EM algorithm indeed proceeds by maximizing Q(|0, x) at each iterationIf(1)= argmaxQ(|0, x), (0) (1)

    Sequence of estimators{(j)}, where

    (j)= argmaxQ(|(j1))

    This iterative scheme

    Contains both an expectation step

    And a maximization step

    Giving the algorithm its name.

    Monte Carlo Methods with R: Monte Carlo Optimization [112]

  • 7/26/2019 Required Part 2

    33/80

    The EM Algorithm

    The Algorithm

    Pick a starting value (0)

    Repeat

    1. Compute (the E-step)

    Q(|(m), x) = E(m)[log Lc(|x, Z)] ,

    where the expectation is with respect to k(z|(m), x) and set m= 0.

    2. Maximize Q(|(m), x) in and take (the M-step)

    (m+1)= arg max

    Q(|(m), x)

    and set m= m+ 1

    until a fixed point is reached; i.e., (m+1)

    =(m)

    .fixed point

    Monte Carlo Methods with R: Monte Carlo Optimization [113]

  • 7/26/2019 Required Part 2

    34/80

    The EM Algorithm

    Properties

    Jensens inequality The likelihood increases at each step of the EM algorithm

    L((j+1)|x) L((j)|x),

    Equality holding if and only ifQ((j+1)|(j), x) =Q((j)|(j), x).

    Every limit point of an EM sequence {(j)}is a stationary point ofL(|x)

    Not necessarily the maximum likelihood estimator

    In practice, we run EM several times with different starting points.

    Implementing the EM algorithm thus means being able to

    (a)Compute the functionQ(

    |, x)(b)Maximize this function.

    Monte Carlo Methods with R: Monte Carlo Optimization [114]

  • 7/26/2019 Required Part 2

    35/80

    The EM Algorithm

    Censored Data Example

    The complete-data likelihood is

    Lc(|y, z)m

    i=1 exp{(yi )2/2}

    n

    i=m+1 exp{(zi )2/2} ,

    With expected complete-data log-likelihood

    Q(|0, y) = 1

    2

    m

    i=1 (yi )2

    1

    2

    n

    i=m+1E0[(Zi )2] ,

    the Zi are distributed from a normal N(, 1) distribution truncated at a.

    M-step (differentiatingQ(|0, y) in and setting it equal to 0 gives

    = my+ (n m)E[Z1]

    n .

    With E[Z1] =+ (a)1(a),

    Monte Carlo Methods with R: Monte Carlo Optimization [115]

  • 7/26/2019 Required Part 2

    36/80

    The EM Algorithm

    Censored Data MLEs

    EM sequence

    (j+1) =m

    ny+

    n m

    n

    (j) +

    (a (j))

    1 (a (j))

    Climbing the Likelihood

    R code

    Monte Carlo Methods with R: Monte Carlo Optimization [116]

  • 7/26/2019 Required Part 2

    37/80

    The EM Algorithm

    Normal Mixture

    Normal Mixture Bimodal Likelihood

    Q(|, x) =1

    2

    n

    i=1E

    Zi(xi 1)

    2 + (1 Zi)(xi 2)2

    x

    .

    Solving the M-step then provides the closed-form expressions

    1= E

    n

    i=1Zixi|x

    E

    n

    i=1Zi|x

    and

    2= E

    ni=1

    (1 Zi)xi|x

    E

    ni=1

    (1 Zi)|x

    .

    Since

    E[Zi|x] =

    (xi 1)

    (xi 1) + 3(xi 2),

    Monte Carlo Methods with R: Monte Carlo Optimization [117]

  • 7/26/2019 Required Part 2

    38/80

    The EM Algorithm

    Normal Mixture MLEs

    EM five times with various starting points

    Two out of five sequences higher mode

    Otherslower mode

    Monte Carlo Methods with R: Monte Carlo Optimization [118]

  • 7/26/2019 Required Part 2

    39/80

    Monte Carlo EM

    Introduction

    If computationQ(|0, x) is difficult, can use Monte Carlo

    For Z1, . . . , ZTk(z|x,(m)), maximize

    Q(|0, x) = 1

    T

    Ti=1

    log Lc(|x, zi)

    Better: Use importance sampling

    Since

    arg max

    L(|x) = arg max

    log g(x|)

    g(x|(0))= arg max

    log E(0)

    f(x, z|)

    f(x, z|(0))

    x

    ,

    Use the approximation to the log-likelihood

    log L(|x) 1

    T

    Ti=1

    Lc(|x, zi)

    Lc((0)|x, zi),

    Monte Carlo Methods with R: Monte Carlo Optimization [119]

  • 7/26/2019 Required Part 2

    40/80

    Monte Carlo EM

    Genetics Data

    Example: Genetic linkage.

    A classic example of the EM algorithm

    Observations (x1, x2, x3, x4) are gathered from the multinomial distribution

    M

    n;

    1

    2+

    4,1

    4(1 ),

    1

    4(1 ),

    4

    .

    Estimation is easier if the x1cell is split into two cells

    We create the augmented model

    (z1, z2, x2, x3, x4) M

    n;

    1

    2,

    4,1

    4(1 ),

    1

    4(1 ),

    4

    withx1=z1+z2.

    Complete-data likelihood: z2+x4(1 )x2+x3

    Observed-data likelihood: (2 +)x1x4(1 )x2+x3

    Monte Carlo Methods with R: Monte Carlo Optimization [120]

  • 7/26/2019 Required Part 2

    41/80

    Monte Carlo EM

    Genetics Linkage Calculations

    The expected complete log-likelihood function is

    E0[(Z2+x4)log + (x2+x3) log(1 )] =

    0

    2 +0x1+x4

    log + (x2+x3) log(1 ),

    which can easily be maximized in, leading to the EM step

    1=

    0 x12 +0

    0 x12 +0

    +x2+x3+x4

    .

    Monte Carlo EM:Replace the expectation with

    zm= 1m

    mi=1 zi,zi B(x1, 0/(2 +0))

    The MCEM step would then be1= zmzm+x2+x3+x4

    ,

    which converges to 1asmgrows to infinity.

    Monte Carlo Methods with R: Monte Carlo Optimization [121]

  • 7/26/2019 Required Part 2

    42/80

    Monte Carlo EM

    Genetics Linkage MLEs

    Note variation in MCEM sequence

    Can control withsimulations

    R code

    Monte Carlo Methods with R: Monte Carlo Optimization [122]

  • 7/26/2019 Required Part 2

    43/80

    Monte Carlo EM

    Random effect logit model

    Example: Random effect logit model

    Random effect logit model,

    yij is distributed conditionally on one covariatexij as a logit model

    P(yij = 1|xij, ui, ) = exp {xij+ui}

    1 + exp {xij+ui},

    ui N(0, 2

    ) is an unobserved random effect.

    (U1, . . . , U n) therefore corresponds to the missing data Z

    Monte Carlo Methods with R: Monte Carlo Optimization [123]

  • 7/26/2019 Required Part 2

    44/80

    Monte Carlo EM

    Random effect logit model likelihood

    For the complete data likelihood with= (, ),

    Q(|, x, y) =

    i,j

    yijE[xij+Ui|,, x, y]

    i,j

    E[log 1 + exp{xij+Ui}|,, x, y]

    i

    E[U2i |,, x, y]/22 n log ,

    it is impossible to compute the expectations in Ui.

    Were those available, the M-step would be difficult but feasible

    MCEM: Simulate theUis conditional on, , x, yfrom

    (ui|,, x, y)exp

    jyijui u

    2i /2

    2

    j[1 + exp {xij+ui}]

    Monte Carlo Methods with R: Monte Carlo Optimization [124]

  • 7/26/2019 Required Part 2

    45/80

    Monte Carlo EM

    Random effect logit MLEs

    Top: Sequence of s from the MCEM

    algorithm

    Bottom: Sequence of completed likeli-

    hoods

    MCEM sequence

    Increases the number of Monte Carlo steps

    at each iteration

    MCEM algorithm

    Does not have EM monotonicity property

    Monte Carlo Methods with R: MetropolisHastings Algorithms [125]

  • 7/26/2019 Required Part 2

    46/80

    Chapter 6: MetropolisHastings Algorithms

    How absurdly simple!, I cried.Quite so!, said he, a little nettled. Every problem becomes very child-ish when once it is explained to you.

    Arthur Conan Doyle

    The Adventure of the Dancing Men

    This Chapter

    The first of a of two on simulation methods based on Markov chains

    The MetropolisHastings algorithm is one of the most general MCMC algorithms

    And one of the simplest.

    There is a quick refresher on Markov chains, just the basics.

    We focus on the most common versions of the MetropolisHastings algorithm.

    We also look at calibration of the algorithm via its acceptance rate

    Monte Carlo Methods with R: MetropolisHastings Algorithms [126]

  • 7/26/2019 Required Part 2

    47/80

    MetropolisHastings Algorithms

    Introduction

    We now make a fundamental shift in the choice of our simulation strategy.

    Up to now we have typically generated iidvariables

    The MetropolisHastings algorithm generatescorrelatedvariablesFrom a Markov chain

    The use of Markov chains broadens our scope of applications

    The requirements on the target fare quite minimal

    Efficient decompositions of high-dimensional problems

    Into a sequence of smaller problems.

    This has been part of aParadigm Shiftin Statistics

    Monte Carlo Methods with R: MetropolisHastings Algorithms [127]

  • 7/26/2019 Required Part 2

    48/80

    MetropolisHastings Algorithms

    A Peek at Markov Chain Theory

    A minimalist refresher on Markov chains

    Basically to define terms See Robert and Casella (2004, Chapter 6) for more of the story

    AMarkov chain{X(t)}is a sequence of dependent random variables

    X(0), X(1), X(2), . . . , X (t), . . .

    where the probability distribution ofX(t) depends only onX(t1).

    The conditional distribution ofX(t)|X(t1) is atransition kernelK,

    X(t+1) |X(0), X(1), X(2), . . . , X (t) K(X(t), X(t+1)) .

    Monte Carlo Methods with R: MetropolisHastings Algorithms [128]

    M k Ch i

  • 7/26/2019 Required Part 2

    49/80

    Markov Chains

    Basics

    For example, a simplerandom walkMarkov chain satisfies

    X(t+1) =X(t) +t , t N(0, 1) ,

    The Markov kernelK(X(t)

    , X(t+1)

    ) corresponds to aN(X(t)

    , 1) density.

    Markov chain Monte Carlo (MCMC) Markov chains typically have a very strongstability property.

    They have a a stationary probability distribution

    A probability distributionfsuch that ifX(t) f, then X(t+1) f, so wehave the equation

    X K(x, y)f(x)dx=f(y).

    Monte Carlo Methods with R: MetropolisHastings Algorithms [129]

    M k Ch i

  • 7/26/2019 Required Part 2

    50/80

    Markov Chains

    Properties

    MCMC Markov chains are alsoirreducible, or else they are useless

    The kernelKallows for free moves all over the stater-space

    For any X(0), the sequence {X(t)} has a positive probability of eventuallyreaching any region of the state-space

    MCMC Markov chains are alsorecurrent, or else they are useless

    They will return to any arbitrary nonnegligible set an infinite number of times

    Monte Carlo Methods with R: MetropolisHastings Algorithms [130]

    M k Ch i

  • 7/26/2019 Required Part 2

    51/80

    Markov Chains

    AR(1) Process

    AR(1) models provide a simple illustration of continuous Markov chains

    Here

    Xn=Xn1+n, ,withn N(0, 2)

    If thens are independent

    Xn is independent fromXn2, Xn3, . . .conditionally onXn1.

    The stationary distribution(x|, 2) is

    N0, 2

    1 2 ,

    which requires||

  • 7/26/2019 Required Part 2

    52/80

    Markov Chains

    Statistical Language

    We associate the probabilistic language of Markov chains

    With the statistical language of data analysis.

    Statistics Markov Chainmarginal distribution invariant distributionproper marginals positive recurrent

    If the marginals are not proper, or if they do not exist

    Then the chain is not positive recurrent.

    It is either null recurrent or transient, and both are bad.

    Monte Carlo Methods with R: MetropolisHastings Algorithms [132]

    Markov Chains

  • 7/26/2019 Required Part 2

    53/80

    Markov Chains

    Pictures of the AR(1) Process

    AR(1) Recurrent and Transient -Note the Scale

    3 1 0 1 2 3

    3

    1

    1

    2

    3

    = 0.4

    x

    y

    plot

    4 2 0 2 4

    4

    2

    0

    2

    4

    = 0.8

    x

    y

    plot

    20 10 0 10 20

    2

    0

    0

    10

    20

    =0.95

    x

    yplot

    20 10 0 10 20

    2

    0

    0

    10

    20 =1.001

    x

    yplot

    R code

    Monte Carlo Methods with R: MetropolisHastings Algorithms [133]

    Markov Chains

  • 7/26/2019 Required Part 2

    54/80

    Markov Chains

    Ergodicity

    In recurrent chains, the stationary distribution is also a limiting distribution

    Iffis the limiting distribution

    X(t) X f, for any initial value X(0)

    This property is also calledergodicity

    For integrable functionsh, the standard average

    1

    T

    Tt=1

    h(X(t)) Ef[h(X)] ,

    The Law of Large Numbers

    Sometimes called theErgodic Theorem

    Monte Carlo Methods with R: MetropolisHastings Algorithms [134]

    Markov Chains

  • 7/26/2019 Required Part 2

    55/80

    Markov Chains

    In Bayesian Analysis

    There is one case where convergence never occurs

    When, in a Bayesian analysis,the posterior distribution is not proper

    The use ofimproper priors f(x) is quite common in complex models,

    Sometimes the posterior is proper, and MCMC works (recurrent)

    Sometimes the posterior is improper, and MCMC fails (transient)

    These transient Markov chains may present all the outer signs of stability

    More later

    Monte Carlo Methods with R: MetropolisHastings Algorithms [135]

    Basic MetropolisHastings algorithms

  • 7/26/2019 Required Part 2

    56/80

    Basic Metropolis Hastings algorithms

    Introduction

    The working principle of Markov chain Monte Carlo methods is straightforward

    Given a target densityf

    We build a Markov kernelKwith stationary distributionf

    Then generate a Markov chain (X(t)) X f

    Integrals can be approximated by to the Ergodic Theorem

    The MetropolisHastings algorithm is an example of those methods.

    Given the target densityf, we simulate from a candidateq(y|x)

    Only need that the ratiof(y)/q(y|x) isknownup to a constant

    Monte Carlo Methods with R: MetropolisHastings Algorithms [136]

    Basic MetropolisHastings algorithms

  • 7/26/2019 Required Part 2

    57/80

    p g g

    A First MetropolisHastings Algorithm

    MetropolisHastings Given x(t),

    1. Generate Ytq(y|x(t)).

    2. Take

    X(t+1) =Yt with probability (x(t), Yt),

    x(t) with probability 1 (x(t), Yt),

    where

    (x, y) = minf(y)f(x) q(x|y)q(y|x) , 1 . qis called theinstrumental or proposal or candidate distribution

    (x, y) is the MetropolisHastingsacceptance probability Looks likeSimulated Annealing- but constant temperature

    MetropolisHastingsexploresrather than maximizes

    Monte Carlo Methods with R: MetropolisHastings Algorithms [137]

    Basic MetropolisHastings algorithms

  • 7/26/2019 Required Part 2

    58/80

    p g g

    Generating Beta Random Variables

    Target densityfis the Be(2.7, 6.3) Candidateqis uniform

    Notice therepeats

    Repeats must be kept!

    Monte Carlo Methods with R: MetropolisHastings Algorithms [138]

    Basic MetropolisHastings algorithms

  • 7/26/2019 Required Part 2

    59/80

    Comparing Beta densities

    Comparison with independent

    sampling Histograms indistinguishable

    Moments match

    K-S test accepts R code

    Monte Carlo Methods with R: MetropolisHastings Algorithms [139]

    Basic MetropolisHastings algorithms

  • 7/26/2019 Required Part 2

    60/80

    A Caution

    The MCMC and exact sampling outcomes look identical, but

    Markov chain Monte Carlo sample has correlation, the iid sample does not

    This means that the quality of the sample is necessarily degraded

    We need more simulations to achieve the same precision

    This is formalized by theeffective sample sizefor Markov chains -later

    Monte Carlo Methods with R: MetropolisHastings Algorithms [140]

    Basic MetropolisHastings algorithms

  • 7/26/2019 Required Part 2

    61/80

    Some Comments

    In the symmetric caseq(x|y) =q(y|x),

    (xt, yt) = minf(yt)

    f(xt), 1

    .

    The acceptance probability is independent ofq

    MetropolisHastings always accept values ofyt such that

    f(yt)/q(yt|x(t))> f(x(t))/q(x(t)|yt)

    Values yt that decrease the ratiomayalso be accepted

    MetropolisHastings only depends on the ratios

    f(yt)/f(x(t)) and q(x(t)|yt)/q(yt|x

    (t)) .

    Independent of normalizing constants

    Monte Carlo Methods with R: MetropolisHastings Algorithms [141]

    Basic MetropolisHastings algorithms

  • 7/26/2019 Required Part 2

    62/80

    The Independent MetropolisHastings algorithm

    The MetropolisHastings algorithm allowsq(y|x)

    We can useq(y|x) =g(y), a special case

    Independent MetropolisHastings

    Given x(t)

    1. Generate Ytg(y).

    2. Take

    X(t+1) =

    Yt with probability min

    f(Yt)g(x

    (t))

    f(x(t))g(Yt), 1

    x

    (t)

    otherwise.

    Monte Carlo Methods with R: MetropolisHastings Algorithms [142]

    Basic MetropolisHastings algorithms

    P i f h I d d M li H i l i h

  • 7/26/2019 Required Part 2

    63/80

    Properties of the Independent MetropolisHastings algorithm

    Straightforward generalization of the AcceptReject method

    Candidates are independent, but still a Markov chain

    The AcceptReject sample is iid, but the MetropolisHastings sample is not

    The AcceptReject acceptance step requires the calculatingM

    MetropolisHastings is AcceptReject for the lazy person

    Monte Carlo Methods with R: MetropolisHastings Algorithms [143]

    Basic MetropolisHastings algorithms

    A li ti f th I d d t M t li H ti l ith

  • 7/26/2019 Required Part 2

    64/80

    Application of the Independent MetropolisHastings algorithm

    We now look at a somewhat more realistic statistical example

    Get preliminary parameter estimates from a model

    Use an independent proposal with those parameter estimates.

    For example, to simulate from a posterior distribution (|x) ()f(x|)

    Take a normal or atdistribution centered at the MLE

    Covariance matrix equal to the inverse of Fishers information matrix.

    Monte Carlo Methods with R: MetropolisHastings Algorithms [144]

    Independent MetropolisHastings algorithm

    B ki D t

  • 7/26/2019 Required Part 2

    65/80

    Braking Data

    Thecarsdataset relates braking distance (y) to speed (x) in a sample of cars.

    Model

    yij =a+bxi+cx2i +ij

    The likelihood function is1

    2

    N/2exp

    122ij

    (yij a bxi cx2i )

    2

    ,whereN= i ni

    Monte Carlo Methods with R: MetropolisHastings Algorithms [145]

    Independent MetropolisHastings algorithm

    Braking Data Least Squares Fit

  • 7/26/2019 Required Part 2

    66/80

    Braking Data Least Squares Fit

    Candidate from Least Squares

    R command: x2=x^2; summary(lm(y~x+x2))Coefficients:

    Estimate Std. Error t value Pr(>|t|)(Intercept) 2.63328 14.80693 0.178 0.860

    x 0.88770 2.03282 0.437 0.664

    x2 0.10068 0.06592 1.527 0.133

    Residual standard error: 15.17 on 47 degrees of freedom

    Monte Carlo Methods with R: MetropolisHastings Algorithms [146]

    Independent MetropolisHastings algorithm

    Braking Data Metropolis Algorithm

  • 7/26/2019 Required Part 2

    67/80

    Braking Data Metropolis Algorithm

    Candidate: normal centered at theMLEs,

    a N(2.63, (14.8)2),

    b N(.887, (2.03)2),

    c N(.100, (0.065)2),

    Inverted gamma

    2 G(n/2, (n 3)(15.17)2)

    See the variability of the curves associated with the simulation.

    Monte Carlo Methods with R: MetropolisHastings Algorithms [147]

    Independent MetropolisHastings algorithm

    Braking Data Coefficients

  • 7/26/2019 Required Part 2

    68/80

    Braking Data Coefficients

    Distributions of estimates

    Credible intervals

    See the skewness

    Note that these are marginal distributions

    Monte Carlo Methods with R: MetropolisHastings Algorithms [148]

    Independent MetropolisHastings algorithm

    Braking Data Assessment

  • 7/26/2019 Required Part 2

    69/80

    Braking Data Assessment

    50, 000 iterations

    See the repeats

    Intercept may not have converged

    R code

    Monte Carlo Methods with R: MetropolisHastings Algorithms [149]

    Random Walk MetropolisHastings

    Introduction

  • 7/26/2019 Required Part 2

    70/80

    Introduction

    Implementation of independent MetropolisHastings can sometimes be difficult

    Construction of the proposal may be complicated

    They ignore local information

    An alternative is to gather information stepwise

    Exploring the neighborhood of the current value of the chain

    Can take into account the value previously simulated to generate the next value

    Gives a more local exploration of the neighborhood of the current value

    Monte Carlo Methods with R: MetropolisHastings Algorithms [150]

    Random Walk MetropolisHastings

    Some Details

  • 7/26/2019 Required Part 2

    71/80

    The implementation of this idea is to simulate Yt according to

    Yt=X(t) +t,

    t is arandom perturbationwith distributiong, independent ofX(t)

    Uniform, normal, etc...

    The proposal densityq(y|x) is now of the formg(y x)

    Typically,g is symmetric around zero, satisfying g(t) =g(t)

    The Markov chain associated with q is arandom walk

    Monte Carlo Methods with R: MetropolisHastings Algorithms [151]

    Random Walk MetropolisHastings

    The Algorithm

  • 7/26/2019 Required Part 2

    72/80

    g

    Given x(t),

    1. Generate Ytg(y x(t)).

    2. Take

    X(t+1) =

    Yt with probability min

    1,

    f(Yt)

    f(x(t))

    ,

    x(t) otherwise.

    Theg chain is a random walk

    Due to the MetropolisHastings acceptance step, the {X(t)}chain is not

    The acceptance probability does not depend on g

    But differentgs result in different ranges and different acceptance rates

    Calibrating the scale of the random walk is for good exploration

    Monte Carlo Methods with R: MetropolisHastings Algorithms [152]

    Random Walk MetropolisHastings

    Normal Mixtures

  • 7/26/2019 Required Part 2

    73/80

    Explore likelihood with random walk

    Similar to Simulated AnnealingBut constant temperature (scale)

    MultimodalScale is important

    Too smallget stuckToo bigmiss modes

    Monte Carlo Methods with R: MetropolisHastings Algorithms [153]

    Random Walk MetropolisHastings

    Normal Mixtures - Different Scales

  • 7/26/2019 Required Part 2

    74/80

    LeftRight: Scale=1,Scale=2,Scale=3

    Scale=1: Too small, gets stuck

    Scale=2: Just right, finds both modes

    Scale=3: Too big, misses mode

    R code

    Monte Carlo Methods with R: MetropolisHastings Algorithms [154]

    Random Walk MetropolisHastings

    Model Selection or Model Choice

  • 7/26/2019 Required Part 2

    75/80

    Random walk MetropolisHastings algorithms also apply to discrete targets.

    As an illustration, we consider a regression

    The swissdataset in Ry= logarithm of the fertility in 47 districts of Switzerland 1888

    The covariate matrixXinvolves five explanatory variables> names(swiss)

    [1] "Fertility" "Agriculture" "Examination" "Education"[5] "Catholic" "Infant.Mortality"

    Compare the 25 = 32 models corresponding to all possible subsets of covariates.

    If we include squares and twoway interactions

    220 = 1048576 models, same R code

    Monte Carlo Methods with R: MetropolisHastings Algorithms [155]

    Random Walk MetropolisHastings

    Model Selection using Marginals

  • 7/26/2019 Required Part 2

    76/80

    Given an ordinary linear regression withnobservations,

    y|, 2, X Nn(X ,2In) , Xis an (n, p) matrix

    The likelihood is

    , 2|y, X = 22n/2 exp 122

    (y X)T(y X) UsingZellnersg-prior, with the constantg=n

    |2, X Nk+1(,n2(XTX)1) and (2|X) 2

    The marginal distribution ofyis a multivariatetdistribution,

    m(y|X) y

    I n

    n+ 1X(XX)1Xy

    1

    n+ 1XX

    n/2

    .

    Find the model with maximum marginal probability

    Monte Carlo Methods with R: MetropolisHastings Algorithms [156]

    Random Walk MetropolisHastings

    Random Walk on Model Space

  • 7/26/2019 Required Part 2

    77/80

    To go from(t) (t+1)

    Firstget a candidate

    (t) =

    1

    01

    1

    0

    =

    1

    00

    1

    0

    Choosea component of

    (t)

    at random, and flip 1 0 or 0 1Acceptthe proposed model with probability

    min

    m(y|X, )

    m(y|X, (t)), 1

    The candidate is symmetric

    Note: This is not the MetropolisHastings algorithm in the book - it is simpler

    Monte Carlo Methods with R: MetropolisHastings Algorithms [157]

    Random Walk MetropolisHastings

    Results from the Random Walk on Model Space

  • 7/26/2019 Required Part 2

    78/80

    Last iterations of the MH search

    The chain goes down often

    Top Ten Models

    Marg.

    7.95 1 0 1 1 1

    7.19 0 0 1 1 1

    6.27 1 1 1 1 1

    5.44 1 0 1 1 0

    5.45 1 0 1 1 0

    Best model excludesthe variable Examination

    = (1, 0, 1, 1, 1)

    Inclusion rates:Agri Exam Educ Cath Inf.Mort0.661 0.194 1.000 0.904 0.949

    Monte Carlo Methods with R: MetropolisHastings Algorithms [158]

    MetropolisHastings Algorithms

    Acceptance Rates

  • 7/26/2019 Required Part 2

    79/80

    Infinite number of choices for the candidateqin a MetropolisHastings algorithm

    Is there and optimal choice?

    The choice ofq=f, the target distribution? Not practical. A criterion for comparison is the acceptance rate

    It can be easily computed with the empirical frequency of acceptance

    In contrast to the AcceptReject algorithmMaximizing the acceptance rate will is not necessarily best

    Especially for random walks

    Also look at autocovariance

    Monte Carlo Methods with R: MetropolisHastings Algorithms [159]

    Acceptance Rates

    Normals from Double Exponentials

  • 7/26/2019 Required Part 2

    80/80

    In the AcceptReject algorithm

    To generate aN(0, 1) from a double-exponentialL()

    The choice= 1 optimizes the acceptance rate

    In an independent MetropolisHastings algorithm

    We can use the double-exponential as an independent candidate q

    Compare the behavior of MetropolisHastings algorithm

    When using theL(1) candidate or theL(3) candidate