simple methods to improve mcmc efficiency in random effect models william browne*, mousa...

Simple methods to improve MCMC efficiency in random effect models

William Browne*, Mousa Golalizadeh*, Martin Green and Fiona Steele

Universities of Bristol and Nottingham

*Thanks to ESRC for supporting this work

Summary

• Introduction.• Application 1 – Clutch size in great tits.• Method 1: Hierarchical centering.• Method 2: Parameter expansion.• Application 2 – Mastitis incidence in dairy cattle.• Method 3: Orthogonal predictors.• Application 3 – Contraceptive discontinuation in

Indonesia.• Conclusions.

Introduction/Synopsis

• MCMC methods allow easy fitting of complex random effects models

• The simplest (default) MCMC algorithms can produce poorly mixing chains.

• By reparameterising the model one can greatly improve mixing.

• These reparameterisations are easy to implement in WinBUGS (or MLwiN)

• The choice of reparameterisation depends in part on the model/dataset.

Application 1: Great tit nesting behaviour (crossed random effects)• Original work was collaborative research with

Richard Pettifor (Institute of Zoology, London), and Robin McCleery and Ben Sheldon (University of Oxford).

Application 1: Great tit nesting behaviour (crossed random effects)• A longitudinal study of great tits

nesting in Wytham Woods, Oxfordshire.

• 6 responses : 3 continuous & 3 binary. • Clutch size, lay date and mean

nestling mass.• Nest success, male and female

survival.• Data: 4165 nesting attempts over a

period of 34 years. • There are 4 higher-level classifications

of the data: female parent, male parent, nestbox and year.

• We only consider Clutch size here

Data background

Source Numberof IDs

Median#obs

Mean#obs

Year 34 104 122.5

Nestbox 968 4 4.30

Male parent 2986 1 1.39

Female parent 2944 1 1.41

Note there is very little information on each individual male and female bird but we can get some estimates of variability via a random effects model.

The data structure can be summarised as follows:

MCMC efficiency for clutch size response

• The MCMC algorithm used in the univariate analysis of clutch size was a simple 10-step Gibbs sampling algorithm.

• .

• To compare methods for each parameter we can look at the effective sample sizes (ESS) which give an estimate of how many ‘independent samples we have’ for each parameter as opposed to 50,000 dependent samples.

• ESS = # of iterations/,

1

)(21k

k

)2,0(~

),2)5(

,0(~)5()(

),2)4(

,0(~)4()(

),2)3(

,0(~)3()(

),2)2(

,0(~)2()(

)5()(

)4()(

)3()(

)2()(

eN

ie

uN

iyearu

uN

inestboxu

uN

ifemaleu

uN

imaleu

ie

iyearu

inestboxu

ifemaleu

imaleu

iy

Clutch SizeParameter Estimate (S.E.) Percentage

variance β 8.808 (0.109) -

2)5(u (Year) 0.365 (0.101) 14.3%

2)4(u (Nest box) 0.107 (0.026) 4.2%

2)3(u (Male) 0.046 (0.043) 1.8%

2)2(u (Female) 0.974 (0.062) 38.1%

2e (Observation) 1.064 (0.055) 41.6%

Here we see that the average clutch size is just below 9 eggs with large variability between female birds and some variability between years. Male birds and nest boxes have less impact.

Effective Sample sizes

Parameter MLwiN WinBUGS

Fixed Effect 671 602

Year 30632 29604

Nestbox 833 788

Male 36 33

Female 3098 3685

Observation 110 135

Time 519s 2601s

We will now consider methods that will improve theESS values for particular parameters. We will firstly consider the fixed effect parameter.

Trace and autocorrelation plots for fixed effect using standard Gibbs sampling algorithm

Hierarchical Centering

This method was devised by Gelfand et al. (1995) for use in nested models. Basically (where feasible) parameters are moved up the hierarchy in a model reformulation. For example:

),0(~),,0(~, 220 eijujijjij NeNueuy

is equivalent to

),0(~),,(~, 220 eijujijjij NeNey

The motivation here is we remove the strong negative correlation between the fixed and random effects by reformulation.

Hierarchical Centering

).,(~),,(~),,(~

),,(~),,(~,1

),,0(~),,(~),,0(~

),,0(~),,0(~

,

1212)5(

12)4(

12)3(

12)2(0

22)5(0

)5()(

2)4(

)4()(

2)3(

)3()(

2)2(

)2()(

)5()(

)4()(

)3()(

)2()(

euu

uu

eiuiyearuinestbox

uimaleuifemale

iiyearinestboximaleifemalei

NeNNu

NuNu

euuuy

In our cross-classified model we have 4 possible hierarchies up which we can move parameters. We have chosen to move the fixed effect up the year hierarchy as it’s variance had biggest ESS although this choice is rather arbitrary.

The ESS for the fixed effect increases 50-fold from 602 to 35,063 while for the year level variance we have a smaller improvement from 29,604 to 34,626. Note this formulation also runs faster 1864s vs 2601s (in WinBUGS).

Trace and autocorrelation plots for fixed effect using hierarchical centering formulation

Parameter Expansion

• We next consider the variances and in particular the between-male bird variance. • When the posterior distribution of a variance parameter has some mass near zero this can hamper the mixing of the chains for both the variance parameter and the associated random effects. • The pictures over the page illustrate such poor mixing.• One solution is parameter expansion (Liu et al. 1998). • In this method we add an extra parameter to the model to improve mixing.

Trace plots for between males variance and a sample male effect using standard Gibbs sampling

algorithm

Autocorrelation plot for male variance and a sample male effect using standard Gibbs sampling

algorithm

Parameter Expansion

),(~,5,..2),,(~,1,1

),,0(~),,0(~),,0(~

),,0(~),,0(~

,

1212)(0

22)5(

)5()(

2)4(

)4()(

2)3(

)3()(

2)2(

)2()(

)5()(5

)4()(4

)3()(3

)2()(20

ekvk

eiviyearvinestbox

vimalevifemale


k

NeNvNv

NvNv

evvvvy

In our example we use parameter expansion for all 4 hierarchies. Note the parameters have an impact on both the random effects and their variance.

The original parameters can be found by:2

)(22

)()()( and kvkkukik

ki vu

Note the models are not identical as we now have different prior distributions for the variances.

Parameter Expansion

• For the between males variance we have a 20-fold increase in ESS from 33 to 600. • The parameter expanded model has different prior distributions for the variances although these priors are still ‘diffuse’.• It should be noted that the point and interval estimate of the level 2 variance has changed from 0.034 (0.002,0.126) to 0.064 (0.000,0.172).• Parameter expansion is computationally slower 3662s vs 2601s for our example.

Trace plots for between males variance and a sample male effect using parameter expansion.

Combining the two methods

),(~,5,..2),,(~,1,1

),,0(~),,(~),,0(~

),,0(~),,0(~

,

1212)(0

22)5(0

)5()(

2)4(

)4()(

2)3(

)3()(

2)2(

)2()(

)5()(

)4()(4

)3()(3

)2()(20

ekvk

eiviyearuinestbox

vimalevifemale


k

NeNNv

NvNv

evvvy

Hierarchical centering and parameter expansion can easily be combined in the same model. Here we perform centering on the year classification and parameter expansion on the other 3 hierarchies.

Effective Sample sizes

Parameter WinBUGS originally

WinBUGS combined

Fixed Effect 602 34296

Year 29604 34817

Nestbox 788 5170

Male 33 557

Female 3685 8580

Observation 135 1431

Time 2601s 2526s

As we can see below the effective sample sizes for all parameters are improved for this formulation while running time remains approximately the same.

Applications 2 & 3: Multilevel discrete-time survival analysis

• Used to model the durations until the occurrence of events e.g. length of time to death.

• Different from standard regression due to right-censoring and time varying covariates.

• In many applications events may be repeatable and the outcome is duration of continuous exposure to the risk of an event.

• In our applications cows may suffer mastitis more than once and women can initiate and discontinue use of contraceptives several times.

• We begin with two levels of data episodes nested within individuals.

• With a discrete time approach we expand the data to create a lower level – time interval, which represents a fixed period of time and is nested within episode.

Data Expansion

• Suppose we have time intervals (e.g. days, weeks) indexed by t = 1,...,K where K is the maximum duration of an episode.

• Let tij denote the number of intervals for which individual i is observed in episode j.

• Before modelling we now need to expand the data for each episode ij to obtain tij records.

• We have ytij=0 for t=1,…,tij-1 and the response for t= tij is 1 if an episode ends in an event and 0 for censored episodes.

• Example: Individual observed for 7 months and has event after 4 months:

Response vector (0,0,0,1,0,0,0), Time intervals (1,2,3,4,1,2,3)

Modelling

• Having restructured the data we can analyse the event occurrence indicator ytij using standard methods for clustered binary response data.

• Here we model the probability of an event occurrence πtij as a function of duration (zt) and covariates xtij with uj being an individual specific random effect representing unobserved time-invariant individual characteristics.

),0(~

)(logit

)( Bernouilli ~

2uj

jtijttij

tijtij

Nu

u

y

βxαz

MCMC algorithm for random effect logistic regression models

),(~)(,1)(

),0(~ ,)(logit

)( Bernoulli~

120

20

u

ujjij

ijij

pp

Nuu

y

060,...,1, ju j

Consider the following simple model:

A standard MCMC algorithm (as used in MLwiN (Rasbash et al. (2000), Browne (2003)) is

Step 1: Update using random walk Metropolis sampling.

Step 2: Update using random walk Metropolis sampling.

Step 3: Update from its inverse Gamma full conditional using Gibbs Sampling

2u

Hierarchical centered formulation

),(~)(,1)(

),(~ ,)(logit

)(Bernoulli~

120

20

**

u

ujjij

ijij

pp

Nuu

y

We can reparameterise by replacing the residuals uj

with random effects u*j= β0+uj. Here the u*j are

(hierarchically) centred around β0.

This formulation allows Gibbs sampling to be used for the fixed effect β0 in addition to the random effects variance. It is interesting to consider how an MCMC algorithm for this centered formulation can be expressed in terms of the original parameterisation.

MCMC Algorithm

At iteration t+1• Step 1: Update β0 from it’s normal conditional

distribution:

Adjust to keep u*

j fixed.

The other steps are unchanged by reparameterisation as the random walk Metropolis sampling for u*

j is equivalent to the same step for uj and the final step is unchanged.

.60

ˆ,60

ˆ

)ˆ,ˆ(~),,|(260

1

)(00

02*

0

u

j

jt

u

Du

DNyp

u

60,...,1)1(0

)(0

)( juu tttjj

Why should hierarchical centering be useful for discrete time survival models ?

• As we have seen discrete time survival models are a special case of random effects logistic regression models.

• Expansion results in many higher level predictors that can be (hierarchically) centered around.

• Hierarchical centering should improve mixing but also speed up estimation.

• Will be particularly useful when we have many level 1 units per level 2 unit and large level 2 variance, where the speed up will be greatest – see application 2.

• Not always useful but other potential solutions – see application 3.

Application 2: Mastitis incidence in dairy cattle

• Mastitis – inflammation of mammary gland of dairy cows, usually caused by a bacterial infection. Infections arising in dry (non-lactating) period often result in clinical mastitis in early lactation.

• Green et al. (2007) use multilevel survival models to investigate how cow, farm and management factors in the dry period correlate with mastitis incidence.

• Data – 2 years x 52 dairy farms – 103 farm years – 8,710 cow lactation periods following dry periods expands to 256,382 records in total!

Mastitis model

• The Model has many predictors.

• The duration terms zt consists of an intercept plus polynomials in log time to order 3.

• The 4 levels are farm, farm year, cow dry period and weekly obs. although no random effects occur at the dry period level.

• The model can easily be converted to a hierarchical centered formulation by centering around the 10 farm-year level fixed effects

)2,2(~)(),,(~)(

15,...,1,1)(,3,...,0,1)(

),0(~),,0(~

Parity1)(logit

)( Bernouilli ~

112

2

10

pvu

ll

vkujk

ijkkkjktijkttijk

tijktijk

SIWpp

lplp

MVNvNu

vvu

y

βxαz

Mastitis predictors

• Predictors at dry period level (6) are :• Parity of cow (5 categories – 4 dummies)• Dummies for Somatic cell count high before drying off, and whether

two SCC readings were available.• At the farm level (9):• One dummy for cows remaining standing for 30 minutes after dry cow

treatment, • Two dummy variables to indicate whether cubicle bedding is

disinfected in the early dry period and whether this is not applicable due to the system used.

• Two dummy variables to indicate if transition cow cubicles are bedded at least once daily and whether there are in fact transition cow cubicles.

• A dummy variable as to whether the cubicle bedding is disinfected in the transition dry period

• Three dummy variables to indicate whether the transition cubicle feed and loaf area is scraped daily, more often than daily or doesn't exist.

Mastitis results

• The model was run for 50,000 iterations after an adapting period and a further burnin of 500 iterations in MLwiN.

• Due to the number of parameters in model we will in the table that follows simply give effective sample size (ESS) values although it should also be noted that parameter values for the non-centred formulation were rather variable due to the small ESS

• The estimation took ~19 hours for the non-centred formulation and 11½ hours for the centred formulation.

Effective sample sizes

Param N Centred

Centred Param N Centred Centred

α0 32 2695 β7 291 3187

α1 875 926 β8 703 3554

α2 98 75 β9 43 5359

α3 99 75 β10 44 3448

β1 1095 1058 β11 34 2674

β2 3747 2087 β12 118 2268

β3 4759 3275 β13 57 2524

β4 5165 4271 β14 288 1945

β5 2765 1438 β15 50 5183

β6 1386 1248

σ2v00 630 764 σ2

v11 2439 2275

σ2v01 1555 1566 σ2

u 796 1052

Method 3a: Orthogonal polynomials

• It is noticeable that the parameters α1 -α3 have small ESS. These correspond to the predictor polynomials in log duration and the three correlations between the predictors are -0.61, 0.79 and -0.90 respectively.

• We can without altering the rest of the analysis replace this group of predictors via a reparameterisation that makes the predictors less correlated and in fact we choose to make them orthogonal i.e.

• To do this we can keep the first predictor and then add the rest in one at a time solving the above at each stage.

• This results in predictors z* :

• logt, logt2+0.94logt, logt3+ 1.50logt2-2.05logt

256382

1

** ,,0i

kiji kjkjzz

Results of orthogonal polynomials

Fitting this model will give coefficient α* that can be easily converted back to the coefficients for the original predictors:

α1= α*1+0.94 α *

2-2.05 α *3, α2= α*

2+1.50 α *3, , α3= α*

3

Effective sample size comparison:

Parameter Standard Orthogonal

α0 2695 3337

α1 926 1997

α2 75 779

α3 75 1038

Application 3: Contraceptive discontinuation in Indonesia

• Steele et al. (2004) use multilevel multistate models to study transitions in and out of contraceptive use in Indonesia. Here we consider a simplification of their model which considers only the transition from use to non-use, commonly referred to as contraceptive discontinuation.

• The data come from the 1997 Indonesia Demographic and Health Survey. Contraceptive use histories were collected retrospectively for the six-year period prior to the survey, and include information on the month and year of starting and ceasing use, the method used, and the reason for discontinuation.

• The analysis is based on 17,833 episodes of contraceptive use for 12,594 women, where an episode is defined as a continuous period of using the same method of contraception.

• Restructuring the data to discrete-time format with monthly time intervals leads to 365,205 records. To reduce the size, monthly intervals are grouped into six-month intervals and a binomial response is defined with denominator equal to the number of months of use within a six-month interval. Aggregation of intervals leads to a dataset with 68,515 records.

Model

• We here have intervals nested within episodes nested within women.

10,...,1,1)(,4,...,0,1)(

).,(~)(),,0(~

)(logit

),( Binomial ~

122

lplp

pNu

u

ny

ll

uuj

jtijttij

tijtijtij

βxαz

Here we include discrete categories for the duration terms zt – a piecewise constant hazard with categories representing 6-11 months, 12-23 months, 24-35 months and >35 months with a base category of 0-5 months.

Predictors

• At the episode level we have:

Age categorized as <25,25-34 and 34-49.

Contraceptive method categorized as pill/injectable, Norplant/IUD, other modern and traditional.

• At the woman level we have:

Education (3 categories).

Type of region of residence (urban/rural).

Socioeconomic status (low, medium or high).

Results

Param N Centred Centred Param N Centred Centred

α0 1665 28 β4 20517 18820

α1 13405 91 β5 21118 19313

α2 11553 81 β6 2036 50

α3 11900 126 β7 2093 44

α4 10463 161 β8 16965 79

β1 13855 163 β9 6488 33

β2 15933 11840 β10 6876 55

β3 19911 20562 σ2u 14 14

The table below gives the effective sample sizes based on runs of 250,000 iterations.

Here the hierarchical centered formulation does really badly. This is because the cluster variance σ2

u is very small: estimates of 0.041 and 0.022 for the two methods

A closer look at the residuals

• It is well known that hierarchical centering works best if the cluster level variance is substantial.

• Here we see that both the variance is small and the distribution of the residuals is not very normal.

• This is due to a few women who discontinue usage very quickly and often, whilst many women never discontinue!

Normal scores

Std

res

idua

ls

Simple logistic regression

• We will consider first removing the random effects from the model (due to their small variance) which will result in a simple logistic regression model.

• It will then be impossible to perform hierarchical centering however we will consider an extension to the orthogonalisation performed in the previous application.

• Note that Hills and Smith (1992) talk about using orthogonal parameterisations and Roberts and Gilks give it one sentence in ‘MCMC in Practice’. Here we consider it in combination with the simple (single site) random walk Metropolis sampler where reduction of correlation in the posterior is perhaps most important.

Method 3b:Orthogonal parameterisation

• For simplicity assume we have all predictors in one matrix P and that we can write ztα+xtijβ as ptijθ where θ=(α,β).

• Step 1: Number the predictors in some ordering 1,…,N.

• Step 2: Take each predictor in turn and replace it with a predictor that is orthogonal to all the already considered predictors.

• For predictor pk.

• Note this requires solving k-1 equations in k-1 unknown w parameters.

• A different orthogonal set of predictors results from each ordering.

kipp

ppwpwpwp

kT

i

kkkkkkk

0)( that so

...

Create

**

1,12,21,1*

Orthogonal parameterisation

• The second step of the algorithm produces both a set of orthogonal predictors that span the same space as the original predictors and a group of w coefficients that can be combined to form a lower diagonal matrix W.

• We can fit this model and recover the coefficients for the original predictors by pre-multiplication by WT.

• It is worth noting here that we use improper Uniform priors for the coefficients and if we used proper priors we would need to also calculate the Jacobian for the reparameterisation to ensure the same priors are used.

• We ordered the predictors in what follows so that the level 2 predictors were last before performing reparameterisation.

Results

• The following is based on 50,000 iterations:

Param Original Orthogonal Param Original Orthogonal

α0 403 11578 β4 11306 12010

α1 4249 12049 β5 9877 12023

α2 3617 11878 β6 500 10945

α3 4643 11920 β7 514 11610

α4 5260 11864 β8 5646 10591

β1 5114 12908 β9 1466 11214

β2 7787 10188 β10 1686 10249

β3 10291 8518Here we see almost universal benefit of the orthogonal parameterisation with virtually zero time costs and very little programming!

Combining orthogonalisation with parameter expansion

• Combining orthogonalisation and parameter expansion we have:

222*

12*

62

**

and ,),( that so

),(~)(,14,...,0,1)(

)10,0(~)(),,0(~

)(logit

),(Binomial~

vukkT

vl

vj

jtijtij

tijtijtij

vuW

plp

NpNv

vp

ny

We ran this model using WinBUGS and only 25,000 iterations following a burnin of 500 iterations which took 34 hours compared to 23½ for 250k in MLwiN without parameter expansion. The results are given overleaf.

Results for full model

• Here we compare simply using the orthogonal approach in MLwiN for 250k with both orthogonal predictors and parameter expansion in WinBUGS for 25k. Note this takes ~1.5 times as long:

Param Orthog.

MLwiN

Param

Expan

Param Orthog.

MLwiN

Param

Expan

α0 22714 14009 β4 23533 23488

α1 23931 22017 β5 24498 23792

α2 24136 15024 β6 23816 22546

α3 23303 4859 β7 23428 24422

α4 22457 1881 β8 22860 22995

β1 23347 22609 β9 23697 23960

β2 22779 20883 β10 23624 23383

β3 22105 14032 σ2u 20 318

Trace plots of variance chain

Before Parameter Expansion

After Parameter Expansion

Here we see the far greater mixing of the variance chain after parameter expansion.

It is worth noting that parameter expansion uses a different prior for σ2

u and results in an estimate of 0.059 (0.048) as opposed to 0.008 (0.006) without and earlier estimates of 0.041(0.026) and 0.022 (0.018) before orthogonalisation.

Note however that all estimates bar parameter expansion are based on very low ESS!

Conclusions• Hierarchical centering – as is well known - works well when

the cluster variance is big but is no good for small variances • Other research building on hierarchical centering includes

Papaspiliopoulus et al. (2003,2007) showing how to construct partially non-centred parameterisations.

• Parameter expansion works well to improve mixing when the cluster variance is small but results in a different prior for the variance.

• Orthogonalisation of predictors appears to be a good idea generally but is slightly more involved than the other reparameterisations i.e. the predictors need orthogonalising outside WinBUGS and the chains need transforming back.

• An interesting area of further research is choosing the order for orthogonalisation i.e. which set of orthogonal predictors to use.

References• Browne, W.J. (2003). MCMC Estimation in MLwiN. London: Institute of

Education, University of London• Browne, W.J. (2004). An illustration of the use of reparameterisation

methods for improving MCMC efficiency in crossed random effect models Multilevel Modelling Newsletter 16 (1): 13-25

• Gamerman D. (1997) Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing. 7, 57--68.

• Gelfand, A.E., Sahu, S.K. and Carlin, B.P. (1995) Efficient parameterisations for normal linear mixed models. Biometrika 83, 479--488

• Gelman, A., Huang, Z., van Dyk, D., and Boscardin, W.J. (2007). Using redundant parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics (to appear).

• Green, M.J., Bradley, A.J., Medley, G.F. and Browne, W.J. (2007) Cow, Farm and Management Factors during the Dry Period that determine the rate of Clinical Mastitis after calving. Journal of Dairy Science 90: 3764--3776.

• Hills, S.E. and Smith, A.F.M. (1992) Parameterization Issues in Bayesian Inference. In Bayesian Statistics 4, (J M Bernardo, J O Berger, A P Dawid, and A F M Smith, eds), Oxford University Press, UK, pp. 227--246.

References cont.• Liu, C., Rubin, D.B., and Wu, Y.N. (1998) Parameter expansion to accelerate

EM: The PX-EM algorithm. Biometrika 85 (4): 755-770.• Liu, J.S., Wu, Y.N. (1999) Parameter Expansion for Data Augmentation.

Journal Of The American Statistical Association 94: 1264-1274• Papaspiliopoulos, O, Roberts, G.O. and Skold, M. (2003) Non-centred

Parameterisations for Hierarchical Models and Data Augmentation. In Bayesian Statistics 7, (J M Bernardo, M J Bayarri, J O Berger, A P Dawid, D Heckerman, A F M Smith and M West, eds), Oxford University Press, UK, pp. 307--32

• Papaspiliopoulos, O, Roberts, G.O. and Skold, M. (2007) A General Framework for the Parametrization of Hierarchical Models. Statistical Science 22, 59--73.

• Rasbash, J., Browne, W.J., Healy, M, Cameron, B and Charlton, C. (2000). The MLwiN software package version 1.10. London: Institute of Education, University of London.

• Steele, F., Goldstein, H. and Browne, W.J. (2004). A general multilevel multistate competing risks model for event history data, with an application to a study of contraceptive use dynamics. Statistical Modelling 4: 145--159

• Van Dyk, D.A., and Meng, X-L. (2001) The Art of Data Augmentation. Journal of Computational and Graphical Statistics. 10, 1--50.

simple methods to improve mcmc efficiency in random effect models william browne*, mousa...

Documents

variance parameter

work slide

fixed effect parameter

male variance

sample male effect

random effects model

extra parameter

male bird variance