bayesian parametric and semi- parametric hierarchical models: an application to disinfection...

Bayesian Parametric and Semi-Parametric Hierarchical models: An application to Disinfection By-Products and Spontaneous

Abortion:

Rich MacLehoseNovember 9th, 2006

Outline

1. Brief introduction to hierarchical models

2. Introduce 2 ‘standard’ parametric models

3. Extend them with 2 semi-parametric models

4. Applied example of Disinfection by-products and spontaneous abortion

Bayesian Hierarchical Models

• Natural way to model many applied problems

G

Ny

i

ii

~

),(~ 2

μσμ

• μi are assumed exchangeable• G may depend on further coefficients

that extend the hierarchy


• Frequently, researchers will be interested in estimating a large number of coefficients, possibly with a large amount of correlation between predictors.

• Hierarchical models may be particularly useful by allowing ‘borrowing’ of information between groups.


• For example, we wish to estimate the effect of 13 chemicals on early pregnancy loss.

• The chemicals are highly correlated, making standard frequentist methods unstable.

Hierarchical Regression Models

• Traditional models treat the outcome as random

• Hierarchical models also treat coefficients as random

kikii xxy ββα +++== L11))1(Pr(logit

kjN

xxy

j

kikii

L

L

1);,(~

))1(Pr(logit2

11

=

+++==

φμβ

ββα

Some Parametric and Semi-Parametric Bayesian Hierarchical

Models

1. Simplest hierarchical model (P1)

2. Fully Bayesian hierarchical model (P2)

3. Dirichlet process prior

4. Dirichlet process prior with selection component

1: The first parametric model (P1)

• A simple “one-level” hierarchical model

• Popularized by Greenland in Epidemiology– he refers to it as “semi-Bayes”– may refer to asymptotic methods commonly

used in fitting semi-Bayes models

• Has seen use in nutritional, genetic, occupational, and cancer research

Hierarchical Models: Bayes and Shrinkage

• μ is our prior belief about the size of the effect and ϕ2 is uncertainty in it

• Effect estimates from hierarchical models are ‘shrunk’ (moved) toward the prior distribution

• Shrinkage:– for Bayesian: natural consequence of combining prior

with data– for frequentist: introduce bias to reduce MSE

(biased but more precise)

• Amount of shrinkage depends on prior variance

),(~ 2φμβ Nj

Hierarchical Models:Bayes and Shrinkage

),(~

),(~2

0

2'

φββ

σβ

N

xNy ii

)ˆ,ˆ(~| VENyβ

)//'(ˆˆ

)//'(ˆ

20

2

122

φβσ

φσ

+=

+= −

yXVE

IXXV

In the simple Normal-Normal setting,

So the posterior is:

Where:

And I is the pxp identity matrix

yi may be either a continuous response or an imputed latent response (via Albert and Chib)

Model P1: Shrinkage

SB model: μ=0 and ϕ2 =2.0, 1.0, and 0.5

The problem with model P1

• Assumes the prior variance is known with certainty– constant shrinkage of all coefficients

• Sensitivity analyses address changes to results with different prior variances

• Data contain information on prior variance

2: A richer parametric model (P2)

• Places prior distribution on ϕ2

– reduces dependence on prior variance

• Could place prior on μ as well (in some situations)

),(~

1);,(~

))1(Pr(logit

212

2

11

ααφ

φμβ

ββα

IG

kjN

xxy

j

kikii

L

L

=

+++==

Properties of model P2

• Prior distribution on ϕ2 allows it to be updated by the data

• As variability of estimates from prior mean increases, so does ϕ2

• As variability of estimates from prior mean decreases, so does ϕ2

• Adaptive shrinkage of all coefficients

Posterior Sampling for Model P2

),(~

),(~

),(~

212

20

2'

ααφ

φββ

σβ

IG

N

xNy ii

)2/)()'(,2/(~|

)ˆ,ˆ(~,|

00212

2

ββββααβφφβ

−−++ pIG

VENy

)//'(ˆˆ

)//'(ˆ

20

2

122

φβσ

φσ

+=

+= −

yXVE

IXXV

In the simple Normal-Normal setting,

So the conditional posteriors are:

Where:

Adaptive Shrinkage of model P2

Model Prior variance Φ | data shrinkage

P1 Fixed Constant Constant

P2 Random ↓ ↑

Adaptive Shrinkage of Model P2

Model Prior variance Φ | data shrinkage

P1 Fixed Constant Constant

P2 Random ↑ ↓

The Problem with Model P2

• How sure are we of our parametric specification of the prior?

• Can we do better by grouping coefficients into clustering and then shrinking the cluster specific coefficients separately?– Amount of shrinkage varies by coefficient

Clustering Coefficients

3: Dirichlet Process Priors

• Popular Bayesian non-parametric approach

• Rather than specifying that βj~N(μ,ϕ2), we specify βj~D

– D is an unknown distribution

– D needs a prior distribution: D~DP(λ,D0)

• D0 is a base distribution such as N(μ,ϕ2)

• λ is a precision parameter. As λ gets large, D converges to D0

Dirichlet Process Prior

GammaInvN

kkDirichletp

ppZ

NhZy

hh

k

hhhi

hhii

.~),(

)/,,/(~

~|

),(~|

2

1

2

−

=

∑=

σμαα

δ

σμ

K

An extension of the finite mixture model David presented last week:

As k becomes infinitely large, this specification becomes equivalent to a DPP

Equivalent Representations of DPP

∑≠−+

+−+ ij

iiii

jjPD

P ),(0)(2)(2

2

1

1

1~],|,[

σμδ

λλ

λσμσμ

Polya Urn Representation:

Stick Breaking Representation

GammaInvN

betap

ppp

ppZ

NhZy

hh

k

k

llhh

hhhi

hhii

.~),(

),1(~

)1(

~|

),(~|

2

'

1

1

''

1

2

−

−=

=

∏

∑−

=

∞

=

σμ

λ

δ

σμ

Where P is the number of coefficients and D0 ~ N-Inv.Gamma

Realizations from a Dirichlet Process

λ=1 λ=100 D0=N(0,1)


• Discrete nature of DPP implies clustering• Probability of clustering increases as λ

decreases• In this application, we want to cluster

coefficients• Soft clustering: coefficients are clustered

at each iteration of the Gibbs sampler, not assumed to be clustered together with certainty


prior for β1 for a given D0 and β2 to β10

Posterior Inference for DPP

),();,(~;~

),(~y2

00

2

φμλβ

σβ

NDDDPPDD

XN

j

i

=

+á

∑≠

+jk

kjjjj

j kwDwy )(00

)( ~,| βδββ

),|(),|(~

),|(

),|(),|(

22*0

2*

2*2

φμβσβ

σβ

σβφμβλ

jkjj

kjkj

jjjoj

NXyND

XyNw

XyNNw

∝

∝ ∫

Use Polya Urn Scheme:

where

)()(* jjiii xyy βα −−=

Posterior Inference for DPP

• Coefficients are assigned to clusters based on the weights, w0j to wop.

• After assignment, cluster specific coefficients are updated to improve mixing

• DPP precision parameter can be random as well

4: Dirichlet Process Prior with Variable Selection Models

• Minor modification to Dirichlet process prior model

• We may desire a more parsimonious model

• If some DBPs have no effect, would prefer to eliminate them from the model

• forward/backward selection• result in inappropriate confidence intervals

Dirichlet Process Prior with Variable Selection Models

• We incorporate a selection model in the Dirichlet Process’s base distribution:

• π is the probability that a coefficient has no effect• (1- π) is the probability that it is N(μ,ϕ2)

),()1( 200 φμππδ ND −+=

Dirichlet Process with Variable Selection

• A coefficient is equal to zero (no effect) with probability π

• A priori, we expect this to happen (π100)% of the time

• We place a prior distribution on π to allow the data to guide inference

Posterior Inference

• Gibbs sampling proceeds as in previous model, except weights are modified.

• Additional weight for null cluster

),|()1(

),|()(

),|(),|()1(

2*

2*,

2*2

σβπ

σπ

σβφμβπλ

kjkj

jnull

jjjoj

XyNw

yNw

XyNNw

−∝

∝

−∝ ∫0

Dirichlet Process Prior withVariable Selection

prior for β1 for a given D0 and β2 to β10

Simulations

• Four hierarchical models, how do they compare?

• The increased complexity of these hierarchical models seems to make sense, but what does it gain us?

• Simulated datasets of size n=500

MSE of Hierarchical Models

Example: Spontaneous Abortion and Disinfection By-Products

• Pregnancy loss prior to 20 weeks of gestation

• Very common (>30% of all pregnancies)

• Relatively little known about its causes– maternal age, smoking, prior pregnancy loss,

occupational exposures, caffeine– disinfection by-products (DBPs)

Disinfection By-Products (DBPs)

• A vast array of DBPs are formed in the disinfection process

• We focus on 2 main types: – trihalomethanes (THMs): CHCl3, CHBr3,

CHCl2Br, CHClBr2

– haloacetic acids (HAAs): ClAA, Cl2AA, Cl3AA, BrAA, Br2AA, Br3AA, BrClAA, Br2ClAA, BrCl2AA

Specific Aim

• To estimate the effect of each of the 13 constituent DBPs (4 THMs and 9 HAAs) on SAB

• The Problem: DBPs are very highly correlated – for example:

• ρ=0.91 between Cl2AA and Cl3AA

Right From the Start

• Enrolled 2507 women from three metropolitan areas in US

• 2001-2004

• Recruitment: – Prenatal care practices (52%)– Health department (32%)– Promotional mailings (3%)– Drug stores, referral, etc (13%)

Preliminary Analysis

• Discrete time hazard model including all 13 DBPs (categorized into 32 coefficients)– time to event: gestational weeks until loss

• α’s are week specific intercepts (weeks 5…20)• z’s are confounders: smoking, alcohol use,

ethnicity, maternal age• xkij is the concentration of kth category of DBP for

the ith individual in the jth week

ijijpipijii xxzzjTjT 32321111)),|(Pr(logit ββγγα ++++++=−≥= LL

Results of Logistic Regression

Results of Logistic Regression

• Several large but imprecise effects are seen

• 4 of 32 coefficients are statistically significant

• Imprecision makes us question results– better analytic approach

DBPs and SAB: model P1

kjN

xxjTjT

j

iiii

L

L

1);,(~

)),|(Pr(logit2

323211

=

++++=−≥=

φμβ

ββã'zá i

Little prior evidence of effect: specify μ=0

Calculate ϕ2 from existing literaturelargest effect: OR=3.0 ϕ2=(ln(3.0)-ln(1/3))/(2 x

1.96)=0.3142

Semi-Bayes Results

Red=ML Estimates Black= SB Estimates

DBPs and SAB: Model P2

• μ=0

• ϕ2 is random. Choose α1=3.39 α2=1.33 – E(ϕ2)=0.31 (as in Semi-Bayes analysis)– V(ϕ2)=0.07 (at ϕ2’s 95th percentile, 95% of

β’s will fall between OR=6 and OR=1/6…the most extreme we believe to be possible)

),(~

1);,(~

)),|(Pr(logit

212

2

11

ααφ

φμβ

ββ

IG

kjN

xxjTjT

j

kikiii

L

L

=

++++=−≥= ã'zá i

Fully-Bayes Results

Red=ML & semi-Bayes Black=fully-Bayes

DBP and SAB: Dirichlet Process Priors

• μ=0, α1=3.39 ,α2= 1.33

• ν1= 1 ν2=1 uninformative choice for λ

),(~

),();,(~

),(~;~

)),|(Pr(logit

212

2021

0

323211

ααφ

φμυυλ

λβββ

IG

NDG

DDPPDDxxjTjT

j

iiii

=

++++=−≥= Lã'zá i

Dirichlet Process Priors Results

DBPs and SAB: Dirichlet Process Priors with Selection Component

• μ=0, α1=3.39 ,α2= 1.33,ν1= 1 ν2=1

• ω1=1.5, ω2=1.5 E(π)=0.5, 95%CI(0.01, 0.99)

),(~);,(~

),()1();,(~

),(~;~

)),|(Pr(logit

21212

20021

0

323211

ωωπααφ

φμππδυυλ

λβββ

betaIG

NDG

DDPPDDxxjTjT

j

iiii

−+=

++++=−≥= Lã'zá i

Selection Component Results

Conclusions (Hierarchical Models)

• Semi-Bayes: Assumes β random• Fully-Bayes: Assumes ϕ2 random• Dirichlet Process: Assumes prior

distribution is random• Dirichlet Process with Selection

Component: Assumes prior distribution is random and allows coefficients to cluster at the null

• Can improve performance (MSE) with increasing complexity

Conclusions (DBPs and SAB)

• Semi-Bayes models provided the least shrinkage; Dirichlet Process models, the most

• These results are in contrast to previous research

• Very little evidence of an effect of any constituent DBP on SAB

Future Directions

• Enormous dimensional data– e.g. SNPs– Cluster effects to reduce dimensions– Algorithmic problems in large datasets

• retrospective DP

bayesian parametric and semi- parametric hierarchical models: an application to disinfection...

Documents