bayesian parametric and semi- parametric hierarchical models: an application to disinfection...
TRANSCRIPT
Bayesian Parametric and Semi-Parametric Hierarchical models: An application to Disinfection By-Products and Spontaneous
Abortion:
Rich MacLehoseNovember 9th, 2006
Outline
1. Brief introduction to hierarchical models
2. Introduce 2 ‘standard’ parametric models
3. Extend them with 2 semi-parametric models
4. Applied example of Disinfection by-products and spontaneous abortion
Bayesian Hierarchical Models
• Natural way to model many applied problems
G
Ny
i
ii
~
),(~ 2
μσμ
• μi are assumed exchangeable• G may depend on further coefficients
that extend the hierarchy
Bayesian Hierarchical Models
• Frequently, researchers will be interested in estimating a large number of coefficients, possibly with a large amount of correlation between predictors.
• Hierarchical models may be particularly useful by allowing ‘borrowing’ of information between groups.
Bayesian Hierarchical Models
• For example, we wish to estimate the effect of 13 chemicals on early pregnancy loss.
• The chemicals are highly correlated, making standard frequentist methods unstable.
Hierarchical Regression Models
• Traditional models treat the outcome as random
• Hierarchical models also treat coefficients as random
kikii xxy ββα +++== L11))1(Pr(logit
kjN
xxy
j
kikii
L
L
1);,(~
))1(Pr(logit2
11
=
+++==
φμβ
ββα
Some Parametric and Semi-Parametric Bayesian Hierarchical
Models
1. Simplest hierarchical model (P1)
2. Fully Bayesian hierarchical model (P2)
3. Dirichlet process prior
4. Dirichlet process prior with selection component
1: The first parametric model (P1)
• A simple “one-level” hierarchical model
• Popularized by Greenland in Epidemiology– he refers to it as “semi-Bayes”– may refer to asymptotic methods commonly
used in fitting semi-Bayes models
• Has seen use in nutritional, genetic, occupational, and cancer research
Hierarchical Models: Bayes and Shrinkage
• μ is our prior belief about the size of the effect and ϕ2 is uncertainty in it
• Effect estimates from hierarchical models are ‘shrunk’ (moved) toward the prior distribution
• Shrinkage:– for Bayesian: natural consequence of combining prior
with data– for frequentist: introduce bias to reduce MSE
(biased but more precise)
• Amount of shrinkage depends on prior variance
),(~ 2φμβ Nj
Hierarchical Models:Bayes and Shrinkage
),(~
),(~2
0
2'
φββ
σβ
N
xNy ii
)ˆ,ˆ(~| VENyβ
)//'(ˆˆ
)//'(ˆ
20
2
122
φβσ
φσ
+=
+= −
yXVE
IXXV
In the simple Normal-Normal setting,
So the posterior is:
Where:
And I is the pxp identity matrix
yi may be either a continuous response or an imputed latent response (via Albert and Chib)
Model P1: Shrinkage
SB model: μ=0 and ϕ2 =2.0, 1.0, and 0.5
The problem with model P1
• Assumes the prior variance is known with certainty– constant shrinkage of all coefficients
• Sensitivity analyses address changes to results with different prior variances
• Data contain information on prior variance
2: A richer parametric model (P2)
• Places prior distribution on ϕ2
– reduces dependence on prior variance
• Could place prior on μ as well (in some situations)
),(~
1);,(~
))1(Pr(logit
212
2
11
ααφ
φμβ
ββα
IG
kjN
xxy
j
kikii
L
L
=
+++==
Properties of model P2
• Prior distribution on ϕ2 allows it to be updated by the data
• As variability of estimates from prior mean increases, so does ϕ2
• As variability of estimates from prior mean decreases, so does ϕ2
• Adaptive shrinkage of all coefficients
Posterior Sampling for Model P2
),(~
),(~
),(~
212
20
2'
ααφ
φββ
σβ
IG
N
xNy ii
)2/)()'(,2/(~|
)ˆ,ˆ(~,|
00212
2
ββββααβφφβ
−−++ pIG
VENy
)//'(ˆˆ
)//'(ˆ
20
2
122
φβσ
φσ
+=
+= −
yXVE
IXXV
In the simple Normal-Normal setting,
So the conditional posteriors are:
Where:
Adaptive Shrinkage of model P2
Model Prior variance Φ | data shrinkage
P1 Fixed Constant Constant
P2 Random ↓ ↑
Adaptive Shrinkage of Model P2
Model Prior variance Φ | data shrinkage
P1 Fixed Constant Constant
P2 Random ↑ ↓
The Problem with Model P2
• How sure are we of our parametric specification of the prior?
• Can we do better by grouping coefficients into clustering and then shrinking the cluster specific coefficients separately?– Amount of shrinkage varies by coefficient
Clustering Coefficients
3: Dirichlet Process Priors
• Popular Bayesian non-parametric approach
• Rather than specifying that βj~N(μ,ϕ2), we specify βj~D
– D is an unknown distribution
– D needs a prior distribution: D~DP(λ,D0)
• D0 is a base distribution such as N(μ,ϕ2)
• λ is a precision parameter. As λ gets large, D converges to D0
Dirichlet Process Prior
GammaInvN
kkDirichletp
ppZ
NhZy
hh
k
hhhi
hhii
.~),(
)/,,/(~
~|
),(~|
2
1
2
−
=
∑=
σμαα
δ
σμ
K
An extension of the finite mixture model David presented last week:
As k becomes infinitely large, this specification becomes equivalent to a DPP
Equivalent Representations of DPP
∑≠−+
+−+ ij
iiii
jjPD
P ),(0)(2)(2
2
1
1
1~],|,[
σμδ
λλ
λσμσμ
Polya Urn Representation:
Stick Breaking Representation
GammaInvN
betap
ppp
ppZ
NhZy
hh
k
k
llhh
hhhi
hhii
.~),(
),1(~
)1(
~|
),(~|
2
'
1
1
''
1
2
−
−=
=
∏
∑−
=
∞
=
σμ
λ
δ
σμ
Where P is the number of coefficients and D0 ~ N-Inv.Gamma
Realizations from a Dirichlet Process
λ=1 λ=100 D0=N(0,1)
Dirichlet Process Prior
• Discrete nature of DPP implies clustering• Probability of clustering increases as λ
decreases• In this application, we want to cluster
coefficients• Soft clustering: coefficients are clustered
at each iteration of the Gibbs sampler, not assumed to be clustered together with certainty
Dirichlet Process Prior
prior for β1 for a given D0 and β2 to β10
Posterior Inference for DPP
),();,(~;~
),(~y2
00
2
φμλβ
σβ
NDDDPPDD
XN
j
i
=
+á
∑≠
+jk
kjjjj
j kwDwy )(00
)( ~,| βδββ
),|(),|(~
),|(
),|(),|(
22*0
2*
2*2
φμβσβ
σβ
σβφμβλ
jkjj
kjkj
jjjoj
NXyND
XyNw
XyNNw
∝
∝ ∫
Use Polya Urn Scheme:
where
)()(* jjiii xyy βα −−=
Posterior Inference for DPP
• Coefficients are assigned to clusters based on the weights, w0j to wop.
• After assignment, cluster specific coefficients are updated to improve mixing
• DPP precision parameter can be random as well
4: Dirichlet Process Prior with Variable Selection Models
• Minor modification to Dirichlet process prior model
• We may desire a more parsimonious model
• If some DBPs have no effect, would prefer to eliminate them from the model
• forward/backward selection• result in inappropriate confidence intervals
Dirichlet Process Prior with Variable Selection Models
• We incorporate a selection model in the Dirichlet Process’s base distribution:
• π is the probability that a coefficient has no effect• (1- π) is the probability that it is N(μ,ϕ2)
),()1( 200 φμππδ ND −+=
Dirichlet Process with Variable Selection
• A coefficient is equal to zero (no effect) with probability π
• A priori, we expect this to happen (π100)% of the time
• We place a prior distribution on π to allow the data to guide inference
Posterior Inference
• Gibbs sampling proceeds as in previous model, except weights are modified.
• Additional weight for null cluster
),|()1(
),|()(
),|(),|()1(
2*
2*,
2*2
σβπ
σπ
σβφμβπλ
kjkj
jnull
jjjoj
XyNw
yNw
XyNNw
−∝
∝
−∝ ∫0
Dirichlet Process Prior withVariable Selection
prior for β1 for a given D0 and β2 to β10
Simulations
• Four hierarchical models, how do they compare?
• The increased complexity of these hierarchical models seems to make sense, but what does it gain us?
• Simulated datasets of size n=500
MSE of Hierarchical Models
Example: Spontaneous Abortion and Disinfection By-Products
• Pregnancy loss prior to 20 weeks of gestation
• Very common (>30% of all pregnancies)
• Relatively little known about its causes– maternal age, smoking, prior pregnancy loss,
occupational exposures, caffeine– disinfection by-products (DBPs)
Disinfection By-Products (DBPs)
• A vast array of DBPs are formed in the disinfection process
• We focus on 2 main types: – trihalomethanes (THMs): CHCl3, CHBr3,
CHCl2Br, CHClBr2
– haloacetic acids (HAAs): ClAA, Cl2AA, Cl3AA, BrAA, Br2AA, Br3AA, BrClAA, Br2ClAA, BrCl2AA
Specific Aim
• To estimate the effect of each of the 13 constituent DBPs (4 THMs and 9 HAAs) on SAB
• The Problem: DBPs are very highly correlated – for example:
• ρ=0.91 between Cl2AA and Cl3AA
Right From the Start
• Enrolled 2507 women from three metropolitan areas in US
• 2001-2004
• Recruitment: – Prenatal care practices (52%)– Health department (32%)– Promotional mailings (3%)– Drug stores, referral, etc (13%)
Preliminary Analysis
• Discrete time hazard model including all 13 DBPs (categorized into 32 coefficients)– time to event: gestational weeks until loss
• α’s are week specific intercepts (weeks 5…20)• z’s are confounders: smoking, alcohol use,
ethnicity, maternal age• xkij is the concentration of kth category of DBP for
the ith individual in the jth week
ijijpipijii xxzzjTjT 32321111)),|(Pr(logit ββγγα ++++++=−≥= LL
Results of Logistic Regression
Results of Logistic Regression
• Several large but imprecise effects are seen
• 4 of 32 coefficients are statistically significant
• Imprecision makes us question results– better analytic approach
DBPs and SAB: model P1
kjN
xxjTjT
j
iiii
L
L
1);,(~
)),|(Pr(logit2
323211
=
++++=−≥=
φμβ
ββã'zá i
Little prior evidence of effect: specify μ=0
Calculate ϕ2 from existing literaturelargest effect: OR=3.0 ϕ2=(ln(3.0)-ln(1/3))/(2 x
1.96)=0.3142
Semi-Bayes Results
Red=ML Estimates Black= SB Estimates
DBPs and SAB: Model P2
• μ=0
• ϕ2 is random. Choose α1=3.39 α2=1.33 – E(ϕ2)=0.31 (as in Semi-Bayes analysis)– V(ϕ2)=0.07 (at ϕ2’s 95th percentile, 95% of
β’s will fall between OR=6 and OR=1/6…the most extreme we believe to be possible)
),(~
1);,(~
)),|(Pr(logit
212
2
11
ααφ
φμβ
ββ
IG
kjN
xxjTjT
j
kikiii
L
L
=
++++=−≥= ã'zá i
Fully-Bayes Results
Red=ML & semi-Bayes Black=fully-Bayes
DBP and SAB: Dirichlet Process Priors
• μ=0, α1=3.39 ,α2= 1.33
• ν1= 1 ν2=1 uninformative choice for λ
),(~
),();,(~
),(~;~
)),|(Pr(logit
212
2021
0
323211
ααφ
φμυυλ
λβββ
IG
NDG
DDPPDDxxjTjT
j
iiii
=
++++=−≥= Lã'zá i
Dirichlet Process Priors Results
DBPs and SAB: Dirichlet Process Priors with Selection Component
• μ=0, α1=3.39 ,α2= 1.33,ν1= 1 ν2=1
• ω1=1.5, ω2=1.5 E(π)=0.5, 95%CI(0.01, 0.99)
),(~);,(~
),()1();,(~
),(~;~
)),|(Pr(logit
21212
20021
0
323211
ωωπααφ
φμππδυυλ
λβββ
betaIG
NDG
DDPPDDxxjTjT
j
iiii
−+=
++++=−≥= Lã'zá i
Selection Component Results
Conclusions (Hierarchical Models)
• Semi-Bayes: Assumes β random• Fully-Bayes: Assumes ϕ2 random• Dirichlet Process: Assumes prior
distribution is random• Dirichlet Process with Selection
Component: Assumes prior distribution is random and allows coefficients to cluster at the null
• Can improve performance (MSE) with increasing complexity
Conclusions (DBPs and SAB)
• Semi-Bayes models provided the least shrinkage; Dirichlet Process models, the most
• These results are in contrast to previous research
• Very little evidence of an effect of any constituent DBP on SAB
Future Directions
• Enormous dimensional data– e.g. SNPs– Cluster effects to reduce dimensions– Algorithmic problems in large datasets
• retrospective DP