gene-environment case-control studies
DESCRIPTION
Gene-Environment Case-Control Studies. Raymond J. Carroll Department of Statistics Faculty of Nutrition Texas A&M University http://stat.tamu.edu/~carroll. Outline. Problem : Can more efficient inference be done assuming gene (G) and environment (X) independence? - PowerPoint PPT PresentationTRANSCRIPT
Gene-Environment Case-Control Studies
Raymond J. CarrollDepartment of Statistics
Faculty of Nutrition
Texas A&M Universityhttp://stat.tamu.edu/~carroll
Outline
• Problem: Can more efficient inference be done assuming gene (G) and environment (X) independence?
• Gene-Environment independence: the case-only method
• Profile likelihood approach • Efficiency gains• Example• Conclusions
Acknowledgment
• This work is joint with Nilanjan Chatterjee, National Cancer Institute
• Papers in: Biometrika, Genetic Epidemiology
http://dceg.cancer.gov/people/ChatterjeeNilanjan.html
Outline
• Theoretical Methods:• With real G and X independence, we used a
profile likelihood method based on nonparametric maximum likelihood
• (Key insight) Equivalent to a device of pretending the study is a regular random sample subject to missing data
• (This allows) generalization to any parametric model for G given X.
A Little Terminology
• Epidemiologists: Case control sample• Econometricians: Choice-based sample• These are exactly the same problems• Subjects have two choices (or disease
states)• Subjects have their covariates sampled
conditional on their choices, i.e., • Random sample from those with disease• Random sample from those without disease
Basic Problem Formalized
• Case control sample: D = disease • Gene expression: G• Environment: X• Strata: S• We are interested in main effects for G
and (X,S) along with their interaction
Prospective Models
• Simplest logistic model
• General logistic model
• The function m(G,X1) is completely
general
0 1 2 3pr(D 1| G,X) H( G X G* X)
0 1pr(D 1| G,X) H{ m(G,X, )β }
Case-Control Data
• Case-control data are not a random sample
• We observe (G,X) given D, i.e., we observe the covariates given the response, not vice-versa
• If we had a random sample, linear logistic regression would be used to fit the model
• Obvious idea: ignore the sampling plan and pretend you have a random sample
Case-Control Data
• Known Fact: The intercept is not identified, rest of the model is identified
• Retrospective odds is given as
0 1
d
0 1
pr(G=g,X=x| D=1)=exp{β +m(g,x,β )-log( / )}
pr(G=g,X=x| D=
=pr(D=d)
0)
Alternative Derivation: Ignore Sampling Plan
• Consider a prospective study
• Let = 1 mean selection into the study
• Pretend
• Then compute
d
d
pr(Δ=1| D=d,G,X) n / pr(D=d);
n # of observations with D d
0 1 0 1 10
logit{pr(D=1| Δ=1,X=x,G=g)}
= log(n / n ) log( / ) m(g,x, )
Case-Control Data
• Fact: all parameters except the intercept can be estimated consistently while ignoring the sampling plan
• Standard Errors: Those compute ignoring the sampling plan are asymptotically correct
0 1 0 1 10
logit{pr(D=1| Δ=1,X=x,G=g)}
= log(n / n ) log( / ) m(g,x, )
Case-Control Data
• The intercept is determined by pr(D=1) in the population, hence not identified from these data
• Little Known Fact: Adding information about pr(D=1) adds no information about
1β
0 1 0 1 10
logit{pr(D=1| Δ=1,X=x,G=g)}
= log(n / n ) log( / ) m(g,x, )
Gene-Environment Independence
• In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata
• This assumption is often used in gene-environment interaction studies
G-E Independence: Discussion
• Does not always hold!
• Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction
• If False: Possible severe bias (Albert, et al., 2001, our own simulations)
G-E Independence: Discussion
• It is reasonable in many problems
• Example: Environment is a treatment in a randomized study under nested case-control sampling
• Example: Reasonable when exposure is not directly controlled by individual behavior• Radiation exposure for A-bomb survivors
• Carcinogenic exposure of employees
• Pesticide exposure in a rural community
Generalizations
• I have phrased this problem as one where G and X are independent given strata
• This makes sense contextually in genetic epidemiology
• All the results I will describe go through if you can write down a probability model for G given (X,S): I do this in the Israeli Study.
Generalizations
• If G is binary, it is natural to apply our approach
• Posit a parametric or semiparametric model for G given (X,S)
• Consequences: • More efficient estimation of G effects• Much more efficient estimation of G and (X,S)
interactions.
Gene-Environment Independence
• Rare Disease Approximation: Rare disease for all values of (G,X)
• May be unreasonable for important genes such as BRCA1/2
• Case-only estimate of multiplicative interaction
(Piegorsch, et al.,1994)xg
xg
0 x gpr(D=1| G,X)=H{β +β X+β G+ XG}
pr(G=1| X=1,D=1)pr(G=0| X=0,D=1)exp( )
pr(G=0| X=1,D=1)pr(G=1| X=0,D= )β
1
β
Gene-Environment Independence: Case-Only Analysis
• Positive Consequence: Often much more powerful than standard analysis• Power advantage of this method often has led
researchers to discard information on controls
• Negative Consequence: no ability to estimate other risk parameters, which are often of greater interest (see example later)
• Restrictions: Can only handle multiplicative interaction, requires rare disease in all values of (G,X)
Gene-Environment Independence
• Fact: gain in power for inference about a multiplicative interaction
• Consequence: There is thus (Fisher) information in the assumption
• Conjecture: Can handle general models and improve efficiency for all parameters
• We do this via a semiparametric profile likelihood approach
• We start though from a different likelihood
Prentice-Pyke Calculation
• Methodology: Start with the retrospective likelihood
• The distribution of (X,G) in the population is left unspecified
• Semiparametric MLE is usual logistic regression
0 1 0 1
0 1 0 1x',g'
pr(G=g,X=x| D=d)
exp d m(g,x, ) 1 H m(pr(X=x,G=g)
pr(X=x
g,x, )=
exp d m(g',x', ) 1 H m(g',x', )',G=g')
Environment and Gene Expression
• Methodology: Start with the retrospective likelihood
• Note how independence of G and X is used here, see the red expressions
• We do not want to model the often multivariate distribution of X
• Gene distribution model can be standard
0 1 0 1
0 1 0 1x',g'
pr(G=g,X=x| D=d)
exp d m(g,x, ) 1 H m(pr(X=x)pr(G=g)
pr(X=x')pr(G=g')
g,x, )=
exp d m(g',x', ) 1 H m(g',x', )
Environment and Gene Expression
• Methodology: Compute a profile estimate • Parametric/semiparametric distribution for G
• Nonparametric distribution for X (possibly high dimensional)
• Result: Explicit profile likelihood
Environment and Gene Expression
• Methodology: Treat as distinct parameters
• Let G have parametric structure:
• Construct the profile likelihood, having estimated the as functions of data and other parameters
• The result is a function of : this function can be calculated explicitly!
i iλ =pr(X=x )
iλ
pr(G=g) =f(g,θ)
0 1 1Ω={θ,β ,β , pr(D 1)}
Ω
Profile Likelihood
• Result:
1 0
1
0 1
0 1
1
d=0
0 = log(n / n ) log / pr(D 0) ;
f(g, )exp d m(g,x, )S(d,g,x, ) =
1 exp m(g,x, )
Profile Likelihood= L(β ,β ,κ,θ)=L(Ω)
S(D,G,X,Ω) =
p
S(d,g,X, )d (
r(D=1)β
g)
Alternative Derivation
• Consider a prospective study
• Let = 1 mean selection into the study
• Pretend
• Then compute
• This is exactly our profile pseudo-likelihood!
d
d
pr(Δ=1| D=d,G,X) n / pr(D=d);
n # of observations with D d
pr(D=d,G=g| Δ=1,X)
Alternative Derivation
• We compute:
• Standard approach computes
• It is this insight that allows us to greatly generalize the work past independence of G and X.
G=gpr(D=d, | Δ=1,X)
G=gpr(D=d| ,Δ=1,X)
Computation
• Intercept: The logistic intercept, and hence pr(D=1), is weakly identified by itself
• Disease rate: If pr(D=1) is known, or a good bound for it is specified, can have significant gains in efficiency.• This does not happen for a regular case-
control study
Interesting Technical Point
• Profile pseudo-likelihood acts like a likelihood
• Information Asymptotics are (almost) exact
• Missing G data handled seamlessly (see next)• Missing genotype
• Unphased haplotype data
Missing Data
• We have a formal likelihood:
• If gene is missing, suggests the formal likelihood
• Result: Inference as if the data were a random sample with missing data
pr(D=d,G=g| Δ=1,X)
*
*g
pr(D=d| Δ=1,X)=
pr(D=d,G=g| Δ=1,X)
Measurement Error
• The likelihood formulation also allows us to deal with measurement error in the environmental variables
Advertisement
First Simulation
• MSE Efficiency of Profile method: 0.02 < pr(D=1) < 0.07
0
0.5
1
1.5
2
2.5
3
3.5
4
G X G times X
pr(G)=.05
pr(G)=.20
Israeli Ovarian Cancer Study
• Population based case-control study• Study the interplay of BRCA1/2 mutations
(G) and two known risk factors (E or X) of ovarian cancer:• oral contraceptive (OC) use• parity.
• Missing Data: Approximately 50% of the controls were not genotyped, and 10% of the cases
Israeli Ovarian Cancer Study
• Results reported in Modan et al., NEJM (2001).
• Their analysis involves• Assumption of parity and OC use are
independent of BRCA1/2 mutation status• Simple but approximate methods for
exploiting G and E independence assumption (including case-only estimate of interaction)
• Risk model adjusted for Age, Race, Family History, History of Gynecological Surgery
Israeli Ovarian Cancer Study
• Disease risk model including same covariates as Modan et al (2001)
• In addition, we explicitly adjusted for the possibility of both G and E being related to S
• FH = family history (breast cancer = 1, ovarian or >= 2 breast cancer = 2)
0 Ash
logit{Pr(G=1| S)}=
β +β I(Ash)
Israeli Ovarian Cancer Study
• Question: Can carriers be protected via OC-use?
• The logarithm of the odds ratio is the sum of• The main effect for OC-use• The interaction term between OC-use and being a
carrier, i.e., interaction between gene and environment
• Note how this involves main effects and interactions
Israeli Ovarian Cancer Study
• Question: Is there a carrier/OC interaction
• The case-only method can only answer this question
Israeli Ovarian Cancer Study
• Interaction of OC and BRCA1/2:
Israeli Ovarian Cancer Study
• Main Effect of BRCA1/2:
Israeli Ovarian Cancer Study
• Odds ratio for OC use among carriers = 1.04 (0.98, 1.09)
• No evidence for protective effect
• Not available from case-only analysis
• Length of interval is ½ the length of the usual analysis
Features of the Method
• Allows estimation of all parameters of logistic regression model and can be used to examine interaction in alternative scales
• Can be used to estimate OR for non-rare diseases • Important for studying major genes such as
BRCA1/2
Features of the Method
• Allows incorporation of external information on Pr(D=1)• Unlike with logistic regression in case-control
studies, this information improves efficiency of estimation
Colorectal Adenoma Study
• PLCO Study: 772 cases, 772 controls
• Three SNPs in the calcium-sensing receptor region
• HWE assumed
• Interest in the interaction of number of copies of one haplotype (GCG) and calcium intake from diet
Colorectal Adenoma Study
• Method #1: Write down the prospective likelihood and apply missing data techniques• A standard analysis
• If ignoring the case-control sampling scheme works for ordinary logistic regression, it should work for missing haplotype regression too, right?
• Wrong! Biased estimates and standard errors
• Method #2: Our method
Colorectal Adenoma Study
Conclusions
• Standard case-control (choice-based) studies• Specify a model for G given X, e.g., G-E
independence in population after conditioning on strata
• No assumptions made about X (high dimensional)
• All parameters estimable, no rare-disease assumption
• Handle missing G data
• Large gains in efficiency versus usual method
• Large gains in efficiency for effects of environment given the gene
Conclusions
• Theoretical Methods:• With real G and X independence, we used a
profile likelihood method based on nonparametric maximum likelihood
• (Key insight) Equivalent to a device of pretending that study is a regular random sample subject to missing data
• (This allows) generalization to any parametric model for G given X.
Acknowledgment
• Two graduate students have worked on this project
Iryna Lobach, Yale Christie Spinka, U of Missouri
Thanks!
http://stat.tamu.edu/~carroll