gene-environment case-control studies

Gene-Environment Case-Control Studies

Raymond J. CarrollDepartment of Statistics

Faculty of Nutrition

Texas A&M Universityhttp://stat.tamu.edu/~carroll

Outline

• Problem: Can more efficient inference be done assuming gene (G) and environment (X) independence?

• Gene-Environment independence: the case-only method

• Profile likelihood approach • Efficiency gains• Example• Conclusions

Acknowledgment

• This work is joint with Nilanjan Chatterjee, National Cancer Institute

• Papers in: Biometrika, Genetic Epidemiology

http://dceg.cancer.gov/people/ChatterjeeNilanjan.html

Outline

• Theoretical Methods:• With real G and X independence, we used a

profile likelihood method based on nonparametric maximum likelihood

• (Key insight) Equivalent to a device of pretending the study is a regular random sample subject to missing data

• (This allows) generalization to any parametric model for G given X.

A Little Terminology

• Epidemiologists: Case control sample• Econometricians: Choice-based sample• These are exactly the same problems• Subjects have two choices (or disease

states)• Subjects have their covariates sampled

conditional on their choices, i.e., • Random sample from those with disease• Random sample from those without disease

Basic Problem Formalized

• Case control sample: D = disease • Gene expression: G• Environment: X• Strata: S• We are interested in main effects for G

and (X,S) along with their interaction

Prospective Models

• Simplest logistic model

• General logistic model

• The function m(G,X1) is completely

general

0 1 2 3pr(D 1| G,X) H( G X G* X)

0 1pr(D 1| G,X) H{ m(G,X, )β }

Case-Control Data

• Case-control data are not a random sample

• We observe (G,X) given D, i.e., we observe the covariates given the response, not vice-versa

• If we had a random sample, linear logistic regression would be used to fit the model

• Obvious idea: ignore the sampling plan and pretend you have a random sample

Case-Control Data

• Known Fact: The intercept is not identified, rest of the model is identified

• Retrospective odds is given as

0 1

d

0 1

pr(G=g,X=x| D=1)=exp{β +m(g,x,β )-log( / )}

pr(G=g,X=x| D=

=pr(D=d)

0)

Alternative Derivation: Ignore Sampling Plan

• Consider a prospective study

• Let = 1 mean selection into the study

• Pretend

• Then compute

d

d

pr(Δ=1| D=d,G,X) n / pr(D=d);

n # of observations with D d

0 1 0 1 10

logit{pr(D=1| Δ=1,X=x,G=g)}

= log(n / n ) log( / ) m(g,x, )

Case-Control Data

• Fact: all parameters except the intercept can be estimated consistently while ignoring the sampling plan

• Standard Errors: Those compute ignoring the sampling plan are asymptotically correct

0 1 0 1 10


= log(n / n ) log( / ) m(g,x, )

Case-Control Data

• The intercept is determined by pr(D=1) in the population, hence not identified from these data

• Little Known Fact: Adding information about pr(D=1) adds no information about

1β

0 1 0 1 10


= log(n / n ) log( / ) m(g,x, )

Gene-Environment Independence

• In many situations, it may be reasonable to assume G and X are independently distributed in the underlying population, possibly after conditioning on strata

• This assumption is often used in gene-environment interaction studies

G-E Independence: Discussion

• Does not always hold!

• Example: polymorphisms in the smoking metabolism pathway may affect the degree of addiction

• If False: Possible severe bias (Albert, et al., 2001, our own simulations)

G-E Independence: Discussion

• It is reasonable in many problems

• Example: Environment is a treatment in a randomized study under nested case-control sampling

• Example: Reasonable when exposure is not directly controlled by individual behavior• Radiation exposure for A-bomb survivors

• Carcinogenic exposure of employees

• Pesticide exposure in a rural community

Generalizations

• I have phrased this problem as one where G and X are independent given strata

• This makes sense contextually in genetic epidemiology

• All the results I will describe go through if you can write down a probability model for G given (X,S): I do this in the Israeli Study.

Generalizations

• If G is binary, it is natural to apply our approach

• Posit a parametric or semiparametric model for G given (X,S)

• Consequences: • More efficient estimation of G effects• Much more efficient estimation of G and (X,S)

interactions.


• Rare Disease Approximation: Rare disease for all values of (G,X)

• May be unreasonable for important genes such as BRCA1/2

• Case-only estimate of multiplicative interaction

(Piegorsch, et al.,1994)xg

xg

0 x gpr(D=1| G,X)=H{β +β X+β G+ XG}

pr(G=1| X=1,D=1)pr(G=0| X=0,D=1)exp( )

pr(G=0| X=1,D=1)pr(G=1| X=0,D= )β

1

β

Gene-Environment Independence: Case-Only Analysis

• Positive Consequence: Often much more powerful than standard analysis• Power advantage of this method often has led

researchers to discard information on controls

• Negative Consequence: no ability to estimate other risk parameters, which are often of greater interest (see example later)

• Restrictions: Can only handle multiplicative interaction, requires rare disease in all values of (G,X)


• Fact: gain in power for inference about a multiplicative interaction

• Consequence: There is thus (Fisher) information in the assumption

• Conjecture: Can handle general models and improve efficiency for all parameters

• We do this via a semiparametric profile likelihood approach

• We start though from a different likelihood

Prentice-Pyke Calculation

• Methodology: Start with the retrospective likelihood

• The distribution of (X,G) in the population is left unspecified

• Semiparametric MLE is usual logistic regression

0 1 0 1

0 1 0 1x',g'

pr(G=g,X=x| D=d)

exp d m(g,x, ) 1 H m(pr(X=x,G=g)

pr(X=x

g,x, )=

exp d m(g',x', ) 1 H m(g',x', )',G=g')

Environment and Gene Expression

• Methodology: Start with the retrospective likelihood

• Note how independence of G and X is used here, see the red expressions

• We do not want to model the often multivariate distribution of X

• Gene distribution model can be standard

0 1 0 1

0 1 0 1x',g'

pr(G=g,X=x| D=d)

exp d m(g,x, ) 1 H m(pr(X=x)pr(G=g)

pr(X=x')pr(G=g')

g,x, )=

exp d m(g',x', ) 1 H m(g',x', )


• Methodology: Compute a profile estimate • Parametric/semiparametric distribution for G

• Nonparametric distribution for X (possibly high dimensional)

• Result: Explicit profile likelihood


• Methodology: Treat as distinct parameters

• Let G have parametric structure:

• Construct the profile likelihood, having estimated the as functions of data and other parameters

• The result is a function of : this function can be calculated explicitly!

i iλ =pr(X=x )

iλ

pr(G=g) =f(g,θ)

0 1 1Ω={θ,β ,β , pr(D 1)}

Ω

Profile Likelihood

• Result:

1 0

1

0 1

0 1

1

d=0

0 = log(n / n ) log / pr(D 0) ;

f(g, )exp d m(g,x, )S(d,g,x, ) =

1 exp m(g,x, )

Profile Likelihood= L(β ,β ,κ,θ)=L(Ω)

S(D,G,X,Ω) =

p

S(d,g,X, )d (

r(D=1)β

g)

Alternative Derivation

• Consider a prospective study

• Let = 1 mean selection into the study

• Pretend

• Then compute

• This is exactly our profile pseudo-likelihood!

d

d

pr(Δ=1| D=d,G,X) n / pr(D=d);

n # of observations with D d

pr(D=d,G=g| Δ=1,X)

Alternative Derivation

• We compute:

• Standard approach computes

• It is this insight that allows us to greatly generalize the work past independence of G and X.

G=gpr(D=d, | Δ=1,X)

G=gpr(D=d| ,Δ=1,X)

Computation

• Intercept: The logistic intercept, and hence pr(D=1), is weakly identified by itself

• Disease rate: If pr(D=1) is known, or a good bound for it is specified, can have significant gains in efficiency.• This does not happen for a regular case-

control study

Interesting Technical Point

• Profile pseudo-likelihood acts like a likelihood

• Information Asymptotics are (almost) exact

• Missing G data handled seamlessly (see next)• Missing genotype

• Unphased haplotype data

Missing Data

• We have a formal likelihood:

• If gene is missing, suggests the formal likelihood

• Result: Inference as if the data were a random sample with missing data

pr(D=d,G=g| Δ=1,X)

*

*g

pr(D=d| Δ=1,X)=

pr(D=d,G=g| Δ=1,X)

Measurement Error

• The likelihood formulation also allows us to deal with measurement error in the environmental variables

First Simulation

• MSE Efficiency of Profile method: 0.02 < pr(D=1) < 0.07

0

0.5

1

1.5

2

2.5

3

3.5

4

G X G times X

pr(G)=.05

pr(G)=.20

Israeli Ovarian Cancer Study

• Population based case-control study• Study the interplay of BRCA1/2 mutations

(G) and two known risk factors (E or X) of ovarian cancer:• oral contraceptive (OC) use• parity.

• Missing Data: Approximately 50% of the controls were not genotyped, and 10% of the cases


• Results reported in Modan et al., NEJM (2001).

• Their analysis involves• Assumption of parity and OC use are

independent of BRCA1/2 mutation status• Simple but approximate methods for

exploiting G and E independence assumption (including case-only estimate of interaction)

• Risk model adjusted for Age, Race, Family History, History of Gynecological Surgery


• Disease risk model including same covariates as Modan et al (2001)

• In addition, we explicitly adjusted for the possibility of both G and E being related to S

• FH = family history (breast cancer = 1, ovarian or >= 2 breast cancer = 2)

0 Ash

logit{Pr(G=1| S)}=

β +β I(Ash)


• Question: Can carriers be protected via OC-use?

• The logarithm of the odds ratio is the sum of• The main effect for OC-use• The interaction term between OC-use and being a

carrier, i.e., interaction between gene and environment

• Note how this involves main effects and interactions


• Question: Is there a carrier/OC interaction

• The case-only method can only answer this question


• Interaction of OC and BRCA1/2:


• Main Effect of BRCA1/2:


• Odds ratio for OC use among carriers = 1.04 (0.98, 1.09)

• No evidence for protective effect

• Not available from case-only analysis

• Length of interval is ½ the length of the usual analysis

Features of the Method

• Allows estimation of all parameters of logistic regression model and can be used to examine interaction in alternative scales

• Can be used to estimate OR for non-rare diseases • Important for studying major genes such as

BRCA1/2

Features of the Method

• Allows incorporation of external information on Pr(D=1)• Unlike with logistic regression in case-control

studies, this information improves efficiency of estimation

Colorectal Adenoma Study

• PLCO Study: 772 cases, 772 controls

• Three SNPs in the calcium-sensing receptor region

• HWE assumed

• Interest in the interaction of number of copies of one haplotype (GCG) and calcium intake from diet


• Method #1: Write down the prospective likelihood and apply missing data techniques• A standard analysis

• If ignoring the case-control sampling scheme works for ordinary logistic regression, it should work for missing haplotype regression too, right?

• Wrong! Biased estimates and standard errors

• Method #2: Our method

Conclusions

• Standard case-control (choice-based) studies• Specify a model for G given X, e.g., G-E

independence in population after conditioning on strata

• No assumptions made about X (high dimensional)

• All parameters estimable, no rare-disease assumption

• Handle missing G data

• Large gains in efficiency versus usual method

• Large gains in efficiency for effects of environment given the gene

Conclusions

• Theoretical Methods:• With real G and X independence, we used a

profile likelihood method based on nonparametric maximum likelihood

• (Key insight) Equivalent to a device of pretending that study is a regular random sample subject to missing data

• (This allows) generalization to any parametric model for G given X.

Acknowledgment

• Two graduate students have worked on this project

Iryna Lobach, Yale Christie Spinka, U of Missouri

Thanks!

http://stat.tamu.edu/~carroll

gene-environment case-control studies

Documents