seminario - testing gene-environment interaction in generalized … · breslow, n. and clayton, d,...
TRANSCRIPT
Seminario Escuela de Estadística UNALMED
SeminarioTesting gene-environment interaction in generalized linear mixed
models with family data
20 de noviembre de 2017
Seminario Introduction
Sección 1
Introduction
Seminario Introduction
Gene, Environment and Disease
Seminario Introduction
Gene-environment interaction
Seminario Introduction
Family data and kinship matrix
Seminario Introduction
Notation
Consider a sample of N independent families.ni : number of members in the ith family (i = 1, . . . ,N).Yij : reponse variable for the phenotype (discrete orcontinuous).Xij = (X 1
ij , . . . ,Xpij )T : p non-environmental covariates.
Gij = (G1ij , . . . ,G
qij )T : q observed genotypes at certain
targeted genetic marker loci.Eij : some environmental exposure factor of interest.Sij = (EijG1
ij , . . . ,EijGqij )T : GE interaction.
Seminario Introduction
Observed genotypes - Gij -
Each column in Gij is a Single Nucleotide Polymorphism (SNP).The genetic information is represented according to the followingcodification:
Gkij =
2, subject j at ith family is homozygous BB1, subject j at ith family is heterozygous Bb or bB0, subject j at ith family is homozygous bb,
with k = 1, . . . , q. B and b represent the dominant and recessivealleles, respectively. In addition, B is the allele that occurs at minorfrequency.
Seminario Introduction
Observed genotypes - Gij -
Seminario Model
Sección 2
Model
Seminario Model
Generalized linear mixed model (GLMM) for Subject j at ith family
g [E (Yij |αij)] = XTij β1 + Eijβ2 + GT
ij θ + STij γ + αij , (1)
Var(Yij |αij) = φω−1ij ν [E (Yij |αij)] ,
αi = (αi1, . . . , αini )T ∼ N(0, 2σ2Φi),where,
g(.): monotone known function.Yij |αij follows a distribution in the exponential family.ν(.): known function.φ: a scale parameter that may be known or may need to beestimated.ωij : known weights (commonly equal to 1).Φi : the kinship matrix and σ2 is a parameter to be estimated.
Seminario Model
Examples of link functions
Family Link g(·) Trait ν(·)Binomial Logit ln
(µ
1−µ
)Binary µ(1 − µ)
Gaussian Identity µ Continuous φGamma Inverse 1/µ Continuous φµ2
Inverse.gaussian Inverse squared 1/µ2 Continuous φµ3
Poisson Log ln(µ) Count µQuasi Identity µ Continuous φ
Quasibinomial Logit ln(
µ1−µ
)Binary φµ(1 − µ)
Quasipoisson Log ln(µ) Count φµ
Seminario Model
GLMM for ith Family
g(µb
i
)= Xiβ1 + Eiβ2 + Giθ + Siγ + Kibi , (2)
where,2Φi = KiKT
i (Cholesky decomposition).αi = Kibi , with bi ∼ N(0, σ2Ini ) and Ini : identity matrix.
µbi = E (Yi |bi).
g(µbi ) = (g(µb
i1), . . . , g(µbini ))
T .Xi = [Xi1 . . .Xini ]T .
Gi = [Gi1 . . .Gini ]T .Ei = (Ei1, . . . ,Eini )T .Si = [Si1 . . .Sini ]T .
Seminario Model
General GLMM
g(µb)
= Xβ + Gθ + Sγ + Kb, (3)
whereµb = E (Y |b).X = [X1 . . .XN ]T .G = [G1 . . .GN ]T .E = (E1, . . . ,EN)T
S = [S1 . . .SN ]T .
K = diag{K1 . . .KN}.b = [b1 . . .bN ]T .X = [X E ]T .β = (βT
1 , β2)T .
We are interested in testing the hypthotesis H0 : γ = 0.
Seminario Model
Important facts
If γ is treated as a fixed vector and the null hyphotesis is testedwith a q degrees of freedom score test, it can result in loss ofpower (Lin et. al., 2013).
Another common strategy is to use a single SNP at time to testGE interaction.Assume γ as a random vector following a multivariate normaldistribution N(0, τ Iq) and to test the equivalent null hypothesisH0 : τ = 0.
= Xβ + Gθ
Seminario Model
Important facts
If γ is treated as a fixed vector and the null hyphotesis is testedwith a q degrees of freedom score test, it can result in loss ofpower (Lin et. al., 2013).Another common strategy is to use a single SNP at time to testGE interaction.
Assume γ as a random vector following a multivariate normaldistribution N(0, τ Iq) and to test the equivalent null hypothesisH0 : τ = 0.
= Xβ + Gθ
Seminario Model
Important facts
If γ is treated as a fixed vector and the null hyphotesis is testedwith a q degrees of freedom score test, it can result in loss ofpower (Lin et. al., 2013).Another common strategy is to use a single SNP at time to testGE interaction.Assume γ as a random vector following a multivariate normaldistribution N(0, τ Iq) and to test the equivalent null hypothesisH0 : τ = 0.
= Xβ + Gθ
Seminario Model
Important facts
If γ is treated as a fixed vector and the null hyphotesis is testedwith a q degrees of freedom score test, it can result in loss ofpower (Lin et. al., 2013).Another common strategy is to use a single SNP at time to testGE interaction.Assume γ as a random vector following a multivariate normaldistribution N(0, τ Iq) and to test the equivalent null hypothesisH0 : τ = 0.
g(µb,γ
)= Xβ + Gθ + Sγ + Kb︸ ︷︷ ︸
random
Seminario Model
Important facts
If γ is treated as a fixed vector and the null hyphotesis is testedwith a q degrees of freedom score test, it can result in loss ofpower (Lin et. al., 2013).Another common strategy is to use a single SNP at time to testGE interaction.Assume γ as a random vector following a multivariate normaldistribution N(0, τ Iq) and to test the equivalent null hypothesisH0 : τ = 0.
g(µb)
= Xβ + Gθ+Sγ + Kb︸︷︷︸random
Seminario Model
Linkage disequilibrium (LD) -PPARG-
Seminario Model
Linkage disequilibrium (LD) -CDKAL1-
Seminario Model
Linkage disequilibrium (LD) -CDKAL1-
Seminario Model
Linkage disequilibrium (LD) -CDKAL1-
Seminario Model
Linkage disequilibrium (LD) -CDKAL1-
Seminario Model
Remedial considerations to face LD
Given the high correlation among the SNPs in G, Lin et. al.(2013) proposed (for independet subjects) to penalize the esti-mation of parameter θ by using ridge regression and introducinga penalization term λ in the quasi-likelihood function. The tun-ning parameter λ is selected using generalized cross validation(Fu, 2005).
Shen et. al. (2013) proposed for generalized linear models (GLM)that ridge regression is equivalent to assume the penalized pa-rameters as independent random variables.
Seminario Model
Remedial considerations to face LD
Given the high correlation among the SNPs in G, Lin et. al.(2013) proposed (for independet subjects) to penalize the esti-mation of parameter θ by using ridge regression and introducinga penalization term λ in the quasi-likelihood function. The tun-ning parameter λ is selected using generalized cross validation(Fu, 2005).Shen et. al. (2013) proposed for generalized linear models (GLM)that ridge regression is equivalent to assume the penalized pa-rameters as independent random variables.
Seminario Model
Remedial considerations to face LD
It is also equivalent for GLMM and assuming θ ∼ N(0, σ2θ Iq), it is
possible to show that λ = φ/σ2θ .
= Xβ +
Seminario Model
Remedial considerations to face LD
It is also equivalent for GLMM and assuming θ ∼ N(0, σ2θ Iq), it is
possible to show that λ = φ/σ2θ .
g(µb,γ,θ
)= Xβ + Gθ + Sγ + Kb︸ ︷︷ ︸
random
Seminario Model
Remedial considerations to face LD
It is also equivalent for GLMM and assuming θ ∼ N(0, σ2θ Iq), it is
possible to show that λ = φ/σ2θ .
g(µb,θ
)= Xβ + Gθ︸︷︷︸
random
+Sγ + Kb︸︷︷︸random
Seminario Null model estimation
Sección 3
Null model estimation
Seminario Null model estimation
Null model estimation
g(µd1,d2
)= Xβ + d1 + d2
with d1 = Gθ, d2 = Kb.
Breslow and Clayton (1993) proposed aFisher scoring solution that may be expressed as the iterative solutionto the system XT W X XT W XT W
W X φ
σ2θ
(GGT )−1 + W WW X W φ
σ2 (KKT )−1 + W
( βd1d2
)=
XT W YW YW Y
Y = Xβ + d1 + d2 + ε: working vector; ε ∼ N(0, φW−1);
W = diag{ωij/
[ν(µd
ij )g ′(µdij )2]}
.Ridge
Seminario Null model estimation
Null model estimation
g(µd1,d2
)= Xβ + d1 + d2
with d1 = Gθ, d2 = Kb. Breslow and Clayton (1993) proposed aFisher scoring solution that may be expressed as the iterative solutionto the system XT W X XT W XT W
W X φ
σ2θ
(GGT )−1 + W WW X W φ
σ2 (KKT )−1 + W
( βd1d2
)=
XT W YW YW Y
Y = Xβ + d1 + d2 + ε: working vector; ε ∼ N(0, φW−1);
W = diag{ωij/
[ν(µd
ij )g ′(µdij )2]}
.Ridge
Seminario Null model estimation
Null model estimation
β = (XT Σ−1X)−1XT Σ−1Y ,
(d1d2
)=
σ2θ
(GGT
)Σ−1(Y − Xβ)
σ2(KKT
)Σ−1(Y − Xβ)
,with Σ = σ2
θGGT + σ2KKT + φW−1.
Seminario GE interaction test
Sección 4
GE interaction test
Seminario GE interaction test
Testing GE interaction
Following the approach developed by Zhang and Lin (2003), wepropose as score statitistic the quadratic form
Uτ = Uτ(β, π
)= 1
2
{(Y − Xβ
)TΣ−1SST Σ−1
(Y − Xβ
)} ∣∣∣∣β.π
.
with π is the estimator of π = (σ2θ , σ
2, φ)T .
To correct for bias, weuse the restricted maximum likelihood (REML) estimators (Breslowand Clayton, 1993) in the GLMM framework to obtain β and πunder the null hypothesis.
Seminario GE interaction test
Testing GE interaction
Following the approach developed by Zhang and Lin (2003), wepropose as score statitistic the quadratic form
Uτ = Uτ(β, π
)= 1
2
{(Y − Xβ
)TΣ−1SST Σ−1
(Y − Xβ
)} ∣∣∣∣β.π
.
with π is the estimator of π = (σ2θ , σ
2, φ)T .To correct for bias, weuse the restricted maximum likelihood (REML) estimators (Breslowand Clayton, 1993) in the GLMM framework to obtain β and πunder the null hypothesis.
Seminario GE interaction test
Testing GE interaction
Zhang and Lin (2003) showed that under H0 : τ = 0, Uτ followsapproximately a mixture of one degree of freedom independent chi-square distributions. However, for computational ease, we use theSatterthwaite method (Satterthwaite, 1941) to approximate the dis-tribution of Uτ by a scaled chi-square distribution κχ2
ξ .
When REML estimates are used to calculate Uτ showed that themean and variance of Uτ can be approximated, respectively, by
tr(PSST
) ∣∣∣∣π
and Iτ = 12{tr(PSST PSST )− JT M−1J
} ∣∣∣∣β,π
Seminario GE interaction test
Testing GE interaction
Zhang and Lin (2003) showed that under H0 : τ = 0, Uτ followsapproximately a mixture of one degree of freedom independent chi-square distributions. However, for computational ease, we use theSatterthwaite method (Satterthwaite, 1941) to approximate the dis-tribution of Uτ by a scaled chi-square distribution κχ2
ξ .When REML estimates are used to calculate Uτ showed that themean and variance of Uτ can be approximated, respectively, by
tr(PSST
) ∣∣∣∣π
and Iτ = 12{tr(PSST PSST )− JT M−1J
} ∣∣∣∣β,π
Seminario GE interaction test
Testing GE interaction
J =(
tr[PSST PGGT ] tr[PSST PKKT ] tr[PSST PW−1])T
M =
tr[PGGT PGGT ] tr[PGGT PKKT ] tr[PGGT PW−1]
tr[PKKT PGT G] tr[PKKT PKKT ] tr[PKKT PW−1]
tr[PW−1PGT G] tr[PW−1PKKT ] tr[PW−1PW−1]
,
and P = Σ−1 −Σ−1X(XT Σ−1X
)−1XT Σ−1.
Seminario GE interaction test
Testing GE interaction
Since the mean and variance of κχ2ξ are given by κξ and 2κ2ξ,
respectively, we obtain the equations tr(PSST ) = κξ and Iτ =2κ2ξ, where P denotes the matrix P evaluated in π. By solvingthese equations, we demonstrate that
κ = Iτ/[2 tr(PSST )]
andξ = 2
[tr(PSST )
]2/Iτ .
Uτκ ∼ χ
2ξ
Seminario GE interaction test
Testing GE interaction
Since the mean and variance of κχ2ξ are given by κξ and 2κ2ξ,
respectively, we obtain the equations tr(PSST ) = κξ and Iτ =2κ2ξ, where P denotes the matrix P evaluated in π. By solvingthese equations, we demonstrate that
κ = Iτ/[2 tr(PSST )]
andξ = 2
[tr(PSST )
]2/Iτ .
Uτκ ∼ χ
2ξ
Seminario Simulations
Sección 5
Simulations
Seminario Simulations
Simulated model
Eij = 2 + 0,01Ageij + 0,1I(Femaleij) + γi + εij ;
logit [P (Yij = 1|Ageij ,Femaleij ,Eij ,G1ij ,G2ij)]= 0,1 + 0,01Ageij + 0,1I(Femaleij) + 0,1Eij + 0,3G1ij
+0,3G2ij + γ1(G1ij × Eij) + γ2(G2ij × Eij) + αij
withI(·) is theindicatorfunction;
εi = (εi1, . . . , εi10)T ∼ N(0, 4I10);αi = (αi1, . . . , αi10)T ∼ N(0, 2Φi);γi ∼ N(0, 4)
Φi is the kinship matrix corresponding to the aforementionedfamily pedigree .
Seminario Simulations
Correlation of the 50 simulated SNPs in LD
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
123456789
1011121314151617181920212223242526272829303132333435363738394041424344454647484950
Seminario Simulations
Type I error
We first compared the empirical type I error of the different methodsat 0.05 α-level. To evaluate type I error, we set γ1 = γ2 = 0 andvaried the number of SNPs q in the gene. The SNPs were eitherindependent or in LD.
SNPs category q σ2 σ2θ λ = 1/σ2
θ Score MinP VCT
Independent5 1.247 0.034 29.4 0.020 0.031 0.034
10 1.240 0.017 58.8 0.023 0.026 0.02450 1.222 0.003 333 0.004 0.022 0.014
LD5 1.243 0.021 47.6 0.025 0.026 0.031
10 1.239 0.009 111 0.004 0.034 0.03050 1.222 0.002 500 0.000 0.028 0.022
Seminario Simulations
Type I error
We first compared the empirical type I error of the different methodsat 0.05 α-level. To evaluate type I error, we set γ1 = γ2 = 0 andvaried the number of SNPs q in the gene. The SNPs were eitherindependent or in LD.
SNPs category q σ2 σ2θ λ = 1/σ2
θ Score MinP VCT
Independent5 1.247 0.034 29.4 0.020 0.031 0.034
10 1.240 0.017 58.8 0.023 0.026 0.02450 1.222 0.003 333 0.004 0.022 0.014
LD5 1.243 0.021 47.6 0.025 0.026 0.031
10 1.239 0.009 111 0.004 0.034 0.03050 1.222 0.002 500 0.000 0.028 0.022
Seminario Simulations
Empirical power for independent SNPs
0.00 0.02 0.04 0.06 0.08 0.10
0.0
0.2
0.4
0.6
0.8
1.0
q = 5
γ1 = γ2
Em
piric
al P
ower
ScoreMinPVCT
0.00 0.02 0.04 0.06 0.08 0.10
0.0
0.2
0.4
0.6
0.8
1.0
q = 10
γ1 = γ2
0.00 0.02 0.04 0.06 0.08 0.10
0.0
0.2
0.4
0.6
0.8
1.0
q = 50
γ1 = γ2
Seminario Simulations
Empirical power for dependent SNPs
0.00 0.02 0.04 0.06 0.08 0.10
0.0
0.2
0.4
0.6
0.8
1.0
q = 5
γ1 = γ2
Em
piric
al P
ower
ScoreMinPVCT
0.00 0.02 0.04 0.06 0.08 0.10
0.0
0.2
0.4
0.6
0.8
1.0
q = 10
γ1 = γ2
0.00 0.02 0.04 0.06 0.08 0.10
0.0
0.2
0.4
0.6
0.8
1.0
q = 50
γ1 = γ2
Seminario Application
Sección 6
Application
Seminario Application
Baependi Heart Study
Seminario Application
Variables
Phenotype: Type II diabetesEnvironmental variable: Body Mass Index (BMI)Genotype: Were considered the following three genetic regions:
Peroxisome-Proliferator-Activated Receptors gamma (PPARG)with 16 variants genotyped;Fat Mass and Obesity associated protein (FTO) with 149 va-riants genotyped;Cyclin-dependent kinase 5 regulatory subunit associated protein1-like 1 (CDKAL1) with 186 variants genotyped.
Covariates: Age, sex, and the first two principal components ofthe entire genotype data of Baependi.
Seminario Application
Summary of cases per subjects and families
Gene Subjects FamiliesControl Cases Total Control Cases Total
PPARG 845 83 928 43 42 85FTO 712 71 783 47 38 85
CDKAL1 661 69 730 47 38 85
bits.3.20GHz with a RAM memory 8.00 GB and operating system 64-R version 3.3.1 and a processor Intel(R) Core(TM) i5-6500 CPU @
Seminario Application
Summary of cases per subjects and families
Gene Subjects FamiliesControl Cases Total Control Cases Total
PPARG 845 83 928 43 42 85FTO 712 71 783 47 38 85
CDKAL1 661 69 730 47 38 85
bits.3.20GHz with a RAM memory 8.00 GB and operating system 64-R version 3.3.1 and a processor Intel(R) Core(TM) i5-6500 CPU @
Seminario Application
Sample size, GLMM parameters, p-values and execution times
Seminario References
Sección 7
References
Seminario References
References
Breslow, N. and Clayton, D, 1993. Approximate inference ingeneralized linear mixed models. J. Am. Stat. Assoc., 88, 9-25.Coombes, B., Basu, S. and Mcgue, M. , 2017. A combinationtest for detection of gene-environment interaction in cohort stu-dies. Genet. Epidemiol., 41, 396-412.Chen, H. and Conomos, M., 2016. GMMAT: Generalized Li-near Mixed Model Association Tests; R Package Version 0.7.Available online:https://content.sph.harvard.edu/xlin/dat/GMMAT_user_manual_v0.7.pdf(accessed on 16 January 2017).Lin, X., Lee, S., Chistiani, D. and Lin, X., 2013. Test for in-teractions between a genetic marker set and environment ingeneralized linear models. Biostatistics, 14, 667-681.
Seminario References
References
Lin, X., Lee, S., Wu, M., Wang, C., Chen, H., Li, Z. and Lin,X., 2016. Test for rare variants by environment interactions insequencing association studies. Biometrics, 72, 156-164.Lin, X, 1997. Variance component testing in generalised linearmodels with random effects. Biometrika, 84, 309-326.Oliveira, C., Pereira, A., de Andrade, M., Soler, J. and Krieger,J., 2008. Heritability of cardiovascular risk factors in a brazilianpopulation: Baependi heart study. BMC Med. Genet., 9, 32.Shen, X., Alam, M., Fikse, F. and Ronnegard, L., 2013. A novelgeneralized ridge regression method for quantitative genetics.Genetics, 193, 1255-1268.Zhang, D. and Lin, X, 2003. Hyphotesis testing in semiparame-tric additive mixed models. Biostatistics 2003, 4, 7-74.
Seminario References
References
Satterthwaite, F., 1941. Synthesis of variance. Psychometrika,6, 309-316.Fu, W.J., 2005.Nonlinear GCV and quasi-GCV for shrinkagemodels. Journal of Statistical Planning and Inference, 131, 333-347.Mazo, M. A., Coombes, B. and de Andrade, M., 2017. AnEfficient Test for Gene-Environment Interaction in GeneralizedLinear Mixed Models with Family Data. International JournalOf Environmental Research And Public Health, 14, 1134-1146.
Seminario References
References
Chen, H., Wang, C., Conomos, M., Stilp, A., Li, Z., Sofer,T., Szpiro, A., Chen, W., Brehm, J., Celedon, J., Redline, S.,Papanicolaou, G., Thornton, T., Laurie, C., Rice, K. and Lin,X., 2016. Control for population structure and relatedness forbinary traits in genetic association studies via logistic mixedmodels. The American journal of human genetics, 98, 653-666.Leal, S., Yan, K. and Muller-Myhsokb, B, 2005. SimPed: ASimulation Program to Generate Haplotype and Genotype 308Data for Pedigree Structures. Hum Hered., 119-122.
Seminario References
Family data and kinship matrix
Return
Seminario References
Ridge regression estimation
g(µd2) = Xβ + Gθ + d2 (d2 = Kb)
ql(β, θ, φR , σR) = −12 log
∣∣∣∣σ2RφR
(KKT )WR + In
∣∣∣∣+
N∑i=1
ni∑j=1
qlij(β, θ, φR ; d2) − 12 dT
2(σ2
RKKT)−1 d2,
where d2 is choosen to maximize the sum of the last two terms,WR = diag
{ωij/
[ν(µd2
ij )g ′(µd2ij )2
]}and
qlij(β,θ, φR ; d2) =∫ µ
d2ij
Yij
ωij(Yij − µ)φRν(µ) dµ.
Seminario References
Ridge regression estimation
g(µd2) = Xβ + Gθ + d2 (d2 = Kb)
ql(β, θ, φR , σR) = −12 log
∣∣∣∣σ2RφR
(KKT )WR + In
∣∣∣∣+
N∑i=1
ni∑j=1
qlij(β, θ, φR ; d2) − 12 dT
2(σ2
RKKT)−1 d2,
where d2 is choosen to maximize the sum of the last two terms,WR = diag
{ωij/
[ν(µd2
ij )g ′(µd2ij )2
]}and
qlij(β,θ, φR ; d2) =∫ µ
d2ij
Yij
ωij(Yij − µ)φRν(µ) dµ.
Seminario References
Ridge regression estimation
Ridge regression estimator of θ, is obtained by minimazing the fun-ction
[ql(β,θ, φR , σR)× φR ]− 12λθ
Tθ.
where λ is a penalizing factor.
XT WRX XT WR XT WRWRX WR + λ(GGT )−1 WR
WRX WR WR + φRσ2
R(KKT )−1
( βd1d2
)=
XT WRYWRYWRY
where d1 = Gθ. Return