fei zou department of biostatistics email: [email protected]
DESCRIPTION
Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping. Fei Zou Department of Biostatistics Email: [email protected]. Outline. Introduction Experimental crosses Existing QTL Mapping Methods Bayesian semi-parametric QTL Mapping Results - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/1.jpg)
Bayesian Variable Selection in Semiparametric Regression
Modeling with Applications to Genetic Mappping
Fei ZouDepartment of BiostatisticsEmail: [email protected]
![Page 2: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/2.jpg)
Outline
• Introduction– Experimental crosses– Existing QTL Mapping Methods
• Bayesian semi-parametric QTL Mapping• Results• Remarks and Conclusions
![Page 3: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/3.jpg)
http://www.cs.unc.edu/Courses/comp590-090-f06/Slides/CSclass_Threadgill.ppt
![Page 4: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/4.jpg)
Overview
• One gene one trait: very unlikely• The vast majority of biological traits are
caused by complex polygenes– Potentially interacting with each other
• Most traits have significant environmental exposure components– Potentially interacting with polygenes
![Page 5: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/5.jpg)
ParentsP1
P2
Experimental Crosses: F2
![Page 6: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/6.jpg)
Experimental Crosses
• F2 Backcross(BC)
F1 F1
P1 P2
F2: BC:
P1 P2
P1 F1
AA BB
AB AB
BB AB AB
AA BB
AA AB
AB AA AB
![Page 7: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/7.jpg)
QTL Data Format
0: homozygous AA, 2: homozygous BB, 1: heterozygote AB.Marker positions:
![Page 8: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/8.jpg)
Linkage Analysis
• Data structure: – Marker data (genotypes plus positions)– Phenotypic trait(s)– Other nongenetic covariates, such as age,
gender, environmental conditions etc• Quantitative trait loci (QTL): a particular
region of the genome containing one or more genes that are associated with the trait being assayed or measured
![Page 9: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/9.jpg)
QTL Mapping of Experimental Crosses
• Single QTL Mapping• Single marker analysis (Sax, 1923 Genetics)• Interval mapping: Lander & Botstein (1989,
Genetics)• Multiple QTL mapping
• Composite interval mapping (Zeng 1993 PNAS, 1994 Genetics; Jansen & Stam, 1994 Genetics)
• Multiple interval mapping (Kao et al., 1999 Genetics)
• Bayesian analysis (Satagopan et al., 1997 Genetics)
![Page 10: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/10.jpg)
Single QTL Interval Mapping
• For backcross, the model assumes
• QTL analysis:
• If QTL genotypes are observed, the analysis is trivial: simple t-test!
• However, QTL position is unknown and therefore QTL genotypes are unobserved
2(QTL genotype is AA) ~ ( , )i AAy N 2(QTL genotype is Aa) ~ ( , )i Aay N
0 : vs :AA Aa A AA AaH H
![Page 11: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/11.jpg)
Interval Mapping
• For QTL between markers– QTL genotypes missing: can use marker genotypes to
infer the conditional probabilities of the QTL genotypes for a given QTL position
– Profile likelihood (LOD score) calculated across the whole genome or candidate regions using EM algorithm
– In any region where the profile exceeds a (genome-wide) significance threshold, a QTL declared at the position with the highest LOD score.
![Page 12: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/12.jpg)
0
2
4
6
8
Chromosome
lod
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16171819 X
Profile LOD
![Page 13: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/13.jpg)
Multiple QTL Mapping
• Most complicated traits are caused by multiple (potentially interacting) genes, which also interact with environment stimuli – Single QTL interval mapping
• Ghost QTL (Lander & Botstein 1989)• Low power
![Page 14: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/14.jpg)
Multiple QTL Mapping• Composite interval mapping (Zeng 1993, 1994;
Jansen & Stam1993): searching for a putative QTL in a given region while simultaneously fitting partial regression coefficients for "background markers" to adjust the effects of other QTLs outside the region
• which background markers to include; window size etc
• Multiple interval mapping (Kao et al 1999): fitting multiple QTLs simultaneously
• Computationally intensive; how many QTLs to include?
![Page 15: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/15.jpg)
Multiple QTL Mapping• Bayesian methods (Stephens and Fisch 1998
Biometrics; Sillanpaa and Arjas 1998 Genetics; Yi and Xu 2002 Genetic Research, and Yi et al. 2003 Genetics): treat the number of QTLs as a parameter by using reversible jump Markov chain Monte Carlo (MCMC) of Green (1995 Biometrika)
• change of dimensionality, the acceptance probability for such dimension change, which in practice, may not be handled correctly (Ven 2004 Genetics)
![Page 16: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/16.jpg)
Multiple QTL Mapping• Alternative, multiple QTL mapping can be
viewed as a variable selection problem– Forward and step-wise selection procedures (Broman
and Speed 2002 JRSSB) – LASSO, etc– Bayesian QTL mapping
• Xu (2003 Genetics), Wang et al (2005 Genetics) Huang et al (2007 Genetics): Bayesian shrinkage
• Yi et al (2003 Genetics): stochastic search variable selection (SSVS) of George and McCulloch (1993 JASA)
• Yi (2004 Genetics): composite model space of Godsill (2001 J. Comp. Graph. Stat)
• Software: R/qtlbim by Yi’s group
![Page 17: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/17.jpg)
Multiple QTL Mapping
• Limitations of existing QTL mapping methods – do not model covariates at all or only
model covariate effect linearly – do not model interactions at all or
model only lower order interactions, such as two way interactions
![Page 18: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/18.jpg)
• The multiple QTL mapping is a very large variable selection problem: for p potential genes, with p being in the hundreds or thousands, there are possible main effect models, possible two-way interactions and possible higher order (k > 2) interactions.
2 p 22p
2pk
![Page 19: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/19.jpg)
Semiparmetric Multiple (Potentially Interacting) QTL Mapping
• Goal: map multiple potentially interacting QTLs without specifically model all potential main and higher order interaction effects
• Semiparametric model:
where function is unspecified, QTL genotypes and represent all non- genetics factors/covariates.
• When equals : non-explicitly modeling the two way interaction between genes 1 and 2 and the gene-environmental interaction between gene 3 and covariate 1.
21 2 1( , ,... , ,..., ) , 1, , with (0, )~
iid
i i i ip i iq i iy x x x t t e i n e N
1 2, ,...i i ipx x x1,...,i iqt t
1 2 3 1i i i ix x x t
![Page 20: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/20.jpg)
Bayesian Semi/non-parametric Methods
– Dirichlet process (Muller et al. 1996)– Splines (Smith and Kohn 1996; Denison et al. 1998
and DiMatteo et al. 2001)– Wavelets (Abramovich et al. 1998 JRSSB)– Kernel models (Liang et al 2007) – Gaussian process (Neal 1997; 1996)
• Gaussian process priors have a large support in the space of all smooth functions through an appropriate choice of covariance kernel.
• Gaussian process is flexible for curve estimation because of their flexible sample path shapes
• Gaussian process related to smoothing spline somehow (Wahba 1978 JRSSB)
![Page 21: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/21.jpg)
Prior Specification on
• A Gaussian process such that all possible finite dimensional distributions follow multivariate normal with mean 0 and covariance function
where , s and s are hyperparameters and
1( ,..., )Tn
2 2' ' '
1 1
1cov( , ) exp[ ( ) ( ) ]p q
i i xk ik i k tj ij i jk j
x x t t
xk
1 1( ,..., , ,..., )i i ip i iqx x t t tj
![Page 22: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/22.jpg)
• Hyperparameter defines the vertical scale of variations, i.e., controls the magnitude of the exponential part. Hyperparameters related to length scales which characterize the distance in that particular direction over which y is expected to vary significantly– controls the smoothness of : when
the posterior mean of almost interpolates the data while centered around the prior mean function if
– When = 0, y is expected to be an essentially constant function of that input variable xj, which is therefore deemed irrelevant (Mackay 1998).
( )xk tj 1/ xk
xk
0
![Page 23: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/23.jpg)
Priors on • The original papers on the Gaussian process
(Mackay 1998; Neal 1997) did not view this method as an approach for variable selection and imposed a Gamma prior on the parameters. However, does provide information about the relevance of any QTL with value near zero indicating an irrelevant QTL.
• For variable selection purpose, we can impose the following Gamma mixture priors on
1/xk xk
s
![Page 24: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/24.jpg)
Prior Specifications
• Inverse Gamma distributions are used for the priors of and . 2
![Page 25: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/25.jpg)
Simulations
• Set ups:– backcross population – 200 or 500 individuals– 151 evenly spaced markers at 5cM intervals– Four QTLs with varying heritabilities:
• Main effect model: all four QTL act additively• Main plus two way interactions• Four way interactions only
![Page 26: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/26.jpg)
![Page 27: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/27.jpg)
![Page 28: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/28.jpg)
![Page 29: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/29.jpg)
![Page 30: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/30.jpg)
![Page 31: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/31.jpg)
![Page 32: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/32.jpg)
n=500 and pure 4 way-interaction model
![Page 33: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/33.jpg)
n=500 and pure additive model
![Page 34: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/34.jpg)
Real Data Analysis
• A mouse study– # samples: 187 backcross samples– # markers: 85 with average marker distance
20 cM – Phenotypes: inguinal, gonadal, retroperitoneal
and mesenteric fat pad weights
![Page 35: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/35.jpg)
![Page 36: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/36.jpg)
![Page 37: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/37.jpg)
Remarks
• For studies with large # of samples and/or large # of markers, MCMC converges very slowly– We employed the hybrid Monte Carlo method, which
merges the Metropolis-Hastings algorithm with sampling techniques based on dynamics simulation.
– We also estimated the maximum a posteriori (MAP) via conjugate gradient method (Hestenes et al 1952 J. Research of National Bureau of Standards)
• point estimate
![Page 38: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/38.jpg)
Real Study: Cardiovascular Disease
• 2655 tag SNPs from roughly 200 selected candidate genes for cardiovascular disease
• 820 individuals• Non-genetic covariates: gender, smoking
status, age
![Page 39: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/39.jpg)
![Page 40: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/40.jpg)
Remarks
• Semiparemetric mapping is powerful in mapping multiple (potentially interacting with higher orders) QTL – Picks up genes related to the trait regardless
of their marginal main effects or joint epistasis effects
– Cannot readily differentiates genetic contributions
• main effect? interaction? or both?• Fine tuned parametric model with selected genes
![Page 41: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/41.jpg)
Remarks and Future Research• How to extend the methodologies to human
genome-wide association (GWA) studies, where hundreds of thousands of markers are available– Is it possible?– potential solutions: pathway analysis; data reduction
techniques• How to extend the method to human pedigree
analysis where mixed effect model is used for correlated family members?– Use inheritance vector: so far results are very
promising
![Page 42: Fei Zou Department of Biostatistics Email: fzou@bios.unc](https://reader035.vdocuments.us/reader035/viewer/2022062501/56816065550346895dcf901c/html5/thumbnails/42.jpg)
Acknowledgement
• Joint work with – Hanwen Huang – Haibo Zhou– Fuxia Cheng– Ina Hoeschele
• Funding support– NIH R01 GM074175