association analysis of rare genetic variants
DESCRIPTION
Association Analysis of Rare Genetic Variants. Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics. Rare Variants. Low allele frequency : usually less than 1% Low power : for most analyses, due to less variation of observations - PowerPoint PPT PresentationTRANSCRIPT
1
Association Analysis of Association Analysis of Rare Genetic VariantsRare Genetic Variants
Qunyuan ZhangDivision of Statistical Genomics
Course M21-621 Computational Statistical Genetics
2
Rare VariantsRare Variants
Low allele frequency: usually less than 1%
Low power: for most analyses, due to less variation of observations
High false positive rate: for some model-based analyses, due to sparse distribution of data, unstable/biased parameter estimation and inflated p-value.
3
An Example of Low Power
Jonathan C. Cohen, et al. Science 305, 869 (2004)
An Example of High False Positive Rate(Q-Q plots from GWAS data, unpublished)
N=~2500
MAF>0.03
N=~2500
MAF<0.03
N=~2500
MAF<0.03
Permuted
N=50000
MAF<0.03
Bootstrapped
5
Three Levels of Three Levels of Rare Variant DataRare Variant Data
Level 1: Individual-level
Level 2: Summarized over subjects
Level 3: Summarized over both subjects and variants
6
Level 1: Individual-level
Subject V1 V2 V3 V4 Trait-1 Trait-2
1 1 0 0 0 90.1 1
2 0 1 0 . 99.2 1
3 0 0 0 0 105.9 0
4 0 0 0 0 89.5 0
5 0 . 0 0 97.6 0
6 0 0 0 0 110.5 0
7 0 0 1 0 88.8 0
8 0 0 0 1 95.4 1
7
Level 2: Summarized over subjects (by group)
Jonathan C. Cohen, et al. Science 305, 869 (2004)Jonathan C. Cohen, et al. Science 305, 869 (2004)
Level 3: Summarized over subjects (by group) and variants (usually by gene)
Variant allele
number
Reference allele
numberTotal
Low-HDL group
20 236 256
High-HDL group
2 254 256
Total 22 490 512
9
Methods For Level 3 Data
10
Single-variant Test vs Total Freq.Test (TFT)
Jonathan C. Cohen, et al. Science 305, 869 (2004)
11
What we have learned …
Single-variant test of rare variants has very low power for detecting association, due to extremely low frequency (usually < 0.01)
Testing collective effect of a set of rare variants may increase the power (sum test, collective test, group test, collapsing test, burden test…)
12
Methods For Level 2 Data
Allowing different samples sizes for different variants
Different variants can be weighted differently
13
CAST: A cohort allelic sums test Morgenthaler and Thilly, Mutation Research 615 (2007) 28–56
Under H0:S(cases)/2N(cases)−S(controls)/2N(controls) =0S: variant number; N: sample size
T= S(cases) − S(controls)N(cases)/N(controls)= S(cases) − S∗(controls)(S can be calculated variant by variant and can be weighted differently, the final T=sum(WiSi) )
Z=T/SQRT(Var(T)) ~ N (0,1)
Var(T)= Var (S(cases) − S* (controls) )=Var(S(cases)) + Var(S* (controls))=Var(S(cases)) + Var(S(controls)) X [N(cases)/N(controls)]^2
14
C-alpha
PLOS Genetics, 2011 | Volume 7 | Issue 3 | e1001322
Effect direction problem
15
C-alpha
QQ Plots of Existing Methods (under the null)
•EFT and C-alphainflated with false positives
•TFT and CAST no inflation, but assuming single effect-direction
•ObjectiveMore general, powerful methods …
CAST C-alpha
EFT TFT
17
More Generalized Methods For Level 2 Data
Structure of Level 2 datavariant 1
variant i variant k
variant 2
…
Strategy
Instead of testing total freq./number, we test the randomness of all tables.
variant 3 …
4. Calculating p-value P= Prob.( )
Exact Probability Test (EPT)
k
iiPL
1
)log(
iA
iiiiii nNCanCanCP ,,, 2211
1.Calculating the probability of each table based on hypergeometric distribution
2. Calculating the logarized joint probability (L) for all k tables
3. Enumerating all possible tables and L scores
ASHG Meeting 1212, Zhang
Likelihood Ratio Test (LRT)
2~):,,,Pr(
):,,,Pr(log2
1212211
12102211
kdfHbaba
HbabaLR k
i
iiA
iiii
k
i
iiiiii
Binomial distribution
ASHG Meeting 1212, Zhang
Q-Q Plots of EPT and LRT(under the null)
EPTN=500
EPTN=3000
LRTN=500
LRTN=3000
Power Comparison significance level=0.00001
Variant proportion
Positive causal 80%
Neutral 20%
Negative Causal0%
Pow
er
Sample size
Pow
er
Sample size
Pow
er
Sample size
Power Comparison significance level=0.00001
Variant proportion
Positive causal 60%
Neutral 20%
Negative Causal20%
Pow
er
Sample size
Power Comparison significance level=0.00001
Variant proportion
Positive causal 40%
Neutral 20%
Negative Causal40%
Pow
er
Sample size
25
Methods For Level 1 Data
•Including covariates
•Extended to quantitative trait
•Better control for population structure
•More sophisticate model
26
Collapsing (C) test
Step 1
Step 2
logit(y)=a + b* X + e (logistic regression)
Li and Leal,The American Journal of Human Genetics 2008(83): 311–321
27
Variant Collapsing
(+) (+) (.) (.)
Subject V1 V2 V3 V4 Collapsed Trait
1 1 0 0 0 1 1
2 0 1 0 0 1 1
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 0 0 1 0 1 0
8 0 0 0 1 1 1
28
WSS
29
WSS
30
WSS
31
Weighted Sum Testi
m
ii gws
1
Collapsing test (Li & Leal, 2008), wi =1 and s=1 if s>1
Weighted-sum test (Madsen & Browning ,2009), wi calculated based-on allele freq. in control group
aSum: Adaptive sum test (Han & Pan ,2010), wi = -1 if b<0 and p<0.1, otherwise wj=1
KBAC (Liu and Leal, 2010), wi = left tail p value
RBT (Ionita-Laza et al, 2011), wi = log scaled probability
PWST p-value weighted sum test (Zhang et al., 2011) :, wi = rescaled left tail p value, incorporating both significance and directions
EREC( Lin et al, 2011), wi = estimated effect size
32
When there are only causal(+) variants …
(+) (+)Subjec
t V1 V2Collapse
d Trait
1 1 0 1 3.00
2 0 1 1 3.10
3 0 0 0 1.95
4 0 0 0 2.00
5 0 0 0 2.05
6 0 0 0 2.10
Collapsing (Li & Leal,2008) works well, power increased
33
(+) (+) (.) (.)
Subject V1 V2 V3 V4Collapse
d Trait1 1 0 0 0 1 3.002 0 1 0 0 1 3.103 0 0 0 0 0 1.954 0 0 0 0 0 2.005 0 0 0 0 0 2.056 0 0 0 0 0 2.107 0 0 1 0 1 2.008 0 0 0 1 1 2.10
When there are causal(+) and non-causal(.) variants …
Collapsing still works, power reduced
34
(+) (+) (.) (.) (-) (-)
Subject V1 V2 V3 V4 V5 V6Collaps
ed Trait1 1 0 0 0 0 0 1 3.002 0 1 0 0 0 0 1 3.103 0 0 0 0 0 0 0 1.954 0 0 0 0 0 0 0 2.005 0 0 0 0 0 0 0 2.056 0 0 0 0 0 0 0 2.107 0 0 1 0 0 0 1 2.008 0 0 0 1 0 0 1 2.109 0 0 0 0 1 0 1 0.95
10 0 0 0 0 0 1 1 1.00
When there are causal(+) non-causal(.) and causal (-) variants …
Power of collapsing test significantly down
35
P-value Weighted Sum Test (PWST)(+) (+) (.) (.) (-) (-)
Subject V1 V2 V3 V4 V5 V6 Collapsed pSum Trait1 1 0 0 0 0 0 1 0.86 3.002 0 1 0 0 0 0 1 0.90 3.103 0 0 0 0 0 0 0 0.00 1.954 0 0 0 0 0 0 0 0.00 2.005 0 0 0 0 0 0 0 0.00 2.056 0 0 0 0 0 0 0 0.00 2.107 0 0 1 0 0 0 1 -0.02 2.008 0 0 0 1 0 0 1 0.08 2.109 0 0 0 0 1 0 1 -0.90 0.95
10 0 0 0 0 0 1 1 -0.88 1.00t 1.61 1.84 -0.04 0.11 -1.84 -1.72
p(x≤t) 0.93 0.95 0.49 0.54 0.05 0.062*(p-0.5) 0.86 0.90 -0.02 0.08 -0.90 -0.88
Rescaled left-tail p-value [-1,1] is used as weight
36
P-value Weighted Sum Test (PWST)
Power of collapsing test is retained
even there are bidirectional effects
37
PWST:Q-Q Plots Under the Null
Direct testInflation of type I error
Corrected by permutation test(permutation of phenotype)
Generalized Linear Mixed Model (GLMM)
& Weighted Sum Test (WST)
38
GLMM & WST
Y : quantitative trait or logit(binary trait)α : interceptβ : regression coefficient of weighted sum m : number of RVs to be collapsed wi : weight of variant igi : genotype (recoded) of variant iΣwigi : weighted sum (WS)X: covariate(s), such as population structure variable(s)τ : fixed effect(s) of XZ: design matrix corresponding to γγ : random polygene effects for individual subjects, ~N(0, G), G=2σ2K, K is the kinship matrix and σ2 the additive ploygene genetic variance ε : residual
ZXgwY i
m
ii
1
39
Base on allele frequency, binary(0,1) or continuous, fixed or variable threshold;
Based on function annotation/prediction; SIFT, PolyPhen etc.
Based on sequencing quality (coverage, mapping quality, genotyping quality etc.);
Data-driven, using both genotype and phenotype data, learning weight from data or adaptive selection, permutation test;
Any combination …
Weight
40
i
m
ii gw
1
Adjusting relatedness in family data for non-data-driven test of rare variants.
Application 1: Family Data
41
i
m
ii gwY
1
ZgwY i
m
ii
1
γ ~N(0,2σ2K)
Unadjusted:
Adjusted:
Q-Q Plots of –log10(P) under the Null
Li & Leal’s collapsing test, ignoring family structure, inflation of type-1 error
Li & Leal’s collapsing test, modeling family structure via GLMM,inflation is corrected
42
(From Zhang et al, 2011, BMC Proc.)
Application 2: Permuting Family Data
ZgwY i
m
ii
1
Permuted
Non-permuted, subject IDs fixed
43
MMPT: Mixed Model-based Permutation Test
Adjusting relatedness in family data for data-driven permutation test of rare variants.
γ ~N(0,2σ2K)
Q-Q Plots under the Null WSS
SPWSTPWSTaSum
Permutation test, ignoring family structure, inflation of type-1 error
44
(From Zhang et al, 2011, IGES Meeting)
Q-Q Plots under the Null WSS
SPWSTPWSTaSum
Mixed model-based permutation test (MMPT), modeling family structure, inflation corrected
(From Zhang et al, 2011, IGES Meeting)
Burden Test vs. Non-burden Test
46
Burden test
)0...(0:
...
210
1
ki
k
iii
H
xY
Non-burden test
T-test, Likelihood Ratio Test, F-test, score test, …
SKAT: sequence kernel association test
0:
)(
0
1
H
xwY i
k
ii
SKAT: sequence kernel association test
)0...(0: 210
1
ki
k
iii
H
xY
Extension of SKAT to Family Data
kinship matrix
Polygenic heritability of the trait Residual
Han Chen et al., 2012, Genetic Epidemiology
Other problems
49
Missing genotypes & imputation
Genotyping errors & QC (family consistency,
sequence review)
Population Stratification
Inherited variants and de novo mutation
Family data & linkage infomation
Variant validation and association validation
Public databases
And more …