association analysis of rare genetic variants

1

Association Analysis of Association Analysis of Rare Genetic VariantsRare Genetic Variants

Qunyuan ZhangDivision of Statistical Genomics

Course M21-621 Computational Statistical Genetics

2

Rare VariantsRare Variants

Low allele frequency: usually less than 1%

Low power: for most analyses, due to less variation of observations

High false positive rate: for some model-based analyses, due to sparse distribution of data, unstable/biased parameter estimation and inflated p-value.

3

An Example of Low Power

Jonathan C. Cohen, et al. Science 305, 869 (2004)

An Example of High False Positive Rate(Q-Q plots from GWAS data, unpublished)

N=~2500

MAF>0.03

N=~2500

MAF<0.03

N=~2500

MAF<0.03

Permuted

N=50000

MAF<0.03

Bootstrapped

5

Three Levels of Three Levels of Rare Variant DataRare Variant Data

Level 1: Individual-level

Level 2: Summarized over subjects

Level 3: Summarized over both subjects and variants

6

Level 1: Individual-level

Subject V1 V2 V3 V4 Trait-1 Trait-2

1 1 0 0 0 90.1 1

2 0 1 0 . 99.2 1

3 0 0 0 0 105.9 0

4 0 0 0 0 89.5 0

5 0 . 0 0 97.6 0

6 0 0 0 0 110.5 0

7 0 0 1 0 88.8 0

8 0 0 0 1 95.4 1

7

Level 2: Summarized over subjects (by group)

Jonathan C. Cohen, et al. Science 305, 869 (2004)Jonathan C. Cohen, et al. Science 305, 869 (2004)

Level 3: Summarized over subjects (by group) and variants (usually by gene)

Variant allele

number

Reference allele

numberTotal

Low-HDL group

20 236 256

High-HDL group

2 254 256

Total 22 490 512

9

Methods For Level 3 Data

10

Single-variant Test vs Total Freq.Test (TFT)

Jonathan C. Cohen, et al. Science 305, 869 (2004)

11

What we have learned …

Single-variant test of rare variants has very low power for detecting association, due to extremely low frequency (usually < 0.01)

Testing collective effect of a set of rare variants may increase the power (sum test, collective test, group test, collapsing test, burden test…)

12


Allowing different samples sizes for different variants

Different variants can be weighted differently

13

CAST: A cohort allelic sums test Morgenthaler and Thilly, Mutation Research 615 (2007) 28–56

Under H0:S(cases)/2N(cases)−S(controls)/2N(controls) =0S: variant number; N: sample size

T= S(cases) − S(controls)N(cases)/N(controls)= S(cases) − S∗(controls)(S can be calculated variant by variant and can be weighted differently, the final T=sum(WiSi) )

Z=T/SQRT(Var(T)) ~ N (0,1)

Var(T)= Var (S(cases) − S* (controls) )=Var(S(cases)) + Var(S* (controls))=Var(S(cases)) + Var(S(controls)) X [N(cases)/N(controls)]^2

14

C-alpha

PLOS Genetics, 2011 | Volume 7 | Issue 3 | e1001322

Effect direction problem

15

C-alpha

QQ Plots of Existing Methods (under the null)

•EFT and C-alphainflated with false positives

•TFT and CAST no inflation, but assuming single effect-direction

•ObjectiveMore general, powerful methods …

CAST C-alpha

EFT TFT

17

More Generalized Methods For Level 2 Data

Structure of Level 2 datavariant 1

variant i variant k

variant 2

…

Strategy

Instead of testing total freq./number, we test the randomness of all tables.

variant 3 …

4. Calculating p-value P= Prob.( )

Exact Probability Test (EPT)

k

iiPL

1

)log(

iA

iiiiii nNCanCanCP ,,, 2211

1.Calculating the probability of each table based on hypergeometric distribution

2. Calculating the logarized joint probability (L) for all k tables

3. Enumerating all possible tables and L scores

ASHG Meeting 1212, Zhang

Likelihood Ratio Test (LRT)

2~):,,,Pr(

):,,,Pr(log2

1212211

12102211

kdfHbaba

HbabaLR k

i

iiA

iiii

k

i

iiiiii

Binomial distribution

ASHG Meeting 1212, Zhang

Q-Q Plots of EPT and LRT(under the null)

EPTN=500

EPTN=3000

LRTN=500

LRTN=3000

Power Comparison significance level=0.00001

Variant proportion

Positive causal 80%

Neutral 20%

Negative Causal0%

Pow

er

Sample size

Pow

er

Sample size

Pow

er

Sample size


Variant proportion

Positive causal 60%

Neutral 20%

Negative Causal20%

Pow

er

Sample size


Variant proportion

Positive causal 40%

Neutral 20%

Negative Causal40%

Pow

er

Sample size

25


•Including covariates

•Extended to quantitative trait

•Better control for population structure

•More sophisticate model

26

Collapsing (C) test

Step 1

Step 2

logit(y)=a + b* X + e (logistic regression)

Li and Leal,The American Journal of Human Genetics 2008(83): 311–321

27

Variant Collapsing

(+) (+) (.) (.)

Subject V1 V2 V3 V4 Collapsed Trait

1 1 0 0 0 1 1

2 0 1 0 0 1 1

3 0 0 0 0 0 0

4 0 0 0 0 0 0

5 0 0 0 0 0 0

6 0 0 0 0 0 0

7 0 0 1 0 1 0

8 0 0 0 1 1 1

28

WSS

29

WSS

30

WSS

31

Weighted Sum Testi

m

ii gws

1

Collapsing test (Li & Leal, 2008), wi =1 and s=1 if s>1

Weighted-sum test (Madsen & Browning ,2009), wi calculated based-on allele freq. in control group

aSum: Adaptive sum test (Han & Pan ,2010), wi = -1 if b<0 and p<0.1, otherwise wj=1

KBAC (Liu and Leal, 2010), wi = left tail p value

RBT (Ionita-Laza et al, 2011), wi = log scaled probability

PWST p-value weighted sum test (Zhang et al., 2011) :, wi = rescaled left tail p value, incorporating both significance and directions

EREC( Lin et al, 2011), wi = estimated effect size

32

When there are only causal(+) variants …

(+) (+)Subjec

t V1 V2Collapse

d Trait

1 1 0 1 3.00

2 0 1 1 3.10

3 0 0 0 1.95

4 0 0 0 2.00

5 0 0 0 2.05

6 0 0 0 2.10

Collapsing (Li & Leal,2008) works well, power increased

33

(+) (+) (.) (.)

Subject V1 V2 V3 V4Collapse

d Trait1 1 0 0 0 1 3.002 0 1 0 0 1 3.103 0 0 0 0 0 1.954 0 0 0 0 0 2.005 0 0 0 0 0 2.056 0 0 0 0 0 2.107 0 0 1 0 1 2.008 0 0 0 1 1 2.10

When there are causal(+) and non-causal(.) variants …

Collapsing still works, power reduced

34

(+) (+) (.) (.) (-) (-)

Subject V1 V2 V3 V4 V5 V6Collaps

ed Trait1 1 0 0 0 0 0 1 3.002 0 1 0 0 0 0 1 3.103 0 0 0 0 0 0 0 1.954 0 0 0 0 0 0 0 2.005 0 0 0 0 0 0 0 2.056 0 0 0 0 0 0 0 2.107 0 0 1 0 0 0 1 2.008 0 0 0 1 0 0 1 2.109 0 0 0 0 1 0 1 0.95

10 0 0 0 0 0 1 1 1.00

When there are causal(+) non-causal(.) and causal (-) variants …

Power of collapsing test significantly down

35

P-value Weighted Sum Test (PWST)(+) (+) (.) (.) (-) (-)

Subject V1 V2 V3 V4 V5 V6 Collapsed pSum Trait1 1 0 0 0 0 0 1 0.86 3.002 0 1 0 0 0 0 1 0.90 3.103 0 0 0 0 0 0 0 0.00 1.954 0 0 0 0 0 0 0 0.00 2.005 0 0 0 0 0 0 0 0.00 2.056 0 0 0 0 0 0 0 0.00 2.107 0 0 1 0 0 0 1 -0.02 2.008 0 0 0 1 0 0 1 0.08 2.109 0 0 0 0 1 0 1 -0.90 0.95

10 0 0 0 0 0 1 1 -0.88 1.00t 1.61 1.84 -0.04 0.11 -1.84 -1.72

p(x≤t) 0.93 0.95 0.49 0.54 0.05 0.062*(p-0.5) 0.86 0.90 -0.02 0.08 -0.90 -0.88

Rescaled left-tail p-value [-1,1] is used as weight

36

P-value Weighted Sum Test (PWST)

Power of collapsing test is retained

even there are bidirectional effects

37

PWST:Q-Q Plots Under the Null

Direct testInflation of type I error

Corrected by permutation test(permutation of phenotype)

Generalized Linear Mixed Model (GLMM)

& Weighted Sum Test (WST)

38

GLMM & WST

Y : quantitative trait or logit(binary trait)α : interceptβ : regression coefficient of weighted sum m : number of RVs to be collapsed wi : weight of variant igi : genotype (recoded) of variant iΣwigi : weighted sum (WS)X: covariate(s), such as population structure variable(s)τ : fixed effect(s) of XZ: design matrix corresponding to γγ : random polygene effects for individual subjects, ~N(0, G), G=2σ2K, K is the kinship matrix and σ2 the additive ploygene genetic variance ε : residual

ZXgwY i

m

ii

1

39

Base on allele frequency, binary(0,1) or continuous, fixed or variable threshold;

Based on function annotation/prediction; SIFT, PolyPhen etc.

Based on sequencing quality (coverage, mapping quality, genotyping quality etc.);

Data-driven, using both genotype and phenotype data, learning weight from data or adaptive selection, permutation test;

Any combination …

Weight

40

i

m

ii gw

1

Adjusting relatedness in family data for non-data-driven test of rare variants.

Application 1: Family Data

41

i

m

ii gwY

1

ZgwY i

m

ii

1

γ ~N(0,2σ2K)

Unadjusted:

Adjusted:

Q-Q Plots of –log10(P) under the Null

Li & Leal’s collapsing test, ignoring family structure, inflation of type-1 error

Li & Leal’s collapsing test, modeling family structure via GLMM,inflation is corrected

42

(From Zhang et al, 2011, BMC Proc.)

Application 2: Permuting Family Data

ZgwY i

m

ii

1

Permuted

Non-permuted, subject IDs fixed

43

MMPT: Mixed Model-based Permutation Test

Adjusting relatedness in family data for data-driven permutation test of rare variants.

γ ~N(0,2σ2K)

Q-Q Plots under the Null WSS

SPWSTPWSTaSum

Permutation test, ignoring family structure, inflation of type-1 error

44

(From Zhang et al, 2011, IGES Meeting)

Q-Q Plots under the Null WSS

SPWSTPWSTaSum

Mixed model-based permutation test (MMPT), modeling family structure, inflation corrected

(From Zhang et al, 2011, IGES Meeting)

Burden Test vs. Non-burden Test

46

Burden test

)0...(0:

...

210

1

ki

k

iii

H

xY

Non-burden test

T-test, Likelihood Ratio Test, F-test, score test, …

SKAT: sequence kernel association test

0:

)(

0

1

H

xwY i

k

ii

SKAT: sequence kernel association test

)0...(0: 210

1

ki

k

iii

H

xY

Extension of SKAT to Family Data

kinship matrix

Polygenic heritability of the trait Residual

Han Chen et al., 2012, Genetic Epidemiology

Other problems

49

Missing genotypes & imputation

Genotyping errors & QC (family consistency,

sequence review)

Population Stratification

Inherited variants and de novo mutation

Family data & linkage infomation

Variant validation and association validation

Public databases

And more …

association analysis of rare genetic variants

Documents

group test

collective test

test tftjonathan

burden test

power sum test

variant number n

exact probability test

datastructure of level