statistical tests for differential expression in count ... › images › 4 › 43 ›...

34
Microarrays NGS Normalization Methylation Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Upload: others

Post on 24-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Statistical testsfor differential expression

in count data (1)

Erik van Zwet, LUMC

NBIC Advanced RNA-seq course25-26 August 2011

Academic Medical Center, Amsterdam

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 2: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

The analysis of a microarray experiment

I Pre-processI image analysis to produce relative abundancesI take the logarithmI remove experimental artifacts (normalize)

I perform 20,000 t tests

I Correct for multiple testing

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 3: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

The model

For any particular gene, the log expression levels are assumed to benormally distributed:

fit=lm(y~group)

or perhaps

fit=lm(y~age+group)

I One might further assume that the gene variances come froma common (Gamma) distribution. In R: limma.

I Results in shrinkage: the empirical variances are pulledtowards their common mean.

I Favours genes with a large variance.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 4: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Multiple Testing

K tests at level α = 0.05. Expect 0.05K false discoveries.

Bonferroni:K tests at level α = 0.05/K . Expect 0.05 false discoveries.

In other words, we expect to make one false discovery in twentyexperiments. Perhaps too conservative?

Simple solution: increase the level.

Unfortunately, this is taboo.

The Concise Oxford Dictionary (1996) (1) a system or the act of setting a person or thing apart as sacred,

prohibited, or accursed. (2) a prohibition or restriction imposed on certain behaviour, word usage, etc., by social

custom.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 5: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Multiple Testing

H0 true H0 false

H0 rejected V S

H0 not rejected U T

Note: V are the false discoveries.

Bonferroni (Holm) controls the FWER

FWER = P(V > 0) ≤ 0.05.

Benjamini–Hochberg controls the FDR

FDR = E

(V

V + S

)≤ 0.05.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 6: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Multiple Testing

FDR = E

(V

V + S

)≤ 0.05.

I We allow not more than 1 in 20 of our discoveries to be false(on average).

I FDR is less conservative than FWER, yet somehow it does notfeel like violating the taboo.

Same for NGS data.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 7: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Testing sets of genes (pathways)

I Next speaker.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 8: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Next Generation Sequencing

An NGS experiment produces very many counts.

I not well approximated by the normal distribution (unlesscounts are large and you take the square root.)

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 9: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Example from Baggerly et al. 2003

library 1 2 3 4 5 6 7 8pheno T+ T+ T+ T+ T+ T− T− T−size 100474 96631 92510 95785 18705 95155 91593 98220

tag 1 129 167 71 61 6 43 247 509tag 2···

We want to test if tag 1 is more/less abundant in the tumor versusthe control group.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 10: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

In the tumor group we have

434

404105= 0.0011

and in the control group we have

799

284968= 0.0028.

So, tag 1 occurs more than twice as often in the control group.

Is this statistically significant?

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 11: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

I Total count for tag 1 in the tumor group has a binomialdistribution with n1 = 404105 and some probability p1.

I Total count for tag 1 in the control group has a binomialdistribution with n2 = 284968 and some probability p2.

We can use R to test the null hypothesis H0 : p1 = p2

x=c(434,799)n=c(404105,284968)prop.test(x,n)

p-value < 2.2e-16.

Giddyup!

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 12: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

A more flexible approach is to fit a logistic regression model for theprobability of observing tag 1. In such a model we could add othercovariates besides the presence of absence of the tumor.

x=c(129,167,71,61,6,43,247,509)n=c(100474,96631,92510,95785,18705,95155,91593,98220)group=factor(c(0,0,0,0,0,1,1,1))y=x/nfit=glm(y~group,family=binomial(),weights=n)

Again, we find a very significant effect.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 13: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

We put all subjects in each group together, and assumed binomialdistributions for the total counts of tag 1. This is only valid if theproportions for each subject within a group are the same.

library 1 2 3 4 5 6 7 8pheno T+ T+ T+ T+ T+ T− T− T−size 100474 96631 92510 95785 18705 95155 91593 98220

tag 1 129 167 71 61 6 43 247 509prop. (%) 0.13 0.17 0.08 0.06 0.03 0.05 0.27 0.52

Do subjects 1 and 2 have the same proportion? No! p=0.012.

x=c(129,167)n=c(100474,96631)prop.test(x,n)

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 14: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Overdispersion

I There is biological/measurement variation between subjects.

I This will increase the variance of the total count:overdispersion.

I Failure to recognize this, will lead to underestimation of thevariance.

I Underestimation of the variance, will lead to many falsepositives.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 15: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

We need to account for extra (biological) variation.

I For every subject, let Pi be random.

I For every subject, X ∼ Bin(ni ,Pi ).

I Test if the Pi differ on average between the two groups.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 16: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Many ways to deal with overdispersion:

I log(Pi/1− Pi ) normally distributed (GLMM)I too slow to do a million tests or more.

I Pi beta distributed. R package BBSeqI seems to be too slow as well.

I If n is large and p is small, then we can use Poisson-Gammaor negative binomial. R packages baySeq, DESeq, edgeR.

I we’ll discuss now.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 17: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Recall, Binomial(n, p) distribution has

mean = np and variance = np(1− p).

DESeq and edgeR assume Negative-Binomial(µ, φ) distributionwith

mean = µ and variance = µ+ φµ2.

Much effort goes into estimating the dispersion φ when very few(biological) replicates are available.

I Here we should point out that it is generally better to increasethe sample size, if you can.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 18: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Recall, Binomial(n, p) distribution has

mean = np and variance = np(1− p).

DESeq and edgeR assume Negative-Binomial(µ, φ) distributionwith

mean = µ and variance = µ+ φµ2.

Much effort goes into estimating the dispersion φ when very few(biological) replicates are available.

I Here we should point out that it is generally better to increasethe sample size, if you can.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 19: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

From the edgeR manual

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 20: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Much more on baySeq, DESeq and edgeR later today.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 21: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Quasi-Binomial “distribution”

A cheap trick is to add an extra dispersion parameter ϕ to thebinomial

variance = np(1− p) becomes variance = ϕnp(1− p).

To estimate, we maximize a “quasi likelihood”.

fit=glm(y~group,family=binomial(),weights=n)

simply becomes

fit=glm(y~group,family=quasibinomial(),weights=n)

Now we find p = 0.12.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 22: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

fit=glm(y~group,family=quasibinomial(),weights=n)

I accounts for overdispersion.

I few assumptions, hence robust.

I standard GLM: fast and reliable software available.

I standard GLM: theory well-developed.

I easy to include covariates.

I extends to dependent data: GEE.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 23: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

I We condition on the library sizes, and test if the proportion ofany given tag relative to all other tags is different betweengroups.

I But, if any tag differs between groups then other tags mustdo so as well, simply because

∑i pi = 1.

I In a two sample binomial experiment with

X1 ∼ Bin(n1, p1) and X2 ∼ Bin(n2, p2)

it makes no sense to test both

H0 : p1 = p2 and H0 : (1− p1) = (1− p2).

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 24: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

I The problem is very noticeable if there are tags that are bothvery abundant and different beween groups.

I Anders and Huber (2010) and Robinson and Oshlack (2010)suggest to modify the library sizes so that single tags havelittle influence.

I Alternatively, we might take a set of tags that we know won’tdiffer between groups and test all tags of interest relative tothat set.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 25: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Methylation

Joint with

I Jelle Goeman.

I Bas Heijmans.

I Elmar Tobi.

n is not large and p is not small so baySeq, DESeq and edgeRcannot be used. Approximation with the normal distribution is alsonot appropriate.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 26: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Methylation

I addition of methyl group (CH3) to cytosine.

I occurs at CpG sites.

I inherited during mitosis, not meiosis.

I stably affects transcription.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 27: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Jirtle and Skinner, 2007

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 28: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Dutch Famine 1944–1945

I Well documented, brief period allows the study of the effectsof famine on human health.

I Children exposed to famine in utero compared to siblings:I diabetesI obesityI cardiovascular disease

How can we measure methylation?

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 29: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Reduced Representation Bisulphite Sequencing

Align to reduced representation &Infer methylation

M--CCGG-----CCGG----GGCC-----GGCC—

M

MCGG-----C

C-----GGC

M MCGG-----CCGGCC-----GGCM

MspI digestion &Size selection (40-220 bp)

End repair &Adaptor ligation

M MCGG-----CCG

GCC-----GGUM

Bisulfite treatment

CGG-----CCGGCC-----GGC

CGG-----CCAGCC-----GGT

CGG-----

-----GGT

Sequencing ofshort reads

PCR

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 30: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

Data

24 same sex sibling pairs. One sib was exposed in utero to DutchFamine.

family 9900 9900 9930 9930 18240 18240 17650 17650 17910 17910 ...

case 0 1 0 1 0 1 0 1 0 1 ...

cg02824864

coverage 16 8 5 7 4 10 7 1 12 33 ...

methylated 11 5 4 7 4 8 3 1 9 25 ...

...

Three million different CpGs.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 31: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

I First, we remove all CpG with insufficient coverage.

I Next, we test for differences in methylation fraction betweenexposed and unexposed individuals for every CpGs.

I Finally, we test various groups of CpG simultaneously.

We can account for relatedness by introducing a fixed effect perfamily, very similar to a paired t test.

For more complicated designs, we would need GEE.

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 32: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

family 9900 9900 9930 9930 18240 18240 17650 17650 17910 17910 ...

case 0 1 0 1 0 1 0 1 0 1 ...

cg02824864

coverage (n) 16 8 5 7 4 10 7 1 12 33 ...

methyl. (x) 11 5 4 7 4 8 3 1 9 25 ...

...

y=x/nfit=glm(y~case+fam,family=quasibinomial(),weights=n)

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 33: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

How to test sets of CpGs?

Next speaker!

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC

Page 34: Statistical tests for differential expression in count ... › images › 4 › 43 › 20110825_Vanzwet_statisticaltesting.pdfMicroarrays NGS Normalization Methylation I Total count

Microarrays NGS Normalization Methylation

How to test sets of CpGs?

Next speaker!

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC