statistical tests for differential expression in count ... › images › 4 › 43 ›...

Microarrays NGS Normalization Methylation

Statistical testsfor differential expression

in count data (1)

Erik van Zwet, LUMC

NBIC Advanced RNA-seq course25-26 August 2011

Academic Medical Center, Amsterdam

Statistical tests for differential expression in count data (1) Erik van Zwet, LUMC


The analysis of a microarray experiment

I Pre-processI image analysis to produce relative abundancesI take the logarithmI remove experimental artifacts (normalize)

I perform 20,000 t tests

I Correct for multiple testing



The model

For any particular gene, the log expression levels are assumed to benormally distributed:

fit=lm(y~group)

or perhaps

fit=lm(y~age+group)

I One might further assume that the gene variances come froma common (Gamma) distribution. In R: limma.

I Results in shrinkage: the empirical variances are pulledtowards their common mean.

I Favours genes with a large variance.



Multiple Testing

K tests at level α = 0.05. Expect 0.05K false discoveries.

Bonferroni:K tests at level α = 0.05/K . Expect 0.05 false discoveries.

In other words, we expect to make one false discovery in twentyexperiments. Perhaps too conservative?

Simple solution: increase the level.

Unfortunately, this is taboo.

The Concise Oxford Dictionary (1996) (1) a system or the act of setting a person or thing apart as sacred,

prohibited, or accursed. (2) a prohibition or restriction imposed on certain behaviour, word usage, etc., by social

custom.



Multiple Testing

H0 true H0 false

H0 rejected V S

H0 not rejected U T

Note: V are the false discoveries.

Bonferroni (Holm) controls the FWER

FWER = P(V > 0) ≤ 0.05.

Benjamini–Hochberg controls the FDR

FDR = E

(V

V + S

)≤ 0.05.



Multiple Testing

FDR = E

(V

V + S

)≤ 0.05.

I We allow not more than 1 in 20 of our discoveries to be false(on average).

I FDR is less conservative than FWER, yet somehow it does notfeel like violating the taboo.

Same for NGS data.



Testing sets of genes (pathways)

I Next speaker.



Next Generation Sequencing

An NGS experiment produces very many counts.

I not well approximated by the normal distribution (unlesscounts are large and you take the square root.)



Example from Baggerly et al. 2003

library 1 2 3 4 5 6 7 8pheno T+ T+ T+ T+ T+ T− T− T−size 100474 96631 92510 95785 18705 95155 91593 98220

tag 1 129 167 71 61 6 43 247 509tag 2···

We want to test if tag 1 is more/less abundant in the tumor versusthe control group.



In the tumor group we have

434

404105= 0.0011

and in the control group we have

799

284968= 0.0028.

So, tag 1 occurs more than twice as often in the control group.

Is this statistically significant?



I Total count for tag 1 in the tumor group has a binomialdistribution with n1 = 404105 and some probability p1.

I Total count for tag 1 in the control group has a binomialdistribution with n2 = 284968 and some probability p2.

We can use R to test the null hypothesis H0 : p1 = p2

x=c(434,799)n=c(404105,284968)prop.test(x,n)

p-value < 2.2e-16.

Giddyup!



A more flexible approach is to fit a logistic regression model for theprobability of observing tag 1. In such a model we could add othercovariates besides the presence of absence of the tumor.

x=c(129,167,71,61,6,43,247,509)n=c(100474,96631,92510,95785,18705,95155,91593,98220)group=factor(c(0,0,0,0,0,1,1,1))y=x/nfit=glm(y~group,family=binomial(),weights=n)

Again, we find a very significant effect.



We put all subjects in each group together, and assumed binomialdistributions for the total counts of tag 1. This is only valid if theproportions for each subject within a group are the same.

library 1 2 3 4 5 6 7 8pheno T+ T+ T+ T+ T+ T− T− T−size 100474 96631 92510 95785 18705 95155 91593 98220

tag 1 129 167 71 61 6 43 247 509prop. (%) 0.13 0.17 0.08 0.06 0.03 0.05 0.27 0.52

Do subjects 1 and 2 have the same proportion? No! p=0.012.

x=c(129,167)n=c(100474,96631)prop.test(x,n)



Overdispersion

I There is biological/measurement variation between subjects.

I This will increase the variance of the total count:overdispersion.

I Failure to recognize this, will lead to underestimation of thevariance.

I Underestimation of the variance, will lead to many falsepositives.



We need to account for extra (biological) variation.

I For every subject, let Pi be random.

I For every subject, X ∼ Bin(ni ,Pi ).

I Test if the Pi differ on average between the two groups.



Many ways to deal with overdispersion:

I log(Pi/1− Pi ) normally distributed (GLMM)I too slow to do a million tests or more.

I Pi beta distributed. R package BBSeqI seems to be too slow as well.

I If n is large and p is small, then we can use Poisson-Gammaor negative binomial. R packages baySeq, DESeq, edgeR.

I we’ll discuss now.



Recall, Binomial(n, p) distribution has

mean = np and variance = np(1− p).

DESeq and edgeR assume Negative-Binomial(µ, φ) distributionwith

mean = µ and variance = µ+ φµ2.

Much effort goes into estimating the dispersion φ when very few(biological) replicates are available.

I Here we should point out that it is generally better to increasethe sample size, if you can.



From the edgeR manual



Much more on baySeq, DESeq and edgeR later today.



Quasi-Binomial “distribution”

A cheap trick is to add an extra dispersion parameter ϕ to thebinomial

variance = np(1− p) becomes variance = ϕnp(1− p).

To estimate, we maximize a “quasi likelihood”.

fit=glm(y~group,family=binomial(),weights=n)

simply becomes

fit=glm(y~group,family=quasibinomial(),weights=n)

Now we find p = 0.12.



fit=glm(y~group,family=quasibinomial(),weights=n)

I accounts for overdispersion.

I few assumptions, hence robust.

I standard GLM: fast and reliable software available.

I standard GLM: theory well-developed.

I easy to include covariates.

I extends to dependent data: GEE.



I We condition on the library sizes, and test if the proportion ofany given tag relative to all other tags is different betweengroups.

I But, if any tag differs between groups then other tags mustdo so as well, simply because

∑i pi = 1.

I In a two sample binomial experiment with

X1 ∼ Bin(n1, p1) and X2 ∼ Bin(n2, p2)

it makes no sense to test both

H0 : p1 = p2 and H0 : (1− p1) = (1− p2).



I The problem is very noticeable if there are tags that are bothvery abundant and different beween groups.

I Anders and Huber (2010) and Robinson and Oshlack (2010)suggest to modify the library sizes so that single tags havelittle influence.

I Alternatively, we might take a set of tags that we know won’tdiffer between groups and test all tags of interest relative tothat set.



Methylation

Joint with

I Jelle Goeman.

I Bas Heijmans.

I Elmar Tobi.

n is not large and p is not small so baySeq, DESeq and edgeRcannot be used. Approximation with the normal distribution is alsonot appropriate.



Methylation

I addition of methyl group (CH3) to cytosine.

I occurs at CpG sites.

I inherited during mitosis, not meiosis.

I stably affects transcription.



Jirtle and Skinner, 2007



Dutch Famine 1944–1945

I Well documented, brief period allows the study of the effectsof famine on human health.

I Children exposed to famine in utero compared to siblings:I diabetesI obesityI cardiovascular disease

How can we measure methylation?



Reduced Representation Bisulphite Sequencing

Align to reduced representation &Infer methylation

M--CCGG-----CCGG----GGCC-----GGCC—

M

MCGG-----C

C-----GGC

M MCGG-----CCGGCC-----GGCM

MspI digestion &Size selection (40-220 bp)

End repair &Adaptor ligation

M MCGG-----CCG

GCC-----GGUM

Bisulfite treatment

CGG-----CCGGCC-----GGC

CGG-----CCAGCC-----GGT

CGG-----

-----GGT

Sequencing ofshort reads

PCR



Data

24 same sex sibling pairs. One sib was exposed in utero to DutchFamine.

family 9900 9900 9930 9930 18240 18240 17650 17650 17910 17910 ...

case 0 1 0 1 0 1 0 1 0 1 ...

cg02824864

coverage 16 8 5 7 4 10 7 1 12 33 ...

methylated 11 5 4 7 4 8 3 1 9 25 ...

...

Three million different CpGs.



I First, we remove all CpG with insufficient coverage.

I Next, we test for differences in methylation fraction betweenexposed and unexposed individuals for every CpGs.

I Finally, we test various groups of CpG simultaneously.

We can account for relatedness by introducing a fixed effect perfamily, very similar to a paired t test.

For more complicated designs, we would need GEE.



family 9900 9900 9930 9930 18240 18240 17650 17650 17910 17910 ...

case 0 1 0 1 0 1 0 1 0 1 ...

cg02824864

coverage (n) 16 8 5 7 4 10 7 1 12 33 ...

methyl. (x) 11 5 4 7 4 8 3 1 9 25 ...

...

y=x/nfit=glm(y~case+fam,family=quasibinomial(),weights=n)



How to test sets of CpGs?

Next speaker!


statistical tests for differential expression in count ... › images › 4 › 43 ›...

Documents