the problem of detecting differentially expressed genes
TRANSCRIPT
![Page 1: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/1.jpg)
The Problem of Detecting Differentially Expressed Genes
![Page 2: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/2.jpg)
Gene 1
Gene 2
Gene N
.
.
.
.. . . .Sample 1 Sample 2 Sample M
![Page 3: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/3.jpg)
Gene 1
Gene 2
Gene N
.
.
.
.. . . .Sample 1 Sample 2 Sample M
Class 1 Class 2
![Page 4: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/4.jpg)
Fold Change is the Simplest Method
Calculate the log ratio between the two classes and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. A two-fold difference is often chosen.
Fold change is not a statistical test.
![Page 5: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/5.jpg)
(1) For gene consider the null hypothesis of no association between its expression level and its class membership.
Test of a Single Hypothesis
(3) Perform a test (e.g Student’s t-test) for each gene.
(2) Decide on level of significance (commonly 5%).
(4) Obtain P-value corresponding to that test statistic.
(5) Compare P-value with the significance level. Theneither reject or retain the null hypothesis.
![Page 6: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/6.jpg)
Two-Sample t-Statistic
2221
21
21
nsnsW
jj
jjj
yy
Student’s t-statistic:
![Page 7: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/7.jpg)
Two-Sample t-Statistic
21
21
11 nnsW
j
jjj
yy
Pooled form of the Student’s t-statistic, assumed common variance in the two classes:
![Page 8: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/8.jpg)
021
21
11 annsW
j
jjj
yy
Two-Sample t-Statistic
Modified t-statistic of Tusher et al. (2001):
![Page 9: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/9.jpg)
Retain Null Reject Null
Null False positive
Non-null False negative TRUE
PREDICTED
Types of Errors in Hypothesis Testing
![Page 10: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/10.jpg)
Multiplicity Problem
Further: Genes are co-regulated, subsequently there is correlation between the test statistics.
When many hypotheses are tested, the probability of a false positive increases sharply with the number of hypotheses.
![Page 11: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/11.jpg)
Suppose we measure the expression of 10,000 genes in a microarray experiment.
If all 10,000 genes were not differentially expressed, then we would expect for:
P= 0.05 for each test, 500 false positives.
P= 0.05/10,000 for each test, .05 false positives.
Example
![Page 12: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/12.jpg)
Controlling the Error Rate
• Methods for controlling false positives e.g. Bonferroni are too strict for microarray analyses
• Use the False Discovery Rate instead (FDR)(Benjamini and Hochberg 1995)
![Page 13: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/13.jpg)
Methods for dealing with the Multiplicity Problem
• The Bonferroni Method controls the family wise error rate (FWER) i.e.the probability that at least one false positive error will be made
• The False Discovery Rate (FDR)emphasizes the proportion of false positives among the identified differentially expressed genes.
Too strict for gene expression data, tries to make it unlikely that even one false rejection of the null is made, may lead to missed findings
Good for gene expression data – says something about the chosen genes
![Page 14: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/14.jpg)
)hypotheses (rejected#
positives) (false#FDR
The FDR is essentially the expectation of the proportion of false positives among the identified differentially expressed genes.
False Discovery Rate Benjamini and Hochberg (1995)
![Page 15: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/15.jpg)
Accept Null Reject Null Total
Null True N00 N01 N0
Non-True N10 N11 N1
Total N - Nr Nr N
Possible Outcomes for N Hypothesis Tests
![Page 16: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/16.jpg)
}1
{FDR01
rNN
E
)1,max(1 rr NN where
}1)(
{FNDR10
rNN
NE
![Page 17: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/17.jpg)
}0|/E{pFDR 01 rr NNN
}0FDR/pr{ rN
Positive FDR
![Page 18: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/18.jpg)
Lindsay, Kettenring, and Siegmund (2004).
A Report on the Future of Statistics.
Statist. Sci. 19.
![Page 19: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/19.jpg)
![Page 20: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/20.jpg)
Key papers on controlling the FDR
• Genovese and Wasserman (2002)
• Storey (2002, 2003)
• Storey and Tibshirani (2003a, 2003b)
• Storey, Taylor and Siegmund (2004)
• Black (2004)
• Cox and Wong (2004)
Controlling FDR
Benjamini and Hochberg (1995)
![Page 21: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/21.jpg)
Benjamini-Hochberg (BH) Procedure
Controls the FDR at level when the P-values following the null distribution are independent and uniformly distributed.
(1) Let be the observed P-values.
(2) Calculate .
(3) If exists then reject null hypotheses corresponding to
. Otherwise, reject nothing.
)()1( Npp
)()1( kpp
Nkpkk k /:maxargˆ)(
Nk 1
k
![Page 22: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/22.jpg)
Example: Bonferroni and BH Tests
Suppose that 10 independent hypothesis tests are carried outleading to the following ordered P-values:
0.00017 0.00448 0.00671 0.00907 0.012200.33626 0.39341 0.53882 0.58125 0.98617
(a) With = 0.05, the Bonferroni test rejects any hypothesiswhose P-value is less than / 10 = 0.005.
Thus only the first two hypotheses are rejected.
(b) For the BH test, we find the largest k such that P(k) < k / N. Here k = 5, thus we reject the first five hypotheses.
![Page 23: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/23.jpg)
q-VALUE
q-value of a gene j is expected proportion of false positives when calling that gene significant.
P-value is the probability under the null hypothesis of obtaining a value of the test statistic as or more extreme than its observed value.
The q-value for an observed test statistic can be viewed as the expected proportion of false positives among all genes with their test statistics as or more extreme than the observed value.
![Page 24: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/24.jpg)
LIST OF SIGNIFICANT GENES
Call all genes significant if pj < 0.05
or
Call all genes significant if qj < 0.05
to produce a set of significant genes so that a proportion of them (<0.05) is expected to be false (at least for a large no. of genes not necessarily independent)
![Page 25: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/25.jpg)
BRCA1 versus BRCA2-mutation positive tumours (Hedenfalk et al., 2001)
BRCA1 (7) versus BRCA2-mutation (8) positive tumours, p=3226 genes
P=.001 gave 51 genes differentially expressed
P=0.0001 gave 9-11 genes
![Page 26: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/26.jpg)
Using q<0.05, gives 160 genes are taken to be significant.
It means that approx. 8 of these 160 genes are expected to be false positives.
Also, it is estimated that 33% of the genes are differentially expressed.
![Page 27: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/27.jpg)
![Page 28: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/28.jpg)
![Page 29: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/29.jpg)
Permutation Method
The null distribution has a resolution on the order of the number of permutations.
If we perform B permutations, then the P-value will be estimatedwith a resolution of 1/B.
If we assume that each gene has the same null distribution and combine the permutations, then the resolution will be 1/(NB) for the pooled null distribution.
Null Distribution of the Test Statistic
![Page 30: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/30.jpg)
Using just the B permutations of the class labels for the gene-specific statistic Wj , the P-value for Wj = wj is assessed as:
where w(b)0j is the null version of wj after the bth permutation
of the class labels.
B
wwbp j
bj
j
|}||:|{# )(0
![Page 31: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/31.jpg)
If we pool over all N genes, then:
B
b
jbi
j NB
Niwwip
1
)(0 },...,1|,||:|{#
![Page 32: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/32.jpg)
Class 1 Class 2
Gene 1 A1(1) A2
(1) A3(1) B4
(1) B5(1) B6
(1)
Gene 2 A1(2) A2
(2) A3(2) B4
(2) B5(2) B6
(2)
Suppose we have two classes of tissue samples, with three samples from each class. Consider the expressions of two genes, Gene 1 and Gene 2.
Null Distribution of the Test Statistic: Example
![Page 33: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/33.jpg)
Class 1 Class 2
Gene 1 A1(1) A2
(1) A3(1) B4
(1) B5(1) B6
(1)
Gene 2 A1(2) A2
(2) A3(2) B4
(2) B5(2) B6
(2)
Gene 1 A1(1) A2
(1) A3(1) A4
(1) A5(1) A6
(1)
To find the null distribution of the test statistic for Gene 1, we proceed under the assumption that there is no difference between the classes (for Gene 1) so that:
Perm. 1 A1(1) A2
(1) A4(1) A3
(1) A5(1) A6
(1) ...There are 10 distinct permutations.
And permute the class labels:
![Page 34: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/34.jpg)
Ten Permutations of Gene 1
A1(1) A2
(1) A3(1) A4
(1) A5(1) A6
(1)
A1(1) A2
(1) A4(1) A3
(1) A5(1) A6
(1)
A1(1) A2
(1) A5(1) A3
(1) A4(1) A6
(1)
A1(1) A2
(1) A6(1) A3
(1) A4(1) A5
(1)
A1(1) A3
(1) A4(1) A2
(1) A5(1) A6
(1)
A1(1) A3
(1) A5(1) A2
(1) A4(1) A6
(1)
A1(1) A3
(1) A6(1) A2
(1) A4(1) A5
(1)
A1(1) A4
(1) A5(1) A2
(1) A3(1) A6
(1)
A1(1) A4
(1) A6(1) A2
(1) A3(1) A5
(1)
A1(1) A5
(1) A6(1) A2
(1) A3(1) A4
(1)
![Page 35: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/35.jpg)
As there are only 10 distinct permutations here, thenull distribution based on these permutations is toogranular.
Hence consideration is given to permuting the labelsof each of the other genes and estimating the null distribution of a gene based on the pooled permutationsso obtained.
But there is a problem with this method in that the null values of the test statistic for each gene does not necessarily have the theoretical null distribution that we are trying to estimate.
![Page 36: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/36.jpg)
Suppose we were to use Gene 2 also to estimate the null distribution of Gene 1.
Suppose that Gene 2 is differentially expressed, then the null values of the test statistic for Gene 2will have a mixed distribution.
![Page 37: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/37.jpg)
Class 1 Class 2
Gene 1 A1(1) A2
(1) A3(1) B4
(1) B5(1) B6
(1)
Gene 2 A1(2) A2
(2) A3(2) B4
(2) B5(2) B6
(2)
Gene 2 A1(2) A2
(2) A3(2) B4
(2) B5(2) B6
(2)
Perm. 1 A1(2) A2
(2) B4(2) A3
(2) B5(2) B6
(2)
...
Permute the class labels:
![Page 38: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/38.jpg)
Example of a null case: with 7 N(0,1) points and8 N(0,1) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labels with t13 density superimposed.
ty
![Page 39: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/39.jpg)
Example of a null case: with 7 N(0,1) points and
8 N(10,9) points; histogram of the pooled two-sample
t-statistic under 1000 permutations of the class
labels with t13 density superimposed.
ty
![Page 40: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/40.jpg)
The SAM Method
Use the permutation method to calculate the null distribution of the modified t-statistic (Tusher et al., 2001).
The order statistics t(1), ... , t(N) are plotted against their null expectations above.
A good test in situations where there are more genes being over-expressed than under-expressed, or vice-versa.
B
b
bjj NjwBw
1
)()(0)(0 ).,...,1()/1(
![Page 41: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/41.jpg)
TRUE
PREDICTED
Retain Null Reject Null
Null a b (false positive)
Non-null c (false negative) d
FDR ~ R
b
R
FNDR ~ ca
c
The FDR and other error rates
![Page 42: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/42.jpg)
TRUE
PREDICTED
Retain Null Reject Null
Null a b (false positive)
Non-null c (false negative) d
FDR ~ R
b
R
FNR = dc
c
The FDR and other error rates
![Page 43: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/43.jpg)
Test Result
non-DE DE Total
True
non-DE
DE
Total
A = 9025 B = 475
C = 100 D = 400
9125 875
9500
500
10000
FDR = B / (B + D) = 475 / 875 = 54%
FNDR = C / (A + C) = 100 / 9125 = 1%
Toy Example with 10,000 Genes
![Page 44: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/44.jpg)
Two-component mixture model
)()()( 1100 jjj wfwfwf 0 is the proportion of genes that are not
differentially expressed, and 01 1
![Page 45: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/45.jpg)
Use of the P-Value as a Summary Statistic(Allison et al., 2002)
Instead of using the pooled form of the t-statistic, we can work with the value pj, which is the P-value associated with tj
in the test of the null hypothesis of no difference in expression between the two classes.
The distribution of the P-value is modelled by the h-component mixture model
h
i
iijij pfpf1
21 ),;()(
where 11 = 12 = 1.
,
![Page 46: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/46.jpg)
Use of the P-Value as a Summary Statistic
Under the null hypothesis of no difference in expression for thejth gene, pj will have a uniform distribution on the unit interval; ie the 1,1 distribution.
The 1,2 density is given by
where
),()},(/)1({),;( )1,0(2111
2121 uIBuuuf
).(/)()(),( 212121 B
![Page 47: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/47.jpg)
![Page 48: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/48.jpg)
• Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. JASA 96,1151-1160.
• Efron B (2004) Large-scale simultaenous hypothesis testing: the choice of a null hypothesis. JASA 99, 96-104.
• Efron B (2004) Selection and Estimation for Large-Scale Simultaneous Inference.
• Efron B (2005) Local False Discovery Rates.
• Efron B (2006) Correlation and Large-Scale Simultaneous Significance Testing.
![Page 49: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/49.jpg)
• McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX. Using mixture models to detect differentially expressed genes. Australian Journal of Experimental Agriculture 45, 859-866.
• McLachlan GJ, Bean RW, Ben-Tovim Jones L. A simple implentation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 26. To appear.
![Page 50: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/50.jpg)
Two component mixture model
)()()( 1100 jjj wfwfwf
)(
)()( 00
0j
jj wf
wfw
Using Bayes’ Theorem, we calculate the posterior probability that gene j is not differentially expressed:
π0 is the proportion of genes that are not differentiallyexpressed. The two-component mixture model is:
![Page 51: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/51.jpg)
0(wj) c0,
then this decision minimizes the (estimated) Bayes risk
If we conclude that gene j is differentially expressed if:
![Page 52: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/52.jpg)
where
101010)1(Risk ecec where e01 is the probability of a false positive and e10 is the probability of a false negative.
![Page 53: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/53.jpg)
Estimated FDR
where
![Page 54: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/54.jpg)
N
jrjcj NNwIwRDFN
10),(1 )/())(ˆ()(ˆˆ
0
N
jj
N
jjcj wwIwRPF
10
10],0[0 )(ˆ/))(ˆ()(ˆˆ
0
N
jj
N
jjcj wwIwRNF
11
10),(1 )(ˆ/))(ˆ()(ˆˆ
0
Similarly, the false positive rate is given by
and the false non-discovery rate and false negativerate by:
![Page 55: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/55.jpg)
F0: N(0,1), π0=0.9F1: N(1,1), π1=0.1
Reject H0 if z≥2
τ0(2) = 0.99972but FDR=0.17
F0: N(0,1), π0=0.6F1: N(1,1), π1=0.4
Reject H0 if z≥2
τ0(2) = 0.251but FDR=0.177
Glonek and Solomon (2003)
![Page 56: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/56.jpg)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
![Page 57: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/57.jpg)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
![Page 58: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/58.jpg)
Gene Statistics: Two-Sample t-Statistic
2221
21
21
nsnsw
jj
jjj
yyStudent’s t-statistic
Pooled form of the Student’s t-statistic, assumed common variance in the two classes 21
21
11 nnsw
j
jjj
yy
Modified t-statistic of Tusher et al. (2001)
021
21
11 annsw
j
jjj
yy
![Page 59: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/59.jpg)
1. Obtain the z-score for each of the genes
The Procedure
2. Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the P-values, pj).
3. The posterior probability of non-differential expression of gene j, is given by 0(zj).
4. Conclude gene j to be differentially expressed if
0(zj) < c0
1jz jp1
![Page 60: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/60.jpg)
τ0(zj) ≤ c
czfzf
zf
jj
j )()(
)(
1100
00
1
0
0
1 1
)(
)(
c
c
zf
zf
j
j
c
c
zf
zf
j
j
19
)(
)(
0
1
If π0/π1 ≤ 9, then
e.g. c = 0.2, 36)(
)(
0
1 j
j
zf
zf
![Page 61: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/61.jpg)
Suppose π0/ π1 ≥ 0.9
then
τ0(zj) ≤ c
τ0(zj) ≤ 0.2 if
36)(
)(
0
1 j
j
zf
zf
Much stronger level of evidence against the nullthan in standard one-at-a-time testing.
![Page 62: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/62.jpg)
e.g.
Zj ~ N(μ,1)
H0 : μ = 0 vs H1 : μ = 2.8
Rejecting H0 if |Zj| ≥ 1.96 yields a two-sidedtest of size 0.05 and power 0.80.
f1(1.96)/f0(1.96) = 4.8.
![Page 63: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/63.jpg)
Z-scores, null case Z-scores, +1
Z-scores, +2Z-scores, +3
![Page 64: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/64.jpg)
zj j
Use of Wilson-Hilferty Transformation asin Broet et al. (2004)
![Page 65: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/65.jpg)
Plot of approximate z-scores obtained byWilson-Hilferty transformation of simulated values of F-statisticwith 1 and 8 degrees of freedom.
Transformed t-statistics
Fre
qu
en
cy
-1 0 1 2 3 4
05
10
15
20
25
![Page 66: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/66.jpg)
EMMIX-FDR
A program has been written in R which interfaces with EMMIX to implement the algorithm described in McLachlan et al (2006).
We fit a mixture of two normal components
to test statistics calculated from the gene expression data.
![Page 67: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/67.jpg)
1100 ˆˆˆˆ z
21010
211
200
2 )ˆˆ(ˆˆˆˆˆˆ zs
When we equate the sample mean and varianceof the mixture to their population counterparts, we obtain:
![Page 68: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/68.jpg)
)ˆ1/(ˆ 01 z
)ˆ1/(}ˆ)ˆ1(ˆˆ{ˆ 021000
221 zs
When we are working with the theoretical null,we can easily estimate the mean and varianceof the non-null component with the following formulae.
![Page 69: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/69.jpg)
)}(/{}:{#)()0(0 Nzz jj
Following the approach of Storey and Tibshirani (2003)we can obtain an initial estimate for π0 as follows:
This estimate of π0 is used as input to EMMIX aspart of the initial parameter estimates. Thus no randomor k-means starts are required, as is usually the case.
There are two different versions of EMMIX used, thestandard version for the empirical null and a modifiedversion for the theoretical null which fixes the mean andvariance of the null component to be 0 and 1 respectively.
![Page 70: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/70.jpg)
Theoretical and empirical nulls
Efron (2004) suggested the use of two kinds of null component: the theoretical and the empirical null. In the theoretical case the null component has mean 0 and variance 1 and the empirical null has unrestricted mean and variance.
![Page 71: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/71.jpg)
Examples
We examined the performance of EMMIX-FDR on three well-known data sets
from the literature.
• Alon colon cancer data (1999)
• Hedenfalk breast cancer data (2001)
• van’t Wout HIV data (2003)
![Page 72: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/72.jpg)
Hedenfalk Breast Cancer Data
Hedenfalk et al. (2001) used cDNA arrays to obtain gene expression profiles of tumours from carriers of either the BRCA1 or BRCA2 mutation (hereditary breast cancers), as well as sporadic breast cancer.
We consider their data set of M = 15 patients, comprising two patient groups: BRCA1 (7) versus BRCA2 - mutation positive (8), with N = 3,226 genes.
The problem is to find genes which are differentially expressed between the BRCA1 and BRCA2 patients.
Hedenfalk et al. (2001) NEJM, 344, 539-547
![Page 73: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/73.jpg)
Fit
to the N values of wj (based on pooled two-sample t-statistic)
),()1,0( 21110 NN
j th gene is taken to be differentially expressed if:
00 )(ˆ cw j
Two component model for the Breast Cancer Data
![Page 74: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/74.jpg)
Histogram of z-scores for 3226 Hedenfalk genes
z
Fre
qu
en
cy
-4 -2 0 2 4
02
04
06
08
01
00
Fitting two component mixture model to Hedenfalk data
Null
Non–Null(DE genes)
![Page 75: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/75.jpg)
Estimates of π0 for Hedenfalk data
• 0.52 (Broet, 2004)
• 0.64 (Gottardo, 2006)
• 0.61 (Ploner et al, 2006)
• 0.47 (Storey, 2002)
Using a theoretical null, we estimated π0 to be 0.65.
![Page 76: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/76.jpg)
We used
pj = 2 F13(-|wj|)
Similar results where pj obtained by permutation methods.
![Page 77: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/77.jpg)
Gene j Pj zj 0 ( zj )
Gene 1
.
.
.
Gene R
.
.
.
.
Gene R+R1
.
.
.
Gene N
0.002
.
.
.
0.1
0.12
0.18
.
.
0.20
c0 = 0.1
Ranking and Selecting the Genes
FDR = Sum/R = 0.06
Proportion ofFalse Negatives= 1 – Sum1/ R1
Local FDR
![Page 78: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/78.jpg)
c0 R
0.5 906 0.26
0.4 694 0.21
0.3 518 0.16
0.2 326 0.11
0.1 143 0.06
RDF ˆ
Estimated FDR for various levels of c0
![Page 79: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/79.jpg)
c0 Nr
0.1 143 0.06 0.32 0.88 0.004
0.2 338 0.11 0.28 0.73 0.02
0.3 539 0.16 0.25 0.60 0.04
0.4 742 0.21 0.22 0.48 0.08
0.5 971 0.27 0.18 0.37 0.12
RDF ˆ DRNF ˆ RNF ˆ RPF ˆ
Estimated FDR and other error rates for various levelsof threshold c0 applied to the posterior probability of nondifferentialexpression for the breast cancer data (Nr=number of selected genes)
![Page 80: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/80.jpg)
Storey and Tibshirani (2003) PNAS, 100, 9440-9445
Comparison of identified DE genes
Our method (143) Hedenfalk (175)
Storey and Tibshirani (160)
101
6
12
8
39
2924
![Page 81: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/81.jpg)
Uniquely Identified Genes: Differentially Expressed between BRCA1 and BRCA2
Gene GO TermUBE2B, DDB2
(UBE2V1)
RAB9, RHOC
ITGB5, ITGA3
PRKCBP1
HDAC3, MIF
KIF5B, spindle body pole protein
CTCL1
TNAFIP1
HARS, HSD17B7
DNA repair
(cell cycle)
small GTPase signal transduction
integrin mediated signalling pathway
regulation of transcription
negative regulation of apoptosis
cytoskeleton organisation
vesicle mediated transport
cation transport
metabolism
![Page 82: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/82.jpg)
1. Obtain the z-score for each of the genes
The Procedure
2. Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the P-values, pj).
3. The posterior probability of non-differential expression of gene j, is given by 0(zj).
4. Conclude gene j to be differentially expressed if
0(zj) < c0
1jz jp1
![Page 83: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/83.jpg)
Breast cancer data: plot of fitted two-componentnormal mixture model with theoretical N(0,1) null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.
z-scores
Fre
qu
en
cy
-4 -2 0 2 4
02
04
06
08
01
00
![Page 84: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/84.jpg)
Hedenfalk breast cancer data:plot of fitted two-component normal mixture model with empirical nulland non-null components (weighted respectively by the estimated proportionof null and non-null genes) imposed on histogram of z-scores.
z-scores
Fre
qu
en
cy
-4 -2 0 2 4
02
04
06
08
01
00
![Page 85: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/85.jpg)
Histogram of N=3,226 z-Values from the Breast Cancer Study.The theoretical N(0,1) null is much narrower than the central peak,which has (δ0,σ0)=(-.02,1.58). In this case the central peak seems toinclude the entire histogram.
![Page 86: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/86.jpg)
Alon Colon Cancer Data
• Affymetrix arrays, samples from colon tissues from 62 patients
• Dataset of M = 62 patients: 40 tumor vs 22 normal, with N = 2,000 genes.
Alon et al. (1999) PNAS, 96, 6745-6750
![Page 87: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/87.jpg)
Alon data theoretical null
We estimated π0 to be 0.39 using the theoretical null. With the empirical null, it is 0.53.
Six smooth muscle genes are included in the 2,000 genes and each has an estimated posterior probability of non-DE less than 0.015.
![Page 88: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/88.jpg)
Colon cancer data: plot of fitted two-componentnormal mixture model with theoretical N(0,1) null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.
z-scores
Fre
qu
en
cy
-6 -4 -2 0 2 4 6
01
02
03
04
05
06
0
![Page 89: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/89.jpg)
Colon cancer data: plot of fitted two-componentnormal mixture model with empirical null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores.
![Page 90: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/90.jpg)
van’t Wout HIV Data
• Four experiments using the same RNA preparation on 4 different slides
• Expression levels of transcripts assessed in CD4-T-cell lines 24 hours after infection with HIV type 1
• Two samples – one corresponds to HIV infected cells and one to non-infected cells
• Dataset of M = 8 samples: 4 HIV-infected vs 4 non-infected, with N = 7,680 genes. Analysed by Gottardo et al (2006)
van’t Wout et al (2003), J Virology 77, 1392-1402Gottardo et al (2006), Biometrics 62, 10-18
![Page 91: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/91.jpg)
HIV data: plot of fitted two-componentnormal mixture model with empirical null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.
z-scores
Fre
qu
en
cy
-6 -4 -2 0 2 4 6
05
01
00
15
02
00
25
03
00
35
0
![Page 92: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/92.jpg)
• Mixture-model based approach to finding DE genes can yield new information
• Gives a measure of the posterior probability that a gene is not DE (i.e. a local FDR rather than global)
• Provides estimates of global rates – the FDR and the false negative rate FNR (FNR=1-sensitivity)
Conclusions
![Page 93: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/93.jpg)
Mixture Model-Based Approach: Advantages
• Provides a local FDR and a measure of the FNR, as well as the global FDR for the selected genes.
• We show that it can yield new information
![Page 94: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/94.jpg)
Allison generated data for 10 mice over 3000 genes. The data are generated in six groups of 500 with a value ρ of 0, 0.4, or 0.8 in the off-diagonal elements of the 500 x 500 covariance matrix used to generate each group.
For a random 20% of the genes, a value d of 0, 4, or 8 is added to the gene expression levels of the last five mice.
Allison Mice Simulation
![Page 95: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/95.jpg)
Theoretical null, ρ=0.8, d=4
![Page 96: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/96.jpg)
Empirical null, ρ=0.8, d=4
![Page 97: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/97.jpg)
Theoretical null, ρ=0.8, d=8
![Page 98: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/98.jpg)
Empirical null, ρ=0.8, d=8
![Page 99: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/99.jpg)
Suppose 0(w) is monotonic decreasing in w. Then
0)(0ˆ cwj for .0wwj
)(ˆ1
)(1ˆˆ
0
000
wF
wFRDF
00 |)( wwwE
![Page 100: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/100.jpg)
Suppose 0(w) is monotonic decreasing in w. Then
0)(0ˆ cwj for .0wwj
)(ˆ1
)(1ˆˆ
0
000
wF
wFRDF
where )()( 000 wwF
1
10
1000ˆ
ˆˆ)(ˆ)(ˆ
w
wwF
![Page 101: The Problem of Detecting Differentially Expressed Genes](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649ebb5503460f94bc405d/html5/thumbnails/101.jpg)
For a desired control level , say = 0.05, define
)(ˆminarg0 wRDFww
(1)
If)(ˆ1
)(1ˆ
00
wF
wF
is monotonic in w, then using (1)
to control the FDR [with and taken to be the empirical distribution function] is equivalent to using the Benjamini-Hochberg procedure based on the P-values corresponding to the statistic wj.
1ˆ 0 )(ˆ 0wF