Download - Statistical Basics for Micro Array Studies
-
8/3/2019 Statistical Basics for Micro Array Studies
1/81
Statistical Basics forMicroarray Studies
Naomi AltmanPenn State University
Dept. of Statistics
Bioinformatics Consulting Center
Sept. 12/06
mailto:[email protected]:[email protected] -
8/3/2019 Statistical Basics for Micro Array Studies
2/81
-
8/3/2019 Statistical Basics for Micro Array Studies
3/81
Data Used for this Talk
Data:Golub et al (1999) demonstrated that microarray gene expression
could be used to distinguish among human leukemia types:
acute myeloid leukemia (AML) and acute lymphoblasticleukemia (ALL).
There are 72 Affymetrix Hgu6800 microarrays (47 ALL, 25 AML),
with data on the expression of 7129 genes.The data were preprocessed to reduce array to array variability.
The entire dataset is available from www.bioconductor.org.We use a subset that is available in R when you load the affy
package.
-
8/3/2019 Statistical Basics for Micro Array Studies
4/81
Data Used for this Talk
Data: (e.g. 5 probe-sets, 19 randomly selectedcases from golubMergeSub in R)
ALL ALL ALL ALL ALL AML AML AML AML ALL ALL ALL ALL AML ALL AML AML ALL ALL
L37378_at 176 160 139 203 157 15 245 225 170 88 129 353 185 398 184 295 21 135 1585
L37792_at 7 10 14 23 0 3 34 23 22 33 -27 35 -5 18 36 34 6 -12 29
L37868_s_at 284 135 183 212 185 -61 319 182 185 281 68 -132 201 652 132 362 380 230 -382
L37882_at 251 514 441 571 230 516 342 234 183 134 271 439 351 1123 574 53 141 385 258
L37936_at 31 27 22 34 -23 287 6 8 42 84 20 16 26 50 8 14 1 -2 33
It usually pays to have a quick look at the numericalvalues of your data. e.g. in this data set, we havesome negative expression values (which is not
biologically supportable).
-
8/3/2019 Statistical Basics for Micro Array Studies
5/81
Statistical Software
Any standard statistical package (and probably Excel)can do the plots and analyses I will show.
I will use R because: Free from www.bioconductor.org
Available: Windows, Mac, Unix, Linux Programmable Great graphics
Bioconductor Project for genomics Lots of resources (R + Splus communities)
The learning curve for R is steep.
http://www.bioconductor.org/http://www.bioconductor.org/ -
8/3/2019 Statistical Basics for Micro Array Studies
6/81
What is statistics?
Statistical methods are used to:
Describe what was observed.
Make inference to what was not observed.
Determine how to collect the dataappropriately.
-
8/3/2019 Statistical Basics for Micro Array Studies
7/81
Description versus inference
Statistical methods are used both to:
Describe what was observed:
requires data
Make inference to what was not observed:
requires data collected using statistical
methods such as randomization andreplication.
-
8/3/2019 Statistical Basics for Micro Array Studies
8/81
Describing the Data
Graphically
Histogram
Density Plot Boxplot
Scatterplot
Numerically
Quantiles Averages
Variance, Standard Deviation
Correlation
-
8/3/2019 Statistical Basics for Micro Array Studies
9/81
Some R commands
# Load software and data
# get data
# see what is in the data
# extract expression values
# extract variables describing# samples
library(affy)
data(golubMergeSub)golubMergeSub
gExprs=exprs(golubMergeSub)
gPheno=pData(golubMergeSub)
gExprs[1:5,]
gPheno[1:5,]
-
8/3/2019 Statistical Basics for Micro Array Studies
10/81
Looking at the data:
Here are 4 of the genes,aggregated over the 72 arrays.
hist(gExprs[i,])
The "shapes" of the distributionsvary.
Most of the data for L38500_at arenegative.
We have combined ALL and AMLdata.
-
8/3/2019 Statistical Basics for Micro Array Studies
11/81
Looking at the data:Here are 2 of the genes, aggregated overthe 72 arrays with histograms organizedby cancer type
hist(gExprs[i,,gPheno$ALL.AML=="ALL"])
There are some obvious features, (e.g.outlier) but it is hard to compare the AMLand ALL samples because the x-axesdiffer and even the bin widths are not the
same.
-
8/3/2019 Statistical Basics for Micro Array Studies
12/81
Looking at the data:
for (i in c(1,11)) {low=min(golubExprs[i,])-1high=max(golubExprs[i,])+1b=seq(low,high,length.out=20)hist(golubExprs[i,ALL.AML=="ALL"],
main=paste(rownames(golubExprs)[i],"ALL"),xlab="Expression",breaks=b)hist(golubExprs[i,ALL.AML=="AML"],main=paste(rownames(golubExprs)[i],"AML"),xlab="Expression",breaks=b)}
Here we have used the same x-axis forboth cancer types (but varying by gene).
We can see that the AML patients tend tobe higher on L38503, but there is notmuch difference for L37378
-
8/3/2019 Statistical Basics for Micro Array Studies
13/81
Boxplot
The red line is themedian, the blue are the
the 75th quantiles.
A boxplot is another
summary thatemphasizes the central50% of the data and theoutliers and is extremelyuseful for comparisons.
-
8/3/2019 Statistical Basics for Micro Array Studies
14/81
Boxplot
The "whiskers" on the plot extend tothe furthest data value that is no more
than 1.5 boxwidths from the edge ofthe box. Any values beyond aredenoted by a circle.
We can see immediately that L38503is differentially expressed.
As well, L37868_s is more variable inthe AML group, which is not useful fordetecting AML, but may have somebiological meaning.
Statisticians love to use
-
8/3/2019 Statistical Basics for Micro Array Studies
15/81
Statisticians love to uselogarithms
Because:
1. Right skewed data become more symmetric.2. If the variance of the data increases with the size, logs
equalize variance.
3. Log(x/y)=Log(x)-Log(y)
Note that Logb(x) = Logc(b) Logc(x) so we statisticiansdon't really care what base.
In microarray analysis, many people use Log2.
In these units a "2-fold" difference (x/y=2) is Log(x/y)=1.
-
8/3/2019 Statistical Basics for Micro Array Studies
16/81
Why we use log2(expression)In this particular dataset,many of the genes did nothave skewed distributions,and the spread in both
cancers was about thesame, but usually we useLog for all or none of thegenes.
-
8/3/2019 Statistical Basics for Micro Array Studies
17/81
Scatterplots
Scatterplots give us the
opportunity to plot 2quantities and colorcoding adds information.
We can clearly see thateven using both genes,
we cannot completelyseparate the AML andALL patients
plot
-
8/3/2019 Statistical Basics for Micro Array Studies
18/81
Scatterplots
Scatterplots are the most
useful tool for qualitycontrol. I always ploteach array against otherarrays. Even when thearrays come fromdifferent conditions, weshould get a diagonal
"cloud".
Note array X5523 whichwas a dud. (These arefrom a local project.)
-
8/3/2019 Statistical Basics for Micro Array Studies
19/81
Golub Data
-
8/3/2019 Statistical Basics for Micro Array Studies
20/81
LO(W)ESS Trend Curves
It is sometimes useful to
add a trend line to ascatterplot (and this isan integral part of the
"normalization"process.
http://www.stat.unc.edu/faculty/marron/marron_movies.html
-
8/3/2019 Statistical Basics for Micro Array Studies
21/81
A plot is worth 50,000 p-values
Most people are far to eager to get to a "p-value"
Look at the data table Plot everything you can
-
8/3/2019 Statistical Basics for Micro Array Studies
22/81
Numerical Summaries
Measuring the Center
Mean=average=
Median=50th quantile
Mode=highest point on histogram or density
Geometric mean=antilog(mean(log(data))
Y
-
8/3/2019 Statistical Basics for Micro Array Studies
23/81
Numerical Summaries
Measuring the Spread
Variance= average squared deviation from the mean
Standard deviation=
Interquartile range=length of boxplot box=75th quantile 25th quantile
var1
)( 2
=
n
YYi
-
8/3/2019 Statistical Basics for Micro Array Studies
24/81
Numerical SummariesMeasuring Linearity
Correlation: -1 r 1If r=0 there is no linear relationship
If r>0, the relationship is increasing.
If r
-
8/3/2019 Statistical Basics for Micro Array Studies
25/81
Populations and StatisticalInference
What do we mean when we state:
Gene X expresses (or differentially expresses)?
Sepals are more similar to petals than to leaves?
Populations and Statistical
-
8/3/2019 Statistical Basics for Micro Array Studies
26/81
Populations and StatisticalInference
What do we mean when we state:
Gene X expresses (or differentially expresses)?
Sepals are more similar to petals than to leaves?
The population is the general class of organismsfor which we want to make an inferentialstatement.
Populations and Statistical
-
8/3/2019 Statistical Basics for Micro Array Studies
27/81
Populations and Statistical
Inference
If we could observe the entire population with nomeasurement error, we would not need
inferential statistics.
Alternatively, if all the members of the populationare identical and there is no measurementerror, we would not need inferential statistics.
Populations and Statistical
-
8/3/2019 Statistical Basics for Micro Array Studies
28/81
Populations and Statistical
InferenceIf we could observe the entire population with nomeasurement error, we would not need
inferential statistics.Alternatively, if all the members of the population
are identical and there is no measurementerror, we would not need inferential statistics.
Statistical inference is used to make a statementabout the population in the presence of errorbecause we observe only a sample from a
varying population and there is measurementerror.
Samples and replication
-
8/3/2019 Statistical Basics for Micro Array Studies
29/81
Samples and replication
A sample is the set of observed items fromthe population.
Generally, for statistical inference we need
information about how variable theseitems are.
If there is also measurement error, we mightwant to make multiple measurements on
the same individual, because averagingthese reduces the measurement error.
Hypotheses
-
8/3/2019 Statistical Basics for Micro Array Studies
30/81
Hypotheses
Classical statistical tests are based on comparing 2 hypotheses
H0: the null hypothesis (which we hope to disprove)
HA: the alternative hypothesis (which we accept if there isevidence to reject the null hypothesis)
e.g.
H0: the gene does not express
HA: the gene expresses
H0: the gene does not differentially express
HA: the gene differentially expresses
-
8/3/2019 Statistical Basics for Micro Array Studies
31/81
Classical hypothesis testingHypotheses are statements about the population:
e.g.HA: the gene expresses in leaf guard cells
Means that the average expression level oversome population of leaf guard cells is positive.
Since there is both measurement error and
biological variability from leaf to leaf, plant toplant, etc., we need to:
What goes into a hypothesis
-
8/3/2019 Statistical Basics for Micro Array Studies
32/81
What goes into a hypothesis
test?1. Determine the population of cells, leaves,
plants for which we want to make theinference.
2. Ensure that the sample represents thepopulation of interest, not just one member of
the population.
This has 2 implications:
Essential data for hypothesis
-
8/3/2019 Statistical Basics for Micro Array Studies
33/81
Essential data for hypothesis
testingWe need to sample several individuals from our
population.
We need to estimate the biological variability ofour population (using the sample).
Sampling should be done at random.
Tests about the population
-
8/3/2019 Statistical Basics for Micro Array Studies
34/81
Tests about the population
average.The usual tests require that the data areindependent and have the samevariability.
Same variability take logarithms
Independent dependence is induced by
taking multiple measurements of the same
object - e.g. organism, RNA sample,microarray
Tests of one sample mean or
-
8/3/2019 Statistical Basics for Micro Array Studies
35/81
Tests of one sample mean or
median t-test
Wilcoxon test Permutation and bootstrap tests
Tests about the population
-
8/3/2019 Statistical Basics for Micro Array Studies
36/81
p paverage.
Hypotheses about expression levels are usually treatedas hypotheses about population averages (t-test) or
population medians (Wilcoxon test).
We want to make a guessabout the value of the redline, but we do not get tomeasure the entire
population.
Tests about the population
-
8/3/2019 Statistical Basics for Micro Array Studies
37/81
p paverage.
Hypotheses about expression levels are usually treatedas hypotheses about population averages (t-test) orpopulation medians (Wilcoxon test).
We want to make a guessabout the value of the red line,
but we do not get to measurethe entire population.
We estimate the parameterfrom the data and then see ifit is too far from the
hypothesized value to beplausible
-
8/3/2019 Statistical Basics for Micro Array Studies
38/81
-
8/3/2019 Statistical Basics for Micro Array Studies
39/81
Classical testing is basedon the distribution of thetest statistic when the null
hypothesis is true.
This depends on theoriginal population, butmay be the sampleaverage, sample "t" value,etc.
The distribution of the test
statistic is computedthrough statistical theory orsimulation, or by"resampling" from the data(bootstrap and permutation
methods).
2 i l id f t t
-
8/3/2019 Statistical Basics for Micro Array Studies
40/81
2 simple ideas for tests
If the population mean is , then the sample
mean should be close to .t-test, permutation test, bootstrap test
If the population median is , then about50% of the observations should be above and 50% below.
Sign and Wilcoxon test
Some Questions of Interest
-
8/3/2019 Statistical Basics for Micro Array Studies
41/81
Some Questions of Interest
Does genei express under this condition?
Does the expression level of genei differ under
these 2 (or more) conditions?
Do genei and genej have similar patterns ofexpression under several conditions
Some Questions of Interest
-
8/3/2019 Statistical Basics for Micro Array Studies
42/81
Some Questions of Interest
Does genei express under this condition?
H0: i= 0 HA: i> 0
Since we measure intensity, which is always
non-negative, we cannot test this. We use a
small constant (how small?) and do a one-
sided test.
This is often considered to be inaccurate - PCRis often used to check at least some of theconclusions
Some Questions of Interest
-
8/3/2019 Statistical Basics for Micro Array Studies
43/81
Some Questions of Interest
Does the expression level of genei differ underthese 2 (or more) conditions?
H0: ij=ik HA: ijik
This is considered the most appropriate setting
for microarray data
t t f
-
8/3/2019 Statistical Basics for Micro Array Studies
44/81
tests of means:
1. Is the population mean 0? (2-sided)
, (1-sided).
2. Is there a difference in populationmeans? (usually 2-sided)
1.5=2.5 Is the mean difference equal 0?(usually 2-sided)
S ti
-
8/3/2019 Statistical Basics for Micro Array Studies
45/81
Some necessary assumptions
1. The observations are independent.
2. The observations are a random samplefrom a known population.
3. If there are 2 populations, the populationhave the same shape.
t test
-
8/3/2019 Statistical Basics for Micro Array Studies
46/81
t-test
RNA was spiked into the sample sothat the expression level for the oligo
should have been 24. Was this levelachieved?
0: =0
: 0
nS
Yt
Y/
* 0=
We obtain the p-valuefrom a table or our
software (d.f.=n-1)
Classical Inference: P-values
-
8/3/2019 Statistical Basics for Micro Array Studies
47/81
We assume that the NULL distribution is true.The test statistic then has a distribution that is
either known or can be simulated under thisassumption.
The p-value is the probability of observing a valueof the test statistic more extreme than theobserved value when Ho is true.
The p-value can be thought of as a measurement
of "surprise". If small, we reject Ho.
Classical Inference: P-values
-
8/3/2019 Statistical Basics for Micro Array Studies
48/81
We assume that the NULL distribution is true.The test statistic then has a distribution that is
either known or can be simulated under thisassumption.
The p-value is the probability of observing a valueof the test statistic more extreme than theobserved value when Ho is true.
The test is 2-sided if the null hypothesis proposesequality of 2 values. Then we consider theabsolute value of the test statistic.
Classical Inference: P-values
-
8/3/2019 Statistical Basics for Micro Array Studies
49/81
We assume that the NULL distribution is true.The test statistic then has a distribution that is
either known or can be simulated under thisassumption.
The p-value is the probability of observing a valueof the test statistic more extreme than theobserved value when Ho is true.
The test is 1-sided if the null hypothesis proposes
or
t test
-
8/3/2019 Statistical Basics for Micro Array Studies
50/81
t-test
When the data come from a distribution that
is approximately symmetricaland are independent
and the null hypothesis is true
nS
Yt
Y /* 0
=
has a sampling distribution that iscalled Student's t distribution on n-
1 degrees of freedom.The degrees of freedom are ameasure of how often outlying
values are observed
t distribution
-
8/3/2019 Statistical Basics for Micro Array Studies
51/81
t distribution
The t-test is the usualexample.
We compute a value oft from the data (formulato follow.)
If the null hypothesis is
true, the distribution isgiven by one of thecurves on the plot.
The vertical lines aredrawn so the 2.5% ofthe distribution is moreextreme than the line.
observed t
t distribution
-
8/3/2019 Statistical Basics for Micro Array Studies
52/81
t distribution
P-valuefor
0
observed t
Fail to reject: this is atypical value if 0
-
8/3/2019 Statistical Basics for Micro Array Studies
53/81
t distribution
-
8/3/2019 Statistical Basics for Micro Array Studies
54/81
t distribution
P-valuefor
= 0
Reject: this is not atypical value if 0
observed t
Confidence Interval
-
8/3/2019 Statistical Basics for Micro Array Studies
55/81
Confidence Interval
A 1- confidence interval for the mean is all
the hypothesized values that would not berejected by doing the 2-sided test at level.
Looking at the test statistic, we can readilysee that this is
nstYn
/1,2/1
t-test
-
8/3/2019 Statistical Basics for Micro Array Studies
56/81
t testExample:
Ho: = 4Sample average=3.6 n=49Sample variance=0.041
One Sample t-testdata: R.gene1.array1
t = -13.5428, df = 48, p-value < 2.2e-16alternative hypothesis: true mean is not equal to 4
What is required for a one
-
8/3/2019 Statistical Basics for Micro Array Studies
57/81
sample t-test?The data should be independent. (This isessential.)
The data histogram should be symmetric andunimodal (i.e. the data should be "bell-shaped,
although normality is not essential).
Testing the difference in means
-
8/3/2019 Statistical Basics for Micro Array Studies
58/81
Testing the difference in means
If X1 Xn and Y1 Ym are independent
samples, another form of the t-test can be usedto test if there is a difference in the mean. (e.g.to test differential expression)
0: X = Y
2-sample t-test
-
8/3/2019 Statistical Basics for Micro Array Studies
59/81
We now test:
H0: difference in populations averages is 0
HA: difference in population averages is not 0
m
S
n
S
YXt
YX
22*
+
=
Again we use the t-table, with d.f. n+m-2
2-sample t-test
-
8/3/2019 Statistical Basics for Micro Array Studies
60/81
e.g. Do genes L37868 and L38503 differentially express betweenALL and AML patients (a test for each gene).
Note that these were independent arrays for each patient.
t.test(log(gExprs[1,],2)~gPheno$ALL.AML)
L37868t = -0.4401, df = 64.69, p-value = 0.6613alternative hypothesis: true difference in means is not equal to 095 percent confidence interval: -0.9250874 0.5910186sample estimates:mean in group ALL 7.283753 mean in group AML 7.450787
L38503t = -4.9775, df = 64.913, p-value = 5.021e-06alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: -1.8834056 -0.8047853sample estimates:mean in group ALL 10.03113 mean in group AML 11.37522
Some notes:
-
8/3/2019 Statistical Basics for Micro Array Studies
61/81
Some notes:
Notes:
1. Statistical significance is not the same aspractical significance.
2. There is an alternative form of the
denominator of the test if we assume that 2populations have the same spread.
3. We need to be careful about theindependence of the samples.
Another example:
-
8/3/2019 Statistical Basics for Micro Array Studies
62/81
A muscle tissue sample was taken from healthy volunteers. Theywere then infused with insulin and another sample was taken.
The 2 samples from each volunteer were hybridized to a 2-channel microarray
H0: gene expression is the same before and after infusion
HA: the gene differentially expresses
This looks like the two-sample t-test, but some of the samples are
not independent becausea) they are from the same subject
b) they are hybridized to the same array
Pairs of Dependent Samples
-
8/3/2019 Statistical Basics for Micro Array Studies
63/81
There is a simple solution:Combine the dependent samples.
In this case, the relevant quantity is thelog(ratio)=M.
For the M-values, we have hypotheses:
H0: =0 HA: 0
We do a 1-sample t-test using M. This issometimes called the aired t-test.
Here are the results for 4 of the genes:
t = 5 1619 df = 5 p value = 0 003579
-
8/3/2019 Statistical Basics for Micro Array Studies
64/81
t = -5.1619, df = 5, p-value = 0.003579
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval: -1.5292 -0.5124
sample estimates: mean of x -1.020839
t = 10.095, df = 5, p-value = 0.0001634
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval: 0.7928 1.3345sample estimates: mean of x 1.063659
Here are the results for 4 of the genes:
t = 4 9325 df = 5 p value = 0 00435
-
8/3/2019 Statistical Basics for Micro Array Studies
65/81
t = -4.9325, df = 5, p-value = 0.00435
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval: -2.124 -0.6686
sample estimates: mean of x -1.396339
t = -1.0535, df = 5, p-value = 0.3403
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval: -0.752 0.315sample estimates: mean of x -0.2188503
Alternatives to the t-test
-
8/3/2019 Statistical Basics for Micro Array Studies
66/81
The t-test is not the only choice for testing
hypotheses about average expression levels.
These methods require equal variance andindependence, but not bell-shaped
Permutation test
Bootstrap test
Wilcoxon test
Permutation test (2-sample)
-
8/3/2019 Statistical Basics for Micro Array Studies
67/81
Compute t* as usual, but do not use t table.
Instead: Make one big sample X1XnY1Ym
For all possible choices of selection n items
(or a sample, if this is too many)
Select n observations at random to be the Xs;the remainder are the Ys.
Compute t* for each selection.
Permutation test (2-sample)
-
8/3/2019 Statistical Basics for Micro Array Studies
68/81
( p )
The p-value is based onwhere t* is in the
histogram of T*
Multiply the tailprobability by 2 for a 2-sided test
Bootstrap test (2-sample)
-
8/3/2019 Statistical Basics for Micro Array Studies
69/81
( )
Compute t* as usual, but do not use t-table.
Instead: Make one big sample X1XnY1Ym
For all possible choices of selection n items withreplacement to be the Xs and m items withreplacement to be the Ys.
Proceed as for the permutation test.
The Wilcoxon test (2-sample)
-
8/3/2019 Statistical Basics for Micro Array Studies
70/81
aka Mann-Whitney testMake one big sample X1XnY1Ym
Rank the observations from lowest (1)
to highest (n+m)
W*= rank(Xs) rank(Ys)
The p-value is computed by the permutation test,
(and these are available in tables based on nand m).
Note:
-
8/3/2019 Statistical Basics for Micro Array Studies
71/81
There are 1-sample and paired sample versionsof these tests.
The Wilcoxon test is often used instead of the t-test. The Wilcoxon p-value is usually very close
to the t-test p-value when the t-test isappropriate, and is more accurate when the t-
test is not appropriate. On the other hand, theWilcoxon test does not readily generalize to themore powerful analyses based on Bayesian
methods.
Power, False Discovery,
-
8/3/2019 Statistical Basics for Micro Array Studies
72/81
NondiscoveryAccept Reject total
H0 true U V m0H0 false T S m-m0
total m-R R m
Correct
Mistake
Power probability of rejecting when H0 false
FDR =V/R
FNR = T/(m-R)
Power, False Discovery,
-
8/3/2019 Statistical Basics for Micro Array Studies
73/81
Nondiscovery
nS
Yt
Y /
* 0
=The p-value is the probability ofrejecting the null hypothesis when it is
true.Under the alternative:
We can see that t* will be larger when the difference between the means is
larger, the variance is smaller or n islarger.
The only item under our control is thesample size.
Power, False Discovery,
-
8/3/2019 Statistical Basics for Micro Array Studies
74/81
Nondiscovery
nS
Yt
Y /
* 0
=For fixed p-value at which we declarestatistical significance, increasing the
sample size:Increases power
Reduces FDRReduces FNR
Power, False Discovery,
-
8/3/2019 Statistical Basics for Micro Array Studies
75/81
Nondiscovery
nS
Yt
Y /
* 0
=For fixed FDR at which we declarestatistical significance, increasing the
sample size:Reduces FNR
The sample size in most microarray studies is too
small.This leads to large FNR which needs to beconsidered when known genes are not found to bestatistically significant.
Improving Power
-
8/3/2019 Statistical Basics for Micro Array Studies
76/81
Besides increasing the sample size we can:
1. Pick the best tests. (Generally t-test andWilcoxon are more powerful thanresampling when appropriate.)
2. Use (Empirical) Bayes tests (SAM,
LIMMA, )
Bayes Methods
-
8/3/2019 Statistical Basics for Micro Array Studies
77/81
We have lots of genes.
Gene i has mean i and varianceBayesian methods assume that the means and
variances come from known distributions (thepriors).
Empirical Bayes methods assume that the means
and variances have distributions that are estimatedfrom the data.
2
i
Empirical Bayes Methods
-
8/3/2019 Statistical Basics for Micro Array Studies
78/81
Most focus is on the distribution of thevariances.
SAM add a constant based on a quantileof the distribution of the S2 over all thegenes. (SAM also uses permutationtests.)
LIMMA more formal eBayes analysis
- results in replacing gene variancesby a weighted average of the genevariance and the mean variance of all
genes.
m
S
n
S
YXt
YX
22*
+
=
True Bayes
-
8/3/2019 Statistical Basics for Micro Array Studies
79/81
Combine the prior information with the data.
Instead of p-values, we obtain the posteriorodds ratio of the hypothesis being true.
Increasing power
-
8/3/2019 Statistical Basics for Micro Array Studies
80/81
Classical methods are known to be lesspowerful than eBayes and Bayes methods.
We will read some papers about this.
Multiple comparisons problemAccept Reject total
-
8/3/2019 Statistical Basics for Micro Array Studies
81/81
The problem with p-valuesis that:
If we declare significancewhen P