![Page 1: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/1.jpg)
04/18/23 1
Microarray Data Analysis
![Page 2: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/2.jpg)
04/18/23 2
Copyright notice
• Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks!
![Page 3: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/3.jpg)
04/18/23 3
Gene Expression MatrixAfter image processing, obtain a data matrixThe final gene expression matrix (on the right) is needed for higher level analysis and mining.
Samples
Gen
es
Gene expression levels
Images
Spo
ts
Spot/Image quantiations
![Page 4: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/4.jpg)
04/18/23 4
Missing data in microarray• Randomly missing values
• the fact that the value is missing is independent of its value
• methods are available for dealing with randomly missing data
• Non-randomly missing values:• the fact that the value is missing is
dependent on its value– (i.e. the value is missing because it is low
expression, or the value is missing because it is high expression)
• available methods do not adequately deal with the situation of non-randomly missing data
![Page 5: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/5.jpg)
04/18/23 5
Missing data in microarray
Randomly missing data:– spotting problems– dust– finger prints– poor hybridization – inadequate resolution– fabrication errors (e.g.
scratches)– image corruption– omission of suspect values*
* could also be non-random
Non-randomly missing data:low expression
e.g. background exceeds signalcensored data
Arrays
max observable intensity
Exp
ress
ion
![Page 6: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/6.jpg)
04/18/23 6
Dealing with missing data
• The problem:– many analyses require complete data
matrices• classification algorithms• clustering algorithms• dimension-reduction methods
• Solutions:– remove all genes (rows) and arrays (columns)
with missing values– estimate missing values
![Page 7: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/7.jpg)
04/18/23 7
Imputation methods
• Naive approaches– missing values = row (gene) average– missing values = column (array) average
• Smarter approaches have been proposed:– K-nearest neighbors– regression-based methods– singular value decomposition
• like principal components for matrices with unequal numbers of rows and columns
![Page 8: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/8.jpg)
04/18/23 8
K-Nearest Neighbors (KNN)
Arrays
Exp
ress
ion
?
randomly missing datum
• chose k genes that are most similar to the gene with the missing value (MV)
• estimate MV as the weighted mean of the neighbors
• considerations:– number of neighbors– distance metric– normalization step
![Page 9: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/9.jpg)
04/18/23 9
KNN - considerations
• parameter k– 10 usually works (5-15)
• distance metric– euclidean distance– correlation-based
distance
Arrays
Exp
ress
ion
?
![Page 10: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/10.jpg)
04/18/23 10
Ordinary Least Squares (OLS)• regression-based approach• also uses k-neighbors• algorithm:
– choose k neighbors (euclidean or correlation; normalize or not)
– the gene with the MV is regressed over the neighbor genes (one at a time, i.e. simple regression)
– for each neighbor, MV is predicted from the regression model
– MV is imputed as the weighed average of the k predictions
![Page 11: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/11.jpg)
04/18/23 11
Singular Value Decomposition (SVD)
• goal:– use the strongest patterns of correlation within the
data matrix to estimate • algorithm
– set MVs to row average (need a starting point)– decompose expression matrix in orthogonal
components, “eigengenes”.– use the proportion, p, of eigengenes corresponding
to largest eigenvalues to reconstruct the MVs from the original matrix (i.e. improve your estimate)
– use EM approach to iteratively improve estimates of MVs until convergence
![Page 12: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/12.jpg)
04/18/23 12
Other Imputation Methods:
• Local Singular Value Decomposition (LSVD)– combines KNN and SVD– algorithm:
• start with a ngenes x marrays matrix• select k neighbor genes (euclidean or correlation;
normalize or not)• perform SVD on the k x marray matrix
• Partial Least Squares (PLS) regression– uses all genes and available data from target gene
• Factor Analysis (FA) regression
![Page 13: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/13.jpg)
04/18/23 13
Which imputation method to use?
• KNN is the most widely-used; current standard
• many alternative choices: OLS, SVD, LSVD, PLS, (FA)
• algorithms require user-supplied parameters: k, p, distance metric, etc.
• No set of rules for choosing which method to use
![Page 14: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/14.jpg)
04/18/23 14
Characteristics of data that may affect choice of imputation method
• dimensionality
• percentage of values missing
• experimental design (time series, case/control, etc.)
• entropy - patterns of correlation in data
• others?
![Page 15: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/15.jpg)
04/18/23 15
Data Analysis
• Determine differential gene expression• Identify up- and down-regulated genes• Gene lists produced using Factor 2 Rule, t-test based
methods
• Co-regulation of genes• Clustering algorithms
• Identify genes that regulate other genes• Networks (e.g. Bayesian)
![Page 16: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/16.jpg)
04/18/23 16
Methods to Decide Differential Expression
• Compare treatment to the control– The fold approach– The t-test– Variations of the t-test
• SAM: significance analysis of microarrays
• Compare several treatments– ANOVA: analysis of variance– MAANOVA:
http://www.jax.org/staff/churchill/labsite/software/anova/index.html
![Page 17: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/17.jpg)
04/18/23 17
Fold Change
• Measure ratios of gene expression levels.
• Ratio = Ti/Ci. Ratio of measured treatment intensity to control intensity for the ith spot
• The log2 ratio treats up and down regulated genes equally– e.g. when looking for genes with more than 2 fold
variation in expression
![Page 18: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/18.jpg)
04/18/23 18
The Fold Approach
• In northern analysis, a 2-fold change can be seen with bare eyes
• Thus biologists tend to use 2-fold as the threshold of differential expression
• mean(x1, x2) > 1
• mean(x1, x2) < -1
![Page 19: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/19.jpg)
04/18/23 19
Illustration of the benefit of using Log ratios
![Page 20: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/20.jpg)
04/18/23 20
Two-fold up-regulation
• Problems with this approach:– Only identifies most changed genes.– Also identifies noise and highly variable
genes.– Ratio is unstable when the denominator is
small.
![Page 21: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/21.jpg)
04/18/23 21
Ratios are unstable
• Initial measurements:
30/60 = 0.5
500/1000 = 0.5
• Add random noise (+15 numerator and -15 denominator):
45/45 = 1.0
515/985 = 0.52
![Page 22: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/22.jpg)
04/18/23 22
Types of tests
• Standard t-test assumes the samples are drawn from normal distributions with equal variance and different means.
• Welch’s t-test allows for different variances between classes.
• Mann-Whitney (Wilcoxon) converts the data to ranks, and does not assume a particular distribution.
• Permutation test computes the t-statistic for many random permutations of the labels.
![Page 23: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/23.jpg)
04/18/23 23
The Student’s t-test
• For sample sizes less than 30 we have to make use of a t-distribution
• We make use of this distribution in the two-sample Students t-test.
• This test is used to test whether two samples come from distributions with the same means.
• The samples are assumed to come from Gaussian (normal) distributions.
• The two samples must have similar dispersions
![Page 24: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/24.jpg)
04/18/23 24
The student’s t distribution• The students t distribution
– is mound shaped– is symmetrical about zero– is more widely dispersed than the standard
normal distribution– it’s actual shape is dependent on the sample size
• different t distributions are identified by their degrees of freedom (df), where df = n-1
![Page 25: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/25.jpg)
04/18/23 25
The student’s t distribution (cont.)
-4 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 4
Standard Errors
df=120 (=z)
df=30
df=15
EG’s (not to scale)
![Page 26: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/26.jpg)
Mean and Median
• The mean is the most common measure of the location of a set of points.
• However, the mean is very sensitive to outliers. • Thus, the median or a trimmed mean is also
commonly used.
04/18/23 26
![Page 27: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/27.jpg)
Range and Variance
• Range is the difference between the max and min• The variance or standard deviation sx is the most
common measure of the spread of a set of points.
• Because of outliers, other measures are often used.
04/18/23 27
![Page 28: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/28.jpg)
04/18/23 28
Statistical Analysis
controlgroupmean
treatmentgroupmean
Is there a difference?
![Page 29: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/29.jpg)
04/18/23 29
What does difference mean?
mediumvariability
highvariability
lowvariability
The mean differenceis the same for all
three cases
![Page 30: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/30.jpg)
04/18/23 30
What does difference mean?
mediumvariability
highvariability
lowvariability
Which one showsthe greatestdifference?
![Page 31: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/31.jpg)
04/18/23 31
What does difference mean?• a statistical difference is a function of the
difference between means relative to the variability
• a small difference between means with large variability could be due to chance
• like a signal-to-noise ratio
lowvariability
Which one showsthe greatestdifference?
![Page 32: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/32.jpg)
04/18/23 32
So we estimate
lowvariability
signal
noise
difference between group means
variability of groups=
XT - XC
SE(XT - XC)=
= t-value
_ _
_ _
![Page 33: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/33.jpg)
04/18/23 33
Probability - p
• With t we check the probability Reject or do not reject Null hypothesis
• You reject if p < 0.05 or less• Difference between means
(groups) is more & more significant if p is less & less
![Page 34: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/34.jpg)
04/18/23 34
Important notes on two sample comparisons
• Type I errors (false positive)– we accept a difference is real when it is not (at the 95% confidence level we are, of course, wrong 5% of the time)– We can increase the significance level to
decrease these errors• Type II errors (false negative)– if we increase
our significance level we risk missing some real differences by making our testing too stringent.
• Convention is we should reduce Type I errors and be conservative
• Both can be minimised by increasing the sample size
![Page 35: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/35.jpg)
04/18/23 35
Paired and unpaired tests
• There are different formulas for the T-test depending on whether we have paired or unpaired data– Paired – making observations of N individuals in two
different situations• In this situation we can consider the difference for each
individual rather than calculate separate means and SEs for the two effects
– Unpaired – Two separate samples drawn from the same parent population
• Can have different sample sizes
![Page 36: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/36.jpg)
04/18/23 36
Tails
• Two-tailed: Do set A and set B come from different distributions?
• One-tailed: Does set A come from a distribution with larger mean than set B?
• This corresponds to finding differentially regulated genes versus finding up-regulated genes.
![Page 37: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/37.jpg)
04/18/23 37
Selecting genes with a t-test
μi = mean expression value in class ini = number of examples in class iv = pooled variance across both classes
21
21
nv
nv
http://mathworld.wolfram.com/Studentst-Distribution.htmlZar. Biostatistical Analysis. 1999.
![Page 38: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/38.jpg)
04/18/23 38
Standard T Test: An example
• Observed gene expression values:
Treatment A: 0.45 0.57 1.02 0.97
Treatment B: 1.50 2.07 0.51 1.63
• Compute mean:
mean (A) = 3.01 / 4 = 0.7525
mean (B) = 5.71 / 4 = 1.4275
![Page 39: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/39.jpg)
04/18/23 39
Pooled variance
• The standard t-test assumes samples are drawn from distributions with the same variance.
• Pooled variance
= (SS1 + SS1) / (n1 + n2 - 2)
= (0.243675 + 1.300875) / (4 + 4 - 2)
= 0.2574SS: variance
![Page 40: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/40.jpg)
04/18/23 40
Selecting genes with a t-test
t = (0.7525 - 1.4275) / sqrt(0.2574/4 + 0.2574/4) = 1.8815
21
21
nv
nv
![Page 41: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/41.jpg)
04/18/23 41
If the Sample Variances are Unlikely to be Equal
• Use Welch’s t-test • degrees of freedom
• wherey
y
x
x
nn
yx22
11
)(22
2
yx nB
nA
BA
y
y
x
x
nB
nA
22
,
![Page 42: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/42.jpg)
04/18/23 42
Welch’s approximation
t = 1.8815Welch’s = |0.7525 - 1.4275| / sqrt(0.08089/4 + 0.43363/4)
= 1.8821
21
21
nv
nv
2
2
1
1
21
nv
nv
t-test Welch’s
![Page 43: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/43.jpg)
04/18/23 43
Degrees of freedom
• For the t-test, dof = n1 + n2 - 2.
• For Welch’s approximation, it is not so simple. Let Ai = vari / ni. Then
11 2
22
1
21
221
nA
nA
AAfloordof
![Page 44: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/44.jpg)
04/18/23 44
Non-parametric p-value
• The t-test assumes the t-distribution– a parametric method– compute the test statistics– use the t pdf to determine the p-value
• A non-parametric method– data are labeled as X and Y– compute the test statistics with true labels– randomly permute the individual labels 10000 times, and
compute the test statistics– find the rank of the true test statistics among the test statistics of
random permutations– for example, if there are 10 permutations with test statistics
larger than the true test statistics, then the p-value is 0.001
![Page 45: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/45.jpg)
04/18/23 45
Mann-Whitney u-test
• Mann-Whitney, also known as Wilcoxon, is a non-parametric test.
• Begin by converting to ranks:
Treatment A: 0.45 0.57 1.02 0.97
Treatment B: 1.50 2.07 0.51 1.63
Treatment A: 1 3 5 4
Treatment B: 6 8 2 7
![Page 46: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/46.jpg)
04/18/23 46
Mann-Whitney u statistic
• The u statistic is
where Ri is the sum of the ranks in class i.
• U = 16 + 10 - 13 = 13
2
22211
1121 2
1,
21
max Rnn
nnRnn
nnU
![Page 47: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/47.jpg)
04/18/23 47
Permutation test
![Page 48: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/48.jpg)
04/18/23 48
Cost-benefits analysis
• t-test assumes both samples are drawn from the same normal distribution.
• Welch’s approximation allows the samples to be drawn from different normals.
• Mann-Whitney makes no assumption about the distribution.
• The tests, as listed, yield decreasing power.• The permutation test gives the most flexibility in
choosing a test statistic that reflects prior knowledge, but it can be computationally expensive for small p-values.
![Page 49: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/49.jpg)
04/18/23 49
Multiple testing correction
• On an array of 10,000 spots, a p-value of 0.0001 may not be significant.
• For significance of 0.05 with 10,000 spots, you need a p-value of 5 10-6.
![Page 50: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/50.jpg)
04/18/23 50
Family-wise Error-rate
• FWER• Chance of any false positives• Assume 0.01 significance level for one gene• Multiply by the number of genes• Many false positives• Bonferroni correction: divide 0.01 by the number
of genes• Bonferroni is conservative because it assumes
that all genes are independent.
![Page 51: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/51.jpg)
04/18/23 52
False discovery rate
• The false discovery rate (FDR) is the percentage of genes above a given position in the ranked list that are expected to be false positives.
• False positive rate: percentage of non-differentially expressed genes that are flagged.
• False discovery rate: percentage of flagged genes that are not differentially expressed.
5 FP13 TP
33 TN5 FN
FDR = FP / (FP + TP) = 5/18 = 27.8%FPR = FP / (FP + TN) = 5/38 = 13.2%
![Page 52: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/52.jpg)
04/18/23 53
Bonferroni vs. FDR
• Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive.
• FDR is the proportion of false positives among the genes that are flagged as differentially expressed.
![Page 53: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/53.jpg)
04/18/23 54
Controlling the FDR
• Order the unadjusted p-values p1 p2 … pm.
• To control FDR at level α,
• Reject the null hypothesis for j = 1, …, j*.• This approach is conservative if many genes are
differentially expressed.
m
jpjj j:max*
(Benjamini & Hochberg, 1995)
![Page 54: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/54.jpg)
04/18/23 55
q-value
• The p-value for a particular gene G is the probability that a randomly generated expression profile would be as or more extremely differentially expressed.
• The q-value for a particular gene G is the proportion of false positives among all genes that are as or more extremely differentially expressed.
• Equivalently, the q-value is the minimal FDR at which this gene appears significant.
![Page 55: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/55.jpg)
04/18/23 56
Q-value software
http://faculty.washington.edu/~jstorey/qvalue/
![Page 56: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/56.jpg)
04/18/23 57
SAMSignificance analysis of microarrays applied to the ionizing radiation response Virginia Goss Tusher, Robert Tibshirani, and Gilbert ChuProc. Natl. Acad. Sci. USA, Vol. 98, Issue 9, 5116-5121, April 24, 2001
![Page 57: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/57.jpg)
04/18/23 58
Abstract• Method for gene filtering: find genes change
that significantly across samples• Significance Analysis of Microarrays (SAM)
assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements.
• For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR).
![Page 58: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/58.jpg)
04/18/23 59
Introduction
• Suitable for oligo, cDNA, protein arrays
• Does not normalize the data!
• Challenge: – methods based on conventional t tests provide
the probability (P) that a difference in gene expression occurred by chance. For an array with 10000 genes, a significance level of alpha = 0.01 would identify 100 genes by chance.
– Experiments are expensive.
![Page 59: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/59.jpg)
04/18/23 60
Introduction• Solution based on SAM:
– assimilate a set of gene-specific t tests. Each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene.
– Instead of more replicates, generate permutations of the data (mix the labels)
• Genes with scores greater than a threshold are deemed potentially significant. The percentage of such genes identified by chance is the false discovery rate (FDR). To estimate the FDR, nonsense genes are identified by analyzing permutations of the measurements.
• The threshold can be adjusted to identify smaller or larger sets of genes, and FDRs are calculated for each set. To demonstrate its utility, SAM was used to analyze a biologically important problem: the transcriptional response of lymphoblastoid cells to ionizing radiation (IR).
![Page 60: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/60.jpg)
04/18/23 61
Motivating ExperimentH
uman
Cel
l Lin
esTreatment
Irradiated (I) Unirradiated (U)
1
One RNA sample for each combinationof cell line and treatment
2
![Page 61: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/61.jpg)
04/18/23 62
Motivating ExperimentH
uman
Cel
l Lin
esTreatment
Irradiated (I) Unirradiated (U)
1 U1A U1B
U2A U2B
I1A I1B
I2A I2B
After labeling, each RNA sample wassplit into two aliquots denoted A and B.
2
![Page 62: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/62.jpg)
04/18/23 63
Motivating ExperimentH
uman
Cel
l Lin
esTreatment
Irradiated (I) Unirradiated (U)
1 U1A U1B
U2A U2B
I1A I1B
I2A I2B
8 GeneChips, one for each sample, wereused to obtain measures of expression.
2
![Page 63: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/63.jpg)
04/18/23 64
First glance at the data
Linear Scatter plot of gene expression Cube root scatter plot of gene expression
![Page 64: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/64.jpg)
04/18/23 65
How to find the significant changes? Naïve method
Cube root scatter plot of average gene expression from the four hybridizations with uninduced cells (avg xU) and induced cells 4 h after exposure to 5 Gy of IR (avg xI). Some of the genes that responded to IR are indicated by arrows.
![Page 65: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/65.jpg)
04/18/23 66
Test Statistic for the ith Gene
d(i) = xI(i) – xU(i)- -
s(i)+s0
Average of 4 normalizedmeasures from
irradiated samples
Average of 4 normalizedmeasures from
unirradiated samples
The usual standarddeviation in the denominator
of a two-sample t-stat
A constant common to allgenes that is added to makevariation in d(i) similar acrossgenes of all intensity levels
![Page 66: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/66.jpg)
04/18/23 67
Selecting the constant s0• At low expression levels, variance in d(i) can be high
because of small values of s(i).
• To stabilize the variance of d(i) across genes, a small positive constant s0 was used in the denominator of the test statistic.
• “The coefficient of variation of d(i) was computed as a function of s(i) in moving windows across the data. The value for s0 was chosen to minimize the coefficient of variation.”
• s0 was chosen to be 3.3 for the ionizing radiation data.
![Page 67: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/67.jpg)
04/18/23 68
More Detail on Selecting s0
• The d(i) are separated into approximately 100 groups. The 1% of the d(i) values with the smallest s(i) values are placed in the first group, the 1% of the d(i) values with the next smallest s(i) are placed in the second group, etc.
• The median absolute deviation (MAD) of the d(i) values is computed separately for each group.
• The coefficient of variation (CV) of these 100 MAD values is computed.
![Page 68: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/68.jpg)
04/18/23 69
More Detail on Selecting s0 (continued)
• This process is repeated for values of s0 equal to the minimum of s(i) over i, the 5th percentile of the s(i) values, the 10th percentile of the s(i) values,..., the 95th percentile of the s(i) values, and the maximum of the s(i) values.
• The value of s0 that minimizes the CV of the 100 MAD values over candidate s0 described above is selected as the constant s0.
![Page 69: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/69.jpg)
04/18/23 70
Balancing the Permutations
•There are differences between the two cell lines.
• Balanced permutations- to minimize the effects of these differences
A permutation is balanced if each group of four
experiments contained two experiments from
line 1 and two from line 2.There are 36 balanced permutations.
![Page 70: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/70.jpg)
04/18/23 71
Example PermutationsH
uman
Cel
l Lin
esTreatment
Irradiated (I) Unirradiated (U)
1 I1A I1B U1A U1B
I2A I2B U2A U2B2
![Page 71: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/71.jpg)
04/18/23 72
• Scatter plots of relative difference in gene expression d(i) vs. genespecific scatter s(i).
![Page 72: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/72.jpg)
04/18/23 73
A Permutation Procedurefor Assessing Significance
1. The irradiated and unirradiated GeneChips were shuffled within each cell line.
2. The d(i) statistic was computed for each gene and ordered across genes from smallest to largest to obtain d1(1)<d1(2)< <d1(g) where g denotes the number of genes.
3. Steps 1 and 2 were repeated for all possible data permutations described in step 1 to obtain dp(1)<dp(2)< <dp(g) for p=1,...,36.
...
...
42
42
![Page 73: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/73.jpg)
04/18/23 74
A Permutation Procedurefor Assessing Significance (continued)
4. For each i, d1(i),...,d36(i) were averaged to obtain dE(i), the “expected relative difference.”
5. The original d(i) statistics were also sorted so that d(1)<d(2)< <d(g).
6. Genes for which | d(i) – dE(i) | > were declared significant, where is a user specified cutoff for significance.
...
![Page 74: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/74.jpg)
04/18/23 75
Example
![Page 75: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/75.jpg)
04/18/23 76
Plot of Observed vs. “Expected” Test Statistics
d(i)
dE(i)
Points for genes withevidence of induction
Points for genes withevidence of repression
2
![Page 76: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/76.jpg)
04/18/23 77
Plot of d(i) vs. log10s(i) forthe Ionizing Radiation Data
d(i)
log10s(i)
24 induced genes
22 repressed genes
![Page 77: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/77.jpg)
04/18/23 78
Estimating FDR for a Selected 1. Find the smallest d(i) among those d(i) for
which d(i) – dE(i) > and call it dup.
2. Find the largest d(i) among those d(i) for which d(i) - dE(i) < - and call it ddown.
3. For each permuted data set, find the number of genes with d(i) >= dup or d(i) <= ddown and denote these counts by n1,...,n36.
4. FDR is estimated by n / n where n is the average of n1,...,n36 and n is the number of genes identified as significant in the original data.
- -
![Page 78: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/78.jpg)
04/18/23 79
FDR cont’d
})()(|{#
})()(|{#
21
36
1 21361
tidtidi
tidtidiFDR p pp
• Note: Cutoffs are asymmetric
![Page 79: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/79.jpg)
04/18/23 80
Counts of Genes beyond the Threshold For Each Permutation
1 45 2 5 3 2 4 3 5 4 6 11 7 8 8 5 9 110 111 312 4
13 414 115 316 917 1218 3119 3120 1221 922 323 124 4
25 426 227 128 129 530 931 1132 433 334 235 536 46
Perm Count Perm Count Perm Count
![Page 80: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/80.jpg)
04/18/23 81
Mean Count = 8.472 FDR Estimate = 8.472/46 = 18.4%
1 45 2 5 3 2 4 3 5 4 6 11 7 8 8 5 9 110 111 312 4
13 414 115 316 917 1218 3119 3120 1221 922 323 124 4
25 426 227 128 129 530 931 1132 433 334 235 536 46
Perm Count Perm Count Perm Count
![Page 81: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/81.jpg)
04/18/23 82
How to choose Δ?
Omitting s0 caused higher FDR.
![Page 82: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/82.jpg)
04/18/23 83
Plot of Observed vs. “Expected” Test Statistics
d(i)
dE(i)
-4.073859
4.054688
![Page 83: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/83.jpg)
04/18/23 84
Plot of d(i) vs. log10s(i) forthe Ionizing Radiation Data
d(i)
log10s(i)
-4.073859
4.054688
![Page 84: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/84.jpg)
04/18/23 85
Same Plot for One of the Permuted Data Sets
d(i)
-4.073859
4.054688
log10s(i)
only 5 genes beyond thresholdscompared to 46 for original data
![Page 85: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/85.jpg)
04/18/23 86
SAM vs. R fold
• R-fold Method:
• Gene i is significant if r(i)>R or r(i)<1/R
FDR 73%-84% - Unacceptable.
• Pairwise fold change: At least 12 out of 16 pairings satisfying the criteria. FDR 60%-71% - Unacceptable.
Why doesn’t it work?
)(
)()(
ix
ixir
U
I
![Page 86: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/86.jpg)
04/18/23 87
Fold-change, SAM- Validation
![Page 87: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/87.jpg)
04/18/23 88
![Page 88: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/88.jpg)
04/18/23 89
SAM vs. Multiple t-Tests
• Trying to keep the FDR or FWER (Family–wise error rate).
• Why doesn’t it work? • FWER- too stringent (Bonferroni, Westfall
and Young)• FDR- too granular (Benjamini and Hochberg)• SAM does not assume normal distribution of
the data• SAM works effectively even with small
sample size.
![Page 89: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/89.jpg)
04/18/23 90
Conclusion SAM• SAM is a method for identifying genes on a
microarray with statistically significant changes in expression.
• SAM provides an estimate of the FDR for each value of the tuning parameter. The estimated FDR is computed from permutations of the data.
• SAM can be generalized to other types of experiments and outcomes by redefining d(i)
• http://www-stat-class.stanford.edu/SAM/SAMServlet.
![Page 90: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/90.jpg)
04/18/23 91
ANOVA
• The t-test and its variants only work when there are two sample pools.
• Analysis of variance (ANOVA) is a general technique for handling multiple variables, with replicates.
• A tutorial is available here:http://cran.at.r-project.org/doc/contrib/Faraway-PRA.pdf
![Page 91: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/91.jpg)
04/18/23 92
A simple experiment
• Measure response to a drug treatment in two different mouse strains.
• Repeat each measurement five times.
• Total experiment = 2 strains * 2 treatments * 5 repetitions = 20 arrays
• If you look for treatment effects using a t-test, then you ignore the strain effects.
![Page 92: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/92.jpg)
04/18/23 93
ANOVA lingo
• Factor: a variable that is under the control of the experimenter (strain, treatment).
• Level: a possible value of a factor (drug, no drug).
• Main effect: an effect that involves only one factor.
• Interaction effect: an effect that involves two or more factors simultaneously.
• Balanced design: an experiment in which each factor and level is measured an equal number of times.
![Page 93: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/93.jpg)
04/18/23 94
Two-factor design
![Page 94: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/94.jpg)
04/18/23 95
Fixed and random effects
• Fixed effect: a factor for which the levels would be repeated exactly if the experiment were repeated.
• Random effect: a term for which the levels would not repeat in a replicated experiment.
• In the simple experiment, treatment and strain are fixed effects, and we include a random effect to account for biological and experimental variability.
![Page 95: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/95.jpg)
04/18/23 96
ANOVA model
is the mean expression level of the gene.• T and S are main effects (treatment, strain)
with n and m levels, respectively.• TS is an interaction effect.• p is the number of replicates per group. represents random error (to be minimized).
.,,1
,,,1
,,,1
pk
mj
ni
TSSTE ijkijjiijk
![Page 96: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/96.jpg)
04/18/23 97
ANOVA steps
• For each gene on the array– Fit the parameters T and S, minimizing .– Test T, S and TS for difference from zero,
yielding three F statistics.– Convert the F statistics into p-values.
![Page 97: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/97.jpg)
04/18/23 98
ANOVA assumptions
• For a given gene, the random error terms are independent, normally distributed and have uniform variance.
• The main effects and their interactions are linear.
![Page 98: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/98.jpg)
04/18/23 99
Summary
• Individual measurements from microarray experiments are not trustworthy.
• Repetition or independent verification (e.g., RT-PCR) are the best means of verification.
• For simple designs, use Welch’s approximation of the t-test.
• For complex designs, use ANOVA.• Correct for multiple comparisons using FDR and
q-values.
![Page 99: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/99.jpg)
04/18/23 100
Bioconductor• Bioconductor is an open source project to
design and provide high quality software and documentation for bioinformatics.
• Current focus: microarrays and gene (transcript) annotation
• Most of the early developments are in the form of R packages.
• Open to (your?) contributions• Software and documentation are available
from www.bioconductor.org.
![Page 100: 6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong](https://reader036.vdocuments.us/reader036/viewer/2022062714/56649d365503460f94a0d799/html5/thumbnails/100.jpg)
04/18/23 101
Bioconductor packages
• General infrastructure– Biobase– annotate, AnnBuilder– tkWidgets
• Pre-processing for Affymetrix data– affy.
• Pre-processing for cDNA data– marrayClasses, marrayInput, marrayNorm, marrayPlots.
• Differential expression– edd, genefilter, multtest, ROC.
• etc.