this is a good time to be doing microarray data analysis · typical microarray study 1. read in...
TRANSCRIPT
This is a good time to be doing Microarray Data Analysis
12th Nov 2006Aedín Culhane,
Dana-Farber Cancer Institute/Harvard School of Public Health.
The Genome Era
Capacity of MicroarraysFirst cDNA Microarrays(45 Arabidopsis genes)
864 Yeast genes
1000 human cancer genes
1995
1996
1996
1999
2004
2005 Exon, tiling arrays
Whole genome (coding) arrays
7,000 Gene on arrays
2003 Genome on 2 arrays
Public Microarray DataArrayExpress• 1602 Experiments (48,386 arrays,
Statistics Aug 06)
GEO • 4,419 Experiments (104,314 arrays)
CIBEX• 5 Experiments (472 arrays)
SMD• 11081 Expts (63329 incl private data)
(31st Oct 2006 )
~160,000 arrays x $500 = $80,000,000
Cancer Studies account for >14% of all studies in databases…
Impact of Microarrays (for Patients)
2004 First microarray approved for treatment decisions by FDA. Affymetrix's AmpliChip Cytochrome P450 Genotyping Test: identifies variations in 2 genes affecting response to a wide variety of drugs.
2005 FDA issued guidelines for applications of genomics in drug development, with the stated hope that genomics will improve the safety and effectiveness of medicines.
2006 Genomics applications in clinical trials rising. ~20% U.S. clinical trials use some sort of genomics approach, with the highest percentage in oncology trials.
Typical Microarray study1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data…
Then again Explore data
4. Unsupervised data analysis (Exploratory Analysis)
5. Select Features of Interest
6. Annotate with biological Information (GO, KEGG, Sequence motifs etc)
7. Other Supervised analysis or Machine Learning
Initial Data QCAffy QC Values Boxplot and Histogram
of data
Overview of the raw data
Box- 25% to 75% inter-quartile range (IQR)
Middle line – Median
Whiskers - Roughly 1.5 * IQR or a 95% confidence interval
Median
"Middle value" of a list.
Odd number of entries; median = middle entry of sorted list
Even number of entries; median = sum of the two middle (after sorting) numbers divided by two.
Median can be estimated from a histogram by finding the smallest number such that the area under the histogram to the left of that number is 50%
Inter-quartile range (IQR)
• Another Dataset – 35 47 48 50 51 53 54 70 75
• Split into two halves, each including the median:
– 35 47 48 50 51 – 51 53 54 70 75
• Find median of each half.
• 1st quartile = 48 • 3rd quartile = 54. • IQR 54-48 = 6.
So what is the IQR for
35 47 48 50 51 53 54 60 70 75
• Split the data into two halves:• 35 47 48 50 51 • 53 54 60 70 75
• Median of each half. • 1st= 48; 3rd = 60.
• Hence IQR is 60-48 = 12.
Histogram of celfile.data
celfile.data
Freq
uenc
y
6 8 10 12 14
020
000
4000
060
000
8000
010
0000
1200
00
Median
Mean
hist(celfile.data)abline(v=mean(celfile.data), col="blue", lwd=2)abline(v=median(celfile.data), col="red", lwd=2)
Log(ratio) Histogram
0
500
1000
1500
2000
2500
3000
-2 -1.8
-1.6
-1.4
-1.2 -1 -0.8
-0.6
-0.4
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Log(ratio)
Freq
uenc
y
LogLog22(ratio) measures treat up(ratio) measures treat up-- and downand down--regulated genes equally regulated genes equally
loglog22(1) = 0(1) = 0 loglog22(2) = 1(2) = 1 loglog22(1/2) = (1/2) = --11
Initial Data Quality Checks• Boxplot, Histogram
• RNA digestion plot
• Affymetrix QC parameters – bioB spike-ins, %P, average background, scale factor.
– affy.qc in library(simpleaffy)
• Image plots of probe level measures (affyPLM)– Residuals.
• Larger residuals (darker) indicate deviations from model
– Normalized Unscaled Standard Errors (NUSE) plot. • Gene standard error estimates from fitPLM standardized across arrays (median SE=1). An
array with elevated SEs relative to other arrays is typically of lower quality.
– Relative Log Expression (RLE) values. • probeset expression value - median expression value across all arrays. Ideally RLE ~ 0.
• Exploratory analysis: Clustering/COA
Preprocessing, normalisation, error models, quality control
Goal of a microarray study
• Detect number of RNA molecules
• Actually measure fluorescence intensity of spot
INDIRECT MEASUREMENT
Normalisation aims to reduce systematic noise introduced in measurement
Expt1 Expt2 Expt3 Expt4 Expt5 Expt6
Gene 1 -3 -3 -1 0 2 3
Gene 2 -2 -2 0 1 2 2
Gene 3 -3 -2 0 1 2 3
Gene 4 3 2 0 -1 -2 -3
Gene 5 2 2 1 0 -2 -3
Gene 6 3 2 1 0 -2 -3
Gene 7 2 2 2 2 2 2
Gene 8 -2 -2 -2 -2 -2 -2
Raw data are not mRNA concentrationso tissue contamination
o clone identification and mapping
o image segmentation
o RNA degradation
o PCR yield, contamination
o signal quantification
o amplification efficiency
o spotting efficiency
o ‘background’ correction
o reverse transcription efficiency
o DNA-support binding
o hybridization efficiency and specificity
o other array manufacturing-related issues
Early Normalization Approaches: Total IntensityEarly Normalization Approaches: Total Intensity
Conceptually simply Conceptually simply
Assumption: Total RNA (mass) in all samples.Assumption: Total RNA (mass) in all samples.
Use a Use a scaling factorscaling factor…….. (Still used in MAS5.0).. (Still used in MAS5.0)
∑
∑
=
==array
k
array
k
N
k
N
k
G
RN
1
1Normalization Factor:
Normalization: kk NGG =′ and kk RR =′ .
Normalize to scaling factor
Normalized to the 75th percentile
Not influenced by outliersStill too much below the line
Why a scaling factor is not sufficientWhy a scaling factor is not sufficient
same-same
2 fold
log-ratio
same-same
The two-component model
raw scale log scale
“additive” noise
“multiplicative” noise
B. Durbin, D. Rocke, JCB 2001
Quantile Normalisation
Outliers are not tolerated
distribution of intensities across every slide is forced to be same.
Observe: IntensityObserve: Intensity--dependent structuredependent structure
Lowess Normalization
Straightens the banana!
Standard deviation regularization(in TM4 MIDAS)
Assumption: logAssumption: log--ratio standard deviations within each block orratio standard deviations within each block or
slide are the same.slide are the same.
Variance regularization can remove the biasVariance regularization can remove the bias
Platform Problems
Spotted Array Platform Specific
– “In house” printing effects
– Regional effects within and between print-tips
– Need regional plate and print-tip lowess normalisation
PCR platesPCR plates
spotting pin quality declinespotting pin quality decline
after delivery of 3x105 spots
after delivery of 5x105 spots
H. Sueltmann DKFZ/MGA
Affymetrix Platform Specific
– Probe level effect. Need a gene expression measure from the 11 probe in probeset
Affymetrix Probe sets
Probe set summarization
PMijg , MMijg
Intensities for perfect match and mismatch probe j for gene g in chip
Need to summarize for each probe set i.e., 16 PM, MM pairs, into a single expression measure.
expression measures: MAS 4.0
expression measures: MAS 4.0
Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean:
o sort dj = PMj -MMjo exclude highest and lowest valueo J := those pairs within 3 standard deviations of the
average
1 ( )# j j
j JAvDiff PM MM
J ∈
= −∑
Expression measures MAS 5.0
Expression measures MAS 5.0
Instead of MM, use "repaired" version CTCT= MM if MM<PM
= PM / "typical log-ratio" if MM>=PM
"Signal" =Tukey.Biweight (log(PM-CT))
(… ≈median)
Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise
Expression measures: Li & Wong
Expression measures: Li & Wong
dChip fits a model for each gene
where– θi: expression index for gene i– φj: probe sensitivity
Need at least 10 or 20 chips.Invariant set
2, (0, )ij ij i j ij ijPM MM Nθ φ ε ε σ− = + ∝
AvDiff-like
with A a set of “suitable” pairs.
Estimate RMA = ai for chip i using robust method median polish(successively remove row and column medians, accumulate terms, until convergence). Works with d>=2
Robust expression measures RMA: Irizarry et al. (2002)
Robust expression measures RMA: Irizarry et al. (2002)
21RMA log ( )j j
j APM BG
∈
= −Α ∑
Comparative MvA plots
MAS5
dChip
RMA
Irizarry et al.
Affymetrix: IPM = IMM + Ispecific ?
log(PM/MM)0From: R. Irizarry et al.,
Biostatistics 2002
Probe-response calibration
i
25
1log log ( )i i
iY x w s ε
=
= + +∑
wi
position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)
Comparison of these Affy methods
• 2 test datasets– Spike-in series: from Affymetrix 59 x HGU95A, 16
genes, 14 concentrations, complex background– Dilution series: from GeneLogic 60 x HGU95Av2,
liver & CNS cRNA in different proportions and amounts
• 15 quality benchmarks -reproducibility-sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins)
http://affycomp.biostat.jhsph.edu
affycomp results (28 Sep 2003)good
bad
Raw Data
Genes >2 fold different
Mas5.0 VSN gcRMA
Normalisation
Red >2 fold difference in gcRMA normalised data
Red >2 fold difference in gcRMA normalised data
gcRMA vsn MAS5.0
Gzmg:1422867_atGzmd:1420343_atBirc1e:1421525_a_atTgfbi:1415871_atPdgfb:1450413_at
Method
Fold
Incr
ease
02
46
810
Selected 5 “follow up” vsn genes. These had similar profiles in
gcRMA and MAS5.0
gcRMA MAS5VSN
Fold
Change
Recap• Normalisation
– Log or glog– Scale to a number, lowess, quantile, lowess,
variance stabilising
• Spotted– Within & between plate, print-tip etc
• Affymetrix– MAS4.0, MAS5.0, RMA, gcRMA, Li&Wong
• With above normalisation methods
Are these methods always valid?
• Mas5.0, RMA, gcRMA and vsn – all assume that the sum of RNA is constant
(same no of genes up and down)
• THIS IS NOT ALWAYS TRUE– k/o of pol II– Blocking methylation/translation etc
Normalising to an external set of genes
• Housekeeping – Not a good idea
• Li & Wong– Transform using non-linear smooth curves– Uses rank invariant probes– Available in dChip and R – Cheng Li & Wing Hung Wong (2001a) PNAS 98, 31-36
• Spike in Controls– External RNA– van de Peppel et al., (2003) EMBO Rep. 4(4):387-93.
Colon Cancer Data
• Fresh-frozen human colorectal tumours. • N=6
– Whole tumour N=3– Parenchymal fraction (LCM dissected)
• On Affymetrix U133plus2 chips – 54675 probesets
Normalised data MAS, RMARMA NormalisationMAS5.0 Normalisation
Normalised data VSN, Li & WongLi & Wong NormalisationVSN Normalisation
Normalisation Matters!
Li & Wong
RMA
MAS 5.0Many Normalisation methods
Need to consider best one for your experimental design
Most normalisation methods assume sum of mRNAs is equal
Exploratory Data Analysis: Clustering and Ordination
Aedín Culhane, Dana-Farber Cancer Institute/Harvard School of Public
Health.
Microarray data analysisMicroarray produce:
• Simultaneously 10,000’s variables
• Multivariate data
• Essential to use exploratory data analysis to “get handle” on data
Typical Analysis of Microarrays
1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data…
Then again Explore data
4. Unsupervised data analysis (Exploratory Analysis)
5. Select Features of Interest.. Include additional biological Information (GO, KEGG, Sequence motifs etc)
6. Other Supervised analysis or Machine Learning
Importance of Data Exploration
• Exploration of Data is Critical– Detect unpredicted patterns in data– Decide what questions to ask
• Clustering– Hierarchical – Flat (k-means)
• Ordination (Dimension Reduction) – Principal Component analysis,
Correspondence analysis
A Distance Metric
• The choice of metric is fundamental
• Exploratory analysis– only discover where you explore..
Expt1 Expt2 Expt3 Expt4 Expt5 Expt6
Gene 1 -3 -3 -1 0 2 3
Gene 2 -2 -2 0 1 2 2
Gene 3 -3 -2 0 1 2 3
Gene 4 3 2 0 -1 -2 -3
Gene 5 2 2 1 0 -2 -3
Gene 6 3 2 1 0 -2 -3
Gene 7 2 2 2 2 2 2
Gene 8 -2 -2 -2 -2 -2 -2
Sample set of gene expression values
Back to our 8 Genes – Create a distance matrix
Expression of 8 genes in 6 arrays
-4
-3
-2
-1
0
1
2
3
4
1 2 3 4 5 6
arrays
log
ratio
Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8
Distance Metrics
•• Euclidean distanceEuclidean distance•• Pearson correlation coefficientPearson correlation coefficient•• Spearman rankSpearman rank•• Manhattan distanceManhattan distance•• Mutual informationMutual information•• etcetc
Each has different properties and can reveal Each has different properties and can reveal different features of the datadifferent features of the data
DistanceDistance
SimilaritySimilarity
2.2. Manhattan: Manhattan: ΣΣi = 1 i = 1 ||xxiAiA –– xxiBiB||66
Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6
Gene AGene A
Gene BGene B
xx1A1A xx2A2A xx3A3A xx4A4A xx5A5A xx6A6A
xx1B1B xx2B2B xx3B3B xx4B4B xx5B5B xx6B6B
1.1. EuclideanEuclidean: : √Σ√Σi = 1i = 1 ((xxiAiA -- xxiBiB))2266
ppAA
ppBB
3. Pearson correlation3. Pearson correlation
Distance Metrics
Distance Is Defined by a Metric
Euclidean Pearson*Distance Metric:
6.0
1.4
+1.00
-0.05D
D
-3
0
3
log2
(cy5
/cy3
)
5 10 15 20
24
68
x
y
-2 0 2 4 6 8 10
-20
24
68
10
x
y
corr=0.87
corr=0.04
Warning: Correlations gone wrong
Clustering: Distance metrics
Euclidean distance
Expt1 Expt2 Expt3
Expt4 Expt5 Expt6
Gene 1 -3 -3 -1 0 2 3
Gene 2 -2 -2 0 1 2 2
Gene 3 -3 -2 0 1 2 3
Gene 4 3 2 0 -1 -2 -3
Gene 5 2 2 1 0 -2 -3
Gene 6 3 2 1 0 -2 -3
Gene 7 2 2 2 2 2 2
Gene 8 -2 -2 -2 -2 -2 -2
Dist(gene 1,2)= √(-3+2)2)+(-3+2)2+(-1+0)2)+(0-1)2+(2-2)2 +(3-2)2
= √ 5 = 2.236068 = 2.24
n√∑ (xi-yi)2i=1
Expression of 8 genes in 6 arrays
-4
-3
-2
-1
0
1
2
3
4
1 2 3 4 5 6
arrays
log
ratio
Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8
Distance Matrix
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8Gene 1 0 2.24 1.73 10.72 10.30 10.82 8.00 6.93Gene 2 0 1.41 9.27 8.66 9.17 6.08 6.71Gene 3 0 10.39 9.75 10.30 6.86 7.42Gene 4 0 1.73 1.41 7.42 6.86Gene 5 0 1 6.78 6.78Gene 6 0 6.86 7.42Gene 7 0 9.80Gene 8 0
Symmetric.Now needs to decide what closest?
Comparison of Linkage Methods
SingleSingle AverageAverage CompleteCompleteJoin by min average max
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8Gene 1 0 2.24 1.73 10.72 10.30 10.82 8.00 6.93Gene 2 0 1.41 9.27 8.66 9.17 6.08 6.71Gene 3 0 10.39 9.75 10.30 6.86 7.42Gene 4 0 1.73 1.41 7.42 6.86Gene 5 0 1 6.78 6.78Gene 6 0 6.86 7.42Gene 7 0 9.80Gene 8 0
5,6 are closest (dist = 1) so merge these
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5,6 Gene 7 Gene 8Gene 1 0 2.24 1.73 10.72 10.30 8.00 6.93Gene 2 0 1.41 9.27 8.66 6.08 6.71Gene 3 0 10.39 9.75 6.86 7.42Gene 4 0 1.41 7.42 6.86Gene 5,6 0 6.78 6.78Gene 7 0 9.80Gene 8 0
Gen
e5
Gen
e6
Gen
e2
Gen
e3
Gene 1 Gene 2,3 Gene 4 Gene 5,6 Gene 7 Gene 8Gene 1 0 1.73 10.72 10.30 8.00 6.93Gene 2,3 0 9.27 8.66 6.08 6.71Gene 4 0 1.41 7.42 6.86Gene 5,6 0 6.78 6.78Gene 7 0 9.80Gene 8 0
Gene 1 Gene 2,3 Gene 4,(5,6) Gene 7 Gene 8Gene 1 0 1.73 10.30 8.00 6.93Gene 2,3 0 8.66 6.08 6.71Gene 4,(5,6) 0.00 6.78 6.86Gene 7 0 9.80
… continue, join 1 to (2,3) at 1.73
…. until done
Gen
e1
Gen
e2
Gen
e3
Gen
e4
Gen
e5
Gen
e6
Hierarchical clustering assembles a number of items into a tree where items that are joined by short branches if they are very similar to each other and by increasingly longer branches as their similarity decreases.
Gen
e4
Gen
e5
Gen
e6
Gen
e8
Gen
e7
Gen
e1
Gen
e2
Gen
e3
12
34
56
7
Cluster Dendrogram
Hei
ght
Heatmap….. Eisen Plots
A B
Interpreting a DendrogramHierarchical analysis results viewed using a dendrogram
(tree)• Distance between nodes (Scale)• Ordering of nodes not important (like baby mobile)
Limitations of hierarchical clustering
• Samples compared in a pair wise manner
• Hierarchy forced on data
• Sometimes difficult to visualise if large data
• Overlapping clustering or time/dose gradients ?
Complementary Approach: ordination
Not this kind of ordination
Ordination- In multivariate statistics
1. Arrangement of units in some order
2. Representation of objects as points along one or several axes of reference (Gower 1984)
Complementary methods
Cluster analysis generally investigates pairwise distances/similarities among objects looking for fine relationships
Ordination in reduced space considers the variance of the whole dataset thus highlighting general gradients/patterns
(Legendre and Legendre, 1998)
Many publications present both
Ordination
• Also refers to as– Latent variable analysis, Dimension reduction
• Aim:
Find axes onto which data can be project so as to explain as much of the variance in the data as possible
Principal Axes• Project new axes through data which capture
variance. Each represents a different trend in the data.
• Orthogonal (decorrelated)
• Typically ranked: First axes most important
• Principal axis, Principal component, latent variable or eigenvector
xxzz
yy
Dimension Reduction (Ordination)
Principal ComponentsPrincipal Componentspick out the directionspick out the directionsin the data that capturein the data that capturethe greatest variabilitythe greatest variability
New Axis 1New Axis 2
New Axis 3
Eigenvalues
• Describe the amount of variance (information) in eigenvectors
• Ranked. First eigenvalue is the largest.
• Generally only examine 1st few components – scree plot
010
2030
40Choosing number of
Eigenvalues: Scree Plot
0.00
00.
005
0.01
00.
015
0.02
00.
025
Maximum number of Eigenvalues/Eigenvectors = max(nrow, ncol) -1
Typical Analysis
0.00
00.
005
0.01
00.
015
0.02
00.
025
X OrdinationPlot of eigenvalues, select number.
Plot PC1 v PC2
etc
Array Projection Gene Projection
Ordination of Gene Expression Data
Ordination Methods
• Most common : – Principal component analysis (PCA)– Correspondence analysis (COA or CA)– Nonmetric multidimensional scaling (NMDS,
MDS)– Principal co-ordinate analysis (PCoA)
Books/Book Chapters:1. Legendre, P., and Legendre, L. 1998. Numerical Ecology, 2nd English Edition. ed.
Elsevier, Amsterdam.2. Wall, M., Rechtsteiner, A., and Rocha, L. 2003. Singular value decomposition and
principal component analysis. In A Practical Approach to Microarray Data Analysis. (eds. D.P. Berrar, W. Dubitzky, and M. Granzow), pp. 91-109. Kluwer, Norwell, MA.
Papers:1. Alter, O., Brown, P.O., and Botstein, D. 2000. Singular value decomposition for genome-
wide expression data processing and modeling. Proc Natl Acad Sci U S A 97: 10101-10106.
2. Culhane, A.C., Perriere, G., Considine, E.C., Cotter, T.G., and Higgins, D.G. 2002. Between-group analysis of microarray data. Bioinformatics 18: 1600-1608.
3. Culhane, A.C., Perriere, G., and Higgins, D.G. 2003. Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics 4: 59.
4. Fellenberg, K., Hauser, N.C., Brors, B., Neutzner, A., Hoheisel, J.D., and Vingron, M. 2001. Correspondence analysis applied to microarray data. Proc Natl Acad Sci U S A 98: 10781-10786.
5. Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2: 559-572.
6. Raychaudhuri, S., Stuart, J.M., and Altman, R.B. 2000. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac SympBiocomput: 455-466.
7. Wouters, L., Gohlmann, H.W., Bijnens, L., Kass, S.U., Molenberghs, G., and Lewi, P.J. 2003. Graphical exploration of gene expression data: a comparative study of three multivariate methods. Biometrics 59: 1131-1139
Reviews1. Quackenbush, J. 2001. Computational analysis of microarray data. Nat Rev Genet 2: 418-
427.
Detecting differentially expressed genes
Normal distribution
σσ = standard = standard deviationdeviationof the of the distributiondistribution
X = X = μμ (mean of the distribution)(mean of the distribution)
Estimating a mean
Estimating a mean
All had the same mean and SD
Population 1Population 1
Mean 1Mean 1
Population 2Population 2
Mean 2Mean 2
Less than a 5 % chance that the sample with mean s came from Less than a 5 % chance that the sample with mean s came from Population 1Population 1
ss is significantly different from is significantly different from Mean 1Mean 1 at the p < 0.05 significance level. at the p < 0.05 significance level.
But we cannot reject the hypothesis that the sample came fromBut we cannot reject the hypothesis that the sample came from Population 2Population 2
Sample mean “Sample mean “ss””
Probability distributionsProbability distributions
The The probabilityprobability of an event is the likelihood of its occurring.of an event is the likelihood of its occurring.
It is sometimes computed as a relative It is sometimes computed as a relative frequency (frequency (rfrf)), where, where
The probability of an event can sometimes be The probability of an event can sometimes be inferred from a “theoretical” inferred from a “theoretical” probability probability distributiondistribution, such as a normal distribution., such as a normal distribution.
the number of “favorable” outcomes for an eventthe number of “favorable” outcomes for an eventthe total number of possible outcomes for that eventthe total number of possible outcomes for that eventrfrf ==
Many biological variables, such as height and weight, can Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal reasonably be assumed to approximate the normal distribution. distribution.
But expression measurements? Probably not.But expression measurements? Probably not.
Fortunately, many statistical tests are considered to be fairly Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other robust to violations of the normality assumption, and other assumptions used in these tests.assumptions used in these tests.
Randomization / resamplingRandomization / resampling based tests can be used to get based tests can be used to get around the violation of the normality assumption.around the violation of the normality assumption.
Normality, Probability and Expression DataNormality, Probability and Expression Data
IMPORTANT CONCEPT No 2
TRUE FALSE
+ve True Positive False Positive Positive Predictive Value
-ve False Negative True Negative Negative Predictive Value
Sensitivity Specificity Accuracy
Test Prediction
True Value (with Disease)
IMPORTANT CONCEPT No 2
←bias accuracy→
←pr
ecis
ion
varia
nce→
Another view: ROC CurveSe
nsiti
vity
1 - specificity
Basic dogma of data analysis:Can always increase sensitivity on the cost of specificity,
or vice versa,
the art is to find the sweet spot.
X
X
X
X
X
X
X
X
X
(It can also be possible to increase both by better choice of method / model)
Our goal is to find genes that are Our goal is to find genes that are significantly different between classessignificantly different between classes
Finding Significant Genes
How?
• Fold Change• T-statistic• Modified t-statistic• Other methods
Fold Change
• Only looks at the difference in the means of two group
• Unreliable in microarrays
• Why? We can’t get good estimate of mean due to too few cases
Average Fold Change Difference for each geneAverage Fold Change Difference for each genesuffers from being arbitrary and not taking suffers from being arbitrary and not taking into account systematic variation in the datainto account systematic variation in the data
??????
tt--test for each genetest for each geneTests whether the difference between the mean of Tests whether the difference between the mean of the query and reference groups are the samethe query and reference groups are the sameEssentially measures signalEssentially measures signal--toto--noisenoiseCalculate Calculate pp--value (permutations or distributions)value (permutations or distributions)May suffer from intensityMay suffer from intensity--dependent effectsdependent effects
Finding Significant Genes
Y − X sY
2
N+
sX2
M
t - statistic= = signalsignal = = difference between meansdifference between meansnoise variability of groups noise variability of groups
Where Y and X: the means
S2: square of the SD or variances
A significantA significantdifferencedifference
ProbablyProbablynotnot
tt--teststests
Estimating the varianceThe t-test considers difference between group
means to standard deviation of data within groups
F-test (ANOVA) is a generalization of this idea to more than 2 groups
But with few replicates, estimates of SE are not stable. This explains why t-test is not powerful
Moderated t-statistics
• There are many proposals for estimating variation
• Many share information across genes• Empirical Bayesian Approaches are popular• SAM, an ad-hoc procedure, is even more
popular• Many are what some call “moderated” t-tests
Some Examples of TestsNotation:
– T is average log expression of Treatment group– C is average log expression of Control group– S is SD
• Tests:– Average log fold-change: (T-C)– t-statistic: (T-C) / S– SAM shrunken t-statistic: (T-C) / (S + S0)– Bayesian posteriors: (T-C) / √(S2+K2)– Wilcoxon Rank test
Note taking log before average is important
One final problemOnce you have a score for each gene, how do you decide on a
cut-off? p-values are popular. Are they appropriate?
Test for each gene null hypothesis: no differential expression.
Notice that if you have look at 10,000 genes for which the null is true you expect to see 500 attain p-values of 0.05
This is called the multiple comparison problem. Statisticians fight about it. But not about the above.
Main message: p-values can’t be interpreted in the usual way
Multiple testing
Popular solutions are either
• slash the p-value – Bonferroi or permutation correction
• or report FDR instead of FPR.
Error Rates
A useful plot
The volcano plot shows, for a particular test, negative log p-value against the effect size (M)
Volcano plot
Comparison of Feature Selection methods
Assessed
1. the gene list produced by 9 different methods
2. the ability of the top genes from each method to form a classifier
Overlap in Gene Ranking in top 200 genes (binary distance, average linkage)
Testing performance of gene-lists as classifiers
• For each dataset :– Divide dataset into training and test.– Apply feature selection method to training data. – Rank genes using feature selection method
• (t-statistic, SAM, template matching, etc).– Select K top genes.
• where k is between 3 -100• Train classifier using these genes.
– Test discriminating power using test set– Record performance of classifier
• Repeat for each gene selection method
Conclusion:The empirical bayes t-statistic is a robust and accurate way to identify regulated genes.
Rank Products is also effective in data with low sample size.
Sample permutation of t-statistic and SAM are not effective in datasets with few samples or with low signal:noise
For larger, or high signal:noise datasets; Most methods work well. Area under the ROC curve method and MaxToutperform other approaches
Jeffery IB, Higgins DG, Culhane AC Comparison and evaluation of microarray feature selection methods. BMC Bioinformatics. Submitted
Finding out more about Genes
We know lots about genes
• Chromosome location• Pathways (KEGG)• Gene Ontology
– Sub- Cellular location (eg nucleus, cytosol)– Biological process (cell signalling)– Molecular function (kinase)
Structure of a GO annotation
Each gene can have several annotated GOs and each GO can have several splits. E.g. DNA topoisomerase II alpha has 8 GO annotations and 11 splits
Gene Sets Score
• Fisher exact test (chi-square test)
• Kolmogorov- Smirnov statistic• weighted KS statistics
• Simple matrix multiplication of of t-statistics x counts
Is a GO term is specific for a set?
51 416
125 8588
173 9004
467
8713
9177
count geneswith GO term in set
count geneswithout GOterm in set
count in set(e.g. differentiallyexpressed genes)
Count in reference set (e.g. all genes on array)
Contingency Table P-value
8x10-52
Fisher's exact testor chi-square test
Gene Ontology: FatiGO
Many options
• GESA• IGA/Rank Prod• GenMAPP, and MAPPFinder• FatiGO
• Segal et al., 2004
Gene Set Enrichment
• proposed by Mootha et al (2003)• similar but more complex and
computationally expensive• Compute Kolmogorov-Smirnov running
sum is computed
Gene Set Enrichment
• For each gene set S• genes are ordered according to some criterion (t-
test; fold change).• Start at top ranking gene• A running sum increases when a gene in set S is
encountered and decreases otherwise• The enrichment score (ES) for a set S is defined
to be the largest value of the running sum.
Kolmogorov-Smirnov test
Running sum over statistics. Compare distance to random distribution
Gene Set Enrichment
• The maximal ES (MES), over all sets S under consideration is recorded.
• For each of B permutations of the class label, ES and MES values are computed.
• The observed MES is then compared to the B values of MES that have been computed, via permutation.
• This is a single p-value for all tests and hence needs no correction
Selection of Categories
pathways (KEGG, cMAP, BioCarta)GO molecular function, biological process cellular locationpublished literatureGenome Info- regions of synteny; cytochromebandsTake care when selecting categories a priori
num categories >>>> num genes (multiple comparison problem)
Dr. Frederick Frankenstein: Igor, would you mind telling me whose brain I did put in?
Igor: And you won't be angry?Dr. Frederick Frankenstein: I will NOT be angry.Igor: Abby someone.Dr. Frederick Frankenstein: Abby someone. Abby
who?Igor: Abby Normal.Dr. Frederick Frankenstein: Abby Normal?Igor: I'm almost sure that was the name.Dr. Frederick Frankenstein: Are you saying that I
put an abnormal brain into a seven and a half foot long, fifty-four inch wide GORILLA? IS THAT WHAT YOU'RE TELLING ME?
From the film Young Frankenstein, 1974
Good Experimental Design & Sample Processing is Critical