henrik bengtsson [email protected] mathematical statistics, centre for mathematical sciences,
DESCRIPTION
Statistical analysis of microarray data. Henrik Bengtsson [email protected] Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden Bioinformatics 2 , Bioinformatics Centre, University of Copenhagen. February 19, 2004. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Henrik [email protected]
Mathematical Statistics, Centre for Mathematical Sciences,
Lund University, Sweden
Bioinformatics 2, Bioinformatics Centre, University of Copenhagen.
February 19, 2004
Statistical analysis of Statistical analysis of microarray datamicroarray data
2 of 91
OutlinePart I – (Very short) Background
• Central Dogma of Biology• Idea behind the microarray technology• Experimental design
Part II - Printing, Hybridization, Scanning & Image Analysis• From clone to slide• From samples to hybridization• From scanning to raw data
Part III - Preprocessing of data (IMPORTANT)• Exploratory data analysis and log transforms• Background correction• Normalization
Part IV - Identifying differentially expressed genes• Cut-off by log-ratios values, the t-statistics and cut-off by T values• Multiple testing, adjusting the p-values• Validation
3 of 91
The cDNA microarray technology-
PART I: (Very short) Background
1. The Central Dogma of Biology2. Idea behind the microarray technology3. Experimental design
4 of 91
Idea of microarrays (and other gene expression techniques):
Measure the amount of mRNA to find genes that are expressed (active).
(measuring protein might be better, but is currently much harder)
The Central Dogma of Biology
5 of 91
History
cDNA microarrays have evolved from Southern & Northern blots, with clone libraries gridded out on nylon membrane filters being an important and still widely used intermediate.
Things took off about 1995 with the introduction of non-porous solid supports, such as glass - these permitted miniaturization - and fluorescence based detection.
Terry Speed et al.
6 of 91
The cDNA Microarray Technique
1. Put a large number of DNA sequences or synthetic DNA oligomers onto a glass slide- 5000-50000 gene expressions at the same time.
2. Measure amounts of cDNA bound to each spot3. Identify genes that behave differently in different cell
populations- tumor cells vs healthy cells- brain cells vs liver cells- same tissue different organisms
• Time series experiments- gene expressions over time after treatment, e.g. T=5,10,30,... min vs T=0.
• ...
7 of 91
Experimental Design1. Experimental design is very important.
Example: Assume we have 8 control (C1, ... C8) and 8 reference (R1, ..., R8) samples, should we compare a) C1-R1, ..., C8-R8, or b) C1-R’, ..., C8-R’ where R’ is a pool of all R, or c) something else?
2. Wrong design can be very wrong and the correct can be very efficient and require less arrays (= less lab work and cheaper experiments!).
3. Statistical theory exists and is helpful. Recently, statisticans has started to look at the microarray design problem in more detail. Progress will for be made sure! Experience from previous setups is also of great value.
We will not discuss this problem anymore in this talk.
8 of 91
The cDNA microarray technology-
PART II: Printing, Hybridization,
Scanning & Image Analysis
1. From clone to slide2. From samples to hybridization3. From scans to raw data
9 of 91
Overview
scanning
data: (Rfg,Gfg,Rbg,Gbg, ...)
cDNA clones
PCR product amplificationpurification
printing
Hybridize
RNA
Test sample
cDNA
RNA
Reference sample
cDNA
excitationred lasergreen
laser
emission
overlay images
Production
10 of 91
Printing / spotting
Arrayer (approx 100,000 EUR)
11 of 91 J. Vallon-Christersson, Dept Oncology, Lund Univ.
Microarray slide preparation
12 of 91
Terminology: probe and target• As defined in Nature 1999:
– The probes are the immobilized DNA sequences spotted on the array, i.e. spot, oligo, immobile substrate.
– The targets are the labeled cDNA sequences to be hybridized to the array, i.e. mobile substrate.
The opposite usage can also be seen in some references. However, think of probes as the measuring device (which you can buy), and the targets (that you provide) as what you want to measure.
13 of 91
RNA extraction & hybridization
Hybridize
RNA
Tumor sample
cDNA
RNA
Reference sample
cDNA
1. Extract mRNA from samples2. Reverse transcription of mRNA to
cDNA in presence of aminoallyl-dUTP (aa-dUTP)
3. Label with Cy3 or Cy5 fluorescent dye (links to aa-dUTP)
4. Hybridize labeled cDNA to slide5. Wash slide
Figure: Hybridization chamber.
(probes)
(targets)
Preparation incl. array (approx 100-200 EUR/array)
14 of 91
ReferencesOriginal cDNA microarray paper:• Mark Schena, Dari Shalon, Ronald W. Davis, and Patrick O. Brown. Quantitative monitoring of gene
expression patterns with a complementary DNA microarray. Science, 270(5235):467–470, October 1995.
General:• Mark Schena. Microarrays Analysis. John Wiley & Sons, Inc., Hoboken, New Jersey, 2003.
• David J. Duggan, Michael Bittner, Yidong Chen, and Paul Meltzer & Jeffrey M. Trent. Expression profiling using cDNA microarrays. Nature Genetics, 21(1 Supplement):10–14, January 1999.
15 of 91
Scanning
Terry Speed et al.
16 of 91Scanner (approx 50,000-100,000 EUR)
Software (approx 0-5,000 EUR)
17 of 91
Two-channel scanningexcitation
red lasergreen
laser
emission
overlay images
higher frequency, more energy
lower frequency,
less energy
18 of 91
Combined color image for visualization
19 of 91
Image Analysis
Terry Speed et al.
20 of 91
Signal quantification
1. AddressingLocate spot centers.
2. SegmentationClassification of pixels either as signal or background (using circles, seeded region growing or other).
3. Signal quantificationa) foreground estimatesb) background estimatesc) ... (shape, size etc)
Terry Speed et al.
21 of 91
Does one size fit all?
Terry Speed et al.
22 of 91
Segmentation and quantification of spot regions
Seeded region growing Fixed-sized or adaptive-sized circles
Inside the boundaries is foreground (the spots) and outside is the background. We estimate both...
Terry Speed et al.
23 of 91
Robust signal estimates:mean vs median
• When a statistician talks about a robust measure he/she normally means a measure that is not sensitive to outliers, e.g.
x = (8, 85, 7, 9, 5, 4, 13, 6, 8)
– The mean of all x’s, i.e. (x1+x2+...+xK)/K, will be affected if one of the measures, say x2, is wrong;
mean(x) = 16.11
– The median of all x’s, i.e. the middle value of (x1+x2+...+xK), will not be affected if a “few” (< 50%) are wrong;
median(x) = 8.0
For this reason we prefer to use median instead of the mean in cases were we expect artifacts.
(If there are a lot of measurements and the errors are symmetrically distributed the median will give the same result as the mean without outliers.)
24 of 91
Typical data output
25 of 91
Summary
scanning
data: (Rfg,Gfg,Rbg,Gbg, ...)
cDNA clones
PCR product amplificationpurification
printing
Hybridize
RNA
Test sample
cDNA
RNA
Reference sample
cDNA
excitationred lasergreen
laser
emission
overlay images
Production
26 of 91
ReferencesImage analysis:• Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of
methods for image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000.
• Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December 2003. 2003:E40.
27 of 91
The cDNA microarray technology-
PART III: Preprocessing of data (IMPORTANT)
1. Exploratory Data Analysis and log-transforms
2. Background correction3. Normalization
28 of 91
Exploratory Data Analysis
Terry Speed et al.
29 of 91
Scatter plot: R vs G
“Observed” data {(R,G)}n=1..5184:
R = signal in red channel, G = signal in green channel
up-regulated genes
down-regulated genes
non-differentially expressed genes along the diagonal:
R = G
Most genes have low gene expression levels. What happens here?
30 of 91
Scatter plot: log2R vs log2G
“Observed” data {(log2R,log2G)}n=1..5184:
R = signal in red channel, G = signal in green channel
up-regulated genes
down-regulated genes
non-differentially expressed genes are still along the diagonal:
log2R = log2G
Low gene expression levels are “blown up”.
31 of 91
Scatter plot: M vs A (recommended)
up-regulated genes
down-regulated genes
non-differentially expressed genes are now along the horizontal line:
M = 0
log2R - log2G = 0
Transformed data {(M,A)}n=1..5184:
M = log2(R) - log2(G) (minus)
A = ½·[log2(R) + log2(G)] (add)
Note: M vs A is basically a rotation of the log2R vs log2G scatter plot.
Why: Now the quantity of interest, i.e. the fold change, is contained in one variable, namely M!
If M > 0, up-regulated.If M < 0, down-regulated.
32 of 91
Details on M vs A
Log-ratios:
M = log2(R) – log2(G) = [logarithmic rules] = log2(R/G)
Average log-intensities:
A = ½·[log2(R) + log2(G)] = [logarithmic rules] = ½·log2(R·G)
There is a one-to-one relationship between (M,A) and (R,G):
R=(22A+M)1/2, G=(22A-M)1/2
33 of 91
More on why log and why M vs A?• It makes the distribution
symmetric around zero
R:G M=log(R/G) R:G M=log(R/G) comment
1:1 0 20=1
2:1 +1 1:2 -1 21=2, 2-1=1/2
4:1 +2 1:4 -2 22=4, 2-2=1/4
8:1 +3 1:8 -3 23=8, 2-3=1/8
16:1 +4 1:16 -4 24=16, 2-4=1/16
32:1 +5 1:32 -5 25=32, 2-5=1/32
Before: After:
• Logs stretch out region we are most interested in and makes the distribution more normal. • Easier to see artifacts of
the data, .e.g. as intensity dependent variation and dye-bias.
• Log base 2 because the raw data is binary data (max intensity is 216-1 = 65535). It is also naturally to think of 2-, 4-, 8-fold etc up and down regulated genes. For the actual analysis, any log-base will do.
34 of 91
M = log2(R/G) (log-ratio),
A = ½·log2(R·G) (log-intensity)
Summary of (R,G) (log2R,log2G) (M,A)
R = red channel signalG = green channel signal
R vs G log(R) vs log(G) M vs A
plot(rg, log=FALSE); plot(rg); ma <- as.MAData(rg); plot(ma)
35 of 91
Density plots of all signals
log2R = red channel signallog2G = green channel signal
Example: Signals in the red channel seem to be slightly weaker than the signals in the green channel, compare with M vs A plot:
...more green.
plotDensity(rg)
36 of 91
Spatial plot of log-ratios (M values)
Print-tip 1
Print-tip 16
Print-tip 8
Print-tip 9
plotSpatial(ma)
37 of 91
Print-tip box plot of log-ratios1
16
boxplot(ma, groupBy=“printtip”)
38 of 91
Printing order of spots1
16
6384 spots printed onto 9 slides in total 399 print turns using 4x4 print-tips...
Hmm... why the horizontal stripes?
39 of 91
Print-order plot of log-ratiosThe spots are order according to when they were spotted/dipped onto the glass slide(s). Note that it takes hours/days to print all spots on all arrays.
plotPrintorder(ma)
40 of 91
Print-dip plot of log-ratiosAverage values of the 16 log-ratios at each dip from each of the 399 print turns.
plotDiporder(ma)
41 of 91
ReferencesExploratory data analysis for microarrays:• Yee Hwa Yang, Sandrine Dudoit, Percy Luu, and Terence P Speed. Normalization for cDNA
microarray data. In Michael L. Bittner, Yidong Chen, Andreas N. Dorsel, and Edward R. Dougherty, editors, Proceedings of SPiE, volume 4266 of Microarrays: Optical Technologies and Informatics, pages 141–152, San Jose, California, June 2001. The International Society for Optical Engineering.
• Henrik Bengtsson. Identification and normalization of plate effects in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2002.
• Gordon Smyth and Terry Speed, METHODS: Selecting Candidate Genes from DNA Array Screens, Dec 2003
42 of 91
Background correction and background estimates revisited...
Terry Speed et al.
43 of 91
Alt. 1: Local background estimatesOne for the red and one for the green channel.
A. Bengtsson, Lund University, 2003.
Axon GenePix Pro- size of bg region depends on sizeof the spot. Give too much variation!
Fixed-size regions- same number of background pixelsgive same variation for all estimate.
44 of 91
Alt. 2: Filtered background estimatesOne for the red and one for the green channel.
A. Bengtsson, Lund University, 2003.
In general, filter by moving a window (mask) over the image:
1) At each position replace the value in the center (red dot) by, say, the median pixel intensity of all spots in the window.
2) Move to next pixel and so on.
3) Result:
(more sophisticated filters than the median filter are used in practice)
Idea: With filters we do not have to know the where the spots are. More robust.
Also known as: Morphological filters
45 of 91
• The foreground region of the spots are assumed to be contaminated with background noise/signal.
• Sources of background: probe unspecifically sticking on slide, irregular / dirty slide surface, dust, noise in the scanner measurement.
• From the image analysis we have foreground and background estimates for each spot in both channels:
(Rfg, Rbg, Gfg, Gbg)
• For each spot on the slide we do background subtraction
Red intensity: R = Rfg – Rbg
Green intensity: G = Gfg - Gbg
• Not considered: real cross-hybridisation and unspecific hybridisation to the probe. Still not known if this is a real problem.
Background subtraction
Terry Speed et al.
Example of an extremely bad array.
46 of 91
Which background estimate should we use?
Axon GenePix Pro- size of bg region depends on sizeof the spot. Give too much variation! Fixed-size regions
- same number of background pixelsgive same variation for all estimate.
Filtered estimates- no spot segmentation.
Fixed-sized are better than variable-sized region methods, but filtered estimates may be even better. Should we do background correction at all? (A. Bengtsson, 2003) is a good start, but more research is needed.
47 of 91
Background correction or not?non-background subtracted background subtracted
-seems better, but...
M = log2(Rfg/Gfg) Mbg = log2({Rfg-Rbg} / {Gfg-Gbg})
48 of 91
Background problem still not solved!
GenePix background:
Spotbackground:
(morphological opening)
Foreground & background Background subtracted
-Still curvature left.Too little back-
ground correction?!
-Curvature in the other direction now!
Too much back-ground correction?!
49 of 91
ReferencesBackground estimation and correction:• Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of methods for
image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000.
• Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December 2003. 2003:E40.
• Henrik Bengtsson, Göran Jönsson, and Johan Vallon-Christersson. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range. Preprints in Mathematical Sciences 2003:37, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2003.
• Charles Kooperberg, Thomas G. Fazzio, Jeffrey J. Delrow, and Thoshio Tsukiyama. Improved background correction for spotted DNA microarrays. Journal of Computational Biology, 9:55–66, 2002.
50 of 91
Normalization
Terry Speed et al.
51 of 91
Expectation: Most genes are non-differentially expressed, i.e. most of the data points should be around M=0.
Idea: Do various exploratory plots to see if this assumption is met. For example, M vs A, spatial plots, density & boxplots plots, print-order plots etc.
Result: We commonly observe something else:Measured value = real value + systematic errors + noise
Correction: If so, normalize the data such that the expectations are met:Corrected value = real value + systematic errors + noise
Normalization
52 of 91
Sources of systematic effects, but also noise and natural variability
• Biological variability • RNA extraction• Probe labeling (dye differences)• Printing (print-order, plate-order, clone variation, etc.)• Hybridization (temperature, time, mixing, etc.)• Human (variation between lab researchers)• Scanning (laser and detector, chemistry of the fluorescent label)• Image analysis (identification, quantification, background methods etc.)
53 of 91
Examples of practical problems that can lead to artifacts in data
• Surface chemistry: uneven surface may lead to high background.
• Dipping the pin into large volume. It helps to pre-printing on a dummy array to drain off excess sample.
• Spot variation can be due to mechanical difference between pins. Pins could be clogged during the printing process.
• Spot size and density depends on surface and solution properties.
• Pins need good washing between samples to prevent sample carryover.
• Dust while printing and while hybridizing the arrays.• ... and much more!
54 of 91
Within-slide normalizationCurrent methods widely used today:
1. No normalization2. Median normalization
Shift of log-ratios.
3. Curve-fit normalization methods.A.k.a. lo(w)ess normalization.
4. Dye-swap normalizationActually an experimental design.
55 of 91
Within-slide normalization cont.The median, the curve-fit and the dye-swap normalization and others, can be applied to different groups of spots:
1. all spots2. print-tip groups3. plate groups4. ...but also1. spatially2. on spots ordered in print order3. ...
56 of 91
No normalization – raw dataNon-normalized data {(M,A)}n=1..N:
M = log2(R/G)
aroma:´plot(ma, slide=2)
57 of 91
Global median normalization
M’ = M – median(M)
Assumption: Changes roughly symmetric.
aroma:normalizeWithinSlide(ma, method=“m”)
58 of 91
Global curve-fit normalizationGlobal curve-fit normalized data {(M,A)}n=1..N:
Mnorm = M - c(A)
where c(A) is an intensity dependent function (loess, splines, ...).
aroma:plot(ma); lowessCurve(ma)
59 of 91
Global curve-fit normalization cont.
(Biased towards the green channel & intensity dependent artifacts)
aroma:
plot(ma); normalizeWithinSlide(ma, method=“l”); plot(ma)
curve-fit normalization done for
all spots together
60 of 91
Print-tip curve-fit normalizationPrint-tip curve-fit normalized data {(M,A)}n=1..N:
Mp,norm = Mp - cp(A); p = print tip 1-16
where cp(A) is an intensity dependent function for print tip p.
aroma:plot(ma); lowessCurve(ma, groupBy=“printtip”); boxplot(ma, groupBy=“printtip”)
61 of 91
Print-tip curve-fit normalization cont.
aroma:
normalizeWithinSlide(ma, “p”); boxplot(ma, groupBy=“printtip”); plotSpatial(ma)
curve-fit normalization done for each print-tip group
separately
62 of 91
Scaled print-tip curve-fit normalizationScaled print-tip curve-fit normalized data {(M,A)}n=1..N:
Mp,norm = sp·(Mp - cp(A)); p=print tip 1-16
where sp is a scale factor for print tip p (Median Absolute Deviation).After print-tip normalization After scaled print-tip normalization
aroma:
normalizeWithinSlide(ma, “p”); normalizeWithinSlide(ma, “s”)
curve-fit normalization+ rescaling
done for each print-tip group
separately
63 of 91
Spatial artifacts after normalization
No normalization Global curve-fit normalization
Print-tip curve-fit normalization Scaled print-tip curve-fit normalization
64 of 91
Another example
Scaled print-tip curve-fit normalization:
65 of 91
Dye-swap normalizationIdea: We have observed dye-dependent biases. What if we do the same hybridization a second time, but with reversed labeling of the dyes? Any dye-effects will hopefully cancel out, i.e. self-normalize, if the dyes are swapped and the signals are averaged.Requires: Paired arrays, i.e. 2, 4, 6, ... arrays.
Details: Let {(M1,A1)} and {(M2,A2)} be the log-ratios and log-intensities for array 1 and 2, respectively. Dye-swap normalization is then:
M = (M1-M2)/2, A = (A1+A2)/2
66 of 91
Dye-swap normalization cont.
What is done? Let {(R1,G1)} and {(R2,G2)} be the corresponding red and green signals from array 1 and 2. Then
M = (M1-M2)/2 = [log2(R1/G1)-log2(R2/G2)]/2= [log2R1-log2G1-(log2R2-log2G2)]/2 = [(log2R1-log2R2)+(log2G2-log2R1)]/2
= [log2(R1/R2) + log2(G2/G1)]/2 = (M1’+M2’)/2
Similarly for A = (A1+A2)/2 = ... = (A1’+A2’)/2. Now, if the red signals are equally biased, then the effects are hopefully cancelled-out. Same for the green channel.
Thus, dye-swap is averaging two virtual (hopefully) self-normalized slides.
Details: Let {(M1,A1)} and {(M2,A2)} be the log-ratios and log-intensities for array 1 and 2, respectively. Dye-swap normalization is then:
M = (M1-M2)/2, A = (A1+A2)/2
67 of 91
Median plate normalization
Will remove some of the intensity dependent effects...
...and some of the spatial artifacts.
68 of 91
Within-slide normalization summaryAlternatives:
1. No normalization
2. Median normalizationa) all spotsb) print-tipc) plate-group
3. Curve-fit normalizationa) Global curve-fit normalizationb) Print-tip curve-fit normalizationc) Scaled print-tip curve-fit normalization
4. Dye-swap normalization
69 of 91
Across-slides normalization
The within-slide normalization makes the red and the green signals comparable. What about signals from different arrays? Across-slide normalization attacks this problem.
Idea: Log-ratios (M values) from different slides should have equal variation.
Example: If the standard deviation of the log-ratios from one array is 0.50 and from another array 0.78, the log-ratios have to be rescaled to get the same standard deviation.
70 of 91
Idea: Log-ratios (M values) from different slides should have equal variation.
True ratio is M’ij where j represents different spots and i represents different arrays. Observed is Mij, where Mij = ai * M’ij. Robust estimate of ai is
Corrected values are calculated as:
0 < ai < 1: how much spread slide i has compared to the average of the other slides.
(Conceptually we could have used the standard deviation instead, but it is not robust!)
aroma:normalizeAcrossSlides(ma)
Across-slides normalization cont.
71 of 91
Combining data from several slides
1. Within-slide normalization
2. Across-slides normalization
3. Averaging
+ more information
aroma:normalizeWithinSlide(ma, “p”); normalizeAcrossSlides(ma); tma <- as.TMAData(ma)
72 of 91
The average of all normalized slidesThe “average” slide:
aroma:plot(tma)
73 of 91
ReferencesNormalization:• Yee Hwa Yang, Sandrine Dudoit, Percy Luu, and Terence P Speed. Normalization for cDNA microarray data. In
Michael L. Bittner, Yidong Chen, Andreas N. Dorsel, and Edward R. Dougherty, editors, Proceedings of SPiE, volume 4266 of Microarrays: Optical Technologies and Informatics, pages 141–152, San Jose, California, June 2001. The International Society for Optical Engineering.
• Henrik Bengtsson and Ola Hössjer. Methodological study of affine transformations of gene expression data with proposed normalization method. Preprints in Mathematical Sciences 2003:38, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2003.
• Wolfgang Huber, Anja von Heydebreck, Holger Sültmann, Annemarie Poustka, and Martin Vingron. Variance stabilization applied to microarray data calibration and to the quantification of differential expression . Bioinformatics, 1:1–9, 2002.
• Henrik Bengtsson. Identification and normalization of plate effects in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2002.
• Gordon Smyth and Terry Speed, METHODS: Selecting Candidate Genes from DNA Array Screens, Dec 2003
• Henrik Bengtsson. aroma - An R Object-oriented Microarray Analysis environment. Preprint in Mathematical Sciences (manuscript in progress), Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2004. URL: http://www.braju.com/R/
74 of 91
The cDNA microarray technology-
PART IV: Identifying differentially expressed
genes1. Cut-off by M values2. The t-statistics and cut-off by T values3. Multiple testing and adjusted the p-values4. Validation
75 of 91
Average of all normalized slidesThe “average” slide:
aroma:plot(tma)
76 of 91
Cut-off by log-ratios (naive)Top 5% of the absolute M values:
aroma:plot(tma); highlight(tma, top(abs(tma$M), 0.05), col=“red”)
77 of 91
The t-statisticsIdea: For replicated data, i.e. multiple measurements of the same thing, we trust the estimate of the average (mean or median) more if the deviation (std.dev. or MAD) is small. If the deviation is large, we do not trust it that much.
• The T statistics down-weight the importance of the average if the deviation is large and vice versa;
T = mean(x) / SE(x)
where SE(x)=std.dev(x)/N (standard error of the mean)
•Example: The blue and the red genes have almost the same average log-ratio, but we are more confident with the measure of the blue gene since its variability across replicates is smaller.
aroma:tma <- as.TMAData(ma); summary(ma$T)
78 of 91
Cut-off by T values (better)
Top 5% of the absolute T values:
aroma:plot(tma); highlight(tma, top(abs(tma$T), 0.05), col=“blue”)
79 of 91
Where are the M values compared to the T values?
Top 5% of the absolute M values in red:
aroma:plot(tma); plot(tma, “TvsSE”); ...
80 of 91
ReferencesIdentification of differentially expressed genes:• Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. Statistical
methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578, Department of Statistics, University of California at Berkeley, 2000.
• M. Callow, S. Dudoit, E. Gong, T. Speed, and E. Rubin. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research, 10(12):2022–9, December 2000.
• Ingrid Lönnstedt and Terence P. Speed. Replicated microarray data. Statistical Sinica, 12(1), 2002.
81 of 91
Multiple testing• In a microarray experiment, each gene (each
probe or probe set) is really a separate experiment.
• Yet, if you treat each gene as an independent comparison, you will always find some with significant differences.
82 of 91
False positive and false negativeFalse Positive and False Negative: Statistical test vs. truth
If truth is T ≠ 0: If truth is T=0:
Statistical test decision: Reject H0: T=0.
Correctly reject H0 False Positive(Type I Error)
Statistical test decision:Do not reject H0: T=0.
False Negative(Type II Error)
Correctly not reject H0
In cDNA microarray experiments we commonly test the hypothesis H0 that T=0 against T≠0 (non-DE or not) for every gene separately. For the genes for which we reject H0, we say they are differentially expressed.
However, by chance we will reject H0 for some genes that are not DE. We call these findings false positive (Type I). Genes that are DE, but for which we do not reject H0 are called false negative (Type II).
83 of 91
False Discovery Rate• False Discovery Rate (FDR) is equal to the p-value
of the test, p = P(|T| > tα/2), times the number of genes in the array– For a p-value of 0.01·10,000 genes = 100 false DEs.– You cannot eliminate false positives, but by choosing a
more stringent p-value, you can keep them manageable (try p=0.001 10 false positives)
• The FDR must be smaller than the number of real differences that you find, which in turn depends on the size of the differences and variability of the measured expression values. It’s tricky, but at least you should be aware of it.
84 of 91
Multiple testing cont.
• Multiple testing pitfalls– thousands of tests, i.e. each gene is tested against
H0: T=0. By chance some will “fail”.– false positives problems more serious.– need to adjust p-values.
• Different adjustment procedures– Bonferroni, Sidak, Duncan, Holm, etc. Not discussed
here, but available automatically in the better microarray analysis software.
85 of 91
Adjusted and unadjusted p-values for the 50 genes with the largest absolute t-statistics.
Terry Speed et al.
86 of 91
ReferencesMultiple testing and adjusted p-values:
• Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578, Department of Statistics, University of California at Berkeley, 2000.
• Y. Ge, S. Dudoit & T. P. Speed, Resampling-based multiple testing for microarray data hypothesis (submitted to Test, Spain). Technical Report #633 of UCB Statistics, 2003.
87 of 91
Validation of identified genes
88 of 91
Exploratory validation plots Were are the differentially expressed genes located?
Does it make sense? If there is a pattern, you might need to redo the normalization.
89 of 91
Biological verification1. Having a top 20 list of interesting genes, what biological
relevance do they have in this experiment? 2. Have you found interesting genes, but with very noisy
data? Do more experiments and see if you can replicate the results. Otherwise, you might just have found noise!
3. Verify interesting genes by other means such as Southern or Northern blots and quantative real-time PCR (QRT-PCR) or other microarray technologies etc.
90 of 91
Summary: Identification of DEs• You need replication and statistics to find real
differences.• Cutoff by log ratios is not enough/correct.• Cutoff by t-statistics is much better.• Multiple testing => must adjust the p values.• Validate your results
91 of 91
Take-home messages1. Good image analysis is essential. Some software are obsolete
and not that good. Commercial software like GenePix could be better (and hopefully will be soon).
2. Background correction or not is not solved. Progress has been done, but more research is needed.
3. Normalization is needed. We understand more now than two years ago.
4. Use at least the t-statistics to identify differentially expressed genes. Do not rely exclusively on log-ratios.
5. Multiple testing must be considered; adjust your p-values.
6. Talk to a statistician before doing the experiments! We do think about these kind of problems for a living.