henrik bengtsson hb@maths.lth.se mathematical statistics, centre for mathematical sciences,

Post on 16-Mar-2016

43 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Statistical analysis of microarray data. Henrik Bengtsson hb@maths.lth.se Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden Bioinformatics 2 , Bioinformatics Centre, University of Copenhagen. February 19, 2004. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Henrik Bengtssonhb@maths.lth.se

Mathematical Statistics, Centre for Mathematical Sciences,

Lund University, Sweden

Bioinformatics 2, Bioinformatics Centre, University of Copenhagen.

February 19, 2004

Statistical analysis of Statistical analysis of microarray datamicroarray data

2 of 91

OutlinePart I – (Very short) Background

• Central Dogma of Biology• Idea behind the microarray technology• Experimental design

Part II - Printing, Hybridization, Scanning & Image Analysis• From clone to slide• From samples to hybridization• From scanning to raw data

Part III - Preprocessing of data (IMPORTANT)• Exploratory data analysis and log transforms• Background correction• Normalization

Part IV - Identifying differentially expressed genes• Cut-off by log-ratios values, the t-statistics and cut-off by T values• Multiple testing, adjusting the p-values• Validation

3 of 91

The cDNA microarray technology-

PART I: (Very short) Background

1. The Central Dogma of Biology2. Idea behind the microarray technology3. Experimental design

4 of 91

Idea of microarrays (and other gene expression techniques):

Measure the amount of mRNA to find genes that are expressed (active).

(measuring protein might be better, but is currently much harder)

The Central Dogma of Biology

5 of 91

History

cDNA microarrays have evolved from Southern & Northern blots, with clone libraries gridded out on nylon membrane filters being an important and still widely used intermediate.

Things took off about 1995 with the introduction of non-porous solid supports, such as glass - these permitted miniaturization - and fluorescence based detection.

Terry Speed et al.

6 of 91

The cDNA Microarray Technique

1. Put a large number of DNA sequences or synthetic DNA oligomers onto a glass slide- 5000-50000 gene expressions at the same time.

2. Measure amounts of cDNA bound to each spot3. Identify genes that behave differently in different cell

populations- tumor cells vs healthy cells- brain cells vs liver cells- same tissue different organisms

• Time series experiments- gene expressions over time after treatment, e.g. T=5,10,30,... min vs T=0.

• ...

7 of 91

Experimental Design1. Experimental design is very important.

Example: Assume we have 8 control (C1, ... C8) and 8 reference (R1, ..., R8) samples, should we compare a) C1-R1, ..., C8-R8, or b) C1-R’, ..., C8-R’ where R’ is a pool of all R, or c) something else?

2. Wrong design can be very wrong and the correct can be very efficient and require less arrays (= less lab work and cheaper experiments!).

3. Statistical theory exists and is helpful. Recently, statisticans has started to look at the microarray design problem in more detail. Progress will for be made sure! Experience from previous setups is also of great value.

We will not discuss this problem anymore in this talk.

8 of 91

The cDNA microarray technology-

PART II: Printing, Hybridization,

Scanning & Image Analysis

1. From clone to slide2. From samples to hybridization3. From scans to raw data

9 of 91

Overview

scanning

data: (Rfg,Gfg,Rbg,Gbg, ...)

cDNA clones

PCR product amplificationpurification

printing

Hybridize

RNA

Test sample

cDNA

RNA

Reference sample

cDNA

excitationred lasergreen

laser

emission

overlay images

Production

10 of 91

Printing / spotting

Arrayer (approx 100,000 EUR)

11 of 91 J. Vallon-Christersson, Dept Oncology, Lund Univ.

Microarray slide preparation

12 of 91

Terminology: probe and target• As defined in Nature 1999:

– The probes are the immobilized DNA sequences spotted on the array, i.e. spot, oligo, immobile substrate.

– The targets are the labeled cDNA sequences to be hybridized to the array, i.e. mobile substrate.

The opposite usage can also be seen in some references. However, think of probes as the measuring device (which you can buy), and the targets (that you provide) as what you want to measure.

13 of 91

RNA extraction & hybridization

Hybridize

RNA

Tumor sample

cDNA

RNA

Reference sample

cDNA

1. Extract mRNA from samples2. Reverse transcription of mRNA to

cDNA in presence of aminoallyl-dUTP (aa-dUTP)

3. Label with Cy3 or Cy5 fluorescent dye (links to aa-dUTP)

4. Hybridize labeled cDNA to slide5. Wash slide

Figure: Hybridization chamber.

(probes)

(targets)

Preparation incl. array (approx 100-200 EUR/array)

14 of 91

ReferencesOriginal cDNA microarray paper:• Mark Schena, Dari Shalon, Ronald W. Davis, and Patrick O. Brown. Quantitative monitoring of gene

expression patterns with a complementary DNA microarray. Science, 270(5235):467–470, October 1995.

General:• Mark Schena. Microarrays Analysis. John Wiley & Sons, Inc., Hoboken, New Jersey, 2003.

• David J. Duggan, Michael Bittner, Yidong Chen, and Paul Meltzer & Jeffrey M. Trent. Expression profiling using cDNA microarrays. Nature Genetics, 21(1 Supplement):10–14, January 1999.

15 of 91

Scanning

Terry Speed et al.

16 of 91Scanner (approx 50,000-100,000 EUR)

Software (approx 0-5,000 EUR)

17 of 91

Two-channel scanningexcitation

red lasergreen

laser

emission

overlay images

higher frequency, more energy

lower frequency,

less energy

18 of 91

Combined color image for visualization

19 of 91

Image Analysis

Terry Speed et al.

20 of 91

Signal quantification

1. AddressingLocate spot centers.

2. SegmentationClassification of pixels either as signal or background (using circles, seeded region growing or other).

3. Signal quantificationa) foreground estimatesb) background estimatesc) ... (shape, size etc)

Terry Speed et al.

21 of 91

Does one size fit all?

Terry Speed et al.

22 of 91

Segmentation and quantification of spot regions

Seeded region growing Fixed-sized or adaptive-sized circles

Inside the boundaries is foreground (the spots) and outside is the background. We estimate both...

Terry Speed et al.

23 of 91

Robust signal estimates:mean vs median

• When a statistician talks about a robust measure he/she normally means a measure that is not sensitive to outliers, e.g.

x = (8, 85, 7, 9, 5, 4, 13, 6, 8)

– The mean of all x’s, i.e. (x1+x2+...+xK)/K, will be affected if one of the measures, say x2, is wrong;

mean(x) = 16.11

– The median of all x’s, i.e. the middle value of (x1+x2+...+xK), will not be affected if a “few” (< 50%) are wrong;

median(x) = 8.0

For this reason we prefer to use median instead of the mean in cases were we expect artifacts.

(If there are a lot of measurements and the errors are symmetrically distributed the median will give the same result as the mean without outliers.)

24 of 91

Typical data output

25 of 91

Summary

scanning

data: (Rfg,Gfg,Rbg,Gbg, ...)

cDNA clones

PCR product amplificationpurification

printing

Hybridize

RNA

Test sample

cDNA

RNA

Reference sample

cDNA

excitationred lasergreen

laser

emission

overlay images

Production

26 of 91

ReferencesImage analysis:• Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of

methods for image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000.

• Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December 2003. 2003:E40.

27 of 91

The cDNA microarray technology-

PART III: Preprocessing of data (IMPORTANT)

1. Exploratory Data Analysis and log-transforms

2. Background correction3. Normalization

28 of 91

Exploratory Data Analysis

Terry Speed et al.

29 of 91

Scatter plot: R vs G

“Observed” data {(R,G)}n=1..5184:

R = signal in red channel, G = signal in green channel

up-regulated genes

down-regulated genes

non-differentially expressed genes along the diagonal:

R = G

Most genes have low gene expression levels. What happens here?

30 of 91

Scatter plot: log2R vs log2G

“Observed” data {(log2R,log2G)}n=1..5184:

R = signal in red channel, G = signal in green channel

up-regulated genes

down-regulated genes

non-differentially expressed genes are still along the diagonal:

log2R = log2G

Low gene expression levels are “blown up”.

31 of 91

Scatter plot: M vs A (recommended)

up-regulated genes

down-regulated genes

non-differentially expressed genes are now along the horizontal line:

M = 0

log2R - log2G = 0

Transformed data {(M,A)}n=1..5184:

M = log2(R) - log2(G) (minus)

A = ½·[log2(R) + log2(G)] (add)

Note: M vs A is basically a rotation of the log2R vs log2G scatter plot.

Why: Now the quantity of interest, i.e. the fold change, is contained in one variable, namely M!

If M > 0, up-regulated.If M < 0, down-regulated.

32 of 91

Details on M vs A

Log-ratios:

M = log2(R) – log2(G) = [logarithmic rules] = log2(R/G)

Average log-intensities:

A = ½·[log2(R) + log2(G)] = [logarithmic rules] = ½·log2(R·G)

There is a one-to-one relationship between (M,A) and (R,G):

R=(22A+M)1/2, G=(22A-M)1/2

33 of 91

More on why log and why M vs A?• It makes the distribution

symmetric around zero

R:G M=log(R/G) R:G M=log(R/G) comment

1:1 0 20=1

2:1 +1 1:2 -1 21=2, 2-1=1/2

4:1 +2 1:4 -2 22=4, 2-2=1/4

8:1 +3 1:8 -3 23=8, 2-3=1/8

16:1 +4 1:16 -4 24=16, 2-4=1/16

32:1 +5 1:32 -5 25=32, 2-5=1/32

Before: After:

• Logs stretch out region we are most interested in and makes the distribution more normal. • Easier to see artifacts of

the data, .e.g. as intensity dependent variation and dye-bias.

• Log base 2 because the raw data is binary data (max intensity is 216-1 = 65535). It is also naturally to think of 2-, 4-, 8-fold etc up and down regulated genes. For the actual analysis, any log-base will do.

34 of 91

M = log2(R/G) (log-ratio),

A = ½·log2(R·G) (log-intensity)

Summary of (R,G) (log2R,log2G) (M,A)

R = red channel signalG = green channel signal

R vs G log(R) vs log(G) M vs A

plot(rg, log=FALSE); plot(rg); ma <- as.MAData(rg); plot(ma)

35 of 91

Density plots of all signals

log2R = red channel signallog2G = green channel signal

Example: Signals in the red channel seem to be slightly weaker than the signals in the green channel, compare with M vs A plot:

...more green.

plotDensity(rg)

36 of 91

Spatial plot of log-ratios (M values)

Print-tip 1

Print-tip 16

Print-tip 8

Print-tip 9

plotSpatial(ma)

37 of 91

Print-tip box plot of log-ratios1

16

boxplot(ma, groupBy=“printtip”)

38 of 91

Printing order of spots1

16

6384 spots printed onto 9 slides in total 399 print turns using 4x4 print-tips...

Hmm... why the horizontal stripes?

39 of 91

Print-order plot of log-ratiosThe spots are order according to when they were spotted/dipped onto the glass slide(s). Note that it takes hours/days to print all spots on all arrays.

plotPrintorder(ma)

40 of 91

Print-dip plot of log-ratiosAverage values of the 16 log-ratios at each dip from each of the 399 print turns.

plotDiporder(ma)

41 of 91

ReferencesExploratory data analysis for microarrays:• Yee Hwa Yang, Sandrine Dudoit, Percy Luu, and Terence P Speed. Normalization for cDNA

microarray data. In Michael L. Bittner, Yidong Chen, Andreas N. Dorsel, and Edward R. Dougherty, editors, Proceedings of SPiE, volume 4266 of Microarrays: Optical Technologies and Informatics, pages 141–152, San Jose, California, June 2001. The International Society for Optical Engineering.

• Henrik Bengtsson. Identification and normalization of plate effects in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2002.

• Gordon Smyth and Terry Speed, METHODS: Selecting Candidate Genes from DNA Array Screens, Dec 2003

42 of 91

Background correction and background estimates revisited...

Terry Speed et al.

43 of 91

Alt. 1: Local background estimatesOne for the red and one for the green channel.

A. Bengtsson, Lund University, 2003.

Axon GenePix Pro- size of bg region depends on sizeof the spot. Give too much variation!

Fixed-size regions- same number of background pixelsgive same variation for all estimate.

44 of 91

Alt. 2: Filtered background estimatesOne for the red and one for the green channel.

A. Bengtsson, Lund University, 2003.

In general, filter by moving a window (mask) over the image:

1) At each position replace the value in the center (red dot) by, say, the median pixel intensity of all spots in the window.

2) Move to next pixel and so on.

3) Result:

(more sophisticated filters than the median filter are used in practice)

Idea: With filters we do not have to know the where the spots are. More robust.

Also known as: Morphological filters

45 of 91

• The foreground region of the spots are assumed to be contaminated with background noise/signal.

• Sources of background: probe unspecifically sticking on slide, irregular / dirty slide surface, dust, noise in the scanner measurement.

• From the image analysis we have foreground and background estimates for each spot in both channels:

(Rfg, Rbg, Gfg, Gbg)

• For each spot on the slide we do background subtraction

Red intensity: R = Rfg – Rbg

Green intensity: G = Gfg - Gbg

• Not considered: real cross-hybridisation and unspecific hybridisation to the probe. Still not known if this is a real problem.

Background subtraction

Terry Speed et al.

Example of an extremely bad array.

46 of 91

Which background estimate should we use?

Axon GenePix Pro- size of bg region depends on sizeof the spot. Give too much variation! Fixed-size regions

- same number of background pixelsgive same variation for all estimate.

Filtered estimates- no spot segmentation.

Fixed-sized are better than variable-sized region methods, but filtered estimates may be even better. Should we do background correction at all? (A. Bengtsson, 2003) is a good start, but more research is needed.

47 of 91

Background correction or not?non-background subtracted background subtracted

-seems better, but...

M = log2(Rfg/Gfg) Mbg = log2({Rfg-Rbg} / {Gfg-Gbg})

48 of 91

Background problem still not solved!

GenePix background:

Spotbackground:

(morphological opening)

Foreground & background Background subtracted

-Still curvature left.Too little back-

ground correction?!

-Curvature in the other direction now!

Too much back-ground correction?!

49 of 91

ReferencesBackground estimation and correction:• Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of methods for

image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000.

• Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December 2003. 2003:E40.

• Henrik Bengtsson, Göran Jönsson, and Johan Vallon-Christersson. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range. Preprints in Mathematical Sciences 2003:37, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2003.

• Charles Kooperberg, Thomas G. Fazzio, Jeffrey J. Delrow, and Thoshio Tsukiyama. Improved background correction for spotted DNA microarrays. Journal of Computational Biology, 9:55–66, 2002.

50 of 91

Normalization

Terry Speed et al.

51 of 91

Expectation: Most genes are non-differentially expressed, i.e. most of the data points should be around M=0.

Idea: Do various exploratory plots to see if this assumption is met. For example, M vs A, spatial plots, density & boxplots plots, print-order plots etc.

Result: We commonly observe something else:Measured value = real value + systematic errors + noise

Correction: If so, normalize the data such that the expectations are met:Corrected value = real value + systematic errors + noise

Normalization

52 of 91

Sources of systematic effects, but also noise and natural variability

• Biological variability • RNA extraction• Probe labeling (dye differences)• Printing (print-order, plate-order, clone variation, etc.)• Hybridization (temperature, time, mixing, etc.)• Human (variation between lab researchers)• Scanning (laser and detector, chemistry of the fluorescent label)• Image analysis (identification, quantification, background methods etc.)

53 of 91

Examples of practical problems that can lead to artifacts in data

• Surface chemistry: uneven surface may lead to high background.

• Dipping the pin into large volume. It helps to pre-printing on a dummy array to drain off excess sample.

• Spot variation can be due to mechanical difference between pins. Pins could be clogged during the printing process.

• Spot size and density depends on surface and solution properties.

• Pins need good washing between samples to prevent sample carryover.

• Dust while printing and while hybridizing the arrays.• ... and much more!

54 of 91

Within-slide normalizationCurrent methods widely used today:

1. No normalization2. Median normalization

Shift of log-ratios.

3. Curve-fit normalization methods.A.k.a. lo(w)ess normalization.

4. Dye-swap normalizationActually an experimental design.

55 of 91

Within-slide normalization cont.The median, the curve-fit and the dye-swap normalization and others, can be applied to different groups of spots:

1. all spots2. print-tip groups3. plate groups4. ...but also1. spatially2. on spots ordered in print order3. ...

56 of 91

No normalization – raw dataNon-normalized data {(M,A)}n=1..N:

M = log2(R/G)

aroma:´plot(ma, slide=2)

57 of 91

Global median normalization

M’ = M – median(M)

Assumption: Changes roughly symmetric.

aroma:normalizeWithinSlide(ma, method=“m”)

58 of 91

Global curve-fit normalizationGlobal curve-fit normalized data {(M,A)}n=1..N:

Mnorm = M - c(A)

where c(A) is an intensity dependent function (loess, splines, ...).

aroma:plot(ma); lowessCurve(ma)

59 of 91

Global curve-fit normalization cont.

(Biased towards the green channel & intensity dependent artifacts)

aroma:

plot(ma); normalizeWithinSlide(ma, method=“l”); plot(ma)

curve-fit normalization done for

all spots together

60 of 91

Print-tip curve-fit normalizationPrint-tip curve-fit normalized data {(M,A)}n=1..N:

Mp,norm = Mp - cp(A); p = print tip 1-16

where cp(A) is an intensity dependent function for print tip p.

aroma:plot(ma); lowessCurve(ma, groupBy=“printtip”); boxplot(ma, groupBy=“printtip”)

61 of 91

Print-tip curve-fit normalization cont.

aroma:

normalizeWithinSlide(ma, “p”); boxplot(ma, groupBy=“printtip”); plotSpatial(ma)

curve-fit normalization done for each print-tip group

separately

62 of 91

Scaled print-tip curve-fit normalizationScaled print-tip curve-fit normalized data {(M,A)}n=1..N:

Mp,norm = sp·(Mp - cp(A)); p=print tip 1-16

where sp is a scale factor for print tip p (Median Absolute Deviation).After print-tip normalization After scaled print-tip normalization

aroma:

normalizeWithinSlide(ma, “p”); normalizeWithinSlide(ma, “s”)

curve-fit normalization+ rescaling

done for each print-tip group

separately

63 of 91

Spatial artifacts after normalization

No normalization Global curve-fit normalization

Print-tip curve-fit normalization Scaled print-tip curve-fit normalization

64 of 91

Another example

Scaled print-tip curve-fit normalization:

65 of 91

Dye-swap normalizationIdea: We have observed dye-dependent biases. What if we do the same hybridization a second time, but with reversed labeling of the dyes? Any dye-effects will hopefully cancel out, i.e. self-normalize, if the dyes are swapped and the signals are averaged.Requires: Paired arrays, i.e. 2, 4, 6, ... arrays.

Details: Let {(M1,A1)} and {(M2,A2)} be the log-ratios and log-intensities for array 1 and 2, respectively. Dye-swap normalization is then:

M = (M1-M2)/2, A = (A1+A2)/2

66 of 91

Dye-swap normalization cont.

What is done? Let {(R1,G1)} and {(R2,G2)} be the corresponding red and green signals from array 1 and 2. Then

M = (M1-M2)/2 = [log2(R1/G1)-log2(R2/G2)]/2= [log2R1-log2G1-(log2R2-log2G2)]/2 = [(log2R1-log2R2)+(log2G2-log2R1)]/2

= [log2(R1/R2) + log2(G2/G1)]/2 = (M1’+M2’)/2

Similarly for A = (A1+A2)/2 = ... = (A1’+A2’)/2. Now, if the red signals are equally biased, then the effects are hopefully cancelled-out. Same for the green channel.

Thus, dye-swap is averaging two virtual (hopefully) self-normalized slides.

Details: Let {(M1,A1)} and {(M2,A2)} be the log-ratios and log-intensities for array 1 and 2, respectively. Dye-swap normalization is then:

M = (M1-M2)/2, A = (A1+A2)/2

67 of 91

Median plate normalization

Will remove some of the intensity dependent effects...

...and some of the spatial artifacts.

68 of 91

Within-slide normalization summaryAlternatives:

1. No normalization

2. Median normalizationa) all spotsb) print-tipc) plate-group

3. Curve-fit normalizationa) Global curve-fit normalizationb) Print-tip curve-fit normalizationc) Scaled print-tip curve-fit normalization

4. Dye-swap normalization

69 of 91

Across-slides normalization

The within-slide normalization makes the red and the green signals comparable. What about signals from different arrays? Across-slide normalization attacks this problem.

Idea: Log-ratios (M values) from different slides should have equal variation.

Example: If the standard deviation of the log-ratios from one array is 0.50 and from another array 0.78, the log-ratios have to be rescaled to get the same standard deviation.

70 of 91

Idea: Log-ratios (M values) from different slides should have equal variation.

True ratio is M’ij where j represents different spots and i represents different arrays. Observed is Mij, where Mij = ai * M’ij. Robust estimate of ai is

Corrected values are calculated as:

0 < ai < 1: how much spread slide i has compared to the average of the other slides.

(Conceptually we could have used the standard deviation instead, but it is not robust!)

aroma:normalizeAcrossSlides(ma)

Across-slides normalization cont.

71 of 91

Combining data from several slides

1. Within-slide normalization

2. Across-slides normalization

3. Averaging

+ more information

aroma:normalizeWithinSlide(ma, “p”); normalizeAcrossSlides(ma); tma <- as.TMAData(ma)

72 of 91

The average of all normalized slidesThe “average” slide:

aroma:plot(tma)

73 of 91

ReferencesNormalization:• Yee Hwa Yang, Sandrine Dudoit, Percy Luu, and Terence P Speed. Normalization for cDNA microarray data. In

Michael L. Bittner, Yidong Chen, Andreas N. Dorsel, and Edward R. Dougherty, editors, Proceedings of SPiE, volume 4266 of Microarrays: Optical Technologies and Informatics, pages 141–152, San Jose, California, June 2001. The International Society for Optical Engineering.

• Henrik Bengtsson and Ola Hössjer. Methodological study of affine transformations of gene expression data with proposed normalization method. Preprints in Mathematical Sciences 2003:38, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2003.

• Wolfgang Huber, Anja von Heydebreck, Holger Sültmann, Annemarie Poustka, and Martin Vingron. Variance stabilization applied to microarray data calibration and to the quantification of differential expression . Bioinformatics, 1:1–9, 2002.

• Henrik Bengtsson. Identification and normalization of plate effects in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2002.

• Gordon Smyth and Terry Speed, METHODS: Selecting Candidate Genes from DNA Array Screens, Dec 2003

• Henrik Bengtsson. aroma - An R Object-oriented Microarray Analysis environment. Preprint in Mathematical Sciences (manuscript in progress), Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2004. URL: http://www.braju.com/R/

74 of 91

The cDNA microarray technology-

PART IV: Identifying differentially expressed

genes1. Cut-off by M values2. The t-statistics and cut-off by T values3. Multiple testing and adjusted the p-values4. Validation

75 of 91

Average of all normalized slidesThe “average” slide:

aroma:plot(tma)

76 of 91

Cut-off by log-ratios (naive)Top 5% of the absolute M values:

aroma:plot(tma); highlight(tma, top(abs(tma$M), 0.05), col=“red”)

77 of 91

The t-statisticsIdea: For replicated data, i.e. multiple measurements of the same thing, we trust the estimate of the average (mean or median) more if the deviation (std.dev. or MAD) is small. If the deviation is large, we do not trust it that much.

• The T statistics down-weight the importance of the average if the deviation is large and vice versa;

T = mean(x) / SE(x)

where SE(x)=std.dev(x)/N (standard error of the mean)

•Example: The blue and the red genes have almost the same average log-ratio, but we are more confident with the measure of the blue gene since its variability across replicates is smaller.

aroma:tma <- as.TMAData(ma); summary(ma$T)

78 of 91

Cut-off by T values (better)

Top 5% of the absolute T values:

aroma:plot(tma); highlight(tma, top(abs(tma$T), 0.05), col=“blue”)

79 of 91

Where are the M values compared to the T values?

Top 5% of the absolute M values in red:

aroma:plot(tma); plot(tma, “TvsSE”); ...

80 of 91

ReferencesIdentification of differentially expressed genes:• Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. Statistical

methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578, Department of Statistics, University of California at Berkeley, 2000.

• M. Callow, S. Dudoit, E. Gong, T. Speed, and E. Rubin. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research, 10(12):2022–9, December 2000.

• Ingrid Lönnstedt and Terence P. Speed. Replicated microarray data. Statistical Sinica, 12(1), 2002.

81 of 91

Multiple testing• In a microarray experiment, each gene (each

probe or probe set) is really a separate experiment.

• Yet, if you treat each gene as an independent comparison, you will always find some with significant differences.

82 of 91

False positive and false negativeFalse Positive and False Negative: Statistical test vs. truth

If truth is T ≠ 0: If truth is T=0:

Statistical test decision: Reject H0: T=0.

Correctly reject H0 False Positive(Type I Error)

Statistical test decision:Do not reject H0: T=0.

False Negative(Type II Error)

Correctly not reject H0

In cDNA microarray experiments we commonly test the hypothesis H0 that T=0 against T≠0 (non-DE or not) for every gene separately. For the genes for which we reject H0, we say they are differentially expressed.

However, by chance we will reject H0 for some genes that are not DE. We call these findings false positive (Type I). Genes that are DE, but for which we do not reject H0 are called false negative (Type II).

83 of 91

False Discovery Rate• False Discovery Rate (FDR) is equal to the p-value

of the test, p = P(|T| > tα/2), times the number of genes in the array– For a p-value of 0.01·10,000 genes = 100 false DEs.– You cannot eliminate false positives, but by choosing a

more stringent p-value, you can keep them manageable (try p=0.001 10 false positives)

• The FDR must be smaller than the number of real differences that you find, which in turn depends on the size of the differences and variability of the measured expression values. It’s tricky, but at least you should be aware of it.

84 of 91

Multiple testing cont.

• Multiple testing pitfalls– thousands of tests, i.e. each gene is tested against

H0: T=0. By chance some will “fail”.– false positives problems more serious.– need to adjust p-values.

• Different adjustment procedures– Bonferroni, Sidak, Duncan, Holm, etc. Not discussed

here, but available automatically in the better microarray analysis software.

85 of 91

Adjusted and unadjusted p-values for the 50 genes with the largest absolute t-statistics.

Terry Speed et al.

86 of 91

ReferencesMultiple testing and adjusted p-values:

• Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578, Department of Statistics, University of California at Berkeley, 2000.

• Y. Ge, S. Dudoit & T. P. Speed, Resampling-based multiple testing for microarray data hypothesis (submitted to Test, Spain). Technical Report #633 of UCB Statistics, 2003.

87 of 91

Validation of identified genes

88 of 91

Exploratory validation plots Were are the differentially expressed genes located?

Does it make sense? If there is a pattern, you might need to redo the normalization.

89 of 91

Biological verification1. Having a top 20 list of interesting genes, what biological

relevance do they have in this experiment? 2. Have you found interesting genes, but with very noisy

data? Do more experiments and see if you can replicate the results. Otherwise, you might just have found noise!

3. Verify interesting genes by other means such as Southern or Northern blots and quantative real-time PCR (QRT-PCR) or other microarray technologies etc.

90 of 91

Summary: Identification of DEs• You need replication and statistics to find real

differences.• Cutoff by log ratios is not enough/correct.• Cutoff by t-statistics is much better.• Multiple testing => must adjust the p values.• Validate your results

91 of 91

Take-home messages1. Good image analysis is essential. Some software are obsolete

and not that good. Commercial software like GenePix could be better (and hopefully will be soon).

2. Background correction or not is not solved. Progress has been done, but more research is needed.

3. Normalization is needed. We understand more now than two years ago.

4. Use at least the t-statistics to identify differentially expressed genes. Do not rely exclusively on log-ratios.

5. Multiple testing must be considered; adjust your p-values.

6. Talk to a statistician before doing the experiments! We do think about these kind of problems for a living.

top related