henrik bengtsson [email protected] mathematical statistics, centre for mathematical sciences,

91
Henrik Bengtsson [email protected] Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden Bioinformatics 2, Bioinformatics Centre, University of Copenhagen. February 19, 2004 Statistical analysis of Statistical analysis of microarray data microarray data

Upload: kailey

Post on 16-Mar-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Statistical analysis of microarray data. Henrik Bengtsson [email protected] Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden Bioinformatics 2 , Bioinformatics Centre, University of Copenhagen. February 19, 2004. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

Henrik [email protected]

Mathematical Statistics, Centre for Mathematical Sciences,

Lund University, Sweden

Bioinformatics 2, Bioinformatics Centre, University of Copenhagen.

February 19, 2004

Statistical analysis of Statistical analysis of microarray datamicroarray data

Page 2: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

2 of 91

OutlinePart I – (Very short) Background

• Central Dogma of Biology• Idea behind the microarray technology• Experimental design

Part II - Printing, Hybridization, Scanning & Image Analysis• From clone to slide• From samples to hybridization• From scanning to raw data

Part III - Preprocessing of data (IMPORTANT)• Exploratory data analysis and log transforms• Background correction• Normalization

Part IV - Identifying differentially expressed genes• Cut-off by log-ratios values, the t-statistics and cut-off by T values• Multiple testing, adjusting the p-values• Validation

Page 3: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

3 of 91

The cDNA microarray technology-

PART I: (Very short) Background

1. The Central Dogma of Biology2. Idea behind the microarray technology3. Experimental design

Page 4: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

4 of 91

Idea of microarrays (and other gene expression techniques):

Measure the amount of mRNA to find genes that are expressed (active).

(measuring protein might be better, but is currently much harder)

The Central Dogma of Biology

Page 5: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

5 of 91

History

cDNA microarrays have evolved from Southern & Northern blots, with clone libraries gridded out on nylon membrane filters being an important and still widely used intermediate.

Things took off about 1995 with the introduction of non-porous solid supports, such as glass - these permitted miniaturization - and fluorescence based detection.

Terry Speed et al.

Page 6: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

6 of 91

The cDNA Microarray Technique

1. Put a large number of DNA sequences or synthetic DNA oligomers onto a glass slide- 5000-50000 gene expressions at the same time.

2. Measure amounts of cDNA bound to each spot3. Identify genes that behave differently in different cell

populations- tumor cells vs healthy cells- brain cells vs liver cells- same tissue different organisms

• Time series experiments- gene expressions over time after treatment, e.g. T=5,10,30,... min vs T=0.

• ...

Page 7: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

7 of 91

Experimental Design1. Experimental design is very important.

Example: Assume we have 8 control (C1, ... C8) and 8 reference (R1, ..., R8) samples, should we compare a) C1-R1, ..., C8-R8, or b) C1-R’, ..., C8-R’ where R’ is a pool of all R, or c) something else?

2. Wrong design can be very wrong and the correct can be very efficient and require less arrays (= less lab work and cheaper experiments!).

3. Statistical theory exists and is helpful. Recently, statisticans has started to look at the microarray design problem in more detail. Progress will for be made sure! Experience from previous setups is also of great value.

We will not discuss this problem anymore in this talk.

Page 8: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

8 of 91

The cDNA microarray technology-

PART II: Printing, Hybridization,

Scanning & Image Analysis

1. From clone to slide2. From samples to hybridization3. From scans to raw data

Page 9: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

9 of 91

Overview

scanning

data: (Rfg,Gfg,Rbg,Gbg, ...)

cDNA clones

PCR product amplificationpurification

printing

Hybridize

RNA

Test sample

cDNA

RNA

Reference sample

cDNA

excitationred lasergreen

laser

emission

overlay images

Production

Page 10: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

10 of 91

Printing / spotting

Arrayer (approx 100,000 EUR)

Page 11: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

11 of 91 J. Vallon-Christersson, Dept Oncology, Lund Univ.

Microarray slide preparation

Page 12: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

12 of 91

Terminology: probe and target• As defined in Nature 1999:

– The probes are the immobilized DNA sequences spotted on the array, i.e. spot, oligo, immobile substrate.

– The targets are the labeled cDNA sequences to be hybridized to the array, i.e. mobile substrate.

The opposite usage can also be seen in some references. However, think of probes as the measuring device (which you can buy), and the targets (that you provide) as what you want to measure.

Page 13: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

13 of 91

RNA extraction & hybridization

Hybridize

RNA

Tumor sample

cDNA

RNA

Reference sample

cDNA

1. Extract mRNA from samples2. Reverse transcription of mRNA to

cDNA in presence of aminoallyl-dUTP (aa-dUTP)

3. Label with Cy3 or Cy5 fluorescent dye (links to aa-dUTP)

4. Hybridize labeled cDNA to slide5. Wash slide

Figure: Hybridization chamber.

(probes)

(targets)

Preparation incl. array (approx 100-200 EUR/array)

Page 14: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

14 of 91

ReferencesOriginal cDNA microarray paper:• Mark Schena, Dari Shalon, Ronald W. Davis, and Patrick O. Brown. Quantitative monitoring of gene

expression patterns with a complementary DNA microarray. Science, 270(5235):467–470, October 1995.

General:• Mark Schena. Microarrays Analysis. John Wiley & Sons, Inc., Hoboken, New Jersey, 2003.

• David J. Duggan, Michael Bittner, Yidong Chen, and Paul Meltzer & Jeffrey M. Trent. Expression profiling using cDNA microarrays. Nature Genetics, 21(1 Supplement):10–14, January 1999.

Page 15: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

15 of 91

Scanning

Terry Speed et al.

Page 16: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

16 of 91Scanner (approx 50,000-100,000 EUR)

Software (approx 0-5,000 EUR)

Page 17: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

17 of 91

Two-channel scanningexcitation

red lasergreen

laser

emission

overlay images

higher frequency, more energy

lower frequency,

less energy

Page 18: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

18 of 91

Combined color image for visualization

Page 19: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

19 of 91

Image Analysis

Terry Speed et al.

Page 20: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

20 of 91

Signal quantification

1. AddressingLocate spot centers.

2. SegmentationClassification of pixels either as signal or background (using circles, seeded region growing or other).

3. Signal quantificationa) foreground estimatesb) background estimatesc) ... (shape, size etc)

Terry Speed et al.

Page 21: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

21 of 91

Does one size fit all?

Terry Speed et al.

Page 22: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

22 of 91

Segmentation and quantification of spot regions

Seeded region growing Fixed-sized or adaptive-sized circles

Inside the boundaries is foreground (the spots) and outside is the background. We estimate both...

Terry Speed et al.

Page 23: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

23 of 91

Robust signal estimates:mean vs median

• When a statistician talks about a robust measure he/she normally means a measure that is not sensitive to outliers, e.g.

x = (8, 85, 7, 9, 5, 4, 13, 6, 8)

– The mean of all x’s, i.e. (x1+x2+...+xK)/K, will be affected if one of the measures, say x2, is wrong;

mean(x) = 16.11

– The median of all x’s, i.e. the middle value of (x1+x2+...+xK), will not be affected if a “few” (< 50%) are wrong;

median(x) = 8.0

For this reason we prefer to use median instead of the mean in cases were we expect artifacts.

(If there are a lot of measurements and the errors are symmetrically distributed the median will give the same result as the mean without outliers.)

Page 24: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

24 of 91

Typical data output

Page 25: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

25 of 91

Summary

scanning

data: (Rfg,Gfg,Rbg,Gbg, ...)

cDNA clones

PCR product amplificationpurification

printing

Hybridize

RNA

Test sample

cDNA

RNA

Reference sample

cDNA

excitationred lasergreen

laser

emission

overlay images

Production

Page 26: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

26 of 91

ReferencesImage analysis:• Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of

methods for image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000.

• Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December 2003. 2003:E40.

Page 27: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

27 of 91

The cDNA microarray technology-

PART III: Preprocessing of data (IMPORTANT)

1. Exploratory Data Analysis and log-transforms

2. Background correction3. Normalization

Page 28: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

28 of 91

Exploratory Data Analysis

Terry Speed et al.

Page 29: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

29 of 91

Scatter plot: R vs G

“Observed” data {(R,G)}n=1..5184:

R = signal in red channel, G = signal in green channel

up-regulated genes

down-regulated genes

non-differentially expressed genes along the diagonal:

R = G

Most genes have low gene expression levels. What happens here?

Page 30: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

30 of 91

Scatter plot: log2R vs log2G

“Observed” data {(log2R,log2G)}n=1..5184:

R = signal in red channel, G = signal in green channel

up-regulated genes

down-regulated genes

non-differentially expressed genes are still along the diagonal:

log2R = log2G

Low gene expression levels are “blown up”.

Page 31: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

31 of 91

Scatter plot: M vs A (recommended)

up-regulated genes

down-regulated genes

non-differentially expressed genes are now along the horizontal line:

M = 0

log2R - log2G = 0

Transformed data {(M,A)}n=1..5184:

M = log2(R) - log2(G) (minus)

A = ½·[log2(R) + log2(G)] (add)

Note: M vs A is basically a rotation of the log2R vs log2G scatter plot.

Why: Now the quantity of interest, i.e. the fold change, is contained in one variable, namely M!

If M > 0, up-regulated.If M < 0, down-regulated.

Page 32: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

32 of 91

Details on M vs A

Log-ratios:

M = log2(R) – log2(G) = [logarithmic rules] = log2(R/G)

Average log-intensities:

A = ½·[log2(R) + log2(G)] = [logarithmic rules] = ½·log2(R·G)

There is a one-to-one relationship between (M,A) and (R,G):

R=(22A+M)1/2, G=(22A-M)1/2

Page 33: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

33 of 91

More on why log and why M vs A?• It makes the distribution

symmetric around zero

R:G M=log(R/G) R:G M=log(R/G) comment

1:1 0 20=1

2:1 +1 1:2 -1 21=2, 2-1=1/2

4:1 +2 1:4 -2 22=4, 2-2=1/4

8:1 +3 1:8 -3 23=8, 2-3=1/8

16:1 +4 1:16 -4 24=16, 2-4=1/16

32:1 +5 1:32 -5 25=32, 2-5=1/32

Before: After:

• Logs stretch out region we are most interested in and makes the distribution more normal. • Easier to see artifacts of

the data, .e.g. as intensity dependent variation and dye-bias.

• Log base 2 because the raw data is binary data (max intensity is 216-1 = 65535). It is also naturally to think of 2-, 4-, 8-fold etc up and down regulated genes. For the actual analysis, any log-base will do.

Page 34: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

34 of 91

M = log2(R/G) (log-ratio),

A = ½·log2(R·G) (log-intensity)

Summary of (R,G) (log2R,log2G) (M,A)

R = red channel signalG = green channel signal

R vs G log(R) vs log(G) M vs A

plot(rg, log=FALSE); plot(rg); ma <- as.MAData(rg); plot(ma)

Page 35: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

35 of 91

Density plots of all signals

log2R = red channel signallog2G = green channel signal

Example: Signals in the red channel seem to be slightly weaker than the signals in the green channel, compare with M vs A plot:

...more green.

plotDensity(rg)

Page 36: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

36 of 91

Spatial plot of log-ratios (M values)

Print-tip 1

Print-tip 16

Print-tip 8

Print-tip 9

plotSpatial(ma)

Page 37: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

37 of 91

Print-tip box plot of log-ratios1

16

boxplot(ma, groupBy=“printtip”)

Page 38: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

38 of 91

Printing order of spots1

16

6384 spots printed onto 9 slides in total 399 print turns using 4x4 print-tips...

Hmm... why the horizontal stripes?

Page 39: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

39 of 91

Print-order plot of log-ratiosThe spots are order according to when they were spotted/dipped onto the glass slide(s). Note that it takes hours/days to print all spots on all arrays.

plotPrintorder(ma)

Page 40: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

40 of 91

Print-dip plot of log-ratiosAverage values of the 16 log-ratios at each dip from each of the 399 print turns.

plotDiporder(ma)

Page 41: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

41 of 91

ReferencesExploratory data analysis for microarrays:• Yee Hwa Yang, Sandrine Dudoit, Percy Luu, and Terence P Speed. Normalization for cDNA

microarray data. In Michael L. Bittner, Yidong Chen, Andreas N. Dorsel, and Edward R. Dougherty, editors, Proceedings of SPiE, volume 4266 of Microarrays: Optical Technologies and Informatics, pages 141–152, San Jose, California, June 2001. The International Society for Optical Engineering.

• Henrik Bengtsson. Identification and normalization of plate effects in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2002.

• Gordon Smyth and Terry Speed, METHODS: Selecting Candidate Genes from DNA Array Screens, Dec 2003

Page 42: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

42 of 91

Background correction and background estimates revisited...

Terry Speed et al.

Page 43: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

43 of 91

Alt. 1: Local background estimatesOne for the red and one for the green channel.

A. Bengtsson, Lund University, 2003.

Axon GenePix Pro- size of bg region depends on sizeof the spot. Give too much variation!

Fixed-size regions- same number of background pixelsgive same variation for all estimate.

Page 44: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

44 of 91

Alt. 2: Filtered background estimatesOne for the red and one for the green channel.

A. Bengtsson, Lund University, 2003.

In general, filter by moving a window (mask) over the image:

1) At each position replace the value in the center (red dot) by, say, the median pixel intensity of all spots in the window.

2) Move to next pixel and so on.

3) Result:

(more sophisticated filters than the median filter are used in practice)

Idea: With filters we do not have to know the where the spots are. More robust.

Also known as: Morphological filters

Page 45: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

45 of 91

• The foreground region of the spots are assumed to be contaminated with background noise/signal.

• Sources of background: probe unspecifically sticking on slide, irregular / dirty slide surface, dust, noise in the scanner measurement.

• From the image analysis we have foreground and background estimates for each spot in both channels:

(Rfg, Rbg, Gfg, Gbg)

• For each spot on the slide we do background subtraction

Red intensity: R = Rfg – Rbg

Green intensity: G = Gfg - Gbg

• Not considered: real cross-hybridisation and unspecific hybridisation to the probe. Still not known if this is a real problem.

Background subtraction

Terry Speed et al.

Example of an extremely bad array.

Page 46: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

46 of 91

Which background estimate should we use?

Axon GenePix Pro- size of bg region depends on sizeof the spot. Give too much variation! Fixed-size regions

- same number of background pixelsgive same variation for all estimate.

Filtered estimates- no spot segmentation.

Fixed-sized are better than variable-sized region methods, but filtered estimates may be even better. Should we do background correction at all? (A. Bengtsson, 2003) is a good start, but more research is needed.

Page 47: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

47 of 91

Background correction or not?non-background subtracted background subtracted

-seems better, but...

M = log2(Rfg/Gfg) Mbg = log2({Rfg-Rbg} / {Gfg-Gbg})

Page 48: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

48 of 91

Background problem still not solved!

GenePix background:

Spotbackground:

(morphological opening)

Foreground & background Background subtracted

-Still curvature left.Too little back-

ground correction?!

-Curvature in the other direction now!

Too much back-ground correction?!

Page 49: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

49 of 91

ReferencesBackground estimation and correction:• Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of methods for

image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000.

• Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December 2003. 2003:E40.

• Henrik Bengtsson, Göran Jönsson, and Johan Vallon-Christersson. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range. Preprints in Mathematical Sciences 2003:37, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2003.

• Charles Kooperberg, Thomas G. Fazzio, Jeffrey J. Delrow, and Thoshio Tsukiyama. Improved background correction for spotted DNA microarrays. Journal of Computational Biology, 9:55–66, 2002.

Page 50: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

50 of 91

Normalization

Terry Speed et al.

Page 51: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

51 of 91

Expectation: Most genes are non-differentially expressed, i.e. most of the data points should be around M=0.

Idea: Do various exploratory plots to see if this assumption is met. For example, M vs A, spatial plots, density & boxplots plots, print-order plots etc.

Result: We commonly observe something else:Measured value = real value + systematic errors + noise

Correction: If so, normalize the data such that the expectations are met:Corrected value = real value + systematic errors + noise

Normalization

Page 52: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

52 of 91

Sources of systematic effects, but also noise and natural variability

• Biological variability • RNA extraction• Probe labeling (dye differences)• Printing (print-order, plate-order, clone variation, etc.)• Hybridization (temperature, time, mixing, etc.)• Human (variation between lab researchers)• Scanning (laser and detector, chemistry of the fluorescent label)• Image analysis (identification, quantification, background methods etc.)

Page 53: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

53 of 91

Examples of practical problems that can lead to artifacts in data

• Surface chemistry: uneven surface may lead to high background.

• Dipping the pin into large volume. It helps to pre-printing on a dummy array to drain off excess sample.

• Spot variation can be due to mechanical difference between pins. Pins could be clogged during the printing process.

• Spot size and density depends on surface and solution properties.

• Pins need good washing between samples to prevent sample carryover.

• Dust while printing and while hybridizing the arrays.• ... and much more!

Page 54: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

54 of 91

Within-slide normalizationCurrent methods widely used today:

1. No normalization2. Median normalization

Shift of log-ratios.

3. Curve-fit normalization methods.A.k.a. lo(w)ess normalization.

4. Dye-swap normalizationActually an experimental design.

Page 55: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

55 of 91

Within-slide normalization cont.The median, the curve-fit and the dye-swap normalization and others, can be applied to different groups of spots:

1. all spots2. print-tip groups3. plate groups4. ...but also1. spatially2. on spots ordered in print order3. ...

Page 56: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

56 of 91

No normalization – raw dataNon-normalized data {(M,A)}n=1..N:

M = log2(R/G)

aroma:´plot(ma, slide=2)

Page 57: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

57 of 91

Global median normalization

M’ = M – median(M)

Assumption: Changes roughly symmetric.

aroma:normalizeWithinSlide(ma, method=“m”)

Page 58: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

58 of 91

Global curve-fit normalizationGlobal curve-fit normalized data {(M,A)}n=1..N:

Mnorm = M - c(A)

where c(A) is an intensity dependent function (loess, splines, ...).

aroma:plot(ma); lowessCurve(ma)

Page 59: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

59 of 91

Global curve-fit normalization cont.

(Biased towards the green channel & intensity dependent artifacts)

aroma:

plot(ma); normalizeWithinSlide(ma, method=“l”); plot(ma)

curve-fit normalization done for

all spots together

Page 60: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

60 of 91

Print-tip curve-fit normalizationPrint-tip curve-fit normalized data {(M,A)}n=1..N:

Mp,norm = Mp - cp(A); p = print tip 1-16

where cp(A) is an intensity dependent function for print tip p.

aroma:plot(ma); lowessCurve(ma, groupBy=“printtip”); boxplot(ma, groupBy=“printtip”)

Page 61: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

61 of 91

Print-tip curve-fit normalization cont.

aroma:

normalizeWithinSlide(ma, “p”); boxplot(ma, groupBy=“printtip”); plotSpatial(ma)

curve-fit normalization done for each print-tip group

separately

Page 62: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

62 of 91

Scaled print-tip curve-fit normalizationScaled print-tip curve-fit normalized data {(M,A)}n=1..N:

Mp,norm = sp·(Mp - cp(A)); p=print tip 1-16

where sp is a scale factor for print tip p (Median Absolute Deviation).After print-tip normalization After scaled print-tip normalization

aroma:

normalizeWithinSlide(ma, “p”); normalizeWithinSlide(ma, “s”)

curve-fit normalization+ rescaling

done for each print-tip group

separately

Page 63: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

63 of 91

Spatial artifacts after normalization

No normalization Global curve-fit normalization

Print-tip curve-fit normalization Scaled print-tip curve-fit normalization

Page 64: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

64 of 91

Another example

Scaled print-tip curve-fit normalization:

Page 65: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

65 of 91

Dye-swap normalizationIdea: We have observed dye-dependent biases. What if we do the same hybridization a second time, but with reversed labeling of the dyes? Any dye-effects will hopefully cancel out, i.e. self-normalize, if the dyes are swapped and the signals are averaged.Requires: Paired arrays, i.e. 2, 4, 6, ... arrays.

Details: Let {(M1,A1)} and {(M2,A2)} be the log-ratios and log-intensities for array 1 and 2, respectively. Dye-swap normalization is then:

M = (M1-M2)/2, A = (A1+A2)/2

Page 66: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

66 of 91

Dye-swap normalization cont.

What is done? Let {(R1,G1)} and {(R2,G2)} be the corresponding red and green signals from array 1 and 2. Then

M = (M1-M2)/2 = [log2(R1/G1)-log2(R2/G2)]/2= [log2R1-log2G1-(log2R2-log2G2)]/2 = [(log2R1-log2R2)+(log2G2-log2R1)]/2

= [log2(R1/R2) + log2(G2/G1)]/2 = (M1’+M2’)/2

Similarly for A = (A1+A2)/2 = ... = (A1’+A2’)/2. Now, if the red signals are equally biased, then the effects are hopefully cancelled-out. Same for the green channel.

Thus, dye-swap is averaging two virtual (hopefully) self-normalized slides.

Details: Let {(M1,A1)} and {(M2,A2)} be the log-ratios and log-intensities for array 1 and 2, respectively. Dye-swap normalization is then:

M = (M1-M2)/2, A = (A1+A2)/2

Page 67: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

67 of 91

Median plate normalization

Will remove some of the intensity dependent effects...

...and some of the spatial artifacts.

Page 68: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

68 of 91

Within-slide normalization summaryAlternatives:

1. No normalization

2. Median normalizationa) all spotsb) print-tipc) plate-group

3. Curve-fit normalizationa) Global curve-fit normalizationb) Print-tip curve-fit normalizationc) Scaled print-tip curve-fit normalization

4. Dye-swap normalization

Page 69: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

69 of 91

Across-slides normalization

The within-slide normalization makes the red and the green signals comparable. What about signals from different arrays? Across-slide normalization attacks this problem.

Idea: Log-ratios (M values) from different slides should have equal variation.

Example: If the standard deviation of the log-ratios from one array is 0.50 and from another array 0.78, the log-ratios have to be rescaled to get the same standard deviation.

Page 70: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

70 of 91

Idea: Log-ratios (M values) from different slides should have equal variation.

True ratio is M’ij where j represents different spots and i represents different arrays. Observed is Mij, where Mij = ai * M’ij. Robust estimate of ai is

Corrected values are calculated as:

0 < ai < 1: how much spread slide i has compared to the average of the other slides.

(Conceptually we could have used the standard deviation instead, but it is not robust!)

aroma:normalizeAcrossSlides(ma)

Across-slides normalization cont.

Page 71: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

71 of 91

Combining data from several slides

1. Within-slide normalization

2. Across-slides normalization

3. Averaging

+ more information

aroma:normalizeWithinSlide(ma, “p”); normalizeAcrossSlides(ma); tma <- as.TMAData(ma)

Page 72: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

72 of 91

The average of all normalized slidesThe “average” slide:

aroma:plot(tma)

Page 73: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

73 of 91

ReferencesNormalization:• Yee Hwa Yang, Sandrine Dudoit, Percy Luu, and Terence P Speed. Normalization for cDNA microarray data. In

Michael L. Bittner, Yidong Chen, Andreas N. Dorsel, and Edward R. Dougherty, editors, Proceedings of SPiE, volume 4266 of Microarrays: Optical Technologies and Informatics, pages 141–152, San Jose, California, June 2001. The International Society for Optical Engineering.

• Henrik Bengtsson and Ola Hössjer. Methodological study of affine transformations of gene expression data with proposed normalization method. Preprints in Mathematical Sciences 2003:38, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2003.

• Wolfgang Huber, Anja von Heydebreck, Holger Sültmann, Annemarie Poustka, and Martin Vingron. Variance stabilization applied to microarray data calibration and to the quantification of differential expression . Bioinformatics, 1:1–9, 2002.

• Henrik Bengtsson. Identification and normalization of plate effects in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2002.

• Gordon Smyth and Terry Speed, METHODS: Selecting Candidate Genes from DNA Array Screens, Dec 2003

• Henrik Bengtsson. aroma - An R Object-oriented Microarray Analysis environment. Preprint in Mathematical Sciences (manuscript in progress), Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2004. URL: http://www.braju.com/R/

Page 74: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

74 of 91

The cDNA microarray technology-

PART IV: Identifying differentially expressed

genes1. Cut-off by M values2. The t-statistics and cut-off by T values3. Multiple testing and adjusted the p-values4. Validation

Page 75: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

75 of 91

Average of all normalized slidesThe “average” slide:

aroma:plot(tma)

Page 76: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

76 of 91

Cut-off by log-ratios (naive)Top 5% of the absolute M values:

aroma:plot(tma); highlight(tma, top(abs(tma$M), 0.05), col=“red”)

Page 77: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

77 of 91

The t-statisticsIdea: For replicated data, i.e. multiple measurements of the same thing, we trust the estimate of the average (mean or median) more if the deviation (std.dev. or MAD) is small. If the deviation is large, we do not trust it that much.

• The T statistics down-weight the importance of the average if the deviation is large and vice versa;

T = mean(x) / SE(x)

where SE(x)=std.dev(x)/N (standard error of the mean)

•Example: The blue and the red genes have almost the same average log-ratio, but we are more confident with the measure of the blue gene since its variability across replicates is smaller.

aroma:tma <- as.TMAData(ma); summary(ma$T)

Page 78: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

78 of 91

Cut-off by T values (better)

Top 5% of the absolute T values:

aroma:plot(tma); highlight(tma, top(abs(tma$T), 0.05), col=“blue”)

Page 79: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

79 of 91

Where are the M values compared to the T values?

Top 5% of the absolute M values in red:

aroma:plot(tma); plot(tma, “TvsSE”); ...

Page 80: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

80 of 91

ReferencesIdentification of differentially expressed genes:• Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. Statistical

methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578, Department of Statistics, University of California at Berkeley, 2000.

• M. Callow, S. Dudoit, E. Gong, T. Speed, and E. Rubin. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research, 10(12):2022–9, December 2000.

• Ingrid Lönnstedt and Terence P. Speed. Replicated microarray data. Statistical Sinica, 12(1), 2002.

Page 81: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

81 of 91

Multiple testing• In a microarray experiment, each gene (each

probe or probe set) is really a separate experiment.

• Yet, if you treat each gene as an independent comparison, you will always find some with significant differences.

Page 82: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

82 of 91

False positive and false negativeFalse Positive and False Negative: Statistical test vs. truth

If truth is T ≠ 0: If truth is T=0:

Statistical test decision: Reject H0: T=0.

Correctly reject H0 False Positive(Type I Error)

Statistical test decision:Do not reject H0: T=0.

False Negative(Type II Error)

Correctly not reject H0

In cDNA microarray experiments we commonly test the hypothesis H0 that T=0 against T≠0 (non-DE or not) for every gene separately. For the genes for which we reject H0, we say they are differentially expressed.

However, by chance we will reject H0 for some genes that are not DE. We call these findings false positive (Type I). Genes that are DE, but for which we do not reject H0 are called false negative (Type II).

Page 83: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

83 of 91

False Discovery Rate• False Discovery Rate (FDR) is equal to the p-value

of the test, p = P(|T| > tα/2), times the number of genes in the array– For a p-value of 0.01·10,000 genes = 100 false DEs.– You cannot eliminate false positives, but by choosing a

more stringent p-value, you can keep them manageable (try p=0.001 10 false positives)

• The FDR must be smaller than the number of real differences that you find, which in turn depends on the size of the differences and variability of the measured expression values. It’s tricky, but at least you should be aware of it.

Page 84: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

84 of 91

Multiple testing cont.

• Multiple testing pitfalls– thousands of tests, i.e. each gene is tested against

H0: T=0. By chance some will “fail”.– false positives problems more serious.– need to adjust p-values.

• Different adjustment procedures– Bonferroni, Sidak, Duncan, Holm, etc. Not discussed

here, but available automatically in the better microarray analysis software.

Page 85: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

85 of 91

Adjusted and unadjusted p-values for the 50 genes with the largest absolute t-statistics.

Terry Speed et al.

Page 86: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

86 of 91

ReferencesMultiple testing and adjusted p-values:

• Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578, Department of Statistics, University of California at Berkeley, 2000.

• Y. Ge, S. Dudoit & T. P. Speed, Resampling-based multiple testing for microarray data hypothesis (submitted to Test, Spain). Technical Report #633 of UCB Statistics, 2003.

Page 87: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

87 of 91

Validation of identified genes

Page 88: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

88 of 91

Exploratory validation plots Were are the differentially expressed genes located?

Does it make sense? If there is a pattern, you might need to redo the normalization.

Page 89: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

89 of 91

Biological verification1. Having a top 20 list of interesting genes, what biological

relevance do they have in this experiment? 2. Have you found interesting genes, but with very noisy

data? Do more experiments and see if you can replicate the results. Otherwise, you might just have found noise!

3. Verify interesting genes by other means such as Southern or Northern blots and quantative real-time PCR (QRT-PCR) or other microarray technologies etc.

Page 90: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

90 of 91

Summary: Identification of DEs• You need replication and statistics to find real

differences.• Cutoff by log ratios is not enough/correct.• Cutoff by t-statistics is much better.• Multiple testing => must adjust the p values.• Validate your results

Page 91: Henrik Bengtsson hb@maths.lth.se Mathematical Statistics,  Centre for Mathematical Sciences,

91 of 91

Take-home messages1. Good image analysis is essential. Some software are obsolete

and not that good. Commercial software like GenePix could be better (and hopefully will be soon).

2. Background correction or not is not solved. Progress has been done, but more research is needed.

3. Normalization is needed. We understand more now than two years ago.

4. Use at least the t-statistics to identify differentially expressed genes. Do not rely exclusively on log-ratios.

5. Multiple testing must be considered; adjust your p-values.

6. Talk to a statistician before doing the experiments! We do think about these kind of problems for a living.