microarray data analysis system (version 2.19 )

MIcroarray Data Analysis System(version 2.19)

Wei Liang

October 2004

Microarray Data Flow

Image Analysis

Database

AGED

Database

Others…

Database

MAD

Raw Gene Expression Data

Normalized Data with Gene Annotation

Interpretation of Analysis Results

.tiff Image File

Gene Annotation

ScannerPrinter

Normalization / Filtering

Expression Analysis

Data Entry / Management

MIDAS is a

Normalization and

Filtering tool for microarray data analysis!

MIDAS is a

Normalization and

Filtering tool for microarray data analysis!

Serves as a data pre-processor for clustering analysis (MeV).

Why Normalization and Filtering?

Cy3

Cy5

Cy5-cDNA

Cy3-cDNA

RT

RT

cDNAarray

Cy5 intensity

Cy3 intensity

Sample2 mRNA

Sample1 mRNA

Wavelength dependent

Intensity dependent

Uneven hybridization gel

print-tip variations

Background variations

Image processing algorithm-dependent

Systematic experimental error

.tiff Image Files

Raw Data File

Why Normalization and Filtering?

• We use these intensities to identify biologically relevant patterns of expression by comparing measured levels between states on a gene-by-gene basis.

• However, before the levels can be appropriately compared, one generally performs a number of transformations on the data to eliminate questionable or low quality data, to adjust the measured intensities to facilitate comparisons, and to select those genes that are significantly differentially expressed.

• The hypothesis underlying microarray analysis is that the measured intensities for each arrayed gene represent its relative expression level.

MIDAS data analysis methods• 8 normalization/transformation methods

Total Intensity normalization

• 10 quality control filtering methods

Invalid-intensity checking

LOWESS (Locfit) normalization

Iterative linear regression normalization

Iterative log mean centering normalization

Ratio Statistics normalization

Low intensity filter

Standard deviation regularization

Slice analysis (non-statistical)

In-slide replicates analysis

Flip-dye consistency checking

Ratio Statistics confidence interval checking

Signal/Noise checking

Cross-file-trim

Spot QC flag checking

MA-ANOVA

Cross-slide replicates t-test (statistical)

Cross-slide one-class SAM (statistical)

• 3 significant genes identification methods

Graphical scripting language

Graphical scripting language

• Read input files

• Define analysis

pipeline and set

parameters for

each analysis module

• Write output files

MIDAS data analysis methods• 8 normalization/transformation methods

Total Intensity normalization

• 10 quality control filtering methods

Invalid-intensity checking


Iterative linear regression normalization

Iterative log mean centering normalization

Ratio Statistics normalization

Low intensity filter


Slice analysis (non-statistical)

In-slide replicates analysis

Flip-dye consistency checking

Ratio Statistics confidence interval checking

Signal/Noise checking

Cross-file-trim

Spot QC flag checking

MA-ANOVA

Cross-slide replicates t-test (statistical)

Cross-slide one-class SAM (statistical)

• 3 significant genes identification methods

Sample dataPair # 1st file name 2nd file name

1 NFE005d0001.mev NFE005d00020.mev









11 NFE005d00010.mev NFE005d00029.mev

12 NFE005d00011.mev NFE005d00030.mev

13 NFE005d00012.mev NFE005d00031.mev

14 NFE005d00013.mev NFE005d00032.mev

15 NFE005d00014.mev NFE005d00033.mev

16 NFE005d00015.mev NFE005d00034.mev

17 NFE005d00016.mev NFE005d00035.mev

18 NFE005d00017.mev NFE005d00036.mev

19 NFE005d00018.mev NFE005d00037.mev

20 NFE005d00019.mev NFE005d00038.mev


-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

A SD = 0.346

• Observations

1. Tilted tails at low intensity end and high intensity end2. Mean not centered at 0 – intensity dependent

R-I plot: logRatio vs. logIntensityProduct


-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

A SD = 0.346Gene X

• If Cy3, Cy5 equally expressed, log2(Cy5/Cy3) = 0

• Two factors contributed to the up-regulated gene X: 1. Biological factors (we are interested) 2. Experimental factors, e.g. different sensitivity to red and green lasers (we are NOT interested and desire to get rid of.)

Exp factor

Bio factor

-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

A SD = 0.346Gene X

Exp factor

Bio factor

We need to find a way to extract the experimental factors

Approach: Assume similar experimental factors applied

to genes closer to each other in the logProd-logRatio plot Predict the Exp factor from a group of locally neighboring

data --- equivalent to a curve fitting problem.



• Local linear regression model

• Tri-cube weight function

• Least Squares

Estimated values of log2(Cy5/Cy3) as function of log10(Cy3*Cy5)

WYXWXX

xyxw

xyxw

xy

iii

iii

ii

')'(

0)()(

)()(

1

2

2

-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

A SD = 0.346


Use the estimated curve y(xi) to correct raw data

-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

A SD = 0.346

Gene X

y(xi) = Exp factorBio factor

log2(Ri’/Gi’) = log2(Ri/Gi) – y(xi)

log2(Ri’/Gi’) = log2(Ri/Gi) – log22y(xi)

log2(Ri’/Gi’) = log2(Ri/Gi * 1/2y(xi))

Ri’ = Ri

Gi’ = Gi * 2 y(xi)


-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy5*Cy3)

SD = 0.346SD = 0.338

B

LOWESS-corrected RI plot


Assumption: Within each block and each slide, spots should have the same spread for log(Cy5/Cy3, 2) values

SD-Reg scales the (Cy3, Cy5) intensity pair for each spot so that the spot sets within each block or each slide will have the same standard deviation as other blocks or slides.


3

5log2 Cy

Cyaij

• Let aij be the raw log ratio for the jth spot in ith block (or slide)

M

M j

j

ijij

Nijij

Nijij

aa

aa

aa

1

1'

)(

)(

2

2

where Nj denotes the number of genes ith block or ith slide, M denotes the number of blocks or slides, aij denotes the

log ratio mean of ith block (or ith slide)

a’ij be the scaled log ratio for the jth spot in ith block (or slide)

Flip dye replicates consistency filter

• The intensities in the file pair are flipped, i.e.

R1/G1 ~ G2/R2 or R1~ G2, G1 ~ R2

G1R1 G2R2Gene1

Gene2

Gene3

Gene4

Gene8

Gene7

Gene6

Gene5

• Flip dye experiments help reduce random error

Flip dye replicates consistency filter• Calculate expression levels for all genes in the flip-dye pair

• Filter genes with inconsistent expression levels betweenflip-dye replicates

• For those genes passed the consistency checking, take geometric mean for the corresponding intensities from the replicated pairs

How consistency is measured between replicates?

Flip dye replicates consistency filter

1

2211

RGGR

File 1 File 2G1R1 G2R2Gene

2

2

1

1

R

G

G

R100% consistency: 0

21

21log

2

21

1

log 22 GG

RR

R

GG

R

Flip dye replicates consistency Filter

• SD cut vs. Threshold cut

SD cut

Threshold cut

Regardless of datasets, always cut the same percentage for the same

The percentage to cut depends on the specified log-ratio consistency range

-1< < 1

1/2 < < 2

21

21log2 GG

RR

21

21

GG

RR

Flip dye replicates consistency filter• Calculate expression levels for all genes in the flip-dye pair

• Filter genes with inconsistent expression levels betweenflip-dye replicates

• For those genes passed the consistency checking, take geometric mean for the corresponding intensities from the replicated pairs

Slice Analysis filter• Remove genes with z-scores beyond an interested range

Slice Analysis filter

-3

-2

-1

0

1

2

3

7 8 9 10 11 12 13 14

log(Cy5*Cy3)

SD = 0.346SD = 0.338

B

• Define a slice window• Sliding the window along the log(IntensityProduct) axis• Calculate logRatioMean and logRatioSD of data points within each slice window• Calculate Z-scores of each data point

Z-score = (logRatio-logRatioMean)/ logRatioSD• Trim data with Z-scores beyond interested range

Slice Analysis filter

-4

-3

-2

-1

0

1

2

3

4

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

log

2(C

y5/C

y3)

-8

-6

-4

-2

0

2

4

6

8

7 8 9 10 11 12 13 14

log(Cy3*Cy5)

log

2(C

y5/C

y3)

Analysis packaging

myAnalysis.prj

MIDAS graphing

MIDAS graphing

R-I plot (.prc)

Box plot (.box)

FlipDye Diagnostic plot (.rrc)Intensity plot (.ity, .lty)

Z-score Distribution plot (.his) SAM plot (.sam)

MIDAS data viewer

Statistical significant genes identification methods

Two methods implemented in this release of MIDAS:

• Cross-slide replicates one-class T-test

• Cross-slide replicates one-class SAM

SAM (Significance Analysis of Microarrays)

Tusher, V.G., R. Tibshirani and G. Chu. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences USA 98: 5116-5121.

A statistical technique for finding significant genes in a set of microarray experiments.

Reference:

Designs:

• two-class unpaired• two-class paired• multi-class unpaired• censored survival• one-class (available in this release)


One-class SAM:

Identify genes whose mean expression across experiments are different from a user-specified mean.

• Assign a score (d) to each gene based on its change in expression relative to the standard deviation of repeated measurements for the gene

• Genes with scores > a threshold (Δ) are deemed potentially significant

• For these “deemed potentially significant” genes, the proportion of

them likely to have been wrongly identified by chance, or

False Discovery Rate (FDR) is estimated

• The goal is picking a set of differentially expressed genes with a

user-satisfied FDR


Δ adjustment

FDR

positively significant genes

Automated report generation

TM4 MIDAS web page

http://www.tigr.org/software/tm4/midas.htmlhttp://www.tm4.org/midas.html

microarray data analysis system (version 2.19 )

Documents

microarray analysis

data preprocessor

low quality data

clustering analysis

filtering tool

arrayed gene

gene basis

measured intensities