this is a good time to be doing microarray data analysis · typical microarray study 1. read in...

This is a good time to be doing Microarray Data Analysis

12th Nov 2006Aedín Culhane,

Dana-Farber Cancer Institute/Harvard School of Public Health.

http://www.r-project.org/index.html

The Genome Era

Capacity of MicroarraysFirst cDNA Microarrays(45 Arabidopsis genes)

864 Yeast genes

1000 human cancer genes

1995

1996

1996

1999

2004

2005 Exon, tiling arrays

Whole genome (coding) arrays

7,000 Gene on arrays

2003 Genome on 2 arrays

Public Microarray DataArrayExpress• 1602 Experiments (48,386 arrays,

Statistics Aug 06)

GEO • 4,419 Experiments (104,314 arrays)

CIBEX• 5 Experiments (472 arrays)

SMD• 11081 Expts (63329 incl private data)

(31st Oct 2006 )

~160,000 arrays x $500 = $80,000,000

Cancer Studies account for >14% of all studies in databases…

Impact of Microarrays (for Patients)

2004 First microarray approved for treatment decisions by FDA. Affymetrix's AmpliChip Cytochrome P450 Genotyping Test: identifies variations in 2 genes affecting response to a wide variety of drugs.

2005 FDA issued guidelines for applications of genomics in drug development, with the stated hope that genomics will improve the safety and effectiveness of medicines.

2006 Genomics applications in clinical trials rising. ~20% U.S. clinical trials use some sort of genomics approach, with the highest percentage in oncology trials.

Typical Microarray study1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data…

Then again Explore data

4. Unsupervised data analysis (Exploratory Analysis)

5. Select Features of Interest

6. Annotate with biological Information (GO, KEGG, Sequence motifs etc)

7. Other Supervised analysis or Machine Learning

Initial Data QCAffy QC Values Boxplot and Histogram

of data

Overview of the raw data

Box- 25% to 75% inter-quartile range (IQR)

Middle line – Median

Whiskers - Roughly 1.5 * IQR or a 95% confidence interval

Median

"Middle value" of a list.

Odd number of entries; median = middle entry of sorted list

Even number of entries; median = sum of the two middle (after sorting) numbers divided by two.

Median can be estimated from a histogram by finding the smallest number such that the area under the histogram to the left of that number is 50%

Inter-quartile range (IQR)

• Another Dataset – 35 47 48 50 51 53 54 70 75

• Split into two halves, each including the median:

– 35 47 48 50 51 – 51 53 54 70 75

• Find median of each half.

• 1st quartile = 48 • 3rd quartile = 54. • IQR 54-48 = 6.

So what is the IQR for

35 47 48 50 51 53 54 60 70 75

• Split the data into two halves:• 35 47 48 50 51 • 53 54 60 70 75

• Median of each half. • 1st= 48; 3rd = 60.

• Hence IQR is 60-48 = 12.

Histogram of celfile.data

celfile.data

Freq

uenc

y

6 8 10 12 14

020

000

4000

060

000

8000

010

0000

1200

00

Median

Mean

hist(celfile.data)abline(v=mean(celfile.data), col="blue", lwd=2)abline(v=median(celfile.data), col="red", lwd=2)

Log(ratio) Histogram

0

500

1000

1500

2000

2500

3000

-2 -1.8

-1.6

-1.4

-1.2 -1 -0.8

-0.6

-0.4

-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Log(ratio)

Freq

uenc

y

LogLog22(ratio) measures treat up(ratio) measures treat up-- and downand down--regulated genes equally regulated genes equally

loglog22(1) = 0(1) = 0 loglog22(2) = 1(2) = 1 loglog22(1/2) = (1/2) = --11

Initial Data Quality Checks• Boxplot, Histogram

• RNA digestion plot

• Affymetrix QC parameters – bioB spike-ins, %P, average background, scale factor.

– affy.qc in library(simpleaffy)

• Image plots of probe level measures (affyPLM)– Residuals.

• Larger residuals (darker) indicate deviations from model

– Normalized Unscaled Standard Errors (NUSE) plot. • Gene standard error estimates from fitPLM standardized across arrays (median SE=1). An

array with elevated SEs relative to other arrays is typically of lower quality.

– Relative Log Expression (RLE) values. • probeset expression value - median expression value across all arrays. Ideally RLE ~ 0.

• Exploratory analysis: Clustering/COA

Preprocessing, normalisation, error models, quality control

Goal of a microarray study

• Detect number of RNA molecules

• Actually measure fluorescence intensity of spot

INDIRECT MEASUREMENT

Normalisation aims to reduce systematic noise introduced in measurement

Expt1 Expt2 Expt3 Expt4 Expt5 Expt6

Gene 1 -3 -3 -1 0 2 3

Gene 2 -2 -2 0 1 2 2

Gene 3 -3 -2 0 1 2 3

Gene 4 3 2 0 -1 -2 -3

Gene 5 2 2 1 0 -2 -3

Gene 6 3 2 1 0 -2 -3

Gene 7 2 2 2 2 2 2

Gene 8 -2 -2 -2 -2 -2 -2

Raw data are not mRNA concentrationso tissue contamination

o clone identification and mapping

o image segmentation

o RNA degradation

o PCR yield, contamination

o signal quantification

o amplification efficiency

o spotting efficiency

o ‘background’ correction

o reverse transcription efficiency

o DNA-support binding

o hybridization efficiency and specificity

o other array manufacturing-related issues

Early Normalization Approaches: Total IntensityEarly Normalization Approaches: Total Intensity

Conceptually simply Conceptually simply

Assumption: Total RNA (mass) in all samples.Assumption: Total RNA (mass) in all samples.

Use a Use a scaling factorscaling factor…….. (Still used in MAS5.0).. (Still used in MAS5.0)

∑

∑

=

==array

k

array

k

N

k

N

k

G

RN

1

1Normalization Factor:

Normalization: kk NGG =′ and kk RR =′ .

Normalize to scaling factor

Normalized to the 75th percentile

Not influenced by outliersStill too much below the line

Why a scaling factor is not sufficientWhy a scaling factor is not sufficient

same-same

2 fold

log-ratio

same-same

The two-component model

raw scale log scale

“additive” noise

“multiplicative” noise

B. Durbin, D. Rocke, JCB 2001

Quantile Normalisation

Outliers are not tolerated

distribution of intensities across every slide is forced to be same.

Observe: IntensityObserve: Intensity--dependent structuredependent structure

Lowess Normalization

Straightens the banana!

Standard deviation regularization(in TM4 MIDAS)

Assumption: logAssumption: log--ratio standard deviations within each block orratio standard deviations within each block or

slide are the same.slide are the same.

Variance regularization can remove the biasVariance regularization can remove the bias

Platform Problems

Spotted Array Platform Specific

– “In house” printing effects

– Regional effects within and between print-tips

– Need regional plate and print-tip lowess normalisation

PCR platesPCR plates

spotting pin quality declinespotting pin quality decline

after delivery of 3x105 spots

after delivery of 5x105 spots

H. Sueltmann DKFZ/MGA

Affymetrix Platform Specific

– Probe level effect. Need a gene expression measure from the 11 probe in probeset

Affymetrix Probe sets

Probe set summarization

PMijg , MMijg

Intensities for perfect match and mismatch probe j for gene g in chip

Need to summarize for each probe set i.e., 16 PM, MM pairs, into a single expression measure.

expression measures: MAS 4.0

expression measures: MAS 4.0

Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean:

o sort dj = PMj -MMjo exclude highest and lowest valueo J := those pairs within 3 standard deviations of the

average

1 ( )# j j

j JAvDiff PM MM

J ∈

= −∑

Expression measures MAS 5.0

Expression measures MAS 5.0

Instead of MM, use "repaired" version CTCT= MM if MM<PM

= PM / "typical log-ratio" if MM>=PM

"Signal" =Tukey.Biweight (log(PM-CT))

(… ≈median)

Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise

Expression measures: Li & Wong

Expression measures: Li & Wong

dChip fits a model for each gene

where– θi: expression index for gene i– φj: probe sensitivity

Need at least 10 or 20 chips.Invariant set

2, (0, )ij ij i j ij ijPM MM Nθ φ ε ε σ− = + ∝

AvDiff-like

with A a set of “suitable” pairs.

Estimate RMA = ai for chip i using robust method median polish(successively remove row and column medians, accumulate terms, until convergence). Works with d>=2

Robust expression measures RMA: Irizarry et al. (2002)

Robust expression measures RMA: Irizarry et al. (2002)

21RMA log ( )j j

j APM BG

∈

= −Α ∑

Comparative MvA plots

MAS5

dChip

RMA

Irizarry et al.

Affymetrix: IPM = IMM + Ispecific ?

log(PM/MM)0From: R. Irizarry et al.,

Biostatistics 2002

Probe-response calibration

i

25

1log log ( )i i

iY x w s ε

=

= + +∑

wi

position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)

Comparison of these Affy methods

• 2 test datasets– Spike-in series: from Affymetrix 59 x HGU95A, 16

genes, 14 concentrations, complex background– Dilution series: from GeneLogic 60 x HGU95Av2,

liver & CNS cRNA in different proportions and amounts

• 15 quality benchmarks -reproducibility-sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins)

http://affycomp.biostat.jhsph.edu

affycomp results (28 Sep 2003)good

bad

Raw Data

Genes >2 fold different

Mas5.0 VSN gcRMA

Normalisation

Red >2 fold difference in gcRMA normalised data

Red >2 fold difference in gcRMA normalised data

gcRMA vsn MAS5.0

Gzmg:1422867_atGzmd:1420343_atBirc1e:1421525_a_atTgfbi:1415871_atPdgfb:1450413_at

Method

Fold

Incr

ease

02

46

810

Selected 5 “follow up” vsn genes. These had similar profiles in

gcRMA and MAS5.0

gcRMA MAS5VSN

Fold

Change

Recap• Normalisation

– Log or glog– Scale to a number, lowess, quantile, lowess,

variance stabilising

• Spotted– Within & between plate, print-tip etc

• Affymetrix– MAS4.0, MAS5.0, RMA, gcRMA, Li&Wong

• With above normalisation methods

Are these methods always valid?

• Mas5.0, RMA, gcRMA and vsn – all assume that the sum of RNA is constant

(same no of genes up and down)

• THIS IS NOT ALWAYS TRUE– k/o of pol II– Blocking methylation/translation etc

Normalising to an external set of genes

• Housekeeping – Not a good idea

• Li & Wong– Transform using non-linear smooth curves– Uses rank invariant probes– Available in dChip and R – Cheng Li & Wing Hung Wong (2001a) PNAS 98, 31-36

• Spike in Controls– External RNA– van de Peppel et al., (2003) EMBO Rep. 4(4):387-93.

Colon Cancer Data

• Fresh-frozen human colorectal tumours. • N=6

– Whole tumour N=3– Parenchymal fraction (LCM dissected)

• On Affymetrix U133plus2 chips – 54675 probesets

Normalised data MAS, RMARMA NormalisationMAS5.0 Normalisation

Normalised data VSN, Li & WongLi & Wong NormalisationVSN Normalisation

Normalisation Matters!

Li & Wong

RMA

MAS 5.0Many Normalisation methods

Need to consider best one for your experimental design

Most normalisation methods assume sum of mRNAs is equal

Exploratory Data Analysis: Clustering and Ordination

Aedín Culhane, Dana-Farber Cancer Institute/Harvard School of Public

Health.

http://www.r-project.org/index.html

Microarray data analysisMicroarray produce:

• Simultaneously 10,000’s variables

• Multivariate data

• Essential to use exploratory data analysis to “get handle” on data

Typical Analysis of Microarrays

1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data…

Then again Explore data

4. Unsupervised data analysis (Exploratory Analysis)

5. Select Features of Interest.. Include additional biological Information (GO, KEGG, Sequence motifs etc)

6. Other Supervised analysis or Machine Learning

Importance of Data Exploration

• Exploration of Data is Critical– Detect unpredicted patterns in data– Decide what questions to ask

• Clustering– Hierarchical – Flat (k-means)

• Ordination (Dimension Reduction) – Principal Component analysis,

Correspondence analysis

A Distance Metric

• The choice of metric is fundamental

• Exploratory analysis– only discover where you explore..

Expt1 Expt2 Expt3 Expt4 Expt5 Expt6

Gene 1 -3 -3 -1 0 2 3

Gene 2 -2 -2 0 1 2 2

Gene 3 -3 -2 0 1 2 3

Gene 4 3 2 0 -1 -2 -3

Gene 5 2 2 1 0 -2 -3

Gene 6 3 2 1 0 -2 -3

Gene 7 2 2 2 2 2 2

Gene 8 -2 -2 -2 -2 -2 -2

Sample set of gene expression values

Back to our 8 Genes – Create a distance matrix

Expression of 8 genes in 6 arrays

-4

-3

-2

-1

0

1

2

3

4

1 2 3 4 5 6

arrays

log

ratio

Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8

Distance Metrics

•• Euclidean distanceEuclidean distance•• Pearson correlation coefficientPearson correlation coefficient•• Spearman rankSpearman rank•• Manhattan distanceManhattan distance•• Mutual informationMutual information•• etcetc

Each has different properties and can reveal Each has different properties and can reveal different features of the datadifferent features of the data

DistanceDistance

SimilaritySimilarity

2.2. Manhattan: Manhattan: ΣΣi = 1 i = 1 ||xxiAiA –– xxiBiB||66

Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6

Gene AGene A

Gene BGene B

xx1A1A xx2A2A xx3A3A xx4A4A xx5A5A xx6A6A

xx1B1B xx2B2B xx3B3B xx4B4B xx5B5B xx6B6B

1.1. EuclideanEuclidean: : √Σ√Σi = 1i = 1 ((xxiAiA -- xxiBiB))2266

ppAA

ppBB

3. Pearson correlation3. Pearson correlation

Distance Metrics

Distance Is Defined by a Metric

Euclidean Pearson*Distance Metric:

6.0

1.4

+1.00

-0.05D

D

-3

0

3

log2

(cy5

/cy3

)

5 10 15 20

24

68

x

y

-2 0 2 4 6 8 10

-20

24

68

10

x

y

corr=0.87

corr=0.04

Warning: Correlations gone wrong

Clustering: Distance metrics

Euclidean distance

Expt1 Expt2 Expt3

Expt4 Expt5 Expt6

Gene 1 -3 -3 -1 0 2 3

Gene 2 -2 -2 0 1 2 2

Gene 3 -3 -2 0 1 2 3

Gene 4 3 2 0 -1 -2 -3

Gene 5 2 2 1 0 -2 -3

Gene 6 3 2 1 0 -2 -3

Gene 7 2 2 2 2 2 2

Gene 8 -2 -2 -2 -2 -2 -2

Dist(gene 1,2)= √(-3+2)2)+(-3+2)2+(-1+0)2)+(0-1)2+(2-2)2 +(3-2)2

= √ 5 = 2.236068 = 2.24

n√∑ (xi-yi)2i=1

Expression of 8 genes in 6 arrays

-4

-3

-2

-1

0

1

2

3

4

1 2 3 4 5 6

arrays

log

ratio

Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8

Distance Matrix

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8Gene 1 0 2.24 1.73 10.72 10.30 10.82 8.00 6.93Gene 2 0 1.41 9.27 8.66 9.17 6.08 6.71Gene 3 0 10.39 9.75 10.30 6.86 7.42Gene 4 0 1.73 1.41 7.42 6.86Gene 5 0 1 6.78 6.78Gene 6 0 6.86 7.42Gene 7 0 9.80Gene 8 0

Symmetric.Now needs to decide what closest?

Comparison of Linkage Methods

SingleSingle AverageAverage CompleteCompleteJoin by min average max

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8Gene 1 0 2.24 1.73 10.72 10.30 10.82 8.00 6.93Gene 2 0 1.41 9.27 8.66 9.17 6.08 6.71Gene 3 0 10.39 9.75 10.30 6.86 7.42Gene 4 0 1.73 1.41 7.42 6.86Gene 5 0 1 6.78 6.78Gene 6 0 6.86 7.42Gene 7 0 9.80Gene 8 0

5,6 are closest (dist = 1) so merge these

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5,6 Gene 7 Gene 8Gene 1 0 2.24 1.73 10.72 10.30 8.00 6.93Gene 2 0 1.41 9.27 8.66 6.08 6.71Gene 3 0 10.39 9.75 6.86 7.42Gene 4 0 1.41 7.42 6.86Gene 5,6 0 6.78 6.78Gene 7 0 9.80Gene 8 0

Gen

e5

Gen

e6

Gen

e2

Gen

e3

Gene 1 Gene 2,3 Gene 4 Gene 5,6 Gene 7 Gene 8Gene 1 0 1.73 10.72 10.30 8.00 6.93Gene 2,3 0 9.27 8.66 6.08 6.71Gene 4 0 1.41 7.42 6.86Gene 5,6 0 6.78 6.78Gene 7 0 9.80Gene 8 0

Gene 1 Gene 2,3 Gene 4,(5,6) Gene 7 Gene 8Gene 1 0 1.73 10.30 8.00 6.93Gene 2,3 0 8.66 6.08 6.71Gene 4,(5,6) 0.00 6.78 6.86Gene 7 0 9.80

… continue, join 1 to (2,3) at 1.73

…. until done

Gen

e1

Gen

e2

Gen

e3

Gen

e4

Gen

e5

Gen

e6

Hierarchical clustering assembles a number of items into a tree where items that are joined by short branches if they are very similar to each other and by increasingly longer branches as their similarity decreases.

Gen

e4

Gen

e5

Gen

e6

Gen

e8

Gen

e7

Gen

e1

Gen

e2

Gen

e3

12

34

56

7

Cluster Dendrogram

Hei

ght

Heatmap….. Eisen Plots

A B

Interpreting a DendrogramHierarchical analysis results viewed using a dendrogram

(tree)• Distance between nodes (Scale)• Ordering of nodes not important (like baby mobile)

Limitations of hierarchical clustering

• Samples compared in a pair wise manner

• Hierarchy forced on data

• Sometimes difficult to visualise if large data

• Overlapping clustering or time/dose gradients ?

Complementary Approach: ordination

Not this kind of ordination

Ordination- In multivariate statistics

1. Arrangement of units in some order

2. Representation of objects as points along one or several axes of reference (Gower 1984)

Complementary methods

Cluster analysis generally investigates pairwise distances/similarities among objects looking for fine relationships

Ordination in reduced space considers the variance of the whole dataset thus highlighting general gradients/patterns

(Legendre and Legendre, 1998)

Many publications present both

Ordination

• Also refers to as– Latent variable analysis, Dimension reduction

• Aim:

Find axes onto which data can be project so as to explain as much of the variance in the data as possible

Principal Axes• Project new axes through data which capture

variance. Each represents a different trend in the data.

• Orthogonal (decorrelated)

• Typically ranked: First axes most important

• Principal axis, Principal component, latent variable or eigenvector

xxzz

yy

Dimension Reduction (Ordination)

Principal ComponentsPrincipal Componentspick out the directionspick out the directionsin the data that capturein the data that capturethe greatest variabilitythe greatest variability

New Axis 1New Axis 2

New Axis 3

Eigenvalues

• Describe the amount of variance (information) in eigenvectors

• Ranked. First eigenvalue is the largest.

• Generally only examine 1st few components – scree plot

010

2030

40Choosing number of

Eigenvalues: Scree Plot

0.00

00.

005

0.01

00.

015

0.02

00.

025

Maximum number of Eigenvalues/Eigenvectors = max(nrow, ncol) -1

Typical Analysis

0.00

00.

005

0.01

00.

015

0.02

00.

025

X OrdinationPlot of eigenvalues, select number.

Plot PC1 v PC2

etc

Array Projection Gene Projection

Ordination of Gene Expression Data

Ordination Methods

• Most common : – Principal component analysis (PCA)– Correspondence analysis (COA or CA)– Nonmetric multidimensional scaling (NMDS,

MDS)– Principal co-ordinate analysis (PCoA)

Books/Book Chapters:1. Legendre, P., and Legendre, L. 1998. Numerical Ecology, 2nd English Edition. ed.

Elsevier, Amsterdam.2. Wall, M., Rechtsteiner, A., and Rocha, L. 2003. Singular value decomposition and

principal component analysis. In A Practical Approach to Microarray Data Analysis. (eds. D.P. Berrar, W. Dubitzky, and M. Granzow), pp. 91-109. Kluwer, Norwell, MA.

Papers:1. Alter, O., Brown, P.O., and Botstein, D. 2000. Singular value decomposition for genome-

wide expression data processing and modeling. Proc Natl Acad Sci U S A 97: 10101-10106.

2. Culhane, A.C., Perriere, G., Considine, E.C., Cotter, T.G., and Higgins, D.G. 2002. Between-group analysis of microarray data. Bioinformatics 18: 1600-1608.

3. Culhane, A.C., Perriere, G., and Higgins, D.G. 2003. Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics 4: 59.

4. Fellenberg, K., Hauser, N.C., Brors, B., Neutzner, A., Hoheisel, J.D., and Vingron, M. 2001. Correspondence analysis applied to microarray data. Proc Natl Acad Sci U S A 98: 10781-10786.

5. Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2: 559-572.

6. Raychaudhuri, S., Stuart, J.M., and Altman, R.B. 2000. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac SympBiocomput: 455-466.

7. Wouters, L., Gohlmann, H.W., Bijnens, L., Kass, S.U., Molenberghs, G., and Lewi, P.J. 2003. Graphical exploration of gene expression data: a comparative study of three multivariate methods. Biometrics 59: 1131-1139

Reviews1. Quackenbush, J. 2001. Computational analysis of microarray data. Nat Rev Genet 2: 418-

427.

Detecting differentially expressed genes

Normal distribution

σσ = standard = standard deviationdeviationof the of the distributiondistribution

X = X = μμ (mean of the distribution)(mean of the distribution)

Estimating a mean

All had the same mean and SD

Population 1Population 1

Mean 1Mean 1

Population 2Population 2

Mean 2Mean 2

Less than a 5 % chance that the sample with mean s came from Less than a 5 % chance that the sample with mean s came from Population 1Population 1

ss is significantly different from is significantly different from Mean 1Mean 1 at the p < 0.05 significance level. at the p < 0.05 significance level.

But we cannot reject the hypothesis that the sample came fromBut we cannot reject the hypothesis that the sample came from Population 2Population 2

Sample mean “Sample mean “ss””

Probability distributionsProbability distributions

The The probabilityprobability of an event is the likelihood of its occurring.of an event is the likelihood of its occurring.

It is sometimes computed as a relative It is sometimes computed as a relative frequency (frequency (rfrf)), where, where

The probability of an event can sometimes be The probability of an event can sometimes be inferred from a “theoretical” inferred from a “theoretical” probability probability distributiondistribution, such as a normal distribution., such as a normal distribution.

the number of “favorable” outcomes for an eventthe number of “favorable” outcomes for an eventthe total number of possible outcomes for that eventthe total number of possible outcomes for that eventrfrf ==

Many biological variables, such as height and weight, can Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal reasonably be assumed to approximate the normal distribution. distribution.

But expression measurements? Probably not.But expression measurements? Probably not.

Fortunately, many statistical tests are considered to be fairly Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other robust to violations of the normality assumption, and other assumptions used in these tests.assumptions used in these tests.

Randomization / resamplingRandomization / resampling based tests can be used to get based tests can be used to get around the violation of the normality assumption.around the violation of the normality assumption.

Normality, Probability and Expression DataNormality, Probability and Expression Data

IMPORTANT CONCEPT No 2

TRUE FALSE

+ve True Positive False Positive Positive Predictive Value

-ve False Negative True Negative Negative Predictive Value

Sensitivity Specificity Accuracy

Test Prediction

True Value (with Disease)

IMPORTANT CONCEPT No 2

←bias accuracy→

←pr

ecis

ion

varia

nce→

Another view: ROC CurveSe

nsiti

vity

1 - specificity

Basic dogma of data analysis:Can always increase sensitivity on the cost of specificity,

or vice versa,

the art is to find the sweet spot.

X

X

X

X

X

X

X

X

X

(It can also be possible to increase both by better choice of method / model)

Our goal is to find genes that are Our goal is to find genes that are significantly different between classessignificantly different between classes

Finding Significant Genes

How?

• Fold Change• T-statistic• Modified t-statistic• Other methods

Fold Change

• Only looks at the difference in the means of two group

• Unreliable in microarrays

• Why? We can’t get good estimate of mean due to too few cases

Average Fold Change Difference for each geneAverage Fold Change Difference for each genesuffers from being arbitrary and not taking suffers from being arbitrary and not taking into account systematic variation in the datainto account systematic variation in the data

??????

tt--test for each genetest for each geneTests whether the difference between the mean of Tests whether the difference between the mean of the query and reference groups are the samethe query and reference groups are the sameEssentially measures signalEssentially measures signal--toto--noisenoiseCalculate Calculate pp--value (permutations or distributions)value (permutations or distributions)May suffer from intensityMay suffer from intensity--dependent effectsdependent effects

Finding Significant Genes

Y − X sY

2

N+

sX2

M

t - statistic= = signalsignal = = difference between meansdifference between meansnoise variability of groups noise variability of groups

Where Y and X: the means

S2: square of the SD or variances

A significantA significantdifferencedifference

ProbablyProbablynotnot

tt--teststests

Estimating the varianceThe t-test considers difference between group

means to standard deviation of data within groups

F-test (ANOVA) is a generalization of this idea to more than 2 groups

But with few replicates, estimates of SE are not stable. This explains why t-test is not powerful

Moderated t-statistics

• There are many proposals for estimating variation

• Many share information across genes• Empirical Bayesian Approaches are popular• SAM, an ad-hoc procedure, is even more

popular• Many are what some call “moderated” t-tests

Some Examples of TestsNotation:

– T is average log expression of Treatment group– C is average log expression of Control group– S is SD

• Tests:– Average log fold-change: (T-C)– t-statistic: (T-C) / S– SAM shrunken t-statistic: (T-C) / (S + S0)– Bayesian posteriors: (T-C) / √(S2+K2)– Wilcoxon Rank test

Note taking log before average is important

One final problemOnce you have a score for each gene, how do you decide on a

cut-off? p-values are popular. Are they appropriate?

Test for each gene null hypothesis: no differential expression.

Notice that if you have look at 10,000 genes for which the null is true you expect to see 500 attain p-values of 0.05

This is called the multiple comparison problem. Statisticians fight about it. But not about the above.

Main message: p-values can’t be interpreted in the usual way

Multiple testing

Popular solutions are either

• slash the p-value – Bonferroi or permutation correction

• or report FDR instead of FPR.

Error Rates

A useful plot

The volcano plot shows, for a particular test, negative log p-value against the effect size (M)

Volcano plot

Comparison of Feature Selection methods

Assessed

1. the gene list produced by 9 different methods

2. the ability of the top genes from each method to form a classifier

Overlap in Gene Ranking in top 200 genes (binary distance, average linkage)

Testing performance of gene-lists as classifiers

• For each dataset :– Divide dataset into training and test.– Apply feature selection method to training data. – Rank genes using feature selection method

• (t-statistic, SAM, template matching, etc).– Select K top genes.

• where k is between 3 -100• Train classifier using these genes.

– Test discriminating power using test set– Record performance of classifier

• Repeat for each gene selection method

Conclusion:The empirical bayes t-statistic is a robust and accurate way to identify regulated genes.

Rank Products is also effective in data with low sample size.

Sample permutation of t-statistic and SAM are not effective in datasets with few samples or with low signal:noise

For larger, or high signal:noise datasets; Most methods work well. Area under the ROC curve method and MaxToutperform other approaches

Jeffery IB, Higgins DG, Culhane AC Comparison and evaluation of microarray feature selection methods. BMC Bioinformatics. Submitted

Finding out more about Genes

We know lots about genes

• Chromosome location• Pathways (KEGG)• Gene Ontology

– Sub- Cellular location (eg nucleus, cytosol)– Biological process (cell signalling)– Molecular function (kinase)

Structure of a GO annotation

Each gene can have several annotated GOs and each GO can have several splits. E.g. DNA topoisomerase II alpha has 8 GO annotations and 11 splits

Gene Sets Score

• Fisher exact test (chi-square test)

• Kolmogorov- Smirnov statistic• weighted KS statistics

• Simple matrix multiplication of of t-statistics x counts

Is a GO term is specific for a set?

51 416

125 8588

173 9004

467

8713

9177

count geneswith GO term in set

count geneswithout GOterm in set

count in set(e.g. differentiallyexpressed genes)

Count in reference set (e.g. all genes on array)

Contingency Table P-value

8x10-52

Fisher's exact testor chi-square test

Gene Ontology: FatiGO

Many options

• GESA• IGA/Rank Prod• GenMAPP, and MAPPFinder• FatiGO

• Segal et al., 2004

Gene Set Enrichment

• proposed by Mootha et al (2003)• similar but more complex and

computationally expensive• Compute Kolmogorov-Smirnov running

sum is computed

Gene Set Enrichment

• For each gene set S• genes are ordered according to some criterion (t-

test; fold change).• Start at top ranking gene• A running sum increases when a gene in set S is

encountered and decreases otherwise• The enrichment score (ES) for a set S is defined

to be the largest value of the running sum.

Kolmogorov-Smirnov test

Running sum over statistics. Compare distance to random distribution

Gene Set Enrichment

• The maximal ES (MES), over all sets S under consideration is recorded.

• For each of B permutations of the class label, ES and MES values are computed.

• The observed MES is then compared to the B values of MES that have been computed, via permutation.

• This is a single p-value for all tests and hence needs no correction

Selection of Categories

pathways (KEGG, cMAP, BioCarta)GO molecular function, biological process cellular locationpublished literatureGenome Info- regions of synteny; cytochromebandsTake care when selecting categories a priori

num categories >>>> num genes (multiple comparison problem)

Dr. Frederick Frankenstein: Igor, would you mind telling me whose brain I did put in?

Igor: And you won't be angry?Dr. Frederick Frankenstein: I will NOT be angry.Igor: Abby someone.Dr. Frederick Frankenstein: Abby someone. Abby

who?Igor: Abby Normal.Dr. Frederick Frankenstein: Abby Normal?Igor: I'm almost sure that was the name.Dr. Frederick Frankenstein: Are you saying that I

put an abnormal brain into a seven and a half foot long, fifty-four inch wide GORILLA? IS THAT WHAT YOU'RE TELLING ME?

From the film Young Frankenstein, 1974

Good Experimental Design & Sample Processing is Critical

this is a good time to be doing microarray data analysis · typical microarray study 1. read in...

Documents