this is a good time to be doing microarray data analysis · typical microarray study 1. read in...

140
This is a good time to be doing Microarray Data Analysis 12 th Nov 2006 Aedín Culhane, Dana-Farber Cancer Institute/Harvard School of Public Health.

Upload: others

Post on 03-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

This is a good time to be doing Microarray Data Analysis

12th Nov 2006Aedín Culhane,

Dana-Farber Cancer Institute/Harvard School of Public Health.

Page 2: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

The Genome Era

Page 3: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Capacity of MicroarraysFirst cDNA Microarrays(45 Arabidopsis genes)

864 Yeast genes

1000 human cancer genes

1995

1996

1996

1999

2004

2005 Exon, tiling arrays

Whole genome (coding) arrays

7,000 Gene on arrays

2003 Genome on 2 arrays

Page 4: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Public Microarray DataArrayExpress• 1602 Experiments (48,386 arrays,

Statistics Aug 06)

GEO • 4,419 Experiments (104,314 arrays)

CIBEX• 5 Experiments (472 arrays)

SMD• 11081 Expts (63329 incl private data)

(31st Oct 2006 )

Page 5: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

~160,000 arrays x $500 = $80,000,000

Cancer Studies account for >14% of all studies in databases…

Page 6: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Impact of Microarrays (for Patients)

2004 First microarray approved for treatment decisions by FDA. Affymetrix's AmpliChip Cytochrome P450 Genotyping Test: identifies variations in 2 genes affecting response to a wide variety of drugs.

2005 FDA issued guidelines for applications of genomics in drug development, with the stated hope that genomics will improve the safety and effectiveness of medicines.

2006 Genomics applications in clinical trials rising. ~20% U.S. clinical trials use some sort of genomics approach, with the highest percentage in oncology trials.

Page 7: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Typical Microarray study1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data…

Then again Explore data

4. Unsupervised data analysis (Exploratory Analysis)

5. Select Features of Interest

6. Annotate with biological Information (GO, KEGG, Sequence motifs etc)

7. Other Supervised analysis or Machine Learning

Page 8: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Initial Data QCAffy QC Values Boxplot and Histogram

of data

Page 9: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Overview of the raw data

Box- 25% to 75% inter-quartile range (IQR)

Middle line – Median

Whiskers - Roughly 1.5 * IQR or a 95% confidence interval

Page 10: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Median

"Middle value" of a list.

Odd number of entries; median = middle entry of sorted list

Even number of entries; median = sum of the two middle (after sorting) numbers divided by two.

Median can be estimated from a histogram by finding the smallest number such that the area under the histogram to the left of that number is 50%

Page 11: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Inter-quartile range (IQR)

• Another Dataset – 35 47 48 50 51 53 54 70 75

• Split into two halves, each including the median:

– 35 47 48 50 51 – 51 53 54 70 75

• Find median of each half.

• 1st quartile = 48 • 3rd quartile = 54. • IQR 54-48 = 6.

So what is the IQR for

35 47 48 50 51 53 54 60 70 75

• Split the data into two halves:• 35 47 48 50 51 • 53 54 60 70 75

• Median of each half. • 1st= 48; 3rd = 60.

• Hence IQR is 60-48 = 12.

Page 12: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Histogram of celfile.data

celfile.data

Freq

uenc

y

6 8 10 12 14

020

000

4000

060

000

8000

010

0000

1200

00

Median

Mean

hist(celfile.data)abline(v=mean(celfile.data), col="blue", lwd=2)abline(v=median(celfile.data), col="red", lwd=2)

Page 13: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data
Page 14: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Log(ratio) Histogram

0

500

1000

1500

2000

2500

3000

-2 -1.8

-1.6

-1.4

-1.2 -1 -0.8

-0.6

-0.4

-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Log(ratio)

Freq

uenc

y

LogLog22(ratio) measures treat up(ratio) measures treat up-- and downand down--regulated genes equally regulated genes equally

loglog22(1) = 0(1) = 0 loglog22(2) = 1(2) = 1 loglog22(1/2) = (1/2) = --11

Page 15: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Initial Data Quality Checks• Boxplot, Histogram

• RNA digestion plot

• Affymetrix QC parameters – bioB spike-ins, %P, average background, scale factor.

– affy.qc in library(simpleaffy)

• Image plots of probe level measures (affyPLM)– Residuals.

• Larger residuals (darker) indicate deviations from model

– Normalized Unscaled Standard Errors (NUSE) plot. • Gene standard error estimates from fitPLM standardized across arrays (median SE=1). An

array with elevated SEs relative to other arrays is typically of lower quality.

– Relative Log Expression (RLE) values. • probeset expression value - median expression value across all arrays. Ideally RLE ~ 0.

• Exploratory analysis: Clustering/COA

Page 16: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Preprocessing, normalisation, error models, quality control

Page 17: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Goal of a microarray study

• Detect number of RNA molecules

• Actually measure fluorescence intensity of spot

INDIRECT MEASUREMENT

Normalisation aims to reduce systematic noise introduced in measurement

Page 18: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Expt1 Expt2 Expt3 Expt4 Expt5 Expt6

Gene 1 -3 -3 -1 0 2 3

Gene 2 -2 -2 0 1 2 2

Gene 3 -3 -2 0 1 2 3

Gene 4 3 2 0 -1 -2 -3

Gene 5 2 2 1 0 -2 -3

Gene 6 3 2 1 0 -2 -3

Gene 7 2 2 2 2 2 2

Gene 8 -2 -2 -2 -2 -2 -2

Page 19: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Raw data are not mRNA concentrationso tissue contamination

o clone identification and mapping

o image segmentation

o RNA degradation

o PCR yield, contamination

o signal quantification

o amplification efficiency

o spotting efficiency

o ‘background’ correction

o reverse transcription efficiency

o DNA-support binding

o hybridization efficiency and specificity

o other array manufacturing-related issues

Page 20: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data
Page 21: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Early Normalization Approaches: Total IntensityEarly Normalization Approaches: Total Intensity

Conceptually simply Conceptually simply

Assumption: Total RNA (mass) in all samples.Assumption: Total RNA (mass) in all samples.

Use a Use a scaling factorscaling factor…….. (Still used in MAS5.0).. (Still used in MAS5.0)

=

==array

k

array

k

N

k

N

k

G

RN

1

1Normalization Factor:

Normalization: kk NGG =′ and kk RR =′ .

Page 22: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Normalize to scaling factor

Normalized to the 75th percentile

Not influenced by outliersStill too much below the line

Page 23: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Why a scaling factor is not sufficientWhy a scaling factor is not sufficient

same-same

2 fold

log-ratio

same-same

Page 24: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

The two-component model

raw scale log scale

“additive” noise

“multiplicative” noise

B. Durbin, D. Rocke, JCB 2001

Page 25: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Quantile Normalisation

Outliers are not tolerated

distribution of intensities across every slide is forced to be same.

Page 26: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Observe: IntensityObserve: Intensity--dependent structuredependent structure

Lowess Normalization

Page 27: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Straightens the banana!

Page 28: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Standard deviation regularization(in TM4 MIDAS)

Assumption: logAssumption: log--ratio standard deviations within each block orratio standard deviations within each block or

slide are the same.slide are the same.

Variance regularization can remove the biasVariance regularization can remove the bias

Page 29: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Platform Problems

Page 30: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Spotted Array Platform Specific

– “In house” printing effects

– Regional effects within and between print-tips

– Need regional plate and print-tip lowess normalisation

Page 31: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

PCR platesPCR plates

Page 32: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

spotting pin quality declinespotting pin quality decline

after delivery of 3x105 spots

after delivery of 5x105 spots

H. Sueltmann DKFZ/MGA

Page 33: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Affymetrix Platform Specific

– Probe level effect. Need a gene expression measure from the 11 probe in probeset

Page 34: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Affymetrix Probe sets

Page 35: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Probe set summarization

PMijg , MMijg

Intensities for perfect match and mismatch probe j for gene g in chip

Need to summarize for each probe set i.e., 16 PM, MM pairs, into a single expression measure.

Page 36: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

expression measures: MAS 4.0

expression measures: MAS 4.0

Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean:

o sort dj = PMj -MMjo exclude highest and lowest valueo J := those pairs within 3 standard deviations of the

average

1 ( )# j j

j JAvDiff PM MM

J ∈

= −∑

Page 37: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Expression measures MAS 5.0

Expression measures MAS 5.0

Instead of MM, use "repaired" version CTCT= MM if MM<PM

= PM / "typical log-ratio" if MM>=PM

"Signal" =Tukey.Biweight (log(PM-CT))

(… ≈median)

Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise

Page 38: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Expression measures: Li & Wong

Expression measures: Li & Wong

dChip fits a model for each gene

where– θi: expression index for gene i– φj: probe sensitivity

Need at least 10 or 20 chips.Invariant set

2, (0, )ij ij i j ij ijPM MM Nθ φ ε ε σ− = + ∝

Page 39: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

AvDiff-like

with A a set of “suitable” pairs.

Estimate RMA = ai for chip i using robust method median polish(successively remove row and column medians, accumulate terms, until convergence). Works with d>=2

Robust expression measures RMA: Irizarry et al. (2002)

Robust expression measures RMA: Irizarry et al. (2002)

21RMA log ( )j j

j APM BG

= −Α ∑

Page 40: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Comparative MvA plots

MAS5

dChip

RMA

Irizarry et al.

Page 41: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Affymetrix: IPM = IMM + Ispecific ?

log(PM/MM)0From: R. Irizarry et al.,

Biostatistics 2002

Page 42: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Probe-response calibration

i

25

1log log ( )i i

iY x w s ε

=

= + +∑

wi

position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)

Page 43: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Comparison of these Affy methods

• 2 test datasets– Spike-in series: from Affymetrix 59 x HGU95A, 16

genes, 14 concentrations, complex background– Dilution series: from GeneLogic 60 x HGU95Av2,

liver & CNS cRNA in different proportions and amounts

• 15 quality benchmarks -reproducibility-sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins)

http://affycomp.biostat.jhsph.edu

Page 44: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

affycomp results (28 Sep 2003)good

bad

Page 45: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Raw Data

Page 46: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Genes >2 fold different

Page 47: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Mas5.0 VSN gcRMA

Page 48: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Normalisation

Red >2 fold difference in gcRMA normalised data

Page 49: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Red >2 fold difference in gcRMA normalised data

Page 50: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

gcRMA vsn MAS5.0

Gzmg:1422867_atGzmd:1420343_atBirc1e:1421525_a_atTgfbi:1415871_atPdgfb:1450413_at

Method

Fold

Incr

ease

02

46

810

Selected 5 “follow up” vsn genes. These had similar profiles in

gcRMA and MAS5.0

gcRMA MAS5VSN

Fold

Change

Page 51: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Recap• Normalisation

– Log or glog– Scale to a number, lowess, quantile, lowess,

variance stabilising

• Spotted– Within & between plate, print-tip etc

• Affymetrix– MAS4.0, MAS5.0, RMA, gcRMA, Li&Wong

• With above normalisation methods

Page 52: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Are these methods always valid?

• Mas5.0, RMA, gcRMA and vsn – all assume that the sum of RNA is constant

(same no of genes up and down)

• THIS IS NOT ALWAYS TRUE– k/o of pol II– Blocking methylation/translation etc

Page 53: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Normalising to an external set of genes

• Housekeeping – Not a good idea

• Li & Wong– Transform using non-linear smooth curves– Uses rank invariant probes– Available in dChip and R – Cheng Li & Wing Hung Wong (2001a) PNAS 98, 31-36

• Spike in Controls– External RNA– van de Peppel et al., (2003) EMBO Rep. 4(4):387-93.

Page 54: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Colon Cancer Data

• Fresh-frozen human colorectal tumours. • N=6

– Whole tumour N=3– Parenchymal fraction (LCM dissected)

• On Affymetrix U133plus2 chips – 54675 probesets

Page 55: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Normalised data MAS, RMARMA NormalisationMAS5.0 Normalisation

Page 56: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Normalised data VSN, Li & WongLi & Wong NormalisationVSN Normalisation

Page 57: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Normalisation Matters!

Li & Wong

RMA

MAS 5.0Many Normalisation methods

Need to consider best one for your experimental design

Most normalisation methods assume sum of mRNAs is equal

Page 58: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Exploratory Data Analysis: Clustering and Ordination

Aedín Culhane, Dana-Farber Cancer Institute/Harvard School of Public

Health.

Page 59: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Microarray data analysisMicroarray produce:

• Simultaneously 10,000’s variables

• Multivariate data

• Essential to use exploratory data analysis to “get handle” on data

Page 60: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Typical Analysis of Microarrays

1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data…

Then again Explore data

4. Unsupervised data analysis (Exploratory Analysis)

5. Select Features of Interest.. Include additional biological Information (GO, KEGG, Sequence motifs etc)

6. Other Supervised analysis or Machine Learning

Page 61: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Importance of Data Exploration

• Exploration of Data is Critical– Detect unpredicted patterns in data– Decide what questions to ask

• Clustering– Hierarchical – Flat (k-means)

• Ordination (Dimension Reduction) – Principal Component analysis,

Correspondence analysis

Page 62: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

A Distance Metric

• The choice of metric is fundamental

• Exploratory analysis– only discover where you explore..

Page 63: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Expt1 Expt2 Expt3 Expt4 Expt5 Expt6

Gene 1 -3 -3 -1 0 2 3

Gene 2 -2 -2 0 1 2 2

Gene 3 -3 -2 0 1 2 3

Gene 4 3 2 0 -1 -2 -3

Gene 5 2 2 1 0 -2 -3

Gene 6 3 2 1 0 -2 -3

Gene 7 2 2 2 2 2 2

Gene 8 -2 -2 -2 -2 -2 -2

Sample set of gene expression values

Page 64: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Back to our 8 Genes – Create a distance matrix

Expression of 8 genes in 6 arrays

-4

-3

-2

-1

0

1

2

3

4

1 2 3 4 5 6

arrays

log

ratio

Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8

Page 65: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Distance Metrics

•• Euclidean distanceEuclidean distance•• Pearson correlation coefficientPearson correlation coefficient•• Spearman rankSpearman rank•• Manhattan distanceManhattan distance•• Mutual informationMutual information•• etcetc

Each has different properties and can reveal Each has different properties and can reveal different features of the datadifferent features of the data

DistanceDistance

SimilaritySimilarity

Page 66: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

2.2. Manhattan: Manhattan: ΣΣi = 1 i = 1 ||xxiAiA –– xxiBiB||66

Exp 1Exp 1 Exp 2Exp 2 Exp 3Exp 3 Exp 4Exp 4 Exp 5Exp 5 Exp 6Exp 6

Gene AGene A

Gene BGene B

xx1A1A xx2A2A xx3A3A xx4A4A xx5A5A xx6A6A

xx1B1B xx2B2B xx3B3B xx4B4B xx5B5B xx6B6B

1.1. EuclideanEuclidean: : √Σ√Σi = 1i = 1 ((xxiAiA -- xxiBiB))2266

ppAA

ppBB

3. Pearson correlation3. Pearson correlation

Distance Metrics

Page 67: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Distance Is Defined by a Metric

Euclidean Pearson*Distance Metric:

6.0

1.4

+1.00

-0.05D

D

-3

0

3

log2

(cy5

/cy3

)

Page 68: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

5 10 15 20

24

68

x

y

-2 0 2 4 6 8 10

-20

24

68

10

x

y

corr=0.87

corr=0.04

Warning: Correlations gone wrong

Page 69: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Clustering: Distance metrics

Euclidean distance

Expt1 Expt2 Expt3

Expt4 Expt5 Expt6

Gene 1 -3 -3 -1 0 2 3

Gene 2 -2 -2 0 1 2 2

Gene 3 -3 -2 0 1 2 3

Gene 4 3 2 0 -1 -2 -3

Gene 5 2 2 1 0 -2 -3

Gene 6 3 2 1 0 -2 -3

Gene 7 2 2 2 2 2 2

Gene 8 -2 -2 -2 -2 -2 -2

Dist(gene 1,2)= √(-3+2)2)+(-3+2)2+(-1+0)2)+(0-1)2+(2-2)2 +(3-2)2

= √ 5 = 2.236068 = 2.24

n√∑ (xi-yi)2i=1

Expression of 8 genes in 6 arrays

-4

-3

-2

-1

0

1

2

3

4

1 2 3 4 5 6

arrays

log

ratio

Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8

Page 70: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Distance Matrix

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8Gene 1 0 2.24 1.73 10.72 10.30 10.82 8.00 6.93Gene 2 0 1.41 9.27 8.66 9.17 6.08 6.71Gene 3 0 10.39 9.75 10.30 6.86 7.42Gene 4 0 1.73 1.41 7.42 6.86Gene 5 0 1 6.78 6.78Gene 6 0 6.86 7.42Gene 7 0 9.80Gene 8 0

Symmetric.Now needs to decide what closest?

Page 71: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Comparison of Linkage Methods

SingleSingle AverageAverage CompleteCompleteJoin by min average max

Page 72: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8Gene 1 0 2.24 1.73 10.72 10.30 10.82 8.00 6.93Gene 2 0 1.41 9.27 8.66 9.17 6.08 6.71Gene 3 0 10.39 9.75 10.30 6.86 7.42Gene 4 0 1.73 1.41 7.42 6.86Gene 5 0 1 6.78 6.78Gene 6 0 6.86 7.42Gene 7 0 9.80Gene 8 0

5,6 are closest (dist = 1) so merge these

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5,6 Gene 7 Gene 8Gene 1 0 2.24 1.73 10.72 10.30 8.00 6.93Gene 2 0 1.41 9.27 8.66 6.08 6.71Gene 3 0 10.39 9.75 6.86 7.42Gene 4 0 1.41 7.42 6.86Gene 5,6 0 6.78 6.78Gene 7 0 9.80Gene 8 0

Gen

e5

Gen

e6

Gen

e2

Gen

e3

Page 73: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Gene 1 Gene 2,3 Gene 4 Gene 5,6 Gene 7 Gene 8Gene 1 0 1.73 10.72 10.30 8.00 6.93Gene 2,3 0 9.27 8.66 6.08 6.71Gene 4 0 1.41 7.42 6.86Gene 5,6 0 6.78 6.78Gene 7 0 9.80Gene 8 0

Gene 1 Gene 2,3 Gene 4,(5,6) Gene 7 Gene 8Gene 1 0 1.73 10.30 8.00 6.93Gene 2,3 0 8.66 6.08 6.71Gene 4,(5,6) 0.00 6.78 6.86Gene 7 0 9.80

… continue, join 1 to (2,3) at 1.73

…. until done

Gen

e1

Gen

e2

Gen

e3

Gen

e4

Gen

e5

Gen

e6

Page 74: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Hierarchical clustering assembles a number of items into a tree where items that are joined by short branches if they are very similar to each other and by increasingly longer branches as their similarity decreases.

Gen

e4

Gen

e5

Gen

e6

Gen

e8

Gen

e7

Gen

e1

Gen

e2

Gen

e3

12

34

56

7

Cluster Dendrogram

Hei

ght

Page 75: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Heatmap….. Eisen Plots

Page 76: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

A B

Interpreting a DendrogramHierarchical analysis results viewed using a dendrogram

(tree)• Distance between nodes (Scale)• Ordering of nodes not important (like baby mobile)

Page 77: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Limitations of hierarchical clustering

• Samples compared in a pair wise manner

• Hierarchy forced on data

• Sometimes difficult to visualise if large data

• Overlapping clustering or time/dose gradients ?

Page 78: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Complementary Approach: ordination

Page 79: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Not this kind of ordination

Page 80: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Ordination- In multivariate statistics

1. Arrangement of units in some order

2. Representation of objects as points along one or several axes of reference (Gower 1984)

Page 81: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Complementary methods

Cluster analysis generally investigates pairwise distances/similarities among objects looking for fine relationships

Ordination in reduced space considers the variance of the whole dataset thus highlighting general gradients/patterns

(Legendre and Legendre, 1998)

Page 82: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Many publications present both

Page 83: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Ordination

• Also refers to as– Latent variable analysis, Dimension reduction

• Aim:

Find axes onto which data can be project so as to explain as much of the variance in the data as possible

Page 84: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Principal Axes• Project new axes through data which capture

variance. Each represents a different trend in the data.

• Orthogonal (decorrelated)

• Typically ranked: First axes most important

• Principal axis, Principal component, latent variable or eigenvector

Page 85: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

xxzz

yy

Dimension Reduction (Ordination)

Principal ComponentsPrincipal Componentspick out the directionspick out the directionsin the data that capturein the data that capturethe greatest variabilitythe greatest variability

New Axis 1New Axis 2

New Axis 3

Page 86: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Eigenvalues

• Describe the amount of variance (information) in eigenvectors

• Ranked. First eigenvalue is the largest.

• Generally only examine 1st few components – scree plot

Page 87: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

010

2030

40Choosing number of

Eigenvalues: Scree Plot

0.00

00.

005

0.01

00.

015

0.02

00.

025

Maximum number of Eigenvalues/Eigenvectors = max(nrow, ncol) -1

Page 88: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Typical Analysis

0.00

00.

005

0.01

00.

015

0.02

00.

025

X OrdinationPlot of eigenvalues, select number.

Plot PC1 v PC2

etc

Array Projection Gene Projection

Page 89: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Ordination of Gene Expression Data

Page 90: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Ordination Methods

• Most common : – Principal component analysis (PCA)– Correspondence analysis (COA or CA)– Nonmetric multidimensional scaling (NMDS,

MDS)– Principal co-ordinate analysis (PCoA)

Page 91: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Books/Book Chapters:1. Legendre, P., and Legendre, L. 1998. Numerical Ecology, 2nd English Edition. ed.

Elsevier, Amsterdam.2. Wall, M., Rechtsteiner, A., and Rocha, L. 2003. Singular value decomposition and

principal component analysis. In A Practical Approach to Microarray Data Analysis. (eds. D.P. Berrar, W. Dubitzky, and M. Granzow), pp. 91-109. Kluwer, Norwell, MA.

Papers:1. Alter, O., Brown, P.O., and Botstein, D. 2000. Singular value decomposition for genome-

wide expression data processing and modeling. Proc Natl Acad Sci U S A 97: 10101-10106.

2. Culhane, A.C., Perriere, G., Considine, E.C., Cotter, T.G., and Higgins, D.G. 2002. Between-group analysis of microarray data. Bioinformatics 18: 1600-1608.

3. Culhane, A.C., Perriere, G., and Higgins, D.G. 2003. Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics 4: 59.

4. Fellenberg, K., Hauser, N.C., Brors, B., Neutzner, A., Hoheisel, J.D., and Vingron, M. 2001. Correspondence analysis applied to microarray data. Proc Natl Acad Sci U S A 98: 10781-10786.

5. Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2: 559-572.

6. Raychaudhuri, S., Stuart, J.M., and Altman, R.B. 2000. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac SympBiocomput: 455-466.

7. Wouters, L., Gohlmann, H.W., Bijnens, L., Kass, S.U., Molenberghs, G., and Lewi, P.J. 2003. Graphical exploration of gene expression data: a comparative study of three multivariate methods. Biometrics 59: 1131-1139

Reviews1. Quackenbush, J. 2001. Computational analysis of microarray data. Nat Rev Genet 2: 418-

427.

Page 92: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Detecting differentially expressed genes

Page 93: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Normal distribution

σσ = standard = standard deviationdeviationof the of the distributiondistribution

X = X = μμ (mean of the distribution)(mean of the distribution)

Page 94: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Estimating a mean

Page 95: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data
Page 96: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Estimating a mean

Page 97: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

All had the same mean and SD

Page 98: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Population 1Population 1

Mean 1Mean 1

Population 2Population 2

Mean 2Mean 2

Less than a 5 % chance that the sample with mean s came from Less than a 5 % chance that the sample with mean s came from Population 1Population 1

ss is significantly different from is significantly different from Mean 1Mean 1 at the p < 0.05 significance level. at the p < 0.05 significance level.

But we cannot reject the hypothesis that the sample came fromBut we cannot reject the hypothesis that the sample came from Population 2Population 2

Sample mean “Sample mean “ss””

Page 99: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Probability distributionsProbability distributions

The The probabilityprobability of an event is the likelihood of its occurring.of an event is the likelihood of its occurring.

It is sometimes computed as a relative It is sometimes computed as a relative frequency (frequency (rfrf)), where, where

The probability of an event can sometimes be The probability of an event can sometimes be inferred from a “theoretical” inferred from a “theoretical” probability probability distributiondistribution, such as a normal distribution., such as a normal distribution.

the number of “favorable” outcomes for an eventthe number of “favorable” outcomes for an eventthe total number of possible outcomes for that eventthe total number of possible outcomes for that eventrfrf ==

Page 100: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Many biological variables, such as height and weight, can Many biological variables, such as height and weight, can reasonably be assumed to approximate the normal reasonably be assumed to approximate the normal distribution. distribution.

But expression measurements? Probably not.But expression measurements? Probably not.

Fortunately, many statistical tests are considered to be fairly Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other robust to violations of the normality assumption, and other assumptions used in these tests.assumptions used in these tests.

Randomization / resamplingRandomization / resampling based tests can be used to get based tests can be used to get around the violation of the normality assumption.around the violation of the normality assumption.

Normality, Probability and Expression DataNormality, Probability and Expression Data

Page 101: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

IMPORTANT CONCEPT No 2

Page 102: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

TRUE FALSE

+ve True Positive False Positive Positive Predictive Value

-ve False Negative True Negative Negative Predictive Value

Sensitivity Specificity Accuracy

Test Prediction

True Value (with Disease)

IMPORTANT CONCEPT No 2

Page 103: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

←bias accuracy→

←pr

ecis

ion

varia

nce→

Page 104: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Another view: ROC CurveSe

nsiti

vity

1 - specificity

Page 105: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Basic dogma of data analysis:Can always increase sensitivity on the cost of specificity,

or vice versa,

the art is to find the sweet spot.

X

X

X

X

X

X

X

X

X

(It can also be possible to increase both by better choice of method / model)

Page 106: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Our goal is to find genes that are Our goal is to find genes that are significantly different between classessignificantly different between classes

Finding Significant Genes

Page 107: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

How?

• Fold Change• T-statistic• Modified t-statistic• Other methods

Page 108: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Fold Change

• Only looks at the difference in the means of two group

• Unreliable in microarrays

• Why? We can’t get good estimate of mean due to too few cases

Page 109: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Average Fold Change Difference for each geneAverage Fold Change Difference for each genesuffers from being arbitrary and not taking suffers from being arbitrary and not taking into account systematic variation in the datainto account systematic variation in the data

??????

Page 110: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

tt--test for each genetest for each geneTests whether the difference between the mean of Tests whether the difference between the mean of the query and reference groups are the samethe query and reference groups are the sameEssentially measures signalEssentially measures signal--toto--noisenoiseCalculate Calculate pp--value (permutations or distributions)value (permutations or distributions)May suffer from intensityMay suffer from intensity--dependent effectsdependent effects

Finding Significant Genes

Page 111: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Y − X sY

2

N+

sX2

M

t - statistic= = signalsignal = = difference between meansdifference between meansnoise variability of groups noise variability of groups

Where Y and X: the means

S2: square of the SD or variances

Page 112: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

A significantA significantdifferencedifference

ProbablyProbablynotnot

tt--teststests

Page 113: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Estimating the varianceThe t-test considers difference between group

means to standard deviation of data within groups

F-test (ANOVA) is a generalization of this idea to more than 2 groups

But with few replicates, estimates of SE are not stable. This explains why t-test is not powerful

Page 114: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Moderated t-statistics

• There are many proposals for estimating variation

• Many share information across genes• Empirical Bayesian Approaches are popular• SAM, an ad-hoc procedure, is even more

popular• Many are what some call “moderated” t-tests

Page 115: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Some Examples of TestsNotation:

– T is average log expression of Treatment group– C is average log expression of Control group– S is SD

• Tests:– Average log fold-change: (T-C)– t-statistic: (T-C) / S– SAM shrunken t-statistic: (T-C) / (S + S0)– Bayesian posteriors: (T-C) / √(S2+K2)– Wilcoxon Rank test

Note taking log before average is important

Page 116: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

One final problemOnce you have a score for each gene, how do you decide on a

cut-off? p-values are popular. Are they appropriate?

Test for each gene null hypothesis: no differential expression.

Notice that if you have look at 10,000 genes for which the null is true you expect to see 500 attain p-values of 0.05

This is called the multiple comparison problem. Statisticians fight about it. But not about the above.

Main message: p-values can’t be interpreted in the usual way

Page 117: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Multiple testing

Popular solutions are either

• slash the p-value – Bonferroi or permutation correction

• or report FDR instead of FPR.

Page 118: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Error Rates

Page 119: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

A useful plot

The volcano plot shows, for a particular test, negative log p-value against the effect size (M)

Page 120: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Volcano plot

Page 121: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Comparison of Feature Selection methods

Page 122: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Assessed

1. the gene list produced by 9 different methods

2. the ability of the top genes from each method to form a classifier

Page 123: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Overlap in Gene Ranking in top 200 genes (binary distance, average linkage)

Page 124: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Testing performance of gene-lists as classifiers

• For each dataset :– Divide dataset into training and test.– Apply feature selection method to training data. – Rank genes using feature selection method

• (t-statistic, SAM, template matching, etc).– Select K top genes.

• where k is between 3 -100• Train classifier using these genes.

– Test discriminating power using test set– Record performance of classifier

• Repeat for each gene selection method

Page 125: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data
Page 126: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data
Page 127: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Conclusion:The empirical bayes t-statistic is a robust and accurate way to identify regulated genes.

Rank Products is also effective in data with low sample size.

Sample permutation of t-statistic and SAM are not effective in datasets with few samples or with low signal:noise

For larger, or high signal:noise datasets; Most methods work well. Area under the ROC curve method and MaxToutperform other approaches

Jeffery IB, Higgins DG, Culhane AC Comparison and evaluation of microarray feature selection methods. BMC Bioinformatics. Submitted

Page 128: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Finding out more about Genes

Page 129: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

We know lots about genes

• Chromosome location• Pathways (KEGG)• Gene Ontology

– Sub- Cellular location (eg nucleus, cytosol)– Biological process (cell signalling)– Molecular function (kinase)

Page 130: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Structure of a GO annotation

Each gene can have several annotated GOs and each GO can have several splits. E.g. DNA topoisomerase II alpha has 8 GO annotations and 11 splits

Page 131: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Gene Sets Score

• Fisher exact test (chi-square test)

• Kolmogorov- Smirnov statistic• weighted KS statistics

• Simple matrix multiplication of of t-statistics x counts

Page 132: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Is a GO term is specific for a set?

51 416

125 8588

173 9004

467

8713

9177

count geneswith GO term in set

count geneswithout GOterm in set

count in set(e.g. differentiallyexpressed genes)

Count in reference set (e.g. all genes on array)

Contingency Table P-value

8x10-52

Fisher's exact testor chi-square test

Page 133: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Gene Ontology: FatiGO

Page 134: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Many options

• GESA• IGA/Rank Prod• GenMAPP, and MAPPFinder• FatiGO

• Segal et al., 2004

Page 135: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Gene Set Enrichment

• proposed by Mootha et al (2003)• similar but more complex and

computationally expensive• Compute Kolmogorov-Smirnov running

sum is computed

Page 136: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Gene Set Enrichment

• For each gene set S• genes are ordered according to some criterion (t-

test; fold change).• Start at top ranking gene• A running sum increases when a gene in set S is

encountered and decreases otherwise• The enrichment score (ES) for a set S is defined

to be the largest value of the running sum.

Page 137: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Kolmogorov-Smirnov test

Running sum over statistics. Compare distance to random distribution

Page 138: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Gene Set Enrichment

• The maximal ES (MES), over all sets S under consideration is recorded.

• For each of B permutations of the class label, ES and MES values are computed.

• The observed MES is then compared to the B values of MES that have been computed, via permutation.

• This is a single p-value for all tests and hence needs no correction

Page 139: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Selection of Categories

pathways (KEGG, cMAP, BioCarta)GO molecular function, biological process cellular locationpublished literatureGenome Info- regions of synteny; cytochromebandsTake care when selecting categories a priori

num categories >>>> num genes (multiple comparison problem)

Page 140: This is a good time to be doing Microarray Data Analysis · Typical Microarray study 1. Read in Data 2. Explore Raw Data 3. Preprocess & Normalize Data… Then again Explore data

Dr. Frederick Frankenstein: Igor, would you mind telling me whose brain I did put in?

Igor: And you won't be angry?Dr. Frederick Frankenstein: I will NOT be angry.Igor: Abby someone.Dr. Frederick Frankenstein: Abby someone. Abby

who?Igor: Abby Normal.Dr. Frederick Frankenstein: Abby Normal?Igor: I'm almost sure that was the name.Dr. Frederick Frankenstein: Are you saying that I

put an abnormal brain into a seven and a half foot long, fifty-four inch wide GORILLA? IS THAT WHAT YOU'RE TELLING ME?

From the film Young Frankenstein, 1974

Good Experimental Design & Sample Processing is Critical