statistical analyses of high density oligonucleotide arrays rafael a. irizarry department of...

40
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry Speed, Walter & Eliza Hall Institute of Medical Research and Francois Collin,Gene Logic) http://biosun01.biostat.jhsph.edu/~ririzarr

Upload: randell-greene

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Probe Arrays 24µm Millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Image of Hybridized Probe Array >200,000 different complementary probes Single stranded, labeled RNA target Oligonucleotide probe * * * * *1.28cm GeneChip Probe Array Hybridized Probe Cell Compliments of D. Gerhold

TRANSCRIPT

Page 1: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Statistical Analyses of High Density Oligonucleotide Arrays

Rafael A. IrizarryDepartment of Biostatistics, JHU

(joint work with Bridget Hobbs and Terry Speed, Walter & Eliza Hall Institute of Medical Research and Francois Collin,Gene Logic)

http://biosun01.biostat.jhsph.edu/~ririzarr

Page 2: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Summary

• Review of technology• Data exploration• Probe level summaries (expression measures)• Normalization• Evaluate and compare through bias, variance and

model fit to 4 expression measures• Use Gene Logic spike-in and dilution study• Conclusion/future work

Page 3: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Probe Arrays

24µm24µm

Millions of copies of a specificMillions of copies of a specificoligonucleotide probeoligonucleotide probe

Image of Hybridized Probe ArrayImage of Hybridized Probe Array

>200,000 different>200,000 differentcomplementary probes complementary probes

Single stranded, Single stranded, labeled RNA targetlabeled RNA target

Oligonucleotide probeOligonucleotide probe

* ****

1.28cm1.28cm

GeneChipGeneChip Probe ArrayProbe ArrayHybridized Probe CellHybridized Probe Cell

Compliments of D. Gerhold

Page 4: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

PM MM

Page 5: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Data and Notation

PMijn , MMijn = Intensity for perfect/mis-match

probe cell j, in chip i, in gene n

i = 1,…, I (ranging from 1 to hundreds)j=1,…, J (usually 16 or 20)n = 1,…, N (between 8,000 and 12,000)

Page 6: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

The Big Picture

• Summarize 20 PM,MM pairs (probe level data) into one number for each gene

• We call this number an expression measure• Affymetrix GeneChip’s Software uses

AvDiff as expression measure• Does it work? Can it be improved?

Page 7: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

What is the evidence? Lockhart et. al. Nature Biotechnology 14 (1996)

Page 8: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Competing Measures of Expression

• GeneChip® software uses Avg.diff

with A a set of “suitable” pairs chosen by software.• Log ratio version is also used.• For differential expression Avg.diffs are compared

between chips.

j

jj MMPMdiffAvg )(1.

Page 9: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Competing Measures of Expression

• GeneChip® new version uses something else

with MM* a version of MM that is never bigger than PM.

)}{log( *jj MMPMghtTukeyBiweisignal

Page 10: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Competing Measures of Expression

• Li and Wong fit a model

Consider expression in chip i• Efron et. al. consider log PM – 0.5 log MM• Another is second largest PM

),0(, 2 NMMPM ijijjiijij

i

Page 11: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Competing Measures of Expression

• Why not stick to what has worked for cDNA?

with A a set of “suitable” pairs.

Aj

j BGPMBGPMAvLog )log(1)(

Page 12: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Features of Probe Level Data

Page 13: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

SD vs. Avg of Defective Probes

Page 14: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

ANOVA: Strong probe effect5 times bigger than gene effect

Page 15: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Histograms of log2(PM/MM) stratifies by log2(PMxMM)/2 for mouse chip for defective and normal probe

Page 16: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Normalization at Probe Level

Page 17: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Spike-In Experiments

• Set A: 11 control cRNAs were spiked in, all at the same concentration, which varied across chips.

• Set B: 11 control cRNAs were spiked in, all at different concentrations, which varied across chips. The concentrations were arranged in 12x12 cyclic Latin square (with 3 replicates)

Page 18: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Set A: Probe Level Data (12 chips)

Page 19: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

What Did We Learn?

• Don’t subtract or divide by MM• Probe effect is additive on log scale• Take logs

Page 20: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Why Remove Background?

Page 21: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Background Distribution

Page 22: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Average Log2(PM-BG)

• Normalize probe level data• Compute BG = background mean by

estimating the mode of the MM distribution• Subtract BG from each PM• If PM-BG < 0 use minimum of positives

divided by 2• Take average

Page 23: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Expression after Normalization

Page 24: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Expression Level Comparison

Page 25: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Spike-In BProbe Set Conc 1 Conc 2 RankBioB-5 100 0.5 1BioB-3 0.5 25.0 2BioC-5 2.0 75.0 4BioB-M 1.0 37.5 4BioDn-3 1.5 50.0 5DapX-3 35.7 3.0 6CreX-3 50.0 5.0 7CreX-5 12.5 2.0 8BioC-3 25.0 100 9DapX-5 5.0 1.5 10DapX-M 3.0 1.0 11

Later we consider 23 different combinations of concentrations

Page 26: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Differential Expression

Page 27: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Differential Expression

Page 28: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Differential Expression

Page 29: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Differential Expression

Page 30: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Observed RanksGene AvDiff MAS 5.0 Li&Wong AvLog(PM-BG)BioB-5 6 2 1 1BioB-3 16 1 3 2BioC-5 74 6 2 5BioB-M 30 3 7 3BioDn-3 44 5 6 4DapX-3 239 24 24 7CreX-3 333 73 36 9CreX-5 3276 33 3128 8BioC-3 2709 8572 681 6431DapX-5 2709 102 12203 10DapX-M 165 19 13 6Top 15 1 5 6 10

Page 31: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Observed vs True Ratio

Page 32: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Dilution Experiment• cRNA hybridized to human chip (HGU95) in

range of proportions and dilutions• Dilution series begins at 1.25 g cRNA per

GeneChip array, and rises through 2.5, 5.0, 7.5, 10.0, to 20.0 g per array. 5 replicate chips were used at each dilution

• Normalize just within each set of 5 replicates• For each probe set compute expression, average

and SD over replicates, and fit a line to log expression vs. log concentration

• Regression line should have slope 1 and high R2

Page 33: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Dilution Experiment Data

Page 34: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Expression and SD

Page 35: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Slope Estimates and R2

Page 36: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Model check

• Compute observed SD of 5 replicate expression estimates

• Compute RMS of 5 nominal SDs • Compare by taking the log ratio• Closeness of observed and nominal SD

taken as a measure of goodness of fit of the model

Page 37: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Observed vs. Model SE

Page 38: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Observed vs. Model SE

Page 39: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Conclusion

• Take logs• PMs need to be normalized • Using global background improves on use of

probe-specific MM• Gene Logic spike-in and dilution study show all

four expression measures performed very well• AvLog(PM-BG) is arguably the best in terms of

bias, variance and model fit• Future: better BG; robust/resistant summaries

Page 40: Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry

Acknowledgements

• Gene Brown’s group at Wyeth/Genetics Institute, and Uwe Scherf’s Genomics Research & Development Group at Gene Logic, for generating the spike-in and dilution data

• Gene Logic for permission to use these data • Ben Bolstad (UC Berkeley)• Magnus Åstrand (Astra Zeneca Mölndal)• Skip Garcia, Tom Cappola, and Joshua Hare (JHU)