gene expression index stat 115 2012. 2 outline gene expression index –mas4, average –mas5, tukey...

Gene Expression Index

Stat 115

Outline• Gene expression index

– MAS4, average

– MAS5, Tukey Biweight

– dChip, model based, multi-array

– RMA, model based, multi-array

– Method comparison• Latin Square spike-in experiment

– Importance of probe mapping

These are perhaps the few most popular of many methods for normalizing and computing expression measures using Affymetrix data. Currently over 50 methods are describedand compared at http://affycomp.biostat.jhsph.edu/.

cDNA Microarrays

• Fold change: ratio Cy5 / Cy3

• When fold change is negative

Log2(Cy5 / Cy3)

Arrays

array 1 array 2 array 3 array 4 array 5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Affymetrix Microarray Expression Index

• How to summarize probes in a probeset?

Brighter PM usually carries more information, but not always the case (cross-hybridization)

MAS4• GeneChip® older software Microarray Analysis

Software 4.0 uses AvgDiff

• A: a set of suitable pairs chosen by software– Remove highest/lowest– Calculate mean, sd from remaining probes– Eliminate probes more than 3 sd from mean

• Drawback (naïve algorithm):– Can omit 30-40% probes – Can give negative values

jj MMPMAvgDiff )(1

MAS5• GeneChip® newest version

• CT* (change threshold) a version of MM that is never bigger than PM– If MM<PM, CT* = MM– If MM>PM, estimate typical

case MM for PM • Tukeybiweight of MMs

with similar PM values ~70% PM

– If typical MMs>PM for, set CT* = PM - • Robust weighting to down weight outliers

)}{log( *jj CTPMghtTukeyBiweisignal

Li & Wong (dChip)Important observation: relative values of probes within a

probeset very stable across multiple samples.

Model-Based Expression Index

• Look at multiple samples at a time, give different probes a different weight

• Each probe signal is proportional to – Amount of target sample:

– Affinity of specific probe sequence to the target: j

Probes 1 2 3

sample 1

sample 2

Li & Wong (dChip)

• Model

• Iteratively estimate θi and φj to minimize εij

• Try to minimize the sum of errors

ijjiijij MMPM

............

...)()()(

333231

232221

131211

MPMPMP

Sample1

Sample2Sample3…

φ1 φ2 φ3

Probe1 Probe2 Probe3 …1

Concentration Probe affinity

RMA = Robust Multi-chip Analysis

• Irizarry & Speed, 2003

• Eliminates MM probes

• Probe intensity background adjustment

• Quantile normalize the background adjusted PM

• Take Log of PM

• Robust probe summary

RMA Background Subtraction

• Signal + BG = PM

• Signal ~ exponential; BG ~ normal

Signal + Noise = Observed

RMA Background Subtraction

• BG distribution

Why Log(PM)• Captures the fact that higher value probes are

more variable• Assume probe noise is comparable on log scale

• For each probe set, PMij = ij

• Fit the model:

– aj is expression index, bj is probe effect– Log2n() stands for logarithm after quantile

normalization of n samples

• Iteratively refit aj and bj (similar to dChip)– Main difference is to minimize error at log PM

)log()log()(log jiijPM

ijjiij baPM )bg(nlog2

RMA model fitting: Median Polish

• For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile-normalized value for GeneChip i and probe j.

• Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0.

• Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column.

gene expressionof the probe seton GeneChip i

probe affinityaffect for thejth probe in theprobe set

residual

An Example (from Dan Nettleton)

Suppose the following are background-adjusted, log2-transformed, quantile-normalized PM intensitiesfor a single probe set. Determine the final RMAexpression measures for this probe set.

1 2 3 4 51 4 3 6 4 72 8 1 10 5 113 6 2 7 8 84 9 4 12 9 125 7 5 9 6 10

An Example (continued)

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

rowmedians

0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

matrix afterremoving

row medians

0 -1 2 0 3 0 -7 2 -3 3-1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3

0 -5 2 0 3

column medians

0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

matrix aftersubtracting

column medians

0 4 0 0 0 0 -2 0 -3 0-1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0

0 0-1 0 0

rowmedians

matrix afterremoving

row medians

0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0

0 1 0 0 0

column medians

matrix aftersubtracting

column medians

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

All row medians and column medians are 0.Thus the median polish procedure has converged.The above is the residual matrix that we willsubtract from the original matrix to obtain thefitted values.

0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0

4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10

4 0 6 4 78 4 10 8 116 2 8 6 99 5 11 9 127 3 9 7 10

original matrix residuals from median polish

matrix of fitted values

4.28.26.29.27.2

row means= μ1

RMAexpressionmeasuresfor the 5 GeneChips

Method Comparison Standard• Spike-ins: introduce markers with known

concentration (intensity) to RNA samples– Should cover a broad range of concentrations– Run two samples with and without spike-in, see

whether algorithm can detect the spike-in (differential expression)

• Dilutions: – Serial dilutions: 1:2, 1:4, 1:8…

• Latin square spike-in captures both approaches above

• Compare both accuracy qualitatively and expression index quantiatively

Latin Square Spike-ins

MAS4 MAS 5

dChip RMA

Red numbers indicate spikedgenes

Method Comparison of Spike-in

Method Comparison Conclusion

• No one uses MAS4 now• With fold change, RMA > dChip > MAS5• With p-value, RMA ~ MAS5 > dChip• MAS 5.0 does a good job on abundant genes• dChip and RMA do better on less abundant genes • Affy developed multi-chip model-based PLIER,

currently open source, although no documentation• All five models are implemented in BioConductor

(open source R package)

214019_at: CCND1

Probe Mapping in Affymetrix Expression arrays

• Inconsistencies in ~5% of NetAffx probe-to-gene annotations (Perez-Iratxeta et al. 2005).

• Remapping all the probes with documented human transcripts resulted in the redefinition of ~37% of probes in Affy’s newest U133 Plus 2.0 array (Harbig et al. 2005).– Provide new and better .cdf file for probe mapping

• Evolving gene/transcript definitions can cause ~30% difference in the differentially expressed genes (Dai et al. 2005).

Acknowledgment

• Terry Speed, Rafael Irizarry & group• Kevin Coombes & Keith Baggerly• Erick Rouchka• Wing Wong & Cheng Li• Mark Reimers• Erin Conlon• Larry Hunter• Zhijin Wu• Wei Li

gene expression index stat 115 2012. 2 outline gene expression index –mas4, average –mas5, tukey...

Documents

data preprocessing - broad institute · data acquisition...

mas5, a yeast homolog of dnaj involvedin mitochondrial...

pooling information across different studies and...

grafik pengendali nonparametrik dengan estimasi fungsi...

research article open access human hematopoietic...

perbandingan metode kekar biweight midcovariance …

software open access the dchip survival analysis module...

research article open access genome-wide temporal …default...

the efficiency of the biweight as a robust estimator … ·...

dls rack & pinion lubrication system rack & pinion...

model-based analysis of oligonucleotide arrays, dchip...

field survey of the 2010 tsunami in...

lecture 13 review of vector calculus - uc san diego...

research article in vitro to in vivo extrapolation for...

tukey’s biweight correlation and the...

short description on how to use dchip snp