p. j. munson, national institutes of health, nov. 2001page 1 a "consistency" test for...

P. J. Munson, National Institutes of Health, Nov. 2001

A "Consistency" Test for Determining the Significance of Gene Expression

Changes on Replicate Samples

and

Two Convenient Variance-stabilizing Transformations

Peter J. Munson, Ph.D.Mathematical and Statistical Computing Laboratory

DCB, CIT, NIH

[email protected]


Introduction

• Math. Stat. Comp. Lab. at NIH• Run Affy LIMS database

– Started Dec 2000, Stores >700 chips, – Serves 3 core facilities at NIH

• Study 1– 2 treatments, 5 time points, 6 subjects, 60 U95A chips, PBMC

cells

• Study 2– 3 treatments, 5 time points, 5 subj., 75 Hu6800 chips, human

cells in culter

• Study 3– 4 doses, 2 time oints, 20 subjects, 20 RG U34A chips, blood

cells


Outline

• Development of Consistency Test• Variance-stabilizing transforms

– Generalize Logarithm, GLog– Adaptive transform for Average Diff, TAD

• Normalization– Normal quantile + adaptive transform

• Application• Probe-pair data visualization:

– Parallel Axis Coordinate Display


Comparing Two Cell Lines

Data from Carlisle, et al., Mol.Carcinogen., 2000Data from Carlisle, et al., Mol.Carcinogen., 2000

• Don’t subtract

background

• Ignore background-level

points

• Calibrate on median

intensity of each cell type

• Over 3-fold change = =

Outside dashed lines

• Are these expression

level changes significant?

real?


Duplicate Experiments and "Consistency" Plot

Identifies Real Changes in ExpressionIdentifies Real Changes in Expression

Vimentin

Keratin 5


Replication Permits Calculation of Significance (P-values)

4 False-positives4 False-positivesOut of 5760 spots:Out of 5760 spots:

P ≈ 4/5760 = 0.0007P ≈ 4/5760 = 0.0007


Consistency Plot

• Compare duplicate experiments, Log Ratio scale

• Set Cutoffs for Over-, Under-expression

• Calculate number detected, D

• Assume Independence, calculate expected number, E, above both, below both cutoffs

• Estimate false positive rate, E/D

0

0. 3

22

45.2

D=24

E=0. 6

E/D=3%

46

11

26.1

4074

4036.6

28

50.4

4113

16

E=0.6

74

88.4

0

1.1

90

27 4170 52 4249

-1

-0.8

-0.6

-0.4

-0.2

-0

0.2

0.4

0.6

0.8

1

L21b**exp45

-1 -0.8 -0.6 -0.4 -0.2 -0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1L12b**exp44

D=24D=24

D=16D=16


-1

0

1

L21**exp64

-1 -0.8 -0.6 -0.4 -0.2 -0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1L12**exp63

-1

-0.8

-0.6

-0.4

-0.2

-0

0.2

0.4

0.6

0.8

1

L21**exp45

-1 0 1L12**exp44

p53 +/+ cells 6 hrs, replicate reciprocal experiment


Consistency Test on Relative ExpressionDEFINE: x(g, i) = relative expression value for gene g (=1,...,n) in experiment i (=1,...,m)

Fi(X) = empirical cdf of xi across genes (spots)

c = minj x(g, j), across experiments

THEN assuming that { x(g, i), g=1,...,n } are an independent sample from distribution Fi , the probability that x(g, i) is consistently large is:

pup (g) = Pr(Xi ≥ c, for all i) = ∏i (1 - Fi(c))


Consistency Test on Relative Expression- 2

DEFINE: x(g, i) = relative expression value for gene g (= 1,...,n) in experiment i (= 1,...,m) pup(g) = ∏i (1 - Fi( minj x(g, j) )) pdn(g) = ∏i (Fi( maxj x(g, j) ))

THEN

Expected number of false positives: E(g) = n * p(g)


Assumptions of Consistency Test

• Independence between experiments

• “Exchangeability” of genes

• Homogeneity of variance across genes (i.e. across expression intensity)

Does NOT require:

• Identical distribution in separate experiments

But, variance homogeneity violated for Affy Avg. Diff. data


Variance Stabilizing Transformations

• Logarithm

• Box-Cox, power

• Generalized Logarithm, GLog

• Adaptive, TAD


Model Variance as Function of Mean AD


Model Variance as Function of Mean AD

Var(y) = a0 Var(y) = a0 + a1*yVar(y) = a0 + a1*y + a2*y2

Var(y) = a2*y2

=>> use logarithms

What about:

Var(y) = a0 + a2*y2


Var(y) = a0 + a2 * y2

= a0*( 1+ (y/c)2) where c = sqrt(a0/a2)

GLog(y; c) = sign(y) *ln{ |y/c| + sqrt(1 + y2/c2) }

= s.d. at y = 0 / CV, e.g. = 10 / 0.1 = 100

Generalized Log Transform (G-Log)


Quantile Normalization for AD (before)


Quantile Normalization for AD (after)


Normal Quantile Transform after GLog(AD)(it’s almost linear)


Adaptive Transform of AD (TAD) - 1

Model variance (over manyreplicates) vs. mean AD

Plot:

Log(SD) or Wilson-Hilferty, SD^(2/3)transformvs.

Mean of NQ(AD)

Fit smooth function, g whichpredicts SD


T(X) = Int(-inf,X,1/g)

Adaptive Transform of AD (TAD) - 2


Adaptive Transform of AD (TAD)


500

1000

1500

Count Axis

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

100

200

300

Count Axis

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

Consistency Test p-values

Time 2 vs. Time 0 Time 1 vs. Time 0

Treatment

Sham


Table 1. Number of genes detected by consistency test with expected false positivesset to 1.0Group Any Time 1-0 2-0 3-0 4-0

Treated 385 13 340 22 19Controls 83 21 23 26 24Both 2 0 1 2 1

Table 3. Number of genes detected by Maximum TAD greater than 1Group Any time 1-0 2-0 3-0 4-0Treated 275 5 264 4 5Controls 6 1 2 4 4Both 1 0 0 0 1

Results of Study 1(5 time points, 2 treatments, 6 subjects)


Probe Pair Data, Delta TAD = 2Parallel Axis Coordinate Display


Probe Pair Data Delta TAD = 0.5


Probe Pair Data, Delta TAD = -1.5


Probe Pair Data, Delta TAD = -0.5


Acknowledgements

Lynn Young, MSCLVinay Prabhu, MSCLJennifer Barb, MSCLHoward Shindel, MSCLAndrew Schwartz, CITSteve Bailey, CIT

Robert Danner, CCAnthony Suffredini, CCPeter Eichacker, CCJames Shelhamer, CCEric Gerstenberger, CC

Sayed Daoud, NCIYves Pommier, NCIJohn Weinstein, NCI

David Krizman, NCIAlex Carlisle, NCI

David Rocke, UC Davis

p. j. munson, national institutes of health, nov. 2001page 1 a "consistency" test for...

Documents

national institutes

experiment i

f i c slide

distribution f i

f i max j xg

pg slide

reciprocal experiment

j p dn g