p. j. munson, national institutes of health, nov. 2001page 1 a "consistency" test for...
TRANSCRIPT
P. J. Munson, National Institutes of Health, Nov. 2001Page 1
A "Consistency" Test for Determining the Significance of Gene Expression
Changes on Replicate Samples
and
Two Convenient Variance-stabilizing Transformations
Peter J. Munson, Ph.D.Mathematical and Statistical Computing Laboratory
DCB, CIT, NIH
P. J. Munson, National Institutes of Health, Nov. 2001Page 2
Introduction
• Math. Stat. Comp. Lab. at NIH• Run Affy LIMS database
– Started Dec 2000, Stores >700 chips, – Serves 3 core facilities at NIH
• Study 1– 2 treatments, 5 time points, 6 subjects, 60 U95A chips, PBMC
cells
• Study 2– 3 treatments, 5 time points, 5 subj., 75 Hu6800 chips, human
cells in culter
• Study 3– 4 doses, 2 time oints, 20 subjects, 20 RG U34A chips, blood
cells
P. J. Munson, National Institutes of Health, Nov. 2001Page 3
Outline
• Development of Consistency Test• Variance-stabilizing transforms
– Generalize Logarithm, GLog– Adaptive transform for Average Diff, TAD
• Normalization– Normal quantile + adaptive transform
• Application• Probe-pair data visualization:
– Parallel Axis Coordinate Display
P. J. Munson, National Institutes of Health, Nov. 2001Page 4
Comparing Two Cell Lines
Data from Carlisle, et al., Mol.Carcinogen., 2000Data from Carlisle, et al., Mol.Carcinogen., 2000
• Don’t subtract
background
• Ignore background-level
points
• Calibrate on median
intensity of each cell type
• Over 3-fold change = =
Outside dashed lines
• Are these expression
level changes significant?
real?
P. J. Munson, National Institutes of Health, Nov. 2001Page 5
Duplicate Experiments and "Consistency" Plot
Identifies Real Changes in ExpressionIdentifies Real Changes in Expression
Vimentin
Keratin 5
P. J. Munson, National Institutes of Health, Nov. 2001Page 6
Replication Permits Calculation of Significance (P-values)
4 False-positives4 False-positivesOut of 5760 spots:Out of 5760 spots:
P ≈ 4/5760 = 0.0007P ≈ 4/5760 = 0.0007
P. J. Munson, National Institutes of Health, Nov. 2001Page 7
Consistency Plot
• Compare duplicate experiments, Log Ratio scale
• Set Cutoffs for Over-, Under-expression
• Calculate number detected, D
• Assume Independence, calculate expected number, E, above both, below both cutoffs
• Estimate false positive rate, E/D
0
0. 3
22
45.2
D=24
E=0. 6
E/D=3%
46
11
26.1
4074
4036.6
28
50.4
4113
16
E=0.6
74
88.4
0
1.1
90
27 4170 52 4249
-1
-0.8
-0.6
-0.4
-0.2
-0
0.2
0.4
0.6
0.8
1
L21b**exp45
-1 -0.8 -0.6 -0.4 -0.2 -0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1L12b**exp44
D=24D=24
D=16D=16
P. J. Munson, National Institutes of Health, Nov. 2001Page 8
-1
0
1
L21**exp64
-1 -0.8 -0.6 -0.4 -0.2 -0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1L12**exp63
-1
-0.8
-0.6
-0.4
-0.2
-0
0.2
0.4
0.6
0.8
1
L21**exp45
-1 0 1L12**exp44
p53 +/+ cells 6 hrs, replicate reciprocal experiment
P. J. Munson, National Institutes of Health, Nov. 2001Page 9
Consistency Test on Relative ExpressionDEFINE: x(g, i) = relative expression value for gene g (=1,...,n) in experiment i (=1,...,m)
Fi(X) = empirical cdf of xi across genes (spots)
c = minj x(g, j), across experiments
THEN assuming that { x(g, i), g=1,...,n } are an independent sample from distribution Fi , the probability that x(g, i) is consistently large is:
pup (g) = Pr(Xi ≥ c, for all i) = ∏i (1 - Fi(c))
P. J. Munson, National Institutes of Health, Nov. 2001Page 10
Consistency Test on Relative Expression- 2
DEFINE: x(g, i) = relative expression value for gene g (= 1,...,n) in experiment i (= 1,...,m) pup(g) = ∏i (1 - Fi( minj x(g, j) )) pdn(g) = ∏i (Fi( maxj x(g, j) ))
THEN
Expected number of false positives: E(g) = n * p(g)
P. J. Munson, National Institutes of Health, Nov. 2001Page 11
Assumptions of Consistency Test
• Independence between experiments
• “Exchangeability” of genes
• Homogeneity of variance across genes (i.e. across expression intensity)
Does NOT require:
• Identical distribution in separate experiments
But, variance homogeneity violated for Affy Avg. Diff. data
P. J. Munson, National Institutes of Health, Nov. 2001Page 12
Variance Stabilizing Transformations
• Logarithm
• Box-Cox, power
• Generalized Logarithm, GLog
• Adaptive, TAD
P. J. Munson, National Institutes of Health, Nov. 2001Page 13
Model Variance as Function of Mean AD
P. J. Munson, National Institutes of Health, Nov. 2001Page 14
Model Variance as Function of Mean AD
Var(y) = a0 Var(y) = a0 + a1*yVar(y) = a0 + a1*y + a2*y2
Var(y) = a2*y2
=>> use logarithms
What about:
Var(y) = a0 + a2*y2
P. J. Munson, National Institutes of Health, Nov. 2001Page 15
Var(y) = a0 + a2 * y2
= a0*( 1+ (y/c)2) where c = sqrt(a0/a2)
GLog(y; c) = sign(y) *ln{ |y/c| + sqrt(1 + y2/c2) }
= s.d. at y = 0 / CV, e.g. = 10 / 0.1 = 100
Generalized Log Transform (G-Log)
P. J. Munson, National Institutes of Health, Nov. 2001Page 16
Quantile Normalization for AD (before)
P. J. Munson, National Institutes of Health, Nov. 2001Page 17
Quantile Normalization for AD (after)
P. J. Munson, National Institutes of Health, Nov. 2001Page 18
Normal Quantile Transform after GLog(AD)(it’s almost linear)
P. J. Munson, National Institutes of Health, Nov. 2001Page 19
Adaptive Transform of AD (TAD) - 1
Model variance (over manyreplicates) vs. mean AD
Plot:
Log(SD) or Wilson-Hilferty, SD^(2/3)transformvs.
Mean of NQ(AD)
Fit smooth function, g whichpredicts SD
P. J. Munson, National Institutes of Health, Nov. 2001Page 20
T(X) = Int(-inf,X,1/g)
Adaptive Transform of AD (TAD) - 2
P. J. Munson, National Institutes of Health, Nov. 2001Page 21
Adaptive Transform of AD (TAD)
P. J. Munson, National Institutes of Health, Nov. 2001Page 22
500
1000
1500
Count Axis
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
100
200
300
Count Axis
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Consistency Test p-values
Time 2 vs. Time 0 Time 1 vs. Time 0
Treatment
Sham
P. J. Munson, National Institutes of Health, Nov. 2001Page 23
Table 1. Number of genes detected by consistency test with expected false positivesset to 1.0Group Any Time 1-0 2-0 3-0 4-0
Treated 385 13 340 22 19Controls 83 21 23 26 24Both 2 0 1 2 1
Table 3. Number of genes detected by Maximum TAD greater than 1Group Any time 1-0 2-0 3-0 4-0Treated 275 5 264 4 5Controls 6 1 2 4 4Both 1 0 0 0 1
Results of Study 1(5 time points, 2 treatments, 6 subjects)
P. J. Munson, National Institutes of Health, Nov. 2001Page 24
Probe Pair Data, Delta TAD = 2Parallel Axis Coordinate Display
P. J. Munson, National Institutes of Health, Nov. 2001Page 25
Probe Pair Data Delta TAD = 0.5
P. J. Munson, National Institutes of Health, Nov. 2001Page 26
Probe Pair Data, Delta TAD = -1.5
P. J. Munson, National Institutes of Health, Nov. 2001Page 27
Probe Pair Data, Delta TAD = -0.5
P. J. Munson, National Institutes of Health, Nov. 2001Page 28
Acknowledgements
Lynn Young, MSCLVinay Prabhu, MSCLJennifer Barb, MSCLHoward Shindel, MSCLAndrew Schwartz, CITSteve Bailey, CIT
Robert Danner, CCAnthony Suffredini, CCPeter Eichacker, CCJames Shelhamer, CCEric Gerstenberger, CC
Sayed Daoud, NCIYves Pommier, NCIJohn Weinstein, NCI
David Krizman, NCIAlex Carlisle, NCI
David Rocke, UC Davis