strong control of the familywise type i error rate in dna microarray analysis using exact step-down...
TRANSCRIPT
Strong Control of the Familywise Type I Error Rate in DNA
Microarray Analysis Using Exact Step-Down Permutation Tests
Peter H. Westfall
Texas Tech University
• glass (1 cm2) • ~ 6,500 genes
Microarrays
Different cDNA sequence
Example
Group 1: Acute Myeloid Leukemia (AML), n1=11Group 2: Acute Lymphoblastic Leukemia (ALL), n2=27
Data:
OBS TYPE G1 G2 G3 … G7000 1 AML (Gene expression levels) 2 AML … … … … 11 AML 12 ALL … … 38 ALL
Testing for 7000 Gene Expression Levels
Goal: Test H0i: FALL,i = FAML,i for i=1,…,7000.
Here, “F” denotes cdf.
Many choices for test statistics.
Multiplicity problem: If tests are done at =.05, and there are 6600 equivalent genes, then .05*6600= 330 will be determined “non-equivalent.”
Closed Testing to Control False Discoveries
Let S = {1,2,…,7000} (gene labels).
Let K = {i1,…,ik} S denote a particular subset.
The Closed Testing Procedure:
1. Test H0K: FALL,K = FAML,K for each K S, using a valid -level test for each.
2. Reject H0i: FALL,i = FAML,i if H0K is rejected for all K {i}.
Theorem: CTP strongly Controls FWE
Proof: Suppose H0j1,..., H0jm all are true (unknown to
you which ones).
You may reject at least one only when you reject the intersection H0j1
... H0jm .
Thus, FWE = P(reject at least one of H0j1,..., H0jm |
H0j1,..., H0jm all are true)
P(reject H0j1... H0jm |
H0j1,..., H0jm all are true) = .
Exact Tests for Composite Hypotheses H0K
Use the permutation distribution of miniK pi, where pi = 2P(T38-2 > |ti|), and
ti = , ,
,
1 111 27
AML i ALL i
p i
X X
s
p-value = proportion of the 38!/(27!11!) permutations for which miniK Pi
* miniK pi .
Note: Exact despite “massively singular” covariance matrix!
A Slight Problem...
There are 27000 -1 subsets K to be tested
This might take a while...
A Fantastic Simplification
You need only test 7000 of the 27000-1 subsets!
Why? Because
P(miniK Pi* c) P(miniK’ Pi
* c) when K K’.
Significance for most lower order subsets is determined by significance of higher order subsets.
Illustration with Four GenesH{1234}
min p = .0121, p{1234} = .0379
H{123}
min p = .0121, p{123} < .0379
H{124}
min p = .0121, p{124} < .0379
H{134}
min p = .0121, p{134} < .0379
H{234}
min p = .0142, p{234} = .0351
H{12}
min p = .0121p{12} < .0379
H{13}
min p = .0121p{13} < .0379
H{14}
min p =.0121p{14} < .0379
H{23}
min p = .0142p{23} < .0351
H{24}
min p = .0142p{24} < .0351
H{34}
min p = .0191p{34} = .0355
H1
p1 = 0.0121p{1} < .0379
H2
p2 = 0.0142p{2} < .0351
H3
p3 = 0.1986p{3} = .1991
H4
p4 = 0.0191p{4} < .0355
(Start at bottom.)
MULTTEST PROCEDURE
Tests only the needed subsets (7000, not 27000 - 1).
Samples from the permutation distribution.
Only one sample is needed, not 7000 distinct samples:
The joint distribution of minP is identical under HK and HS. (Called the “subset pivotality” condition by Westfall and Young, 1993.)
PROC MULTTEST code
Proc multtest noprint out=adjp holm hoc stepperm n=200000; class type; /* AML or ALL */ test mean (gene1-gene7123); contrast ‘AML vs ALL’ -1 1;run;
proc sort data=adjp(where=(raw_p le .0005)); by raw_p;
proc print; var _var_ raw_p stppermp;run;
PROC MULTTEST OutputStepdown Stepdown
Raw permutation Raw permutationOBS Variable p-value p-value OBS Variable p-value p-value
1 GENE3320 1.4E-10 0.0001 21 GENE2043 1.3E-06 0.00952 GENE4847 2.4E-10 0.0001 22 GENE2759 1.3E-06 0.00973 GENE2020 6.6E-10 0.0001 23 GENE6803 1.4E-06 0.01044 GENE1745 1.0E-08 0.0002 24 GENE1674 1.5E-06 0.01065 GENE5039 1.0E-08 0.0002 25 GENE2402 1.5E-06 0.01096 GENE1834 1.5E-08 0.0003 26 GENE2186 1.7E-06 0.01187 GENE461 3.6E-08 0.0005 27 GENE6376 2.1E-06 0.01428 GENE4196 6.2E-08 0.0007 28 GENE3605 2.6E-06 0.01699 GENE3847 7.2E-08 0.0008 29 GENE6806 2.6E-06 0.017010 GENE2288 8.9E-08 0.0010 30 GENE1829 2.7E-06 0.017711 GENE1249 1.7E-07 0.0017 31 GENE6797 3.0E-06 0.019412 GENE6201 1.8E-07 0.0017 32 GENE6677 3.4E-06 0.021613 GENE2242 2.0E-07 0.0019 33 GENE4052 3.7E-06 0.023114 GENE3258 2.1E-07 0.0020 34 GENE1394 4.9E-06 0.029015 GENE1882 3.2E-07 0.0029 35 GENE6405 5.4E-06 0.031116 GENE2111 3.7E-07 0.0033 36 GENE248 6.4E-06 0.035917 GENE2121 5.8E-07 0.0048 37 GENE2267 6.5E-06 0.036418 GENE6200 6.2E-07 0.0051 38 GENE6041 7.8E-06 0.042919 GENE6373 8.2E-07 0.0065 39 GENE6005 8.0E-06 0.043920 GENE6539 1.1E-06 0.0086 40 GENE5772 9.0E-06 0.0480
(50 minutes for 200,000 samples)
Imbalance IssuesUse of student t statistics does result in anexact, closed multiple testing procedure, but ...
There is imbalance:less power for gene types that are highly kurtotic than for normally distributed types.
Solutions: •Use exact unadjusted p-values
– Already available for binary data– Computational difficulties otherwise
•Rank-transform the data prior to analysis
Rank Transform for Better Balance
Proc rank; var gene1-gene7123; run;
Proc multtest noprint out=adjp holm hoc stepperm n=200000; class type; /* AML or ALL */ test mean (gene1-gene7123); contrast ‘AML vs ALL’ -1 1;run;
proc sort data=adjp(where=(raw_p le .0005)); by raw_p;
proc print; var _var_ raw_p stppermp;run;
Rank Transformed ResultsStepdown Stepdown
Raw permutation Raw permutationOBS Variable p-value p-value OBS Variable p-value p-value
1 GENE4847 5.0E-09 0.0000 26 GENE2233 1.6E-06 0.01522 GENE1882 1.1E-08 0.0001 27 GENE4780 1.6E-06 0.01523 GENE3320 3.3E-08 0.0003 28 GENE4107 1.6E-06 0.01524 GENE6218 3.3E-08 0.0003 29 GENE6539 1.6E-06 0.01525 GENE1834 4.6E-08 0.0004 30 GENE3847 2.1E-06 0.01946 GENE760 4.6E-08 0.0004 31 GENE6201 2.1E-06 0.01947 GENE2020 6.4E-08 0.0005 32 GENE1928 2.7E-06 0.02478 GENE5039 6.4E-08 0.0005 33 GENE2759 3.0E-06 0.02609 GENE1745 6.4E-08 0.0005 34 GENE4373 3.4E-06 0.029110 GENE4499 6.4E-08 0.0005 35 GENE50 3.4E-06 0.030711 GENE2267 8.8E-08 0.0008 36 GENE4196 3.4E-06 0.030712 GENE5772 1.2E-07 0.0011 37 GENE804 3.4E-06 0.030713 GENE6041 3.1E-07 0.0030 38 GENE3507 3.4E-06 0.030714 GENE4377 4.1E-07 0.0040 39 GENE4328 3.4E-06 0.030715 GENE2354 4.1E-07 0.0040 40 GENE4546 3.8E-06 0.032316 GENE6855 4.1E-07 0.0040 41 GENE4052 4.3E-06 0.038317 GENE3252 7.2E-07 0.0070 42 GENE5501 4.3E-06 0.038318 GENE2121 7.2E-07 0.0070 43 GENE6806 4.3E-06 0.038319 GENE248 9.4E-07 0.0090 44 GENE6378 4.9E-06 0.040420 GENE2015 9.4E-07 0.0090 45 GENE1249 5.5E-06 0.046921 GENE312 1.2E-06 0.0119 46 GENE2402 5.5E-06 0.046922 GENE1630 1.2E-06 0.0119 47 GENE6797 5.5E-06 0.046923 GENE1144 1.2E-06 0.0119 48 GENE6803 5.5E-06 0.046924 GENE4535 1.2E-06 0.0119 49 GENE1539 6.1E-06 0.0495
Comparing ALL and AML for Gene 6128
0
1000
2000GENE6128
ALL AML
TYPE
Is Better Balance Good?
• Maybe not - Imbalance induces more powerful multiple testing procedure– Bonferroni multiplier implicitly reduced through
imbalance – Serendipity!
Summary
• Westfall-Young Method is an exact, closed testing method, despite large p, small n
• Detected genes are “honestly significant”
• Robust (nonparametric)