strong control of the familywise type i error rate in dna microarray analysis using exact step-down...

Strong Control of the Familywise Type I Error Rate in DNA

Microarray Analysis Using Exact Step-Down Permutation Tests

Peter H. Westfall

Texas Tech University

• glass (1 cm2) • ~ 6,500 genes

Microarrays

Different cDNA sequence

Example

Group 1: Acute Myeloid Leukemia (AML), n1=11Group 2: Acute Lymphoblastic Leukemia (ALL), n2=27

Data:

OBS TYPE G1 G2 G3 … G7000 1 AML (Gene expression levels) 2 AML … … … … 11 AML 12 ALL … … 38 ALL

Testing for 7000 Gene Expression Levels

Goal: Test H0i: FALL,i = FAML,i for i=1,…,7000.

Here, “F” denotes cdf.

Many choices for test statistics.

Multiplicity problem: If tests are done at =.05, and there are 6600 equivalent genes, then .05*6600= 330 will be determined “non-equivalent.”

Closed Testing to Control False Discoveries

Let S = {1,2,…,7000} (gene labels).

Let K = {i1,…,ik} S denote a particular subset.

The Closed Testing Procedure:

1. Test H0K: FALL,K = FAML,K for each K S, using a valid -level test for each.

2. Reject H0i: FALL,i = FAML,i if H0K is rejected for all K {i}.

Theorem: CTP strongly Controls FWE

Proof: Suppose H0j1,..., H0jm all are true (unknown to

you which ones).

You may reject at least one only when you reject the intersection H0j1

... H0jm .

Thus, FWE = P(reject at least one of H0j1,..., H0jm |

H0j1,..., H0jm all are true)

P(reject H0j1... H0jm |

H0j1,..., H0jm all are true) = .

Exact Tests for Composite Hypotheses H0K

Use the permutation distribution of miniK pi, where pi = 2P(T38-2 > |ti|), and

ti = , ,

,

1 111 27

AML i ALL i

p i

X X

s

p-value = proportion of the 38!/(27!11!) permutations for which miniK Pi

* miniK pi .

Note: Exact despite “massively singular” covariance matrix!

A Slight Problem...

There are 27000 -1 subsets K to be tested

This might take a while...

A Fantastic Simplification

You need only test 7000 of the 27000-1 subsets!

Why? Because

P(miniK Pi* c) P(miniK’ Pi

* c) when K K’.

Significance for most lower order subsets is determined by significance of higher order subsets.

Illustration with Four GenesH{1234}

min p = .0121, p{1234} = .0379

H{123}

min p = .0121, p{123} < .0379

H{124}

min p = .0121, p{124} < .0379

H{134}

min p = .0121, p{134} < .0379

H{234}

min p = .0142, p{234} = .0351

H{12}

min p = .0121p{12} < .0379

H{13}

min p = .0121p{13} < .0379

H{14}

min p =.0121p{14} < .0379

H{23}

min p = .0142p{23} < .0351

H{24}

min p = .0142p{24} < .0351

H{34}

min p = .0191p{34} = .0355

H1

p1 = 0.0121p{1} < .0379

H2

p2 = 0.0142p{2} < .0351

H3

p3 = 0.1986p{3} = .1991

H4

p4 = 0.0191p{4} < .0355

(Start at bottom.)

MULTTEST PROCEDURE

Tests only the needed subsets (7000, not 27000 - 1).

Samples from the permutation distribution.

Only one sample is needed, not 7000 distinct samples:

The joint distribution of minP is identical under HK and HS. (Called the “subset pivotality” condition by Westfall and Young, 1993.)

PROC MULTTEST code

Proc multtest noprint out=adjp holm hoc stepperm n=200000; class type; /* AML or ALL */ test mean (gene1-gene7123); contrast ‘AML vs ALL’ -1 1;run;

proc sort data=adjp(where=(raw_p le .0005)); by raw_p;

proc print; var _var_ raw_p stppermp;run;

PROC MULTTEST OutputStepdown Stepdown

Raw permutation Raw permutationOBS Variable p-value p-value OBS Variable p-value p-value

1 GENE3320 1.4E-10 0.0001 21 GENE2043 1.3E-06 0.00952 GENE4847 2.4E-10 0.0001 22 GENE2759 1.3E-06 0.00973 GENE2020 6.6E-10 0.0001 23 GENE6803 1.4E-06 0.01044 GENE1745 1.0E-08 0.0002 24 GENE1674 1.5E-06 0.01065 GENE5039 1.0E-08 0.0002 25 GENE2402 1.5E-06 0.01096 GENE1834 1.5E-08 0.0003 26 GENE2186 1.7E-06 0.01187 GENE461 3.6E-08 0.0005 27 GENE6376 2.1E-06 0.01428 GENE4196 6.2E-08 0.0007 28 GENE3605 2.6E-06 0.01699 GENE3847 7.2E-08 0.0008 29 GENE6806 2.6E-06 0.017010 GENE2288 8.9E-08 0.0010 30 GENE1829 2.7E-06 0.017711 GENE1249 1.7E-07 0.0017 31 GENE6797 3.0E-06 0.019412 GENE6201 1.8E-07 0.0017 32 GENE6677 3.4E-06 0.021613 GENE2242 2.0E-07 0.0019 33 GENE4052 3.7E-06 0.023114 GENE3258 2.1E-07 0.0020 34 GENE1394 4.9E-06 0.029015 GENE1882 3.2E-07 0.0029 35 GENE6405 5.4E-06 0.031116 GENE2111 3.7E-07 0.0033 36 GENE248 6.4E-06 0.035917 GENE2121 5.8E-07 0.0048 37 GENE2267 6.5E-06 0.036418 GENE6200 6.2E-07 0.0051 38 GENE6041 7.8E-06 0.042919 GENE6373 8.2E-07 0.0065 39 GENE6005 8.0E-06 0.043920 GENE6539 1.1E-06 0.0086 40 GENE5772 9.0E-06 0.0480

(50 minutes for 200,000 samples)

Imbalance IssuesUse of student t statistics does result in anexact, closed multiple testing procedure, but ...

There is imbalance:less power for gene types that are highly kurtotic than for normally distributed types.

Solutions: •Use exact unadjusted p-values

– Already available for binary data– Computational difficulties otherwise

•Rank-transform the data prior to analysis

Rank Transform for Better Balance

Proc rank; var gene1-gene7123; run;

Proc multtest noprint out=adjp holm hoc stepperm n=200000; class type; /* AML or ALL */ test mean (gene1-gene7123); contrast ‘AML vs ALL’ -1 1;run;

proc sort data=adjp(where=(raw_p le .0005)); by raw_p;

proc print; var _var_ raw_p stppermp;run;

Rank Transformed ResultsStepdown Stepdown

Raw permutation Raw permutationOBS Variable p-value p-value OBS Variable p-value p-value

1 GENE4847 5.0E-09 0.0000 26 GENE2233 1.6E-06 0.01522 GENE1882 1.1E-08 0.0001 27 GENE4780 1.6E-06 0.01523 GENE3320 3.3E-08 0.0003 28 GENE4107 1.6E-06 0.01524 GENE6218 3.3E-08 0.0003 29 GENE6539 1.6E-06 0.01525 GENE1834 4.6E-08 0.0004 30 GENE3847 2.1E-06 0.01946 GENE760 4.6E-08 0.0004 31 GENE6201 2.1E-06 0.01947 GENE2020 6.4E-08 0.0005 32 GENE1928 2.7E-06 0.02478 GENE5039 6.4E-08 0.0005 33 GENE2759 3.0E-06 0.02609 GENE1745 6.4E-08 0.0005 34 GENE4373 3.4E-06 0.029110 GENE4499 6.4E-08 0.0005 35 GENE50 3.4E-06 0.030711 GENE2267 8.8E-08 0.0008 36 GENE4196 3.4E-06 0.030712 GENE5772 1.2E-07 0.0011 37 GENE804 3.4E-06 0.030713 GENE6041 3.1E-07 0.0030 38 GENE3507 3.4E-06 0.030714 GENE4377 4.1E-07 0.0040 39 GENE4328 3.4E-06 0.030715 GENE2354 4.1E-07 0.0040 40 GENE4546 3.8E-06 0.032316 GENE6855 4.1E-07 0.0040 41 GENE4052 4.3E-06 0.038317 GENE3252 7.2E-07 0.0070 42 GENE5501 4.3E-06 0.038318 GENE2121 7.2E-07 0.0070 43 GENE6806 4.3E-06 0.038319 GENE248 9.4E-07 0.0090 44 GENE6378 4.9E-06 0.040420 GENE2015 9.4E-07 0.0090 45 GENE1249 5.5E-06 0.046921 GENE312 1.2E-06 0.0119 46 GENE2402 5.5E-06 0.046922 GENE1630 1.2E-06 0.0119 47 GENE6797 5.5E-06 0.046923 GENE1144 1.2E-06 0.0119 48 GENE6803 5.5E-06 0.046924 GENE4535 1.2E-06 0.0119 49 GENE1539 6.1E-06 0.0495

Comparing ALL and AML for Gene 6128

0

1000

2000GENE6128

ALL AML

TYPE

Is Better Balance Good?

• Maybe not - Imbalance induces more powerful multiple testing procedure– Bonferroni multiplier implicitly reduced through

imbalance – Serendipity!

Summary

• Westfall-Young Method is an exact, closed testing method, despite large p, small n

• Detected genes are “honestly significant”

• Robust (nonparametric)

strong control of the familywise type i error rate in dna microarray analysis using exact step-down...

Documents