pros and cons of permutation tests in clinical trials

10
STATISTICS IN MEDICINE Statist. Med. 2000; 19:1319 } 1328 Pros and cons of permutation tests in clinical trialst Vance W. Berger*, s Food and Drug Administration, Center for Biologics Evaluation and Research, 1401 Rockville Pike 200S, HFM-215, Rockville, MD 20852-1448, U.S.A. SUMMARY Hypothesis testing, in which the null hypothesis speci"es no di!erence between treatment groups, is an important tool in the assessment of new medical interventions. For randomized clinical trials, permutation tests that re#ect the actual randomization are design-based analyses for such hypotheses. This means that only such design-based permutation tests can ensure internal validity, without which external validity is irrelevant. However, because of the conservatism of permutation tests, the virtues of permutation tests continue to be debated in the literature, and conclusions are generally of the type that permutation tests should always be used or permutation tests should never be used. A better conclusion might be that there are situations in which permutation tests should be used, and other situations in which permutation tests should not be used. This approach opens the door to broader agreement, but begs the obvious question of when to use permutation tests. We consider this issue from a variety of perspectives, and conclude that permutation tests are ideal to study e$cacy in a randomized clinical trial which compares, in a heterogeneous patient population, two or more treatments, each of which may be most e!ective in some patients, when the primary analysis does not adjust for covariates. We propose the p-value interval as a novel measure of the conservatism of a permutation test that can be de"ned independently of the signi"cance level. This p-value interval can be used to ensure that the permutation test have both good global power and an acceptable degree of conservatism. Copyright ( 2000 John Wiley & Sons, Ltd. 1. INTRODUCTION Recent publications regarding the evaluation of carvedilol [1}3] have illustrated the key role played in the drug approval process of hypothesis testing, in which the null hypothesis speci"es no between-group di!erence. In this article, we consider only randomized clinical trials (RCTs), involving a convenience (not random) sample of patients who are randomly allocated to treatment groups. For RCTs, design-based analyses that test the null hypothesis of no treatment * Correspondence to: Vance Berger, Biometry Research Group, Division of Cancer Prevention, National Cancer Institute, Executive Plaza North, Suite 344, Bethesda, MD 20892-7354, U.S.A. s E-mail: vance.berger@nih.gov t The views, opinions and assertions expressed in this article are those of the author. No endorsement by the Food and Drug Administration is intended or should be inferred. Contract/grant sponsor: FDA; contract/grant number: RSR-96-004A Received July 1998 Copyright ( 2000 John Wiley & Sons, Ltd. Accepted September 1999

Upload: vance-w-berger

Post on 06-Jun-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Pros and cons of permutation tests in clinical trials

STATISTICS IN MEDICINEStatist. Med. 2000; 19:1319}1328

Pros and cons of permutation tests in clinical trialst

Vance W. Berger*,s

Food and Drug Administration, Center for Biologics Evaluation and Research, 1401 Rockville Pike 200S, HFM-215, Rockville,MD 20852-1448, U.S.A.

SUMMARY

Hypothesis testing, in which the null hypothesis speci"es no di!erence between treatment groups, is animportant tool in the assessment of new medical interventions. For randomized clinical trials, permutationtests that re#ect the actual randomization are design-based analyses for such hypotheses. This means thatonly such design-based permutation tests can ensure internal validity, without which external validity isirrelevant. However, because of the conservatism of permutation tests, the virtues of permutation testscontinue to be debated in the literature, and conclusions are generally of the type that permutation testsshould always be used or permutation tests should never be used. A better conclusion might be that there aresituations in which permutation tests should be used, and other situations in which permutation tests shouldnot be used. This approach opens the door to broader agreement, but begs the obvious question of when touse permutation tests. We consider this issue from a variety of perspectives, and conclude that permutationtests are ideal to study e$cacy in a randomized clinical trial which compares, in a heterogeneous patientpopulation, two or more treatments, each of which may be most e!ective in some patients, when the primaryanalysis does not adjust for covariates. We propose the p-value interval as a novel measure of theconservatism of a permutation test that can be de"ned independently of the signi"cance level. This p-valueinterval can be used to ensure that the permutation test have both good global power and an acceptabledegree of conservatism. Copyright ( 2000 John Wiley & Sons, Ltd.

1. INTRODUCTION

Recent publications regarding the evaluation of carvedilol [1}3] have illustrated the key roleplayed in the drug approval process of hypothesis testing, in which the null hypothesis speci"esno between-group di!erence. In this article, we consider only randomized clinical trials (RCTs),involving a convenience (not random) sample of patients who are randomly allocated totreatment groups. For RCTs, design-based analyses that test the null hypothesis of no treatment

* Correspondence to: Vance Berger, Biometry Research Group, Division of Cancer Prevention, National CancerInstitute, Executive Plaza North, Suite 344, Bethesda, MD 20892-7354, U.S.A.s E-mail: [email protected] The views, opinions and assertions expressed in this article are those of the author. No endorsement by the Food andDrug Administration is intended or should be inferred.

Contract/grant sponsor: FDA; contract/grant number: RSR-96-004A

Received July 1998Copyright ( 2000 John Wiley & Sons, Ltd. Accepted September 1999

Page 2: Pros and cons of permutation tests in clinical trials

e!ect are permutation tests (PTs) [4}6] that re#ect the design of the study, including strati"ca-tion, minimization, or other restrictions that may have been used for the randomization [7, 8].When we speak of PTs, we are referring to only such design-based PTs. Only such PTs willstrictly preserve the type I error rate of between-group tests in RCTs. We note that a type II error(failing to reject the null hypothesis when it is false) might have equally serious consequences asa type I error (rejecting a true null hypothesis) if a type II error were actually an error, that is, iffailure to reject the null hypothesis were interpreted as establishing the truth of the nullhypothesis. In fact, this is not the case. Negative studies can be explained, and used to designbetter or larger RCTs. However, a type I error is rarely recognized as being a type I error, andthere is a better chance that the results will be accepted on face value (making it unlikely that thestudy will be repeated) when the null hypothesis is rejected than when it is not [9]. As such, thekey property of a hypothesis test is exactness or internal validity [3, 6], which is a prerequisite forexternal validity [3] (Reference [10], p. 228).

The view was recently expressed [4] that because the overwhelming majority of randomizedbiomedical studies are RCTs, PTs are more appropriate than approximate tests (ATs) in somegeneral sense. This argument fails to account for the possibility that the decision to use a PT ornot can be based on the speci"cs of the situation. Regardless of the frequency with which studiesuses random sampling or random allocation, it is best to use methods appropriate to randomsampling when random sampling is used, and to use methods appropriate to RCTs for RCTs.ATs di!er from PTs in that their reference distributions (RDs) are not design-based, but ratherare based on distributional or other models. Even a test based on re-randomization is an AT ifthis re-randomization does not re#ect the actual randomization used in the study. On the otherhand, if making use of distributional assumptions results in the same PT that would have resultedfrom design-based re-randomization, then this test is a PT. We note that ATs are used quitefrequently in RCTs, despite the fact that only PTs strictly preserve the type I error rate. InSections 2 and 3 we refute misconceptions that likely increase the use of ATs and PTs,respectively, in RCTs. In Section 4 we discuss limitations of PTs in a RCT. In Section 5 wesummarize the pros and cons of PTs in RCTs, with an eye towards delineating those situations inwhich a PT might prove useful from those in which it might not. The reader interested in only thisarticle's conclusions might wish to skip to Sections 4.1 and 5.

2. MISCONCEPTIONS AGAINST PERMUTATION TESTS

This section will explore some of reasons cited for using ATs, and not PTs, in RCTs.

2.1. Hypothesis Testing Is Not Useful Anyway, So Who Cares How It Is Done?

Hypothesis testing plays a prominent role, and will likely continue to play a prominent role, inmedical decision-making. In fact, two RCTs, each with a statistically signi"cant bene"t in favourof the experimental treatment, are often required for FDA approval [1]. Debate of the merits ofallowing exceptions to this usual FDA policy [2, 3] is outside the scope of this article, as is debateconcerning hypothesis testing relative to other approaches to inference [2, 11]. Even if one allowsfor exceptions to the typical FDA rule, a one-to-one correspondence between signi"cant p-valuesand favourable results (approval of new products, publication of manuscripts, funding of grantsetc.) is not required to argue that the p-values used, however they are used, should not

1320 V. W. BERGER

Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:1319}1328

Page 3: Pros and cons of permutation tests in clinical trials

misrepresent the extent of signi"cance. Put di!erently, low p-values are persuasive only whensolid methodology (PTs that preserve the type I error rate) is employed [3]. See also Section 3.1.Consequently, any improvement in the way hypothesis testing is conducted will result ina widespread bene"t to the practice of biostatistics.

In addition, exact hypothesis tests can be used to generate exact con"dence intervals. Fora location family, PTs can test the null hypothesis of a non-zero treatment e!ect. This isaccomplished by making use of the strong null hypothesis (see Section 4.1) that speci"esa non-zero shift in a patient's outcome had the patient been randomized to the other treatmentgroup. A con"dence interval for the di!erence between means can then be constructed as the setof values that cannot be rejected by such a PT. The resulting con"dence interval will be exact.This means that even those who prefer con"dence intervals to hypothesis testing can bene"t fromimprovement in the methodology for computing p-values.

2.2. Permutation Tests Do Not Make Inference to the Target Population

One perceived weakness of PTs is that the inference extends to the patients studied, but not thetarget population [12]. We argue that the di$culty in generalizing to a target population isa weakness not of the PT, however, but rather of the study design. If we accept that RCT resultscannot apply more to the general population than to the patients studied, then external validity islimited by internal validity. By ensuring internal validity, the PT actually enhances the ability toextrapolate results. Why is external validity so elusive? In addition to non-random sampling,entry criteria (including the willingness to be randomized), run-in selection [13], and a corollaryto the Heisenberg uncertainty principle that states that one cannot observe a patient withoutaltering that patient's response, especially if the patient is aware of being observed (Reference[14], p. 18), selection into a RCT confers upon a patient special ancillary care and attention whichwould not be provided to unselected patients, making care on any arm of a RCT potentiallybetter than care outside the RCT [15]. As such, the patients in the RCT di!er fundamentally fromthose of the target population, if for no other reason than by virtue of their selection into the RCT.In essence, the patients not studied in the RCT comprise an unrepresented (in the RCT) segmentof the population, to which the results of the RCT need not apply [16], even if the sample appearsrepresentative of the target population [4, 17, 18]. If it were known that the results could not begeneralized, then there would be no reason to conduct the RCT. However, the results may begeneralizable, provided there is internal validity. As such PTs are a prerequisite for internalvalidity, which is prerequisite for external validity (see Section 5).

2.3. If the Assumptions Look Reasonable, then Approximate Tests Are Valid

Let p (PT) and p (AT) be, respectively, the p-value of the PT and AT. Let strategies 1 and 4 be&always use p(AT)' and &always use p(PT)', respectively. Because distributional assumptions will attimes clearly fail to hold (the lack of random sampling of patients in RCTs necessarily precludesthe possibility of random sampling from a speci"ed distribution), strategy 2, which is &reportp(AT) or p (PT) as the AT assumptions do or do not look reasonable', has replaced strategy 1 inmuch of biostatistical practice. For example, the chi-square test, once easier to compute thanFisher's exact test for 2]2 tables, is often recommended (for example, Reference [14], p. 307), butonly if all expected cell counts exceed "ve (this is the check of the chi-square assumption). Clearly,the quantity *"p (AT)!p (PT) is of greater interest than the extent to which the AT assump-tions look reasonable. Unfortunately, one cannot say that * is small with any certainty, even if the

PERMUTATION TESTS IN CLINICAL TRIALS 1321

Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:1319}1328

Page 4: Pros and cons of permutation tests in clinical trials

AT assumptions appear to be met, without actually computing * [8, 11]. Consider, for example,the sotalol data on page 251 of Reference [10].

The data are presented as two 2]2 contingency tables, one for con"rmed reinfarction and onefor any reinfarction. These data can be presented more succinctly as a single 2]3 contingencytable, (33, 5, 545; 29, 8, 836), with two rows (placebo and sotalol) and three ordered columns(con"rmed reinfarction, uncon"rmed reinfarction, or no reinfarction). For example, there were 33con"rmed reinfarctions in the placebo arm. The Smirnov test [19] yields Monte Carlo exact(StatXact, 10 000 samples, seed"8029780) p-values of 0.0485 (two-sided) and 0.0258 (one-sided).The asymptotic p-values are 0.9910 (two-sided) and 0.6823 (one-sided). It is unclear how onewould determine the advisability of the AT, but if one were to &think unconditionally', then the smallmiddle margin would not be a concern. The large sample sizes (over 500 per group), coupled withexpected cell counts that all exceed "ve, would certainly be reassuring, yet because of the smallmiddle margin, we "nd that *"0.9425 (two-sided) and *"0.6565 (one-sided) for these data.

Strategy 3 is &use p(AT) or p(PT) as D*D)** or D*D'**', where ** is a prospectively de"nedtolerance threshold. The actual signi"cance level can reach a#**. Strategy 3 reduces to strategy1 or 4 if **"1 or if **"0, respectively, and accomplishes what strategy 2 purports toaccomplish for small **. If **)Da!p(PT) D, then strategies 3 and 4 give the same rejectiondecisions. The very fact that strategy 1 is not used routinely suggests that primary interest lies inthe PT, and the AT is used, at least when the decision to do so is made not a priori but ratherbased on the result of preliminary test, not because it is exact for the problem for which it is exact(see Section 3.1), but rather because it is purported to be a good approximation for the problem ofinterest. Consequently, in this situation, an AT provides the wrong answer to the right question,and the right answer to the wrong question, and constitutes a type III error [20]. As such, strategy3 follows the perverse logic of reporting p (PT)#*, and not p (PT), even though p (PT) is thequantity of interest and is known, because * is close to zero with high probability. This istantamount to adding random error to any other important piece of data simply because theerror tends to be small. We conclude that the comparison of strategies 4 and 2, which may bedi$cult to make directly, is facilitated by consideration of strategy 3 because strategy 4 ispreferred to strategy 3, which in turn is preferred to strategy 2.

2.4. Nature Randomized the Patient Selection

While this argument is often made, we feel hard-pressed to "nd any merit in it. Absence ofevidence of non-randomness is not evidence of absence of non-randomness, as per Section 15.2.1of Reference [21].

2.5. Diwerent Test Statistics Lead to Diwerent Conclusions

Use of a PT does not obviate the need for careful consideration of the test statistic. It isa testimony to the #exibility of PTs that any AT (with either discrete or continuous data) can beconverted to a PT by using the permutation reference distribution of the test statistic to computethe p-value. In addition, there are PTs, such as convex hull tests [19], that have no analogous ATs.

2.6. The Establishment will not Accept Permutation Tests

Because many journal editors, funding agencies, and regulatory bodies prefer &standard' analyses,the connection between design and analysis is often ignored. As Bross [22, 23] wrote:

1322 V. W. BERGER

Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:1319}1328

Page 5: Pros and cons of permutation tests in clinical trials

&Mathematical statisticians, or some of them, have burned their bridges to reality2Idon't want to interfere with the fun that the theoretical statisticians are having. The onlything that concerns me is that perhaps some scientists might take this seriously.2Inmany experimental situations the usual models are unrealistic and we do need morerealistic foundations for statistical inference. But, this work is almost unpublishable.'[22].

and

&Methods that require assumptions that go beyond the limits of the data (or are notveri"able) lack statistical validity. Such methods should not be used by biostatisticiansto obtain scienti"c "ndings2The failure to consider scienti"c context has resulted inwidespread current use of [fraudulent statistical] methods2[which] has resulted in thepublication of false, misleading, or outright fraudulent biomedical &"ndings' in2lead-ing scienti"c and medical journals. The problem is only compounded when the &"ndings'are endorsed by &blue-ribbon' scienti"c panels. This does not make them any lessfraudulent but it shows how gullible scientists can be when it comes to statisticalmethods. It may be no exaggeration to say that scienti"c fraud in biomedical researchhas reached epidemic proportions.' [23].

The familarity of and precedent for using ATs has contributed to the reluctance of appliedbiostatisticians to use PTs [5], while the simplicity of PTs has caused theoretical statisticians tolook elsewhere for di$cult problems to work on. Consequently, the PT tends to be endorsed byneither the theoretical nor the applied statistician. However, the use of PTs is good theory andgood practice, and is easily justi"ed on these grounds.

3. MISCONCEPTIONS IN FAVOUR OF PERMUTATION TESTS

Misconceptions that might increase the usage of PTs in RCTs are the focus of Section 3.

3.1. Every Permutation Test Is Exact

If t* is the observed value of the test statistic ¹, then p (PT)"P0M¹*t*N"PMX

p*t*N, where

Xpis a random variable having the same distribution as ¹'s null distribution. If it is surmised that

¹'s null distribution is actually Xa(X

amay be, for example, the standard normal distribution),

then p (AT) would be computed as PMXa*t*N. If the nominal signi"cance level is a, then the

actual signi"cance levels are ap"P

0Mp(PT))aN)a and a

a"P

0Mp(AT))aN, which could

exceed a, possibly by quite a bit. However, the AT which makes the distributional assumptionX

ais still an exact test, but it is exact at level a

a, and not necessarily at level a. Consequently, every

test is exact at some level. If we rede"ne aaso that the probability in its de"nition is taken to mean

according to the null distribution Xa, then a

a"a, and the AT is exact at level a. That is, the AT

provides the exact answer, and the only exact answer, to the questions &How extreme is t* relativeto X

a?', answered by p (AT) and &Is t* among the 100a per cent most extreme outcomes of X

a?',

answered by the rejection decision.What is wrong with this self-validation approach is that one can select desired p-value k,

satisfying 0(k(1, and then guarantee an exact p-value of k by taking Xa

to be the uniformdistribution on [t*!(1!k), t*#k]. Unlike the aforementioned uniform distribution, in

PERMUTATION TESTS IN CLINICAL TRIALS 1323

Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:1319}1328

Page 6: Pros and cons of permutation tests in clinical trials

practice Xatypically bears at least some resemblance to X

p, and does not depend on t*. However,

there is nothing unique about an exact p-value, and the term &exact test' is somewhat ofa misnomer. The trick is being exact for the questions &How likely would one be, if one were torepeat the RCT, to "nd ¹*t*?' and &Is t* among the 100a per cent most extreme outcomes thatcould have resulted from the RCT?', or exactness when the probability is computed using X

p, as

dictated by the design of the study. Only the PT, which uses Xp, meets this criterion.

Observing p (AT)(a, when using Xa, may be indicative not of a treatment e!ect but rather of

an unwarranted assumption, and may not even be a rare (as de"ned by probability lessthan a when computed using X

p) event. This is why both a low p-value and (not or) strong

methodology (the PT, using Xp) is required [3]. It may turn out that p (AT)"p (PT) if certain

percentiles of Xa

and Xp

happen to line up. However, Xp

is necessarily discrete, so Xa

cannotcoincide with X

pif X

ais continuous, A PT which uses the wrong distribution, ignoring restrictions

on the randomization, is actually an AT, and is not exact relative to Xpany more than any other

AT will be.

3.2. Any Permutation Tests Will Do

Many study protocols refer to &the exact test', as if there were but one exact test. The power ofa PT (or an AT) depends heavily on the test statistic it uses (as in Section 2.5). This means thata given PT can be inappropriate in a given situation, not by virtue of being a PT, but rather byvirtue of using an inappropriate test statistic. It is important to use a good PT, with good power.

3.3. Permutation Tests Allow for Data-Dependent Test Statistics

Salsburg (Reference [24], Section 9.1) states &the test statistic2need not be de"ned beforeexamining the data'. This is not to be confused with selecting the test statistic based on (nearly)ancillary statistics to be conditioned on, such as the margins of a contingency table [19]. If,having observed the data X*, one could de"ne the test statistic to be the indicator function¹(X)"1 if X"X* or ¹ (X)"0 if XOX*, then the p-value would be P

0MX*N as X* would

constitute the entire extreme region. If P0MXN(a for each point X in the sample space (with

a"0.05 and complete random allocation, this is the case with at least three patients pertreatment group), then the experiment is guaranteed to "nd statistical signi"cance (p(a), and theactual type I error rate is 100 per cent. Selecting the test statistic based on the data is tantamountto throwing a dart at a blank canvas and then drawing the bull's eye around wherever the darthappens to hit, and is categorically not valid.

4. LIMITATIONS OF PERMUTATION TESTS IN CLINICAL TRIALS

While it may be di$cult or impossible to conduct PTs in some cases [25], our concern extends tocases in which PTs can be performed easily. Even in such cases, there are limitations of PTs, andso ATs may need to be considered. These limitations are discussed in Section 4.

4.1. Permutation Tests Need Not Be Exact when Testing the Weak Null Hypothesis

Suppose that our RCT compares experimental treatment E to standard of care S, with a binaryprimary e$cacy endpoint. Let p (0, 1) be the proportion of patients in the population who would

1324 V. W. BERGER

Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:1319}1328

Page 7: Pros and cons of permutation tests in clinical trials

respond to E but not to S, with p (0, 0), p(1, 0), and p (1, 1) de"ned similarly. As no patient isobserved on both treatments, these proportions involve the unobservable counterfactual (Refer-ence [21], Section 3.3) and cannot be estimated directly. With treatment response ratesp(E)"p(0, 1)#p (1, 1) and p(S)"p (1, 0)#p(1, 1), the unconditional (weak) null hypothesis(WNH) speci"es common response rates, p (E)"p(S). When covariates (for example, gender) areprognostic, a more restrictive conditional null hypothesis (for example, equality, for E and S, ofgender-speci"c response rates) is often selected. The strong null hypothesis (SNH) conditions onall covariates, so each stratum consists of a single patient, who then must respond either to bothtreatments or to neither treatment. Speci"cally, the SNH speci"es that p (0, 1)"p (1, 0)"0, so anyobserved responder was a &responder-to-be' who would have responded to either treatment.Response is then a latent baseline patient characteristic.

It is important to recognize that this SNH is not an assumption underlying the validity of PTs,but rather a null hypothesis to be tested, for which PTs are exact. Because the SNH is nottypically the negation of the intended alternative hypothesis, and can be false in an uninterestingway, Gail et al. [26] argued that the WNH is of greater practical interest than the SNH, andshowed that PTs need not be exact under the WNH. However, we argue that in many cases theSNH may actually be of greater interest than the WNH. If the target population is heterogeneouswith respect to various prognostic factors, then even if the WNH is true, future studies may allowfor the delineation of the (0, 1)-type patients (who should receive E) from the (1,0)-type patients(who should receive S), thereby increasing the overall response rate from p (1, 0)#p (1, 1) top(1, 0)#p (1, 1)#p (0, 1). Clearly, then, the truth of the WNH is not a typical null situation inwhich E serves no bene"t to public health, unless there is also perfect overlap of the patientsresponding to each treatment. However, this is the de"nition of the SNH, so it is the SNH, andnot the WNH, that corresponds to the null condition.

4.2. Permutation Tests Tend to Be Conservative

Because PTs have discrete distributions, p-values may not be stable. For example, Fisher's exacttest yields p"0.014 (two-sided) when comparing 6/100 to 0/101, yet p"0.029 when comparing6/101 to 0/100. In addition, the discreteness leads to conservatism of PTs (not every signi"cancelevel is attainable). With test statistic ¹, if ¹"t* is observed, then we de"ne the p-value interval(PVI) to be the range of possible p-values, (P

0M¹'t*N, P

0M¹*t*N]. The length P

0M¹"t*N'0

of the PVI measures the conservatism of the PT independently of a. If P0M¹"t*N is large, then

the PT will lose power because of its conservatism. It is this ensuing loss of power, and not theconservatism itself, that is a concern. That is, if a conservative PT is still more powerful than anAT, then the PT would certainly be preferable. If the AT is more powerful, then it should bepreferred to the PT only if its power advantage can be attributed exclusively to the conservatismof the PT. However, if p(AT)(P

0M¹'t*N, then the power advantage cannot be attributed to

the conservatism of the PT, and the increase in power of the AT likely comes with an increase inactual signi"cance level.

4.3. The Most Powerful Test May Not Be a Conditional (Permutation) Test

The most powerful test unconditionally need not be conditionally exact [27], in which case theAT has better unconditional power but variable conditional size (sometimes exceeding a) and thePT has lower power but "xed conditional size. The RCT design involves a convenience sample,and has no probability space with which to consider patient selection to be a true experiment. As

PERMUTATION TESTS IN CLINICAL TRIALS 1325

Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:1319}1328

Page 8: Pros and cons of permutation tests in clinical trials

such, we agree with Cox [27] that the patients are "xed, meaning that the margins should beconditioned upon. That is, the PT is preferable despite its lower unconditional power.

4.4. It Is Dizcult or Impossible to Compute Sample Sizes for Permutation Tests

The PT RD is speci"c to the RCT and generally cannot be tabulated without the results of theRCT in hand. This often makes it impossible to calculate a required sample size. The AT RD,however, was tabulated before the RCT was even conceived. Recall from Section 3.1 that the goalis to attain one of the 100a per cent most extreme outcomes among those outcomes that areattainable with the experiment performed. If the reality is that the set of attainable outcomescannot be enumerated prior to conducting the study, then it is a disservice to cover this up byreplacing the PT with an AT. Note that computing a sample size for a PT is no more di$cult thanestimating a sample size for a test which adjusts for covariates that are not observed until thestudy is complete. One possible solution is to base sample size considerations on unconditionaltests, but use the PT for analysis.

4.5. It Is Dizcult to Adjust for Covariates with Permutation Tests

It can be tricky to adjust for covariates with PTs, even in problems for which it would have beeneasy to derive a PT had there been no covariates. This is a valid consideration.

5. CONCLUSION

External validity cannot be guaranteed in RCTs, but can be ruled out if there is no internalvalidity. Strict preservation of the type I error rate (by using permutation tests), in conjunctionwith a concerted e!ort to minimize biases, can ensure internal validity. This, in conjunction withnon-statistical arguments, can allow the ability to extrapolate the results of the RCT. For thisreason, we recommend that the PT get the right to "rst refusal when the strong null hypothesis(SNH) is of interest. The question &What about this particular problem makes you believe thata PT is needed?' should be replaced by &What evidence can you o!er that the PT fails and the ATdoes not?'When a platinum standard (as Tukey [7] called the design-based PT as applied to theRCT) is available, justi"cation is not needed for using it. Speci"cally, we recommend thefollowing:

(i) Make explicit whether the null hypothesis of interest is the WNH or the SNH, bear-ing in mind that the null hypothesis should be a statement of what one wants to rule out.If the WNH does not imply that E is useless unless the SNH is also true (as inSection 4.1 when there was a heterogeneous patient population), then the SNH is ofinterest. On the other hand, if it is suspected that p (1, 0)"0 (so that for no patient isS better than E), then the WNH is equivalent to the SNH, and either can be used. Likewise,a study in a baseline homogeneous population o!ers little hope of predictingwhich patients will respond better to each treatment, so the WNH is likely of interest.Of course, a patient population that appears baseline homogeneous may actually beheterogeneous [28].

(ii) If the WNH is of interest, then use either an AT or a PT based on power penalized by anyexcess in the actual signi"cance level over the nominal signi"cance level.

1326 V. W. BERGER

Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:1319}1328

Page 9: Pros and cons of permutation tests in clinical trials

(iii) If the SNH is of interest but the size of the study precludes the possibility of enumeratingall permutations for constructing the PT RD, then use Monte Carlo sampling from the PTRD, if possible, to construct an approximate PT (whose size is typically closer to a than isthat of the AT), or, if this is not possible, then use the AT but state explicitly that the PTwas unavailable and why.

(iv) If the SNH is of interest and the PT is available with desired test statistic ¹, then computethe p-value interval (P

0M¹'t*N, P

0M¹*t*N]. If the PT is not overly conservative, then

report it, without computing the AT or testing the AT assumptions. If the PT is overlyconservative when ¹"t* is observed, then state this explicitly (report the p-value interval)and use either another PT or the AT based on ¹, but do not use the AT based on ¹ if itproduces a p-value which is lower than P

0M¹'t*N.

(v) More research should go into deriving powerful PTs for complicated problems, andsoftware with which to compute these PTs. The two questions which need to be addressedin any testing problem are &What is the sample space?' and &What (partial or complete)ordering on the sample space is induced by the de"nition of extremity?' These questionsmay not be easy to answer when there are multiple strati"cation variables or covariates,arbitrarily censored data, multiple endpoints, longitudinal data, or complicated allocationprocedures, but these complications present absolutely no di$culties above and beyondthe di$culties they present in the solution to these two questions.

ACKNOWLEDGEMENTS

This research was supported in part by grant number RSR-96-004A from the Center for Drug Evaluationand Research (CDER) of the FDA. The author is obliged to Saul Mazolowski, Thomas Permutt, BruceStadel and an anonymous referee for helpful comments.

REFERENCES

1. Fisher LD, Moye LA. Carvedilol and the Food and Drug Administration approval process: an introduction.Controlled Clinical ¹rials 1999; 20:1}15.

2. Fisher LD. Carvedilol and the Food and Drug Administration approval process: the FDA paradigm and re#ectionson hypothesis testing. Controlled Clinical ¹rials 1999; 20:16}39.

3. Moye LA. End-point interpretation in clinical trials: the case for discipline. Controlled Clinical ¹rials 1999: 20:40}49.4. Ludbrook J, Dudley H. Why permutation tests are superior to t and F tests in biomedical research. American

Statistician 1998; 52(2):127}132.5. Feinstein AR. Permutation tests and statistical signi"cance. MD Computing 1993; 10(1):28}41.6 Edgington ES. Randomization ¹ests, third edn. Marcel Dekker: New York, 1995.7. Tukey J. Tightening the clinical trial. Controlled Clinical ¹rials 1993; 14:266}285.8. Senn S. Fisher's game with the Devil. Statistics in Medicine 1994; 13:217}230.9. Kolata GB. Controversy over study of diabetes drugs continues for nearly a decade. Science 1979; 203:986}990.

10. Elwood JM. Critical appraisal of epidemiological studies and clinical trials. Oxford University Press: Oxford, 1998.11. Royall RM. Comment on Hansen et al. Journal of the American Statistical Association 1983; 78:794}796.12. Koch GG, Edwards S. Clinical e$cacy trials with categorical data. In Biopharmaceutical Statistics for Drug

Development, Peace KE (ed.). Marcel-Dekker: New York, 1988.13. Leber PD, Davis CS. Threats to the validity of clinical trials employing enrichment strategies for sample selection.

Controlled Clinical ¹rials 1998; 19:178}187.14. Fisher LD, van Belle G. Biostatistics. Wiley: New York, 1993.15. Mike V. Understanding uncertainties in medical evidence: professional and public responsibilities. In Acceptable

Evidence, Mayo DG, Hollander RD (eds). Oxford University Press: New York, 1991.16. Senn SJ. Falsi"cationism and clinical trials. Statistics in Medicine 1991; 10:1679}1692.

PERMUTATION TESTS IN CLINICAL TRIALS 1327

Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:1319}1328

Page 10: Pros and cons of permutation tests in clinical trials

17. Royall RM. Ethics and statistics in randomized clinical trials (with comments). Statistical Science 1991; 6(1):52}88.18. Olschewski M, Schumacher M, Davis KB. Analysis of randomized and nonrandomized patients in clinical trials using

the comprehensive cohort follow-up study design. Controlled Clinical ¹rials 1992; 13:226}239.19. Berger VW, Permutt T, Ivanova A. The Convex Hull Test for ordered categorical data. Biometrics 1998;

54(4):1541}1550.20. Kimball AW. Error of the third kind in statistical consulting. Journal of the American Statistical Association 1957;

52:133}142.21. Senn S. Statistical Issues in Drug Development. Wiley: Chichester 1997.22. Bross I. Comment on Birnbaum. Journal of the American Statistical Association 1962; 57:298, 309}311.23. Bross ID. How to eradicate fraudulent statistical methods: statisticians must do science. Biometrics 1990;

46:1213}1225.24. Salsburg DS. ¹he ;se of Restricted Signi,cance ¹ests in Clinical ¹rials. Springer-Verlag: New York, 1992.25. Mehta CR. The exact analysis of contingency tables in medical research. In Recent Advances in Clinical ¹rial Design

and Analysis, Thall P (ed.). Kluwer: Boston, 1995.26. Gail MH, Mark SD, Carroll RJ, Green SB, Pee D. On design considerations and randomization-based inference for

community intervention trials. Statistics in Medicine 1996; 15:1069}1092.27. Cox DR. Some problems connected with statistical inference. Annals of Mathematical Statistics 1958; 15:357}372.28. Temple RJ. Special study designs: early escape, enrichment, studies in non-responders. Communications in Statistics

1994; 23(2):499}531.

1328 V. W. BERGER

Copyright ( 2000 John Wiley & Sons, Ltd. Statist. Med. 2000; 19:1319}1328