stopping guidelines vs stopping rules: a practitioner's point of view

This article was downloaded by: [Massachusetts Institute of Technology]On: 26 November 2014, At: 19:55Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Communications in Statistics - Theory andMethodsPublication details, including instructions for authors and subscriptioninformation:http://www.tandfonline.com/loi/lsta20

Stopping guidelines vs stopping rules: apractitioner's point of viewDavid L. DeMets Ph.D aa Departments of Statistics and Human Oncology , University ofWisconsin-Madison , Room 6763 Medical Science Center 420 NorthCharter Street, Madison, Wisconsin, 53706Published online: 27 Jun 2007.

To cite this article: David L. DeMets Ph.D (1984) Stopping guidelines vs stopping rules: a practitioner'spoint of view, Communications in Statistics - Theory and Methods, 13:19, 2395-2417, DOI:10.1080/03610928408828833

To link to this article: http://dx.doi.org/10.1080/03610928408828833

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis, ouragents, and our licensors make no representations or warranties whatsoever as to theaccuracy, completeness, or suitability for any purpose of the Content. Any opinions andviews expressed in this publication are the opinions and views of the authors, and are notthe views of or endorsed by Taylor & Francis. The accuracy of the Content should not berelied upon and should be independently verified with primary sources of information. Taylorand Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs,expenses, damages, and other liabilities whatsoever or howsoever caused arising directly orindirectly in connection with, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantialor systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply,

http://www.tandfonline.com/loi/lsta20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/03610928408828833

http://dx.doi.org/10.1080/03610928408828833

or distribution in any form to anyone is expressly forbidden. Terms & Conditions of accessand use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

http://www.tandfonline.com/page/terms-and-conditions

COMMUN. STATIST.-THEOR. METH., l 3 ( l 9 ) , 2395-2417 (1984)

STOPPING GUIDELINES VS STOPPING RULES: A PRACTITIONER'S POINT OF VIEW

David L. DeMets, Ph.D.

Departments of Statistics and Human Oncology University of Wisconsin-Madison Room 6763 Medical Science Center

420 North Charter Street Madison, Wisconsin 53706

K e y Words and P h r a s e s : c l i n i c a l t r i a l ; e a r l y s t o p p i n g ; d a t a m n i t o r i n g ; d e c i s i o n r u l e s .

ABSTRACT

Monitoring interim accumulating data in a clinical trial for

evidence of therapeutic benefit or toxicity is a frequent policy,

usually carried out by an independent scientific committee. While

statistical methodology has been developed to assess the

significance of these interim analyses, such methods should not be

viewed as absolute rules but only serve as useful guides. The

decision process to terminate a trial early is very complex and

many factors must be taken into account. The complexity of this

decision process is illustrated by reviewing the experience of

several recent clinical trials.

INTRODUCTION

The randomized clinical trial has become one of the primary

means of testing the benefits of a new therapy or procedure

against established practice as a control. During the past two

decades, the methodology for conducting clinical trials has

Copyright @ 1984 by Marcel Dekker, Inc.

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

evolved, especially for multicenter trials. This methodology has

been described by authors such as Friedman, Furberg, and DeMet~

(1981). Most clinical trials sponsored by the National Institutes

of Health (NIH) follow a particular model, referred to here as the

NIH model although it is used more broadly than simply NIH

activity. In general terms, the NIH model has various functioning

bodies which include: 1) the participating investigators and

clinics, 2) an executive or steering committee representing the

investigators, 3) central laboratories as needed, such as a

clinical chemistry or an electrocardiogram reading laboratory,

4) a statistical-data management center, and 5) a policy advisory

and data monitoring committee. Within this administrative

structure, each component contributes to the overall design,

conduct, quality, and analyses of the clinical trial. The purpose

of this paper is to discuss, by way of examples, how many factors

must be considered in the decision making process. At this time

many clinical trials have been conducted with a data monitoring

committee. Unfortunately, only a few trials have been published

on this experience, but those available will be the focus of this

discussion.

THE REPEATED SIGNIF-NCE TESTING PROBLEM

For the NIH clinical trials model, the data monitoring

committee is an independent group of scientists, not directly

involved with patient recruitment and care for the particular

trial, who periodically review progress of the trial. This

committee often includes physicians, epidemiologists,

statisticians and an ethicist. The responsibilities of this

committee may be broadly defined but generally include the ethical

and scientific responsibility of monitoring for evidence of

possible toxicity or early therapeutic benefit. The rationale for

such a committee has been described by Shaw and Chalmers (1970),

Chalmers (1979), and Canner (1979). One important role is to

remove from the participating investigator the possible ethical

dilemma of seeing interim data and yet having to recruit

additional patients.

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

S T O P P I N G G U I D E L I N E S VS S T O P P I N G RULES

The process of repeatedly reviewing data has important

implications for the evaluation of statistical analyses and tests

of hypotheses. One consequence is that the Type I error or false

positive rate is increased if a classical critical value is used

at each test of hypothesis. For example, the probability of

falsely rejecting the null hypothesis is increased to levels

beyond 0.05 if a critcal value of 1.96 were used at each test, the

amount depending on the number of such repeated evaluations.

Statisticians have long recognized the repeated testing issue and

have proposed various methods for preserving traditional Type I

error levels. Many of the early sequential models were found not

to be used very often. A review of these issues is provided by

Gail (1982) and DeMets and Lan (1984). In recent years, however,

two basic ideas have been proposed which seem to be useful.

Pocock (1977) and later O'Brien and Fleming (1979) introduced a

data monitoring concept, usually referred to as group sequential

boundaries, which adjusts the critical value used at each test to

a more conservative (i.e., larger) value. This adjustment allows

a prespecified a level to be achieved. Lan and DeMets (1983) have

further developed this idea and Tsiatis (1981), for example, has

shown how to use group sequential methods in survival analysis.

A second idea, referred to as stochastic curtailment,

considers whether the interim results are so extreme, either for

or against the null hypothesis, that the conclusions from test of

hypothesis to be performed at the end of the trial are not likely

to change. If the conclusions are unlikely to change, early

termination of the trial may be considered. Since the point of

reference is always what would happen at the end of the study if

it were continued, the Type I error is also controlled. This

approach was in part developed by Lan, Simon, and Halperin (1982)

and the application discussed in Halperin, Lan, Ware, Johnson, and

DeMets ( 1982).

Given that such statistical models have been proposed and are

available, the temptation might be to strictly adhere to them in

the monitoring and decision making process. The models, in Fact,

assume that the trial would be terminated as soon as the specific

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

DE METS

criteria are met. However, the decision process to terminate a

study early is a very complex one, involving many more factors

than simply observing whether or not a boundary is exceeded, a

point of view held by Meier ( 1 9 7 9 ) , the Coronary Drug Project

Research Group ( 1 9 8 1 ) . DeMets, Hardy, Friedman, and Lan ( .1984) ,

and Armitage ( 1 9 7 9 ) . To use these models as strict stopping rules

would be unwise. Nevertheless, such models offer a practical

guide to data monitoring committees in assessing the

"significance" of observed interim results.

UNIVERSITY GROUP DIABETES PROJECT - AN EARLY EXPERIENCE One of the earliest multicenter clinical trials sponsored by

the National Institutes of Health was the University Group

Diabetes Program or the UGDP (1970, 1971) . This was a randomized

double blind trial in adult onset diabetics using a placebo

control group and four therapeutic strategies. The four therapies

were: 1) a fixed dose of insulin, 2) a variable dose of insulin,

3) tolbutamide alone, and 4 ) phenformin alone which was added

later in the study. The primary outcome variables were measures

of retinal damage in addition to possible morbidity, such as

infection.

However, in 1970 the data monitoring committee observed what

was felt to be an excess cardiovascular mortality in the

tolbutamide group compared to the placebo group ( 1 2 . 7 % vs.

4 . 9 % ) . The total mortality rate was not statistically

significant, but was in the same direction (14.7% vs. 10.2%).

This treatment group was subsequently terminated. Later, the

phenfomin group was also terminated due to excess total mortality

(15.2% vs. 9.4%) as well as excess cardiovascular mortality (12.7%

vs. 3.1%). Regardless of treatment, diabetics are at higher risk

of cardiovascular disease than the non-diabetic population. As

summarized by Kolata (19791, the tolbutamide decision in

particular has caused a great deal of controversy and, in fact,

led to an NIH review of the UGDP by a special camnittee appointed

by the Biometric Society (see the Report of the Committee for the

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

STOPPING GUIDELINES VS STOPPING RULES

Assessment of Biometric Aspects of Controlled Trials of

Hypoglycemic Agents, (1975)).

Among the criticisms of the decision was the fact that

tolbutamide patients had more cardiovascular risk factors than did

placebo patients, possibly explaining the observed excess

deaths. Subsequent analyses by the special committee suggested

the deaths to be excessive even after adjusting for the baseline

risk factor imbalance. Another criticism was the issue of

accurately determining cause of death, particularly for the

tolbutamide group. This, of course, is a major concern for any

mortality study and is a reason why cause specific mortality in

itself is not adequate as a sole outcome measure.

While the controversy surrounding the UGDP may not be

resolved, it is also unlikely that any of the data monitoring

methods now available would have altered the UGDP decision process

as described. For sure, the decision process was more involved

than simply noting whether a test statistic was impressively

large.

THE CORONARY DRUG PROJECT

Another one of the very early NIH multicenter clinical trials

was the Coronary Drug Project (CDP), a randomized double blind

trial to evaluate the effectiveness of lipid-lowering drugs in the

secondary prevention of coronary heart disease under the direction

of the Coronary Drug Project Research Group (1970, 1972, 1973,

1975). The study evaluated five drug regimens (different drugs or

varying dosages) compared to a placebo. During a three year

period from 1966-1969, 53 participating clinical centers recruited

over 8,000 patients who had a recent documented myocardial

infarction. Patients were followed carefully from time of

recruitment until 1974, a minimum follow-up of five years. At

randomization, patients were classified into two risk groups, one

being a more complicated myocardial infarction situation.

A data and safety monitoring committee (DSMC) was established

which met periodically, approximately every six months, during the

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

follow-up period. The DSMC membership included statisticians,

physicians and basic scientists. This committee made

recommendations to an NIH advisory committee and to the NIH. The

structure of the CDP became a prototype for many subsequent NIH

funded clinical trials, a structure referred to earlier as the NIH

model. The CDP did, in fact, terminate three of the treatment

regimens upon recommendation of the DSMC and that process has been

described in detail by the Coronary Drug Project Research Group

(1981). A brief sumrnary is presented here.

Decision 1: Termination of ~ i g h Dose Estrogen

Approximately six months after the last patient was recruited,

the high dose estrogen group was observed to have increased rate

of non-fatal myocardial infarction (MI) compared to placebo

(6.2% vs 3.2%, 2=4.11) as well as increased total mortality

(8.1% vs 6.9%, z=1.33). one primary concern was the potential for

bias in the diagnosis of non-fatal myocardial infarction, given

that estrogen therapy has side effects which could potentially

unblind the patient and physician. After extensive analysis

investigating possible over or under diagnosis, no evidence was

found that such biases were present. Further analyses then

focused on risk subgroups. The non-fatal MI was primarily seen in

the lower risk group (6.7% vs 2.9%, 2=4.3) but the mortality

difference was in the higher risk group (13.9% vs 8 .5%, z=3.0).

In this case, the decision was reached after some debate to

terminate the high dose estrogen in both the high and low risk

groups, thus dropping that arm of the trial. I£ only a single

outcome variable had been considered, different decisions might

have been reached.

Decision 2: Termination of Dextrothyroxine

A year and a half after the estrogen decision, a decision was

reached to discontinue to dextrothyroxine (DT4) treatment arm, due

to increased mortality overall (14.8% vs 12.5%) and especially in

the higher risk subgroup (22.5% vs 15.4%, 2=3.1). The decision

process focused on whether this result was consistent across

certain subgroups; that is, could this adverse result be further

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

STOPPING GUIDELINES VS STOPPING RULES 2401

isolated. After numerous subgroups had been defined, one such set

(subgroups A and B) drew special attention because the mortality

rate was in favor of DT4 in subgroup A (4.1% vs 7.7%, z=-2.6) and

against (16.4% vs 11.2%, z=3.3) in subgroup B. Recognizing the

dangers of multiple and repeated subgroup analyses, the DMSC

decided to wait until the next meeting to review subsequent

mortality in subgroups A and B. The results continued to suggest

an adverse effect in subgroup B (5.4% vs 3 . 8 % ) , but there was no

trend for a beneficial effect in subgroup A. Based on that

information, along with previous analyses, the decision to

terminate was reached. The authors comment that if this

particular subgrouping had been stated in advance, the decision

process might have been less complicated. Even so, however, there

is no guarantee that subgroups stated in advance will be the only

ones of concern during the trial.

Decision 3: Termination of Low Dose Estrogen

In 1973, the low dose estrogen treatment regimen was dropped

due to an excess of venous thromboembolism, an excess of cancer

deaths (not statistically significant) and total mortality

(19.9% vs 18.8%, also not statistically significant). The

decision considered that since there was a slight adverse trend,

was there any chance at this stage in the trial of showing a

beneficial effect at the scheduled end of the trial? This is an

early application of the curtailed sampling idea. Furthermore,

the DSMC did not want to prove beyond a doubt that estrogen was

harmful, an example of asymmetric boundaries. After a substantial

amount of analyses, it was felt very unlikely, if not impossible,

for the low dose estrogen group to achieve significantly lower

mortality than the placebo group. It was estimated, for example,

that the future percentage of deaths would have to be 1.5% in low

dose estrogen and 6.4% in placebo for the overall comparison to be

just significant. Given this result and the other trends as well

as the earlier high dose estrogen decision, low dose estrogen was

terminated.

Decision 4: Continuation of the clofibrate Group

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

DE METS

MONTH OF FOLLOW-UP

FIG. 1 Life table cumulative mortality rates, Coronary Drug Research Project Group. (Reprinted by permission of the publisher from "~ractical aspects of decision making in clinical trials" by the Coronary Drug Project Research Group, Controlled Clinical Trials, 1, 323. Copyright 1981 by Elsevier Science Publishing Co., Inc.)

One treatment group that was not terminated during the trial

was the clofibrate treated group. At the end of the trial

mortality was 25.5% in clofibrate vs. 25.4% in placebo

(see Figure 1). During the course of the trial, however, the

results were more intriguing. Early on after 20-30 months into

the trial, the test statistic fluctuated around -2.0, in favor of

clofibrate. By 40 months, the test statistic was slightly

positive, in favor of placebo. A t five years, the test statistic

was again near -2.0 with the eventual test statistic to be nearly

zero (see Figure 2 ) . Interestingly, this pattern is seen but not

with such wide swings if the life table results are used as a

retrospective look. The DSMC did recognize the need for

conservatism in the interpretation of interim analyses and used

methods available at that time.

The CDP group concluded their experience by stating that

"Although a number of rather sophisticated statistical tools are

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014


Coronary Drug Project

FIG

N MONTH -1 .o

-2.0

2 z values for clofibrate-placebo differences in proportion of deaths by calendar month since beginning of study (Month 0 = March 1966, Month 100 = July 1974). (Reprinted by permission of the publisher from "Practical aspects of decision making in clinical trials" by the Coronary Drug Project Research Group, Controlled Clinical Trials, 1, 372. Copyright 1981 by Elsevier Science Publishing Co . , ~ n c . )

available in the decision making process, these are at best red

flags that warn of possible treatment problems and can never be

used by themselves as hard and fast decision rules". Subsequent

published experience to be described below only affirms their

opinion.

NOCTURNAL OXYGEN THERAPY TRIAL EXPERIENCE

The Nocturnal Oxygen Therapy Trial (NOTT) was an NIH sponsored

randomized clinical trial under the direction of the Nocturnal

Oxygen Therapy Group (1980) comparing two levels of oxygen therapy

to individuals with advanced chronic obstructive pulmonary

disease. The NOTT entered 201 patients across six clinical

centers. The two groups were those who received standard

continuous oxygen therapy (COT) compared with those who received

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

DE METS

only nocturnal oxygen therapy (NOT). Since COT was considered to

be the standard, the hypothesis was whether less oxygen could be

administered and still have the same benefits; that is, was 12

hours of oxygen as good as 24 hours? The primary outcome

variables were pulmonary function and quality of life measures.

As in the CDP study, a policy advisory and data monitoring

committee was established to review interim data. The monitoring

experience of this trial has been described previously by DeMets

et al. ( 1982).

One aspect of the data monitoring is of particular interest.

While mortality was not defined to be the primary outcome

variable, during the course of the trial a mortality trend

favoring COT began to emerge (see Table I). This mortality result

was not anticipated as a benefit and so it drew considerable

attention. At a committee meeting about one year prior to the

scheduled study termination, the significance levels for mortality

comparisons between therapies considering all randomized patients

were in the neighborhood of 0.10, using the log rank statistic.

One of the questions raised, similar to the CDP dextrothyroxine

situation, was whether this mortality trend was consistent across

subgroups.

A special subcommittee was asked to review the mortality data

carefully before the next scheduled meeting. One such subgroup

was defined on the basis of one second forced expiratory volume

(FEV~), an index of pulmonary function with lower levels

indicating more severe disease. At an interim meeting in the

summer of 1979, the analysis for this subgroup gave a significance

level of 0.01 for survival curve comparisons. Extensive analyses

were done to assure that the result was not due to baseline

imbalances between the two treatment groups. The debate centered

on whether the trial should be stopped entirely, stopped only for

the low FEVl subgroup or continued. To stop only the low FEVl

subgroup could, in practice, have had the effect of terminating

the entire study and the subgroup issue would not be totally

resolved. Furthermore, third party payers were quite interested

in this study since oxygen therapy is very expensive and were

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014


Table I

Significance Levels for Mortality Comparisons at Various Interim Analyses as Originally

Presented and in Retrospect, for the Total Group and for the Lower FEV, subgroup

Meeting Date

Oct, 1978

March, 1979

June, 1979

Sept, 1979

Oct, 1979

Jan, 1980

March, 1980

June, 1980

Original Analyses

Total Group Subgroup

.30

.15

.07 .01

.09 -008

. 5 2 25

. I 5 .05

-06 .O 5

.O1

likely to make decisions

Retrospective Analyses

Total Group subgroup

based on the NOTT results. Given t

this subgroup had not been defined in advance and that several

subgroups had been considered, the decision was to continue to the

next full committee meeting in September of 1979, with continued

interim reports (see Table I). At the next full committee

meeting, the total mortality comparison p-value was 0.09 with the

FEVl subgroup being 0.008.

While the monitoring committee continued to struggle with

this, the statisticians were uncomfortable that all mortality data

might not be available and up to date. A few more deaths in

either group could change the results substantially. It was

agreed that this matter had to be cleared up before any decision

could be made. Special efforts were made to ascertain the vital

status of each patient, taking care not to alarm the participating

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

2406 DE METS

investigators. After this process had been completed, neither the

overall or subgroup comparisons were near statistical

significance. At the January meeting, interest in early

termination decreased. As the final results of the trial emerged,

mortality turned out to be quite significant (41/102 vs 23/101,

P = 0.01) with the results for the two FEV, subgroups being

similar, each in favor of COT. Had the trial been terminated

early on the basis of a subgroup, the study might have suggested

that COT was more effective in one subgroup alone.

In retrospect, had the mortality been as up to date as ideally

possible, the subgroup issue would not have been as great a

concern since the P-value for the low FEV1 was .06 at the

September meeting as shown in Table I. Realistically, mortality

data and other outcome variable data will not be totally up to

date, but the NOTT group strongly encouraged this lag to be

minimized.

Another aspect of this trial is that once the hypothesis

involved morbidity and quality of life, a multitude of potential

outcome measures were available. It is difficult, if not

impossible, to isolate a single variable to fully describe what

superficially might seem to be a simple question. Hence the

amount of data to evaluate can be overwhelming if some selected

group of measures is not identified in advance. Even though that

was done in the NOTT, the task was still not easy and as it turned

out attention focused on another variable, mortality.

THE DIABETIC RETINOPATHY STUDY EXPERIENCE

The Diabetic Retinopathy Study (DRS) was a randomized

controlled clinical trial which entered over 1,700 patient in 15

participating centers. various aspects of this trial have been

presented by Cornfield (1976), Ederer (1981), and the Diabetic

Retinopathy Study Research Group (1976, 1978, 1981). The primary

objective was to compare the effectiveness of photocoagulation

with no therapy to either delay or even prevent visual loss in

diabetic patients with proliferative retinopathy. A unique

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

STOPPING G U I D E L I N E S V S S T O P P I N G RULES 2407

feature of this study is that patient's eyes were randomized, one

to experimental therapy and the other to no treatment. A five

year follow-up was planned and the primary outcome was the

incidence of blindness. Here, as in the other trials described, a

data monitoring committee was established to review interim data

and part of their experience has been reported by Ederer (1981).

Shortly after recruitment of patients was completed and after

two years of the scheduled five years of follow-up, a 60%

reduction in the incidence of blindness was observed in the

treated eyes in contrast to the untreated eyes (16.3% vs 6.4%,

2=5.5). This result was highly statistically significant. The

question the data monitoring comittee was faced with was whether

the early effect could be negated by later harmful effects. A

previously published study suggested such a possibility. To stop

early if there might be long term adverse effects would not be

desirable since study patients might have the untreated eye

treated and other nonstudy patients also might have eyes

treated. on the other hand, if treatment were continued and no

long term adverse effects were observed, the therapeutic benefits

would be delayed for the study patients' untreated eyes as well as

deferring treatment for all other patients not in the study.

Projections were made as to how adverse the long term effects

would have to be in order to reverse the observed early benefit, a

curtailed sampling idea. These types of calculations suggested it

was unlikely that the current trend could be reversed.

As in previous studies discussed, the issue of subgroups was

also a factor. Apparently, in looking at one possible set of

three mutually exclusive subgroups, only one suggested a

"signficant" difference. In some further subdivisions, small

numbers of subjects in each gave the resulting data an

inconsistent or perhaps a random appearance. It was not clear

whether to terminate the entire study or some part of it.

After much discussion and extensive statistical analyses, the

following decisions were reached.

a) The protocol was modified to treat all untreated eyes in

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

DE METS

study patients defined to be at "high risk".

b) Results to date were published for the benefit of all

nonstudy patients.

C) Al1.study patients would be followed to see if any long

term adverse effects were observed, although the

comparison would be contaminated to some degree by this

"later" treatment of untreated high risk eyes.

The five year results confirmed the two year results and no

adverse effects were manifest. The decision process involved

incorporating basic analysis of two treatment groups, subgroup

analyses, other published data, and projections of possible

outcomes.

BETA-BLOCKER HEART ATTACK TRIAL

Like the Diabetic Retinopathy Study, the Beta-Blocker Heart

Attack Trial (BHAT) was terminated earlier than scheduled. BHAT

was a randomized double blind trial of 3,837 patients with a

recent myocardial infarction, conducted in over 30 centers,

sponsored by NIH, and directed by the Beta-Blocker Heart Attack

Trial Research Group (1981, 1982). Patients were randomized into

a propranolol treated goup or a placebo treated group during a two

year period of recruitmerit with an additional two years of

follow-up. Thus, the scheduled follow-up would be at least two

years and at most four years, with the average being three

years. Mortality from all causes was the primary outcome

measure. AS in the others described, the NIH clinical trial model

was used and a data monitoring committee was established. This

trial used group sequential boundaries and stochastic curtailment

methods. The results of their experience has been presented by

DeMets et al. (1984).

Approximately one year after the last patient was randomized,

the data monitoring committee of BHAT recommended to the NIH that

the trial be terminated and the results be made available, The

mortality curves for the two treatment groups are represented in

Figure 3. The logrank test statistic for comparison of these two

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014


Beta-Blocker Heart Attack Trial

Cumulative Mortality Rate (Ole)

0 6 12 18 24 30 Months of Follow-up

F I G . 3 Cumulative mortality curves for 30 months of follow-up, comparing propranolol and placebo treated patients. (Reprinted by permission of the publisher from the "Beta-Blocker Heart Attack Trialn by the Beta-Blocker Heart Attack Study Group. J. Amer. Med. Assoc., 2 4 6 ( 1 8 ) : 2 0 7 3 , NOV. 6 , 1981.)

curves is 2 . 8 2 . This mortality result did not appear suddenly but

rather the trend was seen very early in the life of the study as

shown in Table 11. As discussed earlier, repeatedly testing data

increases the probability of Type I error. One way to achieve an

overall Type I error is to use conservative critical values at

interim analyses. The method proposed by O'Brien and Fleming

(1979) and adopted by BHAT uses a large critical value at early

interim analyses, relaxing this critical value at the trial

progresses, with a more conventional value at the final scheduled

analysis. The 5% significance level group sequential boundaries

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

Table I1

DE METS

O'Brien-Fleming Monitoring Boundaries ( 2 a = .05) for the Beta-Blocker Heart Attack Trial,

Assuming Seven Scheduled Analyses

Meetins

May, 1979

Oct, 1979

Mar, 1980

Oct, 1980

Apr, 1981

Oct, 1987

June, 1982

Log Rank Statistic

1.68

2.24

2.37

2.30

2.34

2.82

Upper Boundary

5.40

3.82

3.12

2.70

2.41

2.20

2.04

of the O'Brien-Fleming family are also shown in Table 11, given

the seven scheduled data monitoring committee meetings. The early

survival comparisons did not exceed these boundaries even though

the test statistic did exceed nominal values. The critical value

at the sixth meeting clearly exceeded this 5% significance level

boundary. Moreover, at that point, the results were so strong

that using stochastic curtailment ideas, it appeared highly

unlikely that the results would be reversed if the trial

continued, short of a major catastrophe. That was also considered

unlikely since this drug has been used extensively in patients of

this type for other indications.

Although the committee felt that this early effect of

propranolol on the reduction in mortality to be statistically and

clinically significant, the decision to terminate was not easy.

The issue focused on how long the drug should be given or how long

was the drug effective in preventing fatal events in this patient

population. This was felt by the clinicians to be an important

recommendation for the BHAT to make. The dilemma for BHAT was

similar to that of the DRS. To continue for another year meant

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014


delaying the potential benefit of therapy to study patients as

well as others. To terminate early meant not having another year

of follow-up to ascertain whether the three year benefit persisted

for at least another year, implying continued treatment for at

least four years.

The statistical approach taken to help resolve this issue was

similar in spirit to the DRS approach, although done

independently. Projected four year life tables were constructed

based on the existing three years of experience. It was argued

that while some information would be obtained, it would not be

very precise due to the few numbers of anticipated events,

assuming the force of mortality to be quite small for years three

and four in contrast to the first year. After considerable

discussion, the committee voted for termination on the basis that

the observed early benefits outweighed the marginal gain in

information from an additional year. As it turned out, two

additional large trials reported similar results, one just prior

to the BHAT decision and one afterwards, confirming the early

benefit.

OTHER TRIALS

While few other trials have published their data monitoring

experience in detail, three recent NIH clinical trials at least

suggest that studies may be continued even when interim results

are quite positive or very negative. The Hypertension Detection

and Follow-up Program (1979), also referred to as the NDFP,

published positive results suggesting that aggressive treatment of

mild hypertension can lead to a reduction in mortality from all

causes (P=.Of). While the mortality results as they were seen by

their data monitoring committee have not been published, one can

get some idea of what those results must have been by looking at

the published life table, recognizing the long period of follow-up

relative to the recruitment period. The survival curves steadily

separate during the five years of follow-up. It is likely that

their results were "significant" somewhere in the middle of the

follow-up period, yet the trial continued. One might speculate

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

DE METS

that there was concern over acceptability of the results had the

trial stopped early and a need to establish toxicity information

on long term usage. For whatever reasons, the committee felt

justified continuing the trial, given the interim results.

Two other randomized multicenter trials, the Aspirin

Myocardial Infarction Study (AMIS) and the Multiple Risk Factor

Intervention Trial (MRFIT) published negative results (1980, .- 1982). AMIS compared aspirin vs. placebo in patients with a

previous myocardial infarction. After several years of follow-up,

the mortality results and the survival curves in the two groups

were nearly identical. Again, one can only speculate what the

rationale to continue might have been. Aspirin is certainly

readily available and there already existed public sentiment about

the benefits of aspirin from earlier published studies. Since no

serious harm was being done other than known side effects of

aspirin usage, continuation may have seemed in this case quite

reasonable.

One might also speculate that MRFIT continued for quite

another reason. his trial involving over 12,000 people was aimed at primary prevention of fatal cardiovascular events in people

identified to be at high risk due to either high cholesterol, high

blood pressure, cigarette smoking or a combination. Two forms of

therapy were compared, a very aggressive special care and a

"regular" care approach. After seven years of follow-up, the

published results suggested no treatment group differences as far

as total mortality or cardiovascular mortality was concerned.

Again the iterim results during the trial must have looked

discouraging. One factor that may well have been argued is that

no one is sure how immediate the impact of such therapy would be,

even assuming it to be effective. While early results were not

impressive, one might conceive that the therapy might eventually

become beneficial if given enough time to reverse years of an

individual's life style.

It is, of course, dangerous to speculate on decision processes

without benefit of the details for these studies' experience.

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

STOPPING GUIDELINES VS STOPPING RULES 2 4 1 3

Even with such details, one might not be able to totally recapture

the issues and factors. The main message from these studies is,

however, that studies were continued given interim data that

strongly suggested the final outcome.

SUMMARY

The CDP experience led the investigators to suggest a number

of issues which must be resolved before any decision for early

termination can or should be reached. Those issues have been

affirmed by others as well. A brief summary of the issues (with

some modifications) is described below.

Are the results explained by a possible imbalance in the

distribution of baseline patient characteristics?

Could the patient evaluation be biased as far as prilnary

outcome measures is concerned?

Is there a consistency of results for other primary or

secondary response variables?

Is there a consistency of results across various

subgroups of the patients and across clinical centers?

What is the risk and benefit ratio of the therapy?

Could the results be explained by patient compliance to

prescribed therapy or some other therapy?

Have the results been interpreted in view of the effect

of repeated testing of accumulating data?

Could the current trends likely be reversed if the trial

continued to the end?

How much additional information would be obtained by

continuing?

What would the impact of early termination have on the

acceptability of the results to medical and scientific

colleagues?

Are the results consistent with other studies or

available information?

Was it ethical to continue further treatment and

follow-up given the early evidence of therapeutic

benefit.

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

2414 D E METS

All of this discussion has attempted to demonstrate that early

termination of any clinical trial is a complex, difficult if not

at times tortuous, and not an easily defined process. Even for

studies which implemented some of the recent statistical

methodology, the decision process was not straightforwardly

dictated by that statistical methodology. These methods clearly

were not stopping rules as is often the manner in which they are

described.

Nevertheless, such statistical methods are essential. Even

though many of the methods assume termination to occur if a

boundary were crossed, the fact that this assumption might be

violated in practice does not lessen their value. Just as

navigators need positional sites to help them judge their

location, data monitoring committees need statistical methods to

help them judge their position with regard to cautious

interpretation of interim results from accumulating data. While

they are not stopping rules, such methods can be useful guides in

the decision making process. As the methods become further

developed and can more closely reflect the clinical trial.

situation, the more precise guides the methods will be, but

ultimately these methods will still only be practical guides and

not absolute rules.

ACKNOWLEDGEmaNTS -----

This work was supported in part by NIH Grants No. NCI

P30-CA-14520-11 and 2-R01-CA18332. The author would like to thank

Ms. Terry Metcalf for typing this manuscript.

BIBLIOGRAPHY

Aspirin Myocardial Infarction Study Research Group, (1980). A randomized controlled trial of aspirin in persons recovered from myocardial infarction. J. Amex. Med. Assoc., 243, 661-669.

Amitage, P., (19791. The analysis of data from clinical trials. The Statistician, 28, 171-1133.

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014


Beta-Blocker Heart At tack Study Group, (1981). The Beta-Blocker Heart At tack T r i a l - p r e l i m i n a r y r e p o r t . J. Amer. Med. Assoc., 246, 2073-2074.

Beta-Blocker Heart At tack Study Group, (1982). The Beta-Blocker Heart At tack T r i a l - f i n a l r e p o r t . J . Amer. Med. Assoc., 247, 1707-1714.

Canner, P. L . , (1979). Monitoring c l i n i c a l t r i a l d a t a f o r ev idence o f a d v e r s e o r b e n e f i t i c i a l t r e a t m e n t e f f e c t s . E d i t e d by J. P. B o i s s e l and C. R. K l i m t . In : M u l t i c e n t e r C o n t r o l l e d T r i a l s : P r i n c i p l e s and Problems. P a r i s : INSERM.

Chalmers, T. C., (1979). I n v i t e d remarks. C l i n . Pharmacol. Ther . , 25, 649-650.

C o r n f i e l d , J., (1976). Recent methodolog ica l c o n t r i b u t i o n s t o c l i n i c a l t r i a l s . Am. J. Epidemiol. , 104, 408-421 .

Coronary Drug P r o j e c t Research Group, (1973). The Coronary Drug P r o j e c t . Design, methods, and b a s e l i n e r e s u l t s . C i r c u l a t i o n , 4 7 ( s u p p l . I), 11-179.

Coronary Drug P r o j e c t Research Group, (1970). The Coronary Drug P r o j e c t . I n i t i a l f i n d i n g s l e a d i n g t o m o d i f i c a t i o n s o f i t s r e s e a r c h p r o t o c o l . J . Amer. Med. Assoc., 214, 1303-1313.

Coronary Drug P r o j e c t Research Group, (1972). The Coronary Drug P r o j e c t . F ind ings l e a d i n g t o f u r t h e r m o d i f i c a t i o n s o f i t s p r o t o c o l wi th r e s p e c t t o d e x t r o t h y r o x i n e . J . Amer. Med. Assoc., 220,

Coronary Drug P r o j e c t Research P r o j e c t . F ind ings l e a d i n g 2.5 mg/day e s t r o g e n group. 226, 652-657.

Coronary Drug P r o j e c t Research

Group, (1973). The Coronary Drug t o d i s c o n t i n u a t i o n of t h e J. A m e r . Med. Assoc.,

Group, (1975). C l o f i b r a t e and n i a c i n i n coronary h e a r t d i s e a s e . J. Amer. Med. Assoc., 231, 360-381.

Coronary Drug P r o j e c t Research Group, (1981). P r a c t i c a l a s p e c t s o f d e c i s i o n making i n c l i n i c a l t r i a l s : The Coronary Drug P r o j e c t a s a c a s e s t u d y . C o n t r o l l e d C l i n i c a l T r i a l s , 1, 363-376.

DeMets, D. and Lan, G . , ( 1 9 8 4 ) . An overview o f s e q u e n t i a l methods and t h e i r a p p l i c a t i o n i n c l i n i c a l t r i a l s . Corn. S t a t i s t . , l3(9), 2315-2338.

DeMets, D. L . , Williams, G. W . , Brown, B. W . , and t h e NOTT Research Group, (1982). A c a s e r e p o r t o f d a t a moni tor ing

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

DE METS

experience: The nocturnal oxygen therapy trial. Controlled Clinical Trials, 3, 113-124.

DeMets, D., Hardy, R., Friedman, L., and Lan, G., (1984). Statistical aspects of early termination in the beta-blocker heart attack trial. Controlled Clinical Trials, to appear.

Ederer, F., (1981). Early stopping of a clinical trial: A case study - the Diabetic Retinopathy study. Presented at the Spring ENAR Biometric Society Meeting, Richmond, Virginia.

Friedman, L. M., Furberg, C. D., and DeMets, D. L., (1981). Fundamentals of Clinical Trials. Boston: Wright-PSG.

Gail, M., (1982). Monitoring and stopping clinical trials. In: Statistics in Medical Research: Methods and Issues. Edited by V. Mike1 and K. Stanley. New york: John Wiley and Sons.

Halperin, M., Lan, K. K. G., Ware, J. H., Johnson, N. J., and DeMets, D. L., (1982). An aid to data monitoring in long-term clinical trials. Controlled Clinical Trials, 3, 311-323.

Hypertension Detection and Followup Program Cooperative Group, (1979). Five-year findings of the Hypertension Detection and Followup Program. J. Amer. Med. Assoc., 242, 2562-2576.

Kolata, G. B., (1979). Controversy over study of diabetes drugs continues for nearly a decade. Science, 203, 986-990.

Lan, K. K. G. and DeMets, D. L., (1983). Discrete sequential boundaries for clinical trials. Biometrika, in press.

Lan, K. K. G . , Simon, R., and Halperin, M., (1982). Stochastically curtailed tests in long-term clinical trials. Comm. Statist., C(Sequentia1 Analysis), 1, 207-219.

Meier, P., (1979). Terminating a trial - the ethical problem. Clin. Pharmacol. Ther., 25, 633-640.

Multiple Risk Factor Intervention Trial Research Group, (1982). The multiple risk factor intervential trial risk factor changes and mortality results. J. Amer. Med. Assoc., 248, 1465-1477.

Nocturnal Oxygen Therapy Group, (1980). Continuous or nocturnal oxygen therapy in hypoxemic chronic obstructive lung disease. Ann. Intern. Med., 93, 391-398.

O'Brien, P. C. and Fleming, T. R., (1979). A multiple testing procedure for clinical trials. Biometrics, 35, 549-556.

Pocock, S. J., (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64, 191-199.

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

STOPPING GUIDELINES VS STOPPING RULES 2417

Report of the Committee for the Assessment of Biometric Aspects of Controlled Trials of Hypoglycemic Agents, (1975). J. Amer. Med. Assoc., 231, 583-608.

shaw, L. W. and Chalmers, T. C., (1970). Ethics incooperative clinical trials. Ann. N.Y. Acad. Sci., 169, 487-495.

The Diabetic Retinopathy Study Research Group, (1981). Diabetic Retinopathy Study Report No. 6: Design, methods, and baseline results. Invest. Ophthalmol. Vis. Sci., 21, 149-209.

The Diabetic Retinopathy Study Research Group, (1978). Photocoagulation treatment of proliferative diabetic retinopathy. The second report of the Diabetic Retinopathy Study. Ophthalmol., 85, 82-106.

The ~iabetic Retinopathy Study Research Group, (1976). Preliminary report on effects of photocoagulation therapy. Am. J. Ophthalmol., 81, 383-396.

Tsiatis, A. A . , (1981). The asymptotic joint distribution of the efficient scores test for the proportional hazards model calculated over time. Biometrika, 68, 311-315.

University Group Diabetes Program, (1970). A study of the effects of hypoglycemic agents on vascular complications in patients 21th adult-onset diabetes. 11. Mortality results. Diabetes, 19(suppl 2 ) , 787-830.

University Group Diabetes Program, (1971). Effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes. IV. A preliminary report on phenformin results. J. Amer. Med. Assoc., 217, 777-784.

Dow

nloa

ded

by [

Mas

sach

uset

ts I

nstit

ute

of T

echn

olog

y] a

t 19:

55 2

6 N

ovem

ber

2014

stopping guidelines vs stopping rules: a practitioner's point of view

Documents