interim analyses and sequential designs in phase iii studies

Interim analyses and sequential designs in phase III studies

Susan Todd, Anne Whitehead, Nigel Stallard & John Whitehead

Medical and Pharmaceutical Statistics Research Unit, The University of Reading, PO Box 240, Earley Gate, Reading, Berkshire, RG6 6FN

Recruitment of patients to a clinical trial usually occurs over a period of time, resulting

in the steady accumulation of data throughout the trial's duration. Yet, according to

traditional statistical methods, the sample size of the trial should be determined in

advance, and data collected on all subjects before analysis proceeds. For ethical and

economic reasons, the technique of sequential testing has been developed to enable the

examination of data at a series of interim analyses. The aim is to stop recruitment to the

study as soon as there is suf®cient evidence to reach a ®rm conclusion. In this paper we

present the advantages and disadvantages of conducting interim analyses in phase III

clinical trials, together with the key steps to enable the successful implementation of

sequential methods in this setting. Examples are given of completed trials, which have

been carried out sequentially, and references to relevant literature and software are

provided.

Keywords: clinical trials, error rates, monitoring, sequential trials

Introduction

In this, the ®rst in a series of three papers dealing with the

opportunities and dangers presented by interim analyses in

clinical trials, we focus on phase III clinical studies. A phase

III clinical trial is a large-scale study, typically comparing a

promising experimental treatment with a control (placebo

or active). Its purpose is to seek ®rm evidence to support a

claim that the experimental treatment has clinical bene®ts.

In this paper we show how sequential methodology can

play an important role in such trials.

The traditional approach to conducting phase III

clinical trials has been to calculate a single ®xed sample

size in advance of the study, which depends upon a

speci®ed signi®cance level and power and the treatment

advantage to be detected. Data on all patients are then

collected before any formal analyses are performed. While

such a framework is logical when observations are available

simultaneously, as in an agricultural ®eld trial, it may be

less suitable for medical studies, in which patients are

recruited over months if not years, and data are available

sequentially. Here, results from patients who enter the trial

early on are available for analysis while later patients are

still being enrolled. It is natural to be interested in such

results, but the uncontrolled examination of data can lead

to misleading and sometimes wholly inappropriate con-

clusions, an issue which is considered further in this article.

Some routine monitoring of trial progress, usually

blinded to treatment allocation, is often undertaken as part

of a phase III trial. This can range from simple checking

of protocol compliance and the accurate completion of

record forms, to monitoring adverse events in trials of

serious conditions so that prompt action can be taken.

Such monitoring may be undertaken in conjunction with

a data and safety monitoring board (DSMB), established

to review the information collected. It would therefore

appear that assessment of interim treatment differences is

a logical and worthwhile extension. However, the hand-

ling of treatment comparisons while a trial is still in

progress poses problems in medical ethics, statistical

analysis and practical organization [1]. In methodological

terms, the approach presented in this paper is known as the

frequentist approach and is the most widely used frame-

work in clinical trials. An alternative school of thought,

not discussed here, but mentioned for completeness, is the

Bayesian approach as described by Spiegelhalter et al. [2].

Opportunities and dangers

The most appealing reason for monitoring trial data for

treatment differences is that, ethically, it is desirable to

terminate or change a trial when evidence has emerged

Correspondence: Dr S. Todd, Medical and Pharmaceutical Statistics Research

Unit, The University of Reading, PO Box 240, Earley Gate, Reading, Berkshire,

RG6 6FN. Tel.: 0118 9318917; Fax: 0118 9753169; E-mail: s.c.todd@

reading.ac.uk

Received 18 April 2000, accepted 9 February 2001.

394 f 2001 Blackwell Science Ltd Br J Clin Pharmacol, 51, 394±399

that one treatment is clearly superior to the other. This is

particularly important when life-threatening diseases are

involved. Alternatively, the data may support the conclu-

sion that the experimental treatment and the control do

not differ by some predetermined clinically relevant

magnitude, in which case it would be desirable, both

ethically and economically, to stop the study and divert

resources elsewhere. Finally, if information in a trial is

accruing more slowly than expected, perhaps because of a

low event rate, then extension of recruitment until a large

enough sample has been recruited may be appropriate.

Unfortunately multiple analyses of accumulating data

lead to problems in the interpretation of results. The main

problem occurs when signi®cance testing is undertaken at

the various interim looks. Even if the treatments are really

equally effective, the more often one analyses the

accumulating data, the greater the chance of eventually

and wrongly detecting a difference, thereby drawing

incorrect conclusions from the trial. Armitage et al. [3]

were the ®rst to compute numerically the extent to which

the type I error probability (the probability of incorrectly

declaring the experimental treatment as different from

control) is increased over its nominal level if a standard

hypothesis test is conducted at each of a series of interim

looks. They studied the problem of testing a normal mean

with known variance and set the signi®cance level or

type I error probability for the trial to be 5%. If one

interim analysis and one ®nal analysis are performed

this error rises to 8%. If four interim analyses and a ®nal

analysis are undertaken this ®gure is 14%. Similar ®gures

can be anticipated for other response types. In order to

make use of the advantages of monitoring the treatment

difference, methodology is required to maintain the

overall type I error rate at an acceptable level.

A second problem concerns the ®nal analysis. When

data are inspected at interim looks, the analysis appropriate

for ®xed sample size studies is no longer valid. Quantities

such as P values, point estimates and con®dence intervals

are still well de®ned, but new methods of calculation are

required. If a traditional analysis is performed at the end

of a trial that stops because the experimental treatment

is found better than control, the P value will be too small

(too signi®cant), the point estimate too large and the

con®dence interval too narrow.

To deal with the above problems, special techniques

are required. These can be broadly termed sequential

methods. In the following section a brief overview of this

methodology and related issues is given.

Sequential methodology

In his 1999 paper [4], Whitehead lists the key ingredients

required to conduct a trial sequentially (see Figure 1). The

®rst two ingredients are common to both ®xed sample

size and sequential studies, but are worth emphasizing

for completeness. The second two are solutions to the

particular problems of error rates and analysis in the

sequential setting. Any combination of choices for the

four ingredients is permissible, but, largely for historical

reasons, particular combinations preferred by authors in

the ®eld have been extensively developed, incorporated

into software (see below) and used in practice. Each of

the four ingredients will now be considered brie¯y in turn.

Parameterization of the treatment difference

As with a ®xed sample size study the ®rst stage in designing

a phase III sequential clinical trial is to establish a primary

measure of ef®cacy. The authority of any clinical trial

will be greatly enhanced if a single primary response is

speci®ed in the protocol and is subsequently found to

show signi®cant bene®t of the experimental treatment.

The choice should depend upon such criteria as clinical

relevance, ease of obtaining accurate measurements

and familiarity to clinicians. Appropriate choice for the

associated parameter measuring treatment difference

can then be made. This should depend upon such criteria

as interpretability, for example whether a measurement

based on a difference or a ratio is more familiar, and

precision of the resulting analysis. A wide variety of

continuous and discrete data types can be dealt with.

Suppose that in a clinical trial the appropriate response

is identi®ed as survival time following treatment for

cancer, then a suitable parameter of interest might be the

log-hazard ratio. If the primary response is a continuous

measure such as the reduction in blood pressure after

1 month of antihypertensive medication then the differ-

ence in true (unknown) means is of interest. Finally, if

we are considering a dichotomous variable, such as the

occurrence (or not) of deep vein thrombosis following

hip replacement, the log-odds ratio may be the parameter

of interest.

Test statistics for use in interim analyses

A sequential test monitors a statistic summarizing the

current difference between the experimental treatment

and control at a series of times during the trial. If the

absolute value of this statistic exceeds some speci®ed

critical value, the trial is stopped and the null hypothesis

of no difference between treatments is rejected. The

timing of the interim looks can be measured directly in

terms of number of patients, or more ¯exibly in terms

of information. It should be noted that the test statistic

measuring treatment difference may increase or decrease

between looks, while the statistic measuring information

will always increase. Early work in this area prescribed


f 2001 Blackwell Science Ltd Br J Clin Pharmacol, 51, 394±399 395

designs whereby traditional test statistics such as the

t-statistic or the chi-squared statistic, were monitored

after each patient's response was obtained. Examples can

be found in the book by Armitage [5]. Later work

by Pocock [6] and O'Brien & Fleming [7] allowed

inspections after the responses from each group of k

patients were obtained, where k was prede®ned. Since

then, statisticians have developed more ¯exible ways

of conducting sequential trials when considering the

number and the timing of interim inspections. Whitehead

[8] monitors a statistic measuring treatment difference

known in technical terms as the ef®cient score and times

the interim looks in terms of a second statistic approxi-

mately proportional to study sample size known as observed

Fisher's information. Jennison & Turnbull [9] use a direct

estimate of the treatment difference itself as the test statistic

of interest and record inspections in terms of a function of

its standard error.

Stopping rules for sequential trials

As highlighted above, a sequential test compares the test

statistic measuring treatment difference with appropriate

critical values. These critical values form a stopping rule

or boundary for the trial. At any stage in the trial, if

the boundary is crossed, the study is stopped and an

appropriate conclusion drawn. If the statistic stays within

the test boundary then there is not enough evidence to

come to a conclusion at present and a further interim

look should be taken. It is possible to look after every

patient or to have just one or two interim analyses. When

interims are performed after groups of patients this may

be referred to as a `group sequential trial'. The advantage of

looking after every patient is that a trial can be stopped

as soon as an additional patient response results in the

boundary being crossed. In contrast, performing just one

or two looks reduces the potential for stopping, and hence

delays it. However, the logistics of performing interim

analyses after groups of subjects are far easier to manage. In

practice, planning for between 4 and 8 interim analyses

appears sensible.

Once it had been established that there was a problem

with in¯ating the type I error when using traditional tests

and the usual ®xed sample size critical values, designs had

to be suggested which adjusted for this. It is the details of

the derivation of the stopping rule that introduces much of

the variety of sequential methodology. Key early work in

the area includes the tests of Pocock [6] and O'Brien &

Fleming [7]. A more ¯exible approach, referred to as the

alpha-spending method was proposed by Lan & DeMets

[10] and extended by Kim & DeMets [11]. A collection of

designs based on straight line boundaries, which builds

on work that has steadily accumulated since the 1940s is

discussed by Whitehead [8], the best known and most

widely implemented of these being the triangular test.

The important issues to focus upon are the desirable

reasons for stopping or continuing a study. Reasons for

stopping may include:' The experimental treatment is obviously worse than

the control' The experimental treatment is already obviously

better' There is little chance of showing that the experi-

mental treatment is better.

Reasons for continuing may include:' A moderate advantage of the experimental treatment

is likely and it is desired to estimate the magnitude

carefully' The event rate is low and more patients are needed to

achieve power.

These will determine the type of stopping rule that is

appropriate for the study under consideration. Stopping

rules are now available for testing superiority, noninfer-

iority, equivalence and even safety aspects of clinical trials.

As an example, consider a clinical trial conducted by the

Medical Research Council Renal Cancer Collaborators

between 1992 and 1997 [12]. Patients with metastatic

renal carcinoma were randomly assigned to treatment with

either the biological therapy, interferon-a, or the hormone

therapy, oral medroxyprogesterone acetate (MPA). The

use of interferon-a was experimental and this treatment

is known to be both toxic and costly. Consequently its

bene®ts over MPA needed to be substantial to justify its

wider use. A stopping rule was required to satisfy the

following requirements:' Early stopping if data showed a clear advantage of

interferon-a over oral MPA' Early stopping if data showed no worthwhile

advantage of interferon-a (either interferon-a obviously

worse or little difference between treatments).

Figure 1 Key ingredients for conducting a sequential trial.

S. Todd et al.


This suggested use of an asymmetric stopping rule. The

design chosen was the triangular test [8], similar in appear-

ance to the stopping rule in Figure 2. Interim analyses

were planned every 6 months from the start of the trial.

The precise form of the stopping rule is de®ned, as is

the sample size in a ®xed sample size trial, by consideration

of signi®cance level, power and desired treatment advan-

tage, with reference to the primary endpoint. The primary

endpoint in the MRC study was survival time and the

treatment difference was measured by the log-hazard

ratio. It was decided that if a difference in 2 year survival

from 20% on MPA to 32% on interferon-a (log-hazard

ratio x0.342) was present, then a signi®cant treatment

difference at the two-sided 5% signi®cance level should

be detected with 90% power.

Analysis following a sequential trial

Once a sequential trial has stopped, an analysis will be

performed. The interim analyses determine only whether

stopping should take place, they do not provide a complete

interpretation of the data. An appropriate ®nal analysis

must take account of the fact that a sequential design was

used. Unfortunately, many trials which have been

terminated at an interim analysis are ®nally reported

with analyses which take no statistical account of the

inspections made [13]. In a sequential trial, although the

meaning and interpretation of data summaries such as

signi®cance levels, point estimates and con®dence intervals

remain as for ®xed sample size trials, various methods of

calculation have been proposed. These lead to slightly

different results when applied to the same set of data. The

user of a computer package such as those referenced

below may accept the convention of the package and

use the resulting analysis without being concerned about

the details of calculation. Readers who wish to develop

a deeper understanding of statistical analysis following a

sequential trial are referred to Chapter 5 of Whitehead [8]

and Chapter 8 of Jennison & Turnbull [14].

Sequential clinical trials in practice

Increasingly, sequential procedures are being implemented

in modern clinical trials. Peace [15] presents case studies

of several applications, some of which have formed part of

New Drug Applications (NDAs) that have been approved

by the Food and Drug Administration (FDA). Additional

examples can be found in the proceedings of two work-

shops, one on practical issues in data monitoring sponsored

by the US National Institutes of Health held in 1992

(published in issues 5 and 6 of volume 12, 1993, of

Statistics in Medicine) and the other on early stopping

rules in cancer clinical trials held at Cambridge University

in 1993 (published in issues 13 and 14 of volume 13,

1994, of Statistics in Medicine). The medical literature also

demonstrates the widening use of sequential methods.

Examples of such studies include trials of corticosteroids

for AIDS-induced pneumonia [16], of enoxaparin for

prevention of deep vein thrombosis resulting from hip

replacement surgery [17] and of implanted de®brilators in

coronary heart disease [18]. Two books dealing exclusively

with the implementation of sequential methods in clinical

trials are those by Whitehead [8] and Jennison & Turnbull

[14]. In addition, there are three commercial software

packages currently available. The package PEST [19] is

based on straight line boundaries. The package EaSt [20]

implements the alpha-spending boundaries of Wang &

Tsiatis [21] and Pampallona & Tsiatis [22]. A recent

addition to the package S-Plus is the S+ SeqTrial module

[23]. PEST and EaSt have both been developed over a

number of years and are the leading packages in this ®eld.

Both packages allow construction of stopping rules for a

variety of practical circumstances, and provide a valid

®nal analysis. PEST also includes computation of appro-

priate test statistics at each interim analysis, together with

some additional ®nal analysis options. A good review of

the capabilities of earlier versions is given by Emerson

[24]. The S-plus module is relatively new this year and

consequently has not yet been as extensively used. An

example of the design and implementation of an actual

sequential trial is given in Figure 2.

When planning any clinical trial sequentially, the

implications of introducing a stopping rule need to be

thought out carefully in advance of the study. In addition,

all involved in the trial should be consulted with regard to

the choice of a clinically relevant difference, speci®cation

of an appropriate power requirement, and the selection of

a suitable stopping rule. As part of the protocol for the

study the operation of any sequential procedure should be

described clearly in the statistical section.

If a DSMB is appointed one of their roles should be

to scrutinize any proposed sequential stopping rule prior

to the start of the study and to review the protocol in

collaboration with the trial Steering Committee. The

procedure for undertaking the interim analyses should

also be ®nalized in advance of the trial start-up. The

DSMB would then review results of the interim analyses

as they are reported. Membership of the DSMB and its

relationship with other parties in a clinical trial has been

considered in the 1993 Statistics in Medicine volume

referenced above and by Whitehead [25]. It is important

that the interim results of an ongoing trial are not cir-

culated widely as this may have an undesirable effect on

the future progress of the trial. Investigators' attitudes will

clearly be affected by whether a treatment looks good

or bad as the trial progresses. It is usual for the DSMB to be



supplied with full information and, ideally, the only other

individual to have knowledge of the treatment comparison

would be the statistician who performs the actual analyses.

Decision making as part of a sequential trial (whether

by a DSMB or another party involved in the trial) is both

important and time sensitive. A decision taken to stop a

study not only affects the current trial, but often affects

future trials planned in the same therapeutic area. How-

ever, continuing a trial too long puts participants at

unnecessary risk and delays the dissemination of important

information. It is essential to make important scienti®c

and ethical decisions with con®dence. Wondering

whether the data supporting interim analyses are accurate

and up-to-date is unsettling and makes the decision

process harder. It is therefore necessary for the statistician

performing the interim analyses to have both timely

and accurate data. Unfortunately, a trade-off exists Ð it

takes time to ensure accuracy. Potential problems can

be alleviated if data for interim analyses are reported

separately from the other trial data, as part of a `fast-track'

system. Less data means that they can be validated quicker.

If timeliness and accuracy are not in balance, not only

may real-time decisions be made on old data, but more

seriously, differential reporting may lead to inappropriate

study conclusions.

Discussion

Sequential methodology in phase III clinical trials is not

new, but it is true to say that it is the more recent

theoretical developments, together with the availability

of software, which have precipitated its wider use. The

methodology is ¯exible as it enables choice of a stopping

rule from a number of alternatives, allowing the trial

design to meet the study objectives. One important point

is that a stopping rule should not govern the trial

completely. If external circumstances change the appro-

priateness of the trial or assumptions made when choosing

the design are suspected to be false, it can and should

be overridden, although the reasons for doing so must

be carefully documented.

Methodology for conducting a phase III clinical trial

sequentially has been extensively developed, evaluated and

documented. Error rates can be accurately preserved and

valid inferences drawn. It is important that this fact

is recognized and that individuals contemplating the use

of interim analyses conduct them correctly. Both the FDA

and the Medicines Control Agency (MCA) do not look

favourably on evidence from trials incorporating

unplanned looks at data. In the US, the Federal Register

(1985) published regulations for NDAs which included

V3

–3

–2

–1

0

1

2

3

4

5

6

7

1 2 4 5 6

Z

Amongst the studies conducted in the development of Viagra was

a small trial in men suffering erectile dysfunction as a result of

spinal cord injury [26]. An ef®cient trial methodology for reaching

a reliable conclusion with as few subjects as possible was required.

It was felt that spontaneous improvement of their erections would

be reported by 25% of men on placebo. An increase in the

percentage of improvements from 25% on control to 60% on

Viagra was felt to be clinically relevant. It was desired to detect this

with power 0.8. A signi®cance level of 0.05 was speci®ed. When

the objectives of the trial were considered in detail, an appropriate

stopping rule known as the triangular test was chosen.

Eligible men attending clinics in Southport, Belfast and Stoke

Mandeville, who had a regular female partner, were randomised

between Viagra and a matching placebo pill. After 4 weeks they

were asked whether the treatment received had improved their

erections. By January 1996, 12 men had completed 4 weeks of

treatment with 5/6 on Viagra and 1/6 on placebo reporting impro-

vement. The ®rst point plotted on the ®gure (x) represents those

data. The statistic Z signi®es the advantage seen so far on Viagra

and is calculated from the observed number of successes on Viagra

minus the number of successes that would have been expected if

Viagra had no effect. The expected number of successes can be

found by multiplying the total number of successes (6) by the

proportion of men receiving Viagra (1/2), giving 3, so that Z is

equal to 5±3=2 as plotted in the ®gure. The statistic V measures

the information on which that comparison is based. This is the

variance of Z. The inner dotted boundaries, known as the

Christmas tree correction for discrete looks, form the stopping

boundary: reach this and the trial is complete. Crossing the upper

boundary results in a positive trial conclusion. The data were

studied again in February, where 6/8 improved on Viagra and 1/8

improved on placebo, and in March, by which time improvement

rates were 8/10 on Viagra and 1/10 on placebo. The upper

boundary was reached and recruitment closed. When the results

on the 6 men under treatment at that time were added, the rates

became 9/12 and 1/14, respectively. By using a series of interim

looks, the design allowed a strong positive conclusion to be drawn

after only 26 men had been treated. A total of 57 subjects would

have been entered into a ®xed sample size trial.

Figure 2 Statistics for Viagra.

S. Todd et al.


the requirement that the analysis of a phase III trial

`assess...the effects of any interim analyses performed'.

The FDA guidelines were updated by publication of

`E9 Statistical Principles for Clinical Trials' in a later Federal

Register (1998). Section 3 of this document discusses group

sequential designs and Section 4 covers trial conduct

including trial monitoring, interim analysis, early stopping,

sample size adjustment and the role of an independent

DSMB. With such acknowledgement from regulatory

authorities the future for sequential methodology within

clinical trials is encouraging.

The authors are grateful to the two referees for their comments and

suggestions.

References

1 Pocock SJ. Clinical trials: a practical approach. New York. Wiley.

1983.

2 Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian

approaches to randomized trials. J Roy Statist Soc Series A 1994;

157: 357±416.

3 Armitage P, McPherson CK, Rowe BC. Repeated signi®cance

tests on accumulating data. J Roy Statist Soc Series A 1969;

132: 235±244.

4 Whitehead J. A uni®ed theory for sequential clinical trials.

Statistics Med 1999; 18: 2271±2286.

5 Armitage P. Sequential medical trials (2nd edn). Oxford. UK.

Blackwell, 1975.

6 Pocock SJ. Group sequential methods in the design and analysis

of clinical trials. Biometrika 1977; 64: 191±199.

7 O'Brien PC, Fleming TR. A multiple testing procedure for

clinical trials. Biometrics 1979; 35: 549±556.

8 Whitehead J. The design and analysis of sequential clinical trials

(revised 2nd edn). Chichester, UK, John Wiley & Sons Ltd,

1997.

9 Jennison C, Turnbull BW. Group sequential analysis

incorporating covariate information. J Am Statist Assoc 1997;

92: 1330±1341.

10 Lan KKG, DeMets DL. Discrete sequential boundaries for

clinical trials. Biometrika 1983; 70: 659±663.

11 Kim K, DeMets DL. Design and analysis of group sequential

tests based on the type I error spending rate function. Biometrika

1987; 74: 149±154.

12 Medical Research Council Renal Cancer Collaborators.

Interferon-a and survival in metastatic renal carcinoma:

early results of a randomised controlled trial. Lancet 1999;

353: 14±17.

13 Facey KM, Lewis JA. The management of interim

analyses in drug development. Statistics Med 1998;

17: 1801±1809.

14 Jennison C, Turnbull BW. Group sequential methods with

applications to clinical trials Boca Raton. USA. Chapman &

Hall/CRC, 2000.

15 Peace KE. Biopharmaceutical sequential statistical applications

New York. Marcel Dekker., 1992.

16 Montaner JSG, Lawson LM, Levitt N, et al. Corticosteroids

prevent early deterioration in patients with moderately severe

Pneumocystis carinii pneumonia and the acquired

immunode®ciency syndrome (AIDS). Ann Inter Med 1990;

113: 14±20.

17 Whitehead J. Sequential designs for pharmaceutical clinical

trials. Pharmaceut Med 1992; 6: 179±191.

18 Moss AJ, Hall WJ, Cannom DS et al. Improved survival with

implanted de®brillator in patients with coronary disease at

high risk of ventricular arrhythmia. N Engl J Med 1996; 335:

1933±1940.

19 MPS Research Unit. PEST 4: operating manual. The University

of Reading, UK, 2000.

20 Cytel Software Corporation. EaSt. A software package for

the design and interim monitoring of group-sequential clinical trials.

Cytel Software Corporation, Cambridge, Mass, 2000.

21 Wang SK, Tsiatis AA. Approximately optimal one-parameter

boundaries for group sequential trials. Biometrics 1987;

43: 193±199.

22 Pampallona S, Tsiatis AA, Kim K. Group sequential designs for

one-sided and two-sided hypothesis testing with provision for

early stopping in favour of the null hypothesis. J Statistical

Planning Inference 1994; 42: 19±35.

23 MathSoft Inc. S-Plus. MathSoft Inc, Seattle, Washington 2000,

2000.

24 Emerson SS. Statistical packages for group sequential methods.

Amer Statist 1996; 50: 183±192.

25 Whitehead J. On being the statistician on a data and safety

monitoring board. Statistics Med 1999; 18: 3425±3434.

26 Derry FA, Dinsmore WW, Fraser M, et al. Ef®cacy and

safety of oral sildena®l (viagra) in men with erectile

dysfunction caused by spinal cord injury. Neurology 1998;

51: 1629±1633.



interim analyses and sequential designs in phase iii studies

Documents