borderline significance

Interpreting and reporting clinical trials with results of borderline

significance

Amy Kirkwood and Professor Allan Hackshaw

CRUK and UCL Cancer Trials Centre

What is the problem?

• After submitting several phase III trials to high profile journals we noticed a disparity in the language they allowed us to use when forming conclusions about borderline results.

• Some initially stated that a p-value just above 0.05 indicated that there was ‘no effect’, despite a clinically relevant effect size

• Should these results really just be ignored?

Why do we stick to a 0.05 limit?

• The cut-off of 0.05 was first suggested by RA Fisher in 1925 as being low enough to make decisions.

• It has become widely adopted.

• It is an arbitrary cut-off, but many researchers seem to adhere to it strictly

Examples

Relative risk of 0.75 (95% CI: 0.57-0.99)

p-value = 0.048

Clear evidence of an effect?

Relative risk of 0.75 (95% CI: 0.55-1.03)

p-value = 0.07

No effect?

Size of treatment effects

• In the past, many ‘new’ interventions at that time were compared to minimal or no treatment so we were looking for (and finding) large treatment effects.

• Today new interventions are being compared to standard treatments which already work well so smaller differences are expected

• It is therefore not as easy to get very small p-values.

How are p-values determined?

Size of a p-value

Size of the treatment effectEg. hazard ratio, relative risk,

absolute risk difference, or mean difference

Size of the standard error, which is influenced by:

•Number of subjects•Number of events*•Standard deviation*

Interpretation• Very small p-values (easy to interpret) arise when the

effect size is large and the standard error is small.

• Borderline p-values arise when either:

• We have a clinically meaningful treatment effect but a moderate or large standard error (usually when there are insufficient participants or events).

Or

• The treatment effect is smaller than expected (should have had a larger trial).

An example: EICESS-92 phase III trial

• A trial comparing standard chemotherapy with or without etoposide for treating Ewing’s Sarcoma at high risk of recurrence/death (mainly a childhood cancer).

• Primary endpoint: Event-free-survival

• Powered to detect a hazard ratio of 0.60

• Sample size: 400 patients (but 492 actually recruited).

The EICESS-92 Phase III Trial

• Observed HR: 0.83 (95% CI: 0.65-1.05) p=0.12

• P>0.05

• Should we conclude no effect?

• But a 17% risk reduction is clinically significant, although smaller than the 40% initially expected.

• How do we interpret this?

EICESS-92: The Confidence Interval

• Most people understand that the true effect is likely to lie somewhere within the CI range - hence the possibility of it being 1.0 (no effect).

• But there is a common misconception that it lies anywhere within this range with equal probability.


• The true HR is more likely to lie around the estimated HR (0.83) than at the extremes of the confidence interval.

There is a 50% chance that the range 0.77 and 0.90 contains the true hazard ratio


Similarly there is a 75% chance that 0.72 and 0.95 contains the true HR.

0.77 0.90


• The upper limit of the confidence interval is 1.05 and only just exceeds 1.0.

There is only a 6% chance that the range ≥1.0 contains the true HR

• The conclusion reported in the paper was that “the addition of etoposide seemed to be beneficial”.

• This is the only randomised trial of etoposide in these children.

• The disorder is uncommon: 6.5 years to recruit 492 patients across Europe. Another trial is unlikely.

• Although the target sample size was exceeded, the treatment effect was smaller than expected (HR 0.83 vs 0.60), which is probably why the result was not statistically significant (i.e. trial was not big enough).

Are these sort of results common and how are they reported?

The Literature Search• We conducted a literature search to see how often

trials with borderline results arose and how they were reported.

• We looked though every issue of 6 major journals in 2009. The journals chosen were

• The BMJ

• The Lancet

• JAMA

• New England Journal of Medicine

• Journal of the National Cancer Institute

• Journal of Clinical Oncology

The Literature Search

To be selected a paper had to:

• Report the results of a phase III randomised trial.

• Have borderline results for the primary outcome measure.

What counted as borderline?To count as a borderline result we needed to see:

• A non-zero effect size

AND

• A p-value between 0.05 and 0.1

OR

• one end of the 95% confidence interval close to the no effect value (eg for ratios, the upper tail of the CI had to be <1.1 or the lower tail >0.90)

Literature search results

Journal Number of Phase III trials

Number with Borderline p-values

BMJ 44 2

The Lancet 64 3

JAMA 40 2

NEJM 70 8

JNCI 6 3

JCO 64 6

288 24 (1 in 12)

Below is a table showing the numbers of phase III trials found and the number with borderline p-values.

Literature Search Results

• We examined the conclusion given in the abstract because this is what most people focus on.

• Was the language used appropriate?

• Some authors discussed their results further in the Discussion section.

ConclusionNumber of

StudiesRange of P-values

No effect 10 0.06 - 0.17

Some evidence 11 0.06 - 0.13

Confidence in effect 3 0.056 - 0.1

Literature Search Results

Example 1

Interventions and patient group

Primary endpoint Main resultConclusion

reported in the Abstract

Nurse-led psycho-educational intervention versus usual care for palliative care in patients with advanced cancer

Symptom intensity, assessed by an assessment scale

(quality of life and resource use were other endpoints)N=322

Mean difference:-27.8 scores (95% CI -57.2 to +1.6)P=0.06 ?

“Those receiving nurse-led… intervention… had higher scores for quality of life and mood, but did not have improvements in symptom intensity scores”

Bakitas et al, JAMA 2009;302:741-9.




Tailored care plan versus usual care in patients with coronary heart disease

Patients with systolic blood pressure >140mm Hg at 18 months (hospital admission was another endpoint)N=903

Odds ratio 0.6695% CI 0.43 to 1.01P=0.06

?“Admissions to hospital were significantly reduced…but no other clinical benefits were shown”

Example 2

Murphy et al, BMJ 2009;339:b4220.




Pre-surgical chemoradiotherapy versus chemotherapy in patients with locally advanced cancer of the esophagogastric junction.

Overall survivalN=126 (target was 576)

Hazard ratio 0.6795% CI 0.41 to 1.07P=0.07 ?

“Although… statistical significance was not achieved, results point to a survival advantage for preoperative chemoradiotherapy”

Example 3

Stahl et al, J Clin Oncol 2009;27:851-6.




Aerobic exercise training plus usual care versus usual care alone, in patients with chronic heart failure

All-cause mortality or hospitalisationN=2331

Hazard ratio 0.9395% CI 0.84 to 1.02P=0.13

?“…exercise training resulted in non-significant reductions in the primary endpoint….”

Example 4

O’Connor et al, JAMA 2009;301:1439-50.




Artesunate suppository versus placebo in patients with severe malaria who cannot be treated orally; N=12,068

MortalityRisk difference -0.4%95% CI -1.0 to +0.2%P=0.1 ?

“…..a single inexpensive artesunate suppository… substantially reduces the risk of death or permanent disability”

Example 5

Gomes et al, Lancet 2009;373:557-66.




Telephone counselling using cognitive behavioural skills vs. no intervention to encourage smoking cessation in adolescents;

N=2151

6-months prolonged abstinence from smoking

Absolute risk difference 4.0%95% CI -0.2 to 8.1%P=0.06

?“…personalized motivational interviewing...is effective in increasing teen smoking cessation”

Example 6

Peterson et al, J Natl Cancer Inst 2009;101:1378-92.

Papers with borderline negative results

• What if a new intervention appears to show harm but has a borderline p-value?

• Perhaps authors would be inclined to be firmer with conclusions than if a new intervention shows possible benefit?

• We found two such papers where the authors only concluded that it did not show benefit.

Papers with borderline negative results

• Trial of calcuim dobesilate vs placebo for the prevention of clinically significant macular oedema (CSME) in 635 patients with type 2 diabetes.

• 86 patients in the calcuim dobesilate group and 69 in the placebo group developed CSME

• Hazard ratio 1.32 (95% CI 0.96-1.81), p=0.08

“Calcium dobesilate did not reduce the risk of development of CSME.”

Borderline results elsewhereWe were sent this table after our paper was published, showing the results and conclusions from papers on statins and mortality.

Meta-analyses Risk Estimate (95% CI) Author’s Conclusions

Arch Intern Med 2005; 165:725-730

0.87 (0.81 - 0.94) Decreases mortality

Arch Intern Med 2006; 166: 2307 – 2313

0.92 (0.84 -1.01) No effect

J Am Coll Cardiol2008; 52: 1769-81

0.93 (0.87- 0.99) Decreases mortality

BMJ 2009;338:b2376 0.88 (0.81 – 0.96) Decreases mortality

Arch Intern Med2010; 170: 1024-1031

0.91 (0.83 -1.01) No effect

Possible solutions

• Design trials with larger numbers. But not always feasible (eg high costs or rare disorder)

• However, even a relatively large trial can produce an effect size smaller than expected (Ewings sarcoma example)

• Meta analyses. Example (doublet chemotherapy for pancreatic cancer):– One trial: HR 0.86, 95% CI 0.72-1.02, p=0.08– Meta-analysis 3 trials: HR 0.86, 95% CI 0.75-0.98, p=0.02

Conclusions• Borderline results cannot be used as strong

evidence either in favour or against an intervention

• But do not completely dismiss an effect if p>0.05 when the treatment effect is clinically meaningful

• Do not conclude ‘no effect’; look at other endpoints, and other evidence

• A lack of statistical significance does not mean lack of an effect (Altman & Bland BMJ 1995)

Conclusions

• Say that there is probably evidence of an effect but use appropriate language, eg words such as “suggestion”, “indication” and “seems”

• The same principles apply to other areas of research (eg risk factors)

borderline significance

Documents