borderline significance
TRANSCRIPT
Interpreting and reporting clinical trials with results of borderline
significance
Amy Kirkwood and Professor Allan Hackshaw
CRUK and UCL Cancer Trials Centre
What is the problem?
• After submitting several phase III trials to high profile journals we noticed a disparity in the language they allowed us to use when forming conclusions about borderline results.
• Some initially stated that a p-value just above 0.05 indicated that there was ‘no effect’, despite a clinically relevant effect size
• Should these results really just be ignored?
Why do we stick to a 0.05 limit?
• The cut-off of 0.05 was first suggested by RA Fisher in 1925 as being low enough to make decisions.
• It has become widely adopted.
• It is an arbitrary cut-off, but many researchers seem to adhere to it strictly
Examples
Relative risk of 0.75 (95% CI: 0.57-0.99)
p-value = 0.048
Clear evidence of an effect?
Relative risk of 0.75 (95% CI: 0.55-1.03)
p-value = 0.07
No effect?
Size of treatment effects
• In the past, many ‘new’ interventions at that time were compared to minimal or no treatment so we were looking for (and finding) large treatment effects.
• Today new interventions are being compared to standard treatments which already work well so smaller differences are expected
• It is therefore not as easy to get very small p-values.
How are p-values determined?
Size of a p-value
Size of the treatment effectEg. hazard ratio, relative risk,
absolute risk difference, or mean difference
Size of the standard error, which is influenced by:
•Number of subjects•Number of events*•Standard deviation*
Interpretation• Very small p-values (easy to interpret) arise when the
effect size is large and the standard error is small.
• Borderline p-values arise when either:
• We have a clinically meaningful treatment effect but a moderate or large standard error (usually when there are insufficient participants or events).
Or
• The treatment effect is smaller than expected (should have had a larger trial).
An example: EICESS-92 phase III trial
• A trial comparing standard chemotherapy with or without etoposide for treating Ewing’s Sarcoma at high risk of recurrence/death (mainly a childhood cancer).
• Primary endpoint: Event-free-survival
• Powered to detect a hazard ratio of 0.60
• Sample size: 400 patients (but 492 actually recruited).
The EICESS-92 Phase III Trial
• Observed HR: 0.83 (95% CI: 0.65-1.05) p=0.12
• P>0.05
• Should we conclude no effect?
• But a 17% risk reduction is clinically significant, although smaller than the 40% initially expected.
• How do we interpret this?
EICESS-92: The Confidence Interval
• Most people understand that the true effect is likely to lie somewhere within the CI range - hence the possibility of it being 1.0 (no effect).
• But there is a common misconception that it lies anywhere within this range with equal probability.
EICESS-92: The Confidence Interval
• The true HR is more likely to lie around the estimated HR (0.83) than at the extremes of the confidence interval.
There is a 50% chance that the range 0.77 and 0.90 contains the true hazard ratio
EICESS-92: The Confidence Interval
Similarly there is a 75% chance that 0.72 and 0.95 contains the true HR.
0.77 0.90
EICESS-92: The Confidence Interval
• The upper limit of the confidence interval is 1.05 and only just exceeds 1.0.
There is only a 6% chance that the range ≥1.0 contains the true HR
• The conclusion reported in the paper was that “the addition of etoposide seemed to be beneficial”.
• This is the only randomised trial of etoposide in these children.
• The disorder is uncommon: 6.5 years to recruit 492 patients across Europe. Another trial is unlikely.
• Although the target sample size was exceeded, the treatment effect was smaller than expected (HR 0.83 vs 0.60), which is probably why the result was not statistically significant (i.e. trial was not big enough).
Are these sort of results common and how are they reported?
The Literature Search• We conducted a literature search to see how often
trials with borderline results arose and how they were reported.
• We looked though every issue of 6 major journals in 2009. The journals chosen were
• The BMJ
• The Lancet
• JAMA
• New England Journal of Medicine
• Journal of the National Cancer Institute
• Journal of Clinical Oncology
The Literature Search
To be selected a paper had to:
• Report the results of a phase III randomised trial.
• Have borderline results for the primary outcome measure.
What counted as borderline?To count as a borderline result we needed to see:
• A non-zero effect size
AND
• A p-value between 0.05 and 0.1
OR
• one end of the 95% confidence interval close to the no effect value (eg for ratios, the upper tail of the CI had to be <1.1 or the lower tail >0.90)
Literature search results
Journal Number of Phase III trials
Number with Borderline p-values
BMJ 44 2
The Lancet 64 3
JAMA 40 2
NEJM 70 8
JNCI 6 3
JCO 64 6
288 24 (1 in 12)
Below is a table showing the numbers of phase III trials found and the number with borderline p-values.
Literature Search Results
• We examined the conclusion given in the abstract because this is what most people focus on.
• Was the language used appropriate?
• Some authors discussed their results further in the Discussion section.
ConclusionNumber of
StudiesRange of P-values
No effect 10 0.06 - 0.17
Some evidence 11 0.06 - 0.13
Confidence in effect 3 0.056 - 0.1
Literature Search Results
Example 1
Interventions and patient group
Primary endpoint Main resultConclusion
reported in the Abstract
Nurse-led psycho-educational intervention versus usual care for palliative care in patients with advanced cancer
Symptom intensity, assessed by an assessment scale
(quality of life and resource use were other endpoints)N=322
Mean difference:-27.8 scores (95% CI -57.2 to +1.6)P=0.06 ?
“Those receiving nurse-led… intervention… had higher scores for quality of life and mood, but did not have improvements in symptom intensity scores”
Bakitas et al, JAMA 2009;302:741-9.
Interventions and patient group
Primary endpoint Main resultConclusion
reported in the Abstract
Tailored care plan versus usual care in patients with coronary heart disease
Patients with systolic blood pressure >140mm Hg at 18 months (hospital admission was another endpoint)N=903
Odds ratio 0.6695% CI 0.43 to 1.01P=0.06
?“Admissions to hospital were significantly reduced…but no other clinical benefits were shown”
Example 2
Murphy et al, BMJ 2009;339:b4220.
Interventions and patient group
Primary endpoint Main resultConclusion
reported in the Abstract
Pre-surgical chemoradiotherapy versus chemotherapy in patients with locally advanced cancer of the esophagogastric junction.
Overall survivalN=126 (target was 576)
Hazard ratio 0.6795% CI 0.41 to 1.07P=0.07 ?
“Although… statistical significance was not achieved, results point to a survival advantage for preoperative chemoradiotherapy”
Example 3
Stahl et al, J Clin Oncol 2009;27:851-6.
Interventions and patient group
Primary endpoint Main resultConclusion
reported in the Abstract
Aerobic exercise training plus usual care versus usual care alone, in patients with chronic heart failure
All-cause mortality or hospitalisationN=2331
Hazard ratio 0.9395% CI 0.84 to 1.02P=0.13
?“…exercise training resulted in non-significant reductions in the primary endpoint….”
Example 4
O’Connor et al, JAMA 2009;301:1439-50.
Interventions and patient group
Primary endpoint Main resultConclusion
reported in the Abstract
Artesunate suppository versus placebo in patients with severe malaria who cannot be treated orally; N=12,068
MortalityRisk difference -0.4%95% CI -1.0 to +0.2%P=0.1 ?
“…..a single inexpensive artesunate suppository… substantially reduces the risk of death or permanent disability”
Example 5
Gomes et al, Lancet 2009;373:557-66.
Interventions and patient group
Primary endpoint Main resultConclusion
reported in the Abstract
Telephone counselling using cognitive behavioural skills vs. no intervention to encourage smoking cessation in adolescents;
N=2151
6-months prolonged abstinence from smoking
Absolute risk difference 4.0%95% CI -0.2 to 8.1%P=0.06
?“…personalized motivational interviewing...is effective in increasing teen smoking cessation”
Example 6
Peterson et al, J Natl Cancer Inst 2009;101:1378-92.
Papers with borderline negative results
• What if a new intervention appears to show harm but has a borderline p-value?
• Perhaps authors would be inclined to be firmer with conclusions than if a new intervention shows possible benefit?
• We found two such papers where the authors only concluded that it did not show benefit.
Papers with borderline negative results
• Trial of calcuim dobesilate vs placebo for the prevention of clinically significant macular oedema (CSME) in 635 patients with type 2 diabetes.
• 86 patients in the calcuim dobesilate group and 69 in the placebo group developed CSME
• Hazard ratio 1.32 (95% CI 0.96-1.81), p=0.08
“Calcium dobesilate did not reduce the risk of development of CSME.”
Borderline results elsewhereWe were sent this table after our paper was published, showing the results and conclusions from papers on statins and mortality.
Meta-analyses Risk Estimate (95% CI) Author’s Conclusions
Arch Intern Med 2005; 165:725-730
0.87 (0.81 - 0.94) Decreases mortality
Arch Intern Med 2006; 166: 2307 – 2313
0.92 (0.84 -1.01) No effect
J Am Coll Cardiol2008; 52: 1769-81
0.93 (0.87- 0.99) Decreases mortality
BMJ 2009;338:b2376 0.88 (0.81 – 0.96) Decreases mortality
Arch Intern Med2010; 170: 1024-1031
0.91 (0.83 -1.01) No effect
Possible solutions
• Design trials with larger numbers. But not always feasible (eg high costs or rare disorder)
• However, even a relatively large trial can produce an effect size smaller than expected (Ewings sarcoma example)
• Meta analyses. Example (doublet chemotherapy for pancreatic cancer):– One trial: HR 0.86, 95% CI 0.72-1.02, p=0.08– Meta-analysis 3 trials: HR 0.86, 95% CI 0.75-0.98, p=0.02
Conclusions• Borderline results cannot be used as strong
evidence either in favour or against an intervention
• But do not completely dismiss an effect if p>0.05 when the treatment effect is clinically meaningful
• Do not conclude ‘no effect’; look at other endpoints, and other evidence
• A lack of statistical significance does not mean lack of an effect (Altman & Bland BMJ 1995)
Conclusions
• Say that there is probably evidence of an effect but use appropriate language, eg words such as “suggestion”, “indication” and “seems”
• The same principles apply to other areas of research (eg risk factors)