subgroup analysis “fun to look at but don’t believe them!” (p sleight, 2000) deciding on...

Subgroup Analysis

“Fun to look at but don’t believe them!” (P Sleight, 2000)

Deciding on analysis after looking at the data is “dangerous, useful, and often done.” (IJ Good, 1983)

“We cannot necessarily, perhaps very rarely, pass from (the overall result of a trial) to stating exactly what effect the

treatment will have on a particular patient. But there is surely, no way and no method

of deciding that.”

A. Bradford Hill, 1952

“Several factors are threatening to create a path to “depersonalized” medicine despite advances both in fundamental science and

clinical therapeutics. The tendency to focus on statistics for the group rather

than the individual clinical features of the patient is one factor.”

Horwitz RI, et al, (De)Personalized Medicine, Science, 8 March 2013

Most trials report subgroup analyses (median=4 subgroups)

Assmann SF, Lancet 2000; 355:1064-1069

Influence of Study Characteristics on Reporting of Subgroups

• 44% of 469 randomized trials published in major journals reported subgroup analyses

• Subgroup analyses were more likely to be reported in high impact journals, non-surgical trials, and large trials.

• There was an interaction between source of funding and reporting of subgroups in trials without significant overall results – a subgroup finding!

Sun et al, BMJ, 2011

Conclusion: Sun et al, BMJ 2011

“Industry funded randomised controlled trials, in the absence of statistically

significant primary outcomes, are more likely to report subgroup analyses than

non-industry funded trials. Industry funded trials less frequently test for interaction than non-industry funded

trials. Subgroup analyses from industry funded trials with negative results for

the primary outcome should be viewed with caution.”

Aims of Subgroup Analysis

• To show consistency of trial findings for major endpoints for important patient subsets

• To assess whether there are large differences in the treatment effect among different types of patients and, if so, identify hypotheses for future research. (Assess the possibility of treatment X subgroup or covariate interactions)

Aim should not be to salvage a trial for which the overallresults were not as hoped for!

Subgroup Analysis by Astrological Birth Sign

Gemini or Libra 9% (NS)

Other signs -28% (p < 0.00001)

Overall -23% (p < 0.00001)

PercentageReduction in 5 Week

Vascular Mortality

“Lack of evidence of benefit just in one particular subgroup is not good evidence of lack of benefit.”

ISIS-2: Streptokinase and Aspirin for Acute MI

Subgrouping Considerations

• Most trials are not designed to look at subgroups; sample size and power based on overall treatment effect (power is lower for subgroups than overall comparison).

• For subgroup analysis, it is often not clear how to control for type 1 error (the more subgroups examined, the greater the risk of a type 1 error).

• Not all subgroups of interest can be pre-specified (we are not that smart).

• The subgroup may not be what it appears to be (it may be a marker or label for some other characteristic).

Subgroup Definitions

• Proper subgroup – grouping of patients according to baseline characteristics

• Improper subgroup – grouping of patients according to characteristics following randomization (i.e., factors potentially affected by treatment)

• Interaction – evidence that treatment effects differ by subgroup (quantitative versus qualitative)

Yusuf S, et al., JAMA, 266:93-98, 1991.

A Priori and A Posteriori Subgroups

• A priori: written in the protocol in advance of the study (hypothesis driven)

• A posteriori (post hoc or exploratory):

- specified … later

- before unblinding

- after unblinding

• Both have inflated error rates, but more of a problem with a posteriori defined subgroups.

INSIGHT START Protocol: Early Treatment for HIV

“Subgroup analyses for the primary endpoint and major secondary outcomes will be

performed to determine whether the treatment effect (early versus deferred) differs

qualitatively across various baseline-defined subgroups. Subgroup analysis will be

performed by age, gender, race/ethnicity, geographic region, the presence of risk factors

for serious non-AIDS conditions, baseline CD4+ cell count, baseline HIV RNA level, calendar date of enrollment to assess the

effect of different treatment patterns that may emerge, and the ART-regimen pre-specified at the time of randomization….An overall test of

heterogeneity will provide evidence of whether the magnitude of the treatment difference

varies across baseline subgroups.”

Pre- and Post-Stratificationand Subgroup Analysis

• Pre-stratification variables are often, but not always, subgroups of interest.

• Aim of post-stratified analysis is to obtain a “better” estimate of overall treatment effect.

• Aim of subgroup analysis is to determine whether treatment differences are consistent.

• Like post-stratification, plans for subgroup analysis should be pre-specified –– sometimes there are surprises.

Subgrouping vs. Stratification

Pre-stratification “insurance” for balance in randomization

increase the accuracy of estimates of treatment effect

check the consistency of the treatment effect

Grouping Purpose

Post-stratification

Subgroups

Stratified Design for Comparing Treatments

Stratum A B

1

2

3

4

m1

m2

m3

m4

na nb

Treatment

m1A

m2A

m3A

m4A

m1B

m2B

m3B

m4B

• Typical situation: m1 ≠ m2 m3 m4≠ ≠• Study is designed/powered based on na and nb

• Goal: miA = miB for all i.

Subgrouping Factors Determined Experimentally

Determinedby Randomization

B

No B

A No A

{

BaselineCharacteristic

B

No B

A No A

{

versus

2 x 2 Factorial

NIH Policy on Subgroups

“When an NIH-defined Phase III clinical trial is proposed, evidence must be reviewed to show whether or not clinically important sex/gender and race/ethnicity differences in the intervention effect are to be expected.”

“Inclusion of the results of sex/gender, race/ethnicity and relevant subpopulations analyses is strongly encouraged in all publication submissions.”

http://grants.nih.gov/grants/funding/women_min/guidelines_amended_10_2001,htm

ICH Guidelines on Subgroups

• If the size of the study permits, important demographic or baseline value-defined subgroups should be examined.

• These analyses are not intended to “salvage” an otherwise unsupportive study.

• Subgroup analyses may suggest hypotheses to be examined in other studies

• If there is a prior hypothesis about a subgroup, this should be part of the statistical analysis plan.

Issues to Consider

• Appropriate significance level? Bonferroni method may be too conservative – loss of power in a situation where power is already low.

• Should subgroup analysis be performed if the overall result is negative? Much harder sell.

• Should only a priori subgroups be described? Not always that smart.

• How should subgroup analyses be presented? Interaction tests important.

• Should analyses be based on post-randomization measures? No

A Consumer’s (and Producer’s?)Guide to Subgroup Analysis

• Document heterogeneity between subgroups

• Argue consistency with biologic phenomena

• Argue consistency with other data from the trial

• Argue consistency with other studies

• It is easy to build a story after the fact!

Data from Neonatal Hypocalcemia Trial:All Calcium Levels in mmol/l

Treatment mean 2.445 2.408 2.3002.195

No. babies 64 102 169 285

SE 0.0365 0.0311 0.0211 0.0189

Treatment effect 0.037 0.105SE 0.0480 0.0283P-value 0.44 0.0002

Breast-fed Bottle-fedSupplement Placebo Supplement Placebo

Reference: Cockburn et al, BMJ, 281:11-14; 1980.See also Pocock. Clinical Trials a Practical Approach..

Data from Neonatal Hypocalcemia Trial (cont.)

22.10557.0

068.0

)0189.00211.00311.00365.0(

105.0037.0212222

Z

P-value = 0.22

HDFP Study

Black men 112 140 -18.5

Black women 70 98 -27.8

White men109 126 -14.7

White women 58 55 +2.1

30-4981 82 -5.7

50-59115 159 -25.3

60-69153 178 -16.4

Overall 349 419 -16.9

Race, Sex, AgeDeaths

SC RCPercent Difference

in Mortality

HDFP Subgroups

Black Men (1)

SC

RC

Dead Alive

112 952

140 944

O = 0.79^1

W = 55.01

Black Women (2)

SC

RC

Dead Alive

70 1274

98 1256

O = 0.70^2

W = 38.32

White Men (3)

SC

RC

Dead Alive

109 1783

126 1735

O = 0.84^3

W = 54.83

White Women (4)

SC

RC

Dead Alive

58 1026

55 1101

O = 1.13^4

W = 26.84

w ic 1

4

174.9

log ˆ O p (55.0)log(0.79) (38.3)log(0.70)

(54.8)log(0.84) (26.8)log(1.13)

log ˆ O p 0.188

ˆ O p 0.83

X 2(3) w ic1

4

(log ˆ O i log ˆ O p )2

0.134 1.1110.008 2.551

3.804; p 0.28

/174.9

Aspirin and Risk of Stroke of Death

Men

Aspirin

No Aspirin

Event No Event

29 171

56 150

Women

Aspirin

No Aspirin

Event No Event

17 73

12 77

Among men, aspirin reduced the risk ofstroke or death by 48% (p=0.004); among women, aspirin

increased the risk of stroke or death by 42% (p=0.35).

Overall, aspirin reduced the risk of stroke or death by 31% (p=0.05).

Conclusion: “We conclude that aspirin is an efficaciousdrug for men with threatened stroke”.

N Engl J Med 1978; 299:53-59.

Absolute effects of antiplatelet therapy on vascular events in the 29 trials in high risk patients with separate information available on each patient subdivided by age and

sex and by diastolic blood pressure and diabetes.

Antiplatelet Trialists' Collaboration BMJ 1994;308:81-106

©1994 by British Medical Journal Publishing Group

◄No difference by gender

Cox Model for Interaction

• Treatment x gender interaction

Z1 = 1 if eplerenone; 0 if placebo

Z2 = 1 if male; 0 if female

Z3 = Z1 x Z2

H0 : β3 = 0

h(t; Z) = h0 (t) exp[β1 Z1 + β2 Z2 + β3 Z3]

Subgroup Analyses According to Follow-up Time

• Heart and estrogen/progestin Replacement Study (HERS)– JAMA 1998; 280: 605-613.

• Adenomatous Polyp Prevention on Vioxx (APPROVe) Trial– N Engl J Med 2005; 352:1092-1102– Lancet 2008; 372:1756-1764.

HERS

Estrogen-Progestin (n=1380)

Placebo (n=1383)

Hazard Ratio (95% CI)

Primary CHD events 172 176

0.99(0.80 – 1.22)

Year 1 57 38 1.52

Year 2 47 48 1.00

Year 3 35 41 0.87

Year 4 33 49 0.67

P=.009 for interaction

APPROVE

Rofecoxib (n=1287)

Placebo (n=1299)

Hazard ratio (95% CI)

Confirmed thrombotic events

46 261.92

(1.19 – 3.11)

Months 0-18 22 20 1.18

Months 19-36 24 6 4.45

P=.01 for failed test of proportional hazards (interaction)

Later determined that a different test for interaction was pre-specified and inclusion of events after treatment

discontinuation changed findings.

Barrett-Connor on HERS*A Fable: Looking for the Pony

A man has 2 sons, one a hopeless pessimist and the other an unrealistic optimist. Determined to change their thinking to a less extreme position, the man buys a room full of toys for the pessimist and a room full of horse manure for the optimist.

When he returns, the pessimist is crying because he has broken all of his toys. In contrast, the optimist is shoveling through his gift and proclaim: “with all that manure there must be a pony in there somewhere.”

Circulation 2002;105:902-903.

“New Study Reassures Most Users of Hormones. For Newly Menopausal, There’s No Heart Risk; A

Reversal of Findings.”

“At Issue is something called the P value…”

Wall Street Journal

April 4, 2007

Cardiovascular and Global Index Events by Years Since Menopause at Baseline (WHI Study)

Years Since Menopause<10 10-19 ≥20

No. of Cases No. of Cases No. of Cases

Hormone Therapy (n=3608)

Placebo(n=3529)

HR(95%CI)

Hormone Therapy

(n=4483)Placebo(n=3529)

HR(95%CI)

Hormone Therapy (n=3608)

Placebo(n=3529)

HR(95%CI)

CHD‡ 39 51 0.76 113 103 1.10 194 158 1.28 .02(0.50-1.16) (0.84-1.45) (1.03-1.58)

Stroke 41 23 1.77 100 79 1.23 142 113 1.26 .36(1.05-2.98) (0.92-1.66) (0.98-1.62)

Total Mortality 53 67 0.76 142 149 0.98 267 240 1.14 .51(0.53-1.09) (0.78-1.24) (0.96-1.36)

Global Index§ 222 203 1.05 482 440 1.12 675 632 1.09 .62(0.86-1.27) (0.98-1.27) (0.98-1.22)

Pvalue

forTrend†

† Test for trend (interaction) using years since menopause as continuous (linear) form of categorical coded values. Cox regression models stratified according to active vs. placebo and trial, including terms for years since menopause and the interaction between trials and years since menopause

JAMA 2007;297:1465-1477

CHD Events by Years Since Menopause at Baseline

Years Since Menopause

<10 10-19 ≥20

HR(95%CI)

HR(95%CI)

HR(95%CI)

CHD‡ 0.76 1.10 1.28 .02(0.50-1.16) (0.84-1.45) (1.03-1.58)

P-value

forTrend†

“These analyses, although not definitive, suggest that thehealth consequences of hormone therapy may vary by distance from menopause…”

AIDS Vaccine Trial(Science 28 February 2003)

5.7% vs. 5.8%

Infected

Vaccine

Placebo

NotInfected

191 3,139

98 1,581

O ˆ R 0.98

289 4,720

3,330

1,679

5,009

95% CI (0.78 to 1.24)

AIDS Vaccine TrialSubgroup Analysis

White and Hispanic Black, Asian, Other

Infected

Vaccine

Placebo

NotInfected

179 2,824

81 1,427

Infected

Vaccine

Placebo

NotInfected

12 315

17 154

6.0 vs. 5.4% 3.7 vs. 9.9%

O ˆ R 1 1.12 (95% CI : 0.85 to 1.46) O ˆ R 2 0.35 (95% CI : 0.16 to 0.74)

ˆ O p 1.02; 12 8.6 for homogeneity of odds ratio; p = 0.003

Example: ACTG 155

Randomization (allocation ratio)Arms: AZT 2

ddC 2AZT + ddC 3

Primary outcome: disease progression (AIDS/death)

Secondary outcome: CD4+ cell count change, toxicities

Sample Size: 991

NumberSubgrouping: CD4<50 269

50≤CD4<150336CD4≥150 386

“We found no overall benefits of zalcitabine used alone or with zidovudine. However, a trend analysis suggested a better outcome for combination therapy compared with zidovudine as the pretreatment CD4 cell count increased”.

“Our study suggests that combination therapy may be beneficial in patients with higher CD4 cell counts”.

Pooled Analysis of AZT + ddX vs. AZTTreatment Naïve Patients

< 100382 0.66 (0.53 - 0.82)

100 - 199 319 0.63 (0.50 - 0.81)

200 - 299 186 0.62 (0.45 - 0.84)

300 - 499 90 0.63 (0.40 - 0.98)

No. AIDS/DeathEvents Hazard Ratio*

BaselineCD4+

*AZT + ddx vs. AZT

Some Lessons From ACTG 155 Presentation

1. What does “a priori” mean?

If it is important, amend the protocol.

2. Confusion about stratification and subgrouping.

Lessons Continued

3. It is easy to develop explanations for possible subgroup effects.

4. By chance some subgroups will be more extreme than others.

Lessons Continued

5. For an ordered/continuous variable, test for trend is important.

CD4+> 5050 - 149150+

4 df test for interaction (3 treatment groups and 3 CD4 categories) or

2 df test (3 treatment groups and continuous CD4)

6. “Subgroup label” may be a marker for something else.

Guidelines to Follow for Interpreting Subgroup Analysis

• Assess magnitude of interaction before focusing on separate subgroups and their tests of significance

• Assess consistency with biologic phenomenon realizing that “human imagination is capable of developing a rationale for most findings” (Ware, NEJM, 2003).

• Assess consistency with other data from trial

• Assess consistency with other studies

Guidelines For Reporting Subgroup Analyses (NEJM 2007;2189-2194)

• Abstract: Only if based on primary outcome and pre-specified

• Methods: Number pre-specified; any of special interest; endpoint; methods used to assess heterogeneity; number preformed; potential effect on type 1 error

• Results: present tests of heterogeneity; forest plot

• Discussion: Cautious in interpretation; state limitations; cite supporting or contradictory data

Criteria Used to Assess Credibility of Subgroup Effect (BMJ 2012:344:e1553)

• Design– Baseline characteristic?– Stratification factor?– A priori specified?– Fewer than 5 subgroups tested?

• Analysis– Test for interaction performed?– If multiple interactions, independent?

• Context– Direction correctly pre-specified?– Consistent with evidence from previous studies?– Consistent across outcomes?– Indirect evidence (e.g., biologic rationale) supports

finding?

Methods Section of ESPRIT PaperN Engl J Med 2009; 361: p. 1550

“Data on the primary end point were summarized for pre-specified subgroups

defined according to baseline characteristics. A total of 12 subgroup analyses were pre-

specified. The heterogeneity of hazard-ratio estimates between subgroups were assessed

by including an interaction term between treatment and subgroup in expanded Cox models. The results of subgroup analyses

should be interpreted with caution; a significant interaction could be due to chance, because there was no adjustment made to type

1 error for the number of subgroups examined.”

5 subgroups were reported: age, gender, race/ethnicity,baseline CD4+ count and baseline HIV RNA level.

Summary

• P-values for individual subgroups are misleading – report CIs.

• Calculate subgroup by treatment interactions, but be cognizant of low power and risks of type 1 error if multiple subgroups are examined

• Keep in mind most trials are designed assuming no interaction.

• Define key subgroups to be investigated in the protocol.

• Report subgroup findings very cautiously – ultimately want validation in another study or meta-analysis.

“Only one thing is worse than doing subgroup analyses --- believing the results.” Richard Peto

subgroup analysis “fun to look at but don’t believe them!” (p sleight, 2000) deciding on...

Documents

particular subgroup

subgroup analysisto

subgroup analysisfun

large trials

randomized trials

nonindustry funded trials

nonsurgical trials

randomised controlled