use and abuse of p values

NY, 14 December 2007NY, 14 December 2007

Emmanuel LesaffreBiostatistical Centre, K.U.Leuven, Leuven, BelgiumDept of Biostatistics, Erasmus MC, Rotterdam, the

Netherlands

Use and abuse of P Use and abuse of P valuesvalues

Clinical Research Methodology CourseRandomized Clinical Trials and the “REAL

WORLD”

55

ContentsContents1.1. P-value: P-value: What is it?What is it?

2.2. Type I errorType I error

3.3. Multiple testingMultiple testing

4.4. Type II errorType II error

5.5. Sample size calculationSample size calculation

6.6. Negative studiesNegative studies

7.7. Testing at baselineTesting at baseline

8.8. Statistical significance Statistical significance clinical relevance clinical relevance

9.9. Confidence interval Confidence interval P-value P-value

10.10. P-value of clinical trial P-value of clinical trial of epidemiological study of epidemiological study

11.11. Take home messagesTake home messages

66

1. P-value: What is it?1. P-value: What is it?

77

1. P-value: What is it?1. P-value: What is it? Etoricoxib Etoricoxib Placebo Placebo

– WOMAC Pain Subscale: difference in means = WOMAC Pain Subscale: difference in means = -15.07-15.07

– What does this result mean?What does this result mean?

– What do you expect ifWhat do you expect if etoricoxib=placeboetoricoxib=placebo? ? difference difference 0 0

– But even if But even if etoricoxib=placeboetoricoxib=placebo, result will , result will vary around 0vary around 0

– What is a What is a large/small differencelarge/small difference??

– What is the What is the play of chanceplay of chance? ?

The same questions for the other scores & The same questions for the other scores & comparisonscomparisons

88

1. P-value: What is it?1. P-value: What is it? Etoricoxib Etoricoxib Placebo Placebo

– Suppose Suppose H0: E=PH0: E=P

– P=0.05P=0.05 result belongs to theresult belongs to the 5% extreme results 5% extreme results that could happenthat could happen under H0 ( under H0 (if H0 is trueif H0 is true))

– P=0.01P=0.01 result belongs to theresult belongs to the 5% extreme results 5% extreme results that could happenthat could happen under H0 ( under H0 (if H0 is trueif H0 is true) ) and and onlyonly 1% 1% isis MORE EXTREME MORE EXTREME

– P<0.0001P<0.0001 result belongs to theresult belongs to the 5% extreme results 5% extreme results that could happenthat could happen under H0 ( under H0 (if H0 is trueif H0 is true) ) and and IS VERY IS VERY EXTREMEEXTREME

99

1. P-value: What is it?1. P-value: What is it? GENERAL RULE GENERAL RULE

– When When P < 0.05 P < 0.05 (= (= significance level significance level ):): Result is considered to beResult is considered to be TOO EXTREMETOO EXTREME to believe thatto believe that H0 is H0 is

truetrue H0 is rejectedH0 is rejected we do we do NOT believe NOT believe that that E=P E=P Significant at 0.05Significant at 0.05 (*, **, ***) (*, **, ***)

– When When P P 0.05: 0.05: Result could have happened whenResult could have happened when H0 is trueH0 is true H0 is NOT rejectedH0 is NOT rejected it is possible that it is possible that E=P E=P Result is Result is 0, but we believe that this is due to 0, but we believe that this is due to PLAY OF CHANCEPLAY OF CHANCE NOT NOT significant at 0.05significant at 0.05 (NS) (NS)

1010

1. P-value: What is it?1. P-value: What is it? Results EResults ECCPP

– E E P, WOMAC PainP, WOMAC Pain P < 0.0001 P < 0.0001 Significant at 0.05 (***)Significant at 0.05 (***) We doWe do NOT believe NOT believe that that E=P E=P

– E E C, WOMAC Physical FunctionC, WOMAC Physical Function P = 0.367 P = 0.367 NSNS It could be thatIt could be that E=C, E=C, result isresult is PLAY of CHANCEPLAY of CHANCE

– E E C, Patient Global AssessmentC, Patient Global Assessment P = 0.051 P = 0.051 NSNS It could be thatIt could be that E=C, E=C, result isresult is PLAY of CHANCEPLAY of CHANCE

1111

1. P-value: What is it?1. P-value: What is it? Previous decision rule = hypothesis testing hypothesis testing

– Test H0: E=P H0: E=P versus HA: E≠PHA: E≠P

– Using a statistical test (t-test, ²-test, etc)

– With 2-sided significance level = = 0.05

– In clinical trial setting: Above test is interpreted as: H0: E H0: E P P versus HA: E > PHA: E > P And at 1-sided 1-sided significance level = /2 = 0.05/2 = 0.025

(2.5%)

When resultresult is on the wrong sidewrong side (E < P) with P < 0.05, then efficacy of E over P is not demonstratedefficacy of E over P is not demonstrated

1212

1. P-value: What is it?1. P-value: What is it? What if H0: E=P is true & What if H0: E=P is true & P=0.023P=0.023??

– We will We will reject H0reject H0

– We will make an We will make an ERRORERROR

= = Type I error Type I error

P(Type I error) = P(Type I error) = False-positive rateFalse-positive rate== ProbabilityProbability that result belongs to that result belongs to 5% 5% extreme resultsextreme results

if H0 is trueif H0 is true

== 0.050.05

1313

2. Type I error2. Type I error Type I error: Type I error: Practical implicationsPractical implications

– Suppose Suppose H0 H0 isis TRUE TRUE

– Risk = 5% implications:Risk = 5% implications: 100 studies 100 studies on average on average 5 5 studies studies wrong conclusionwrong conclusion Prob(at least 1 study Prob(at least 1 study wrong conclusionwrong conclusion) )

11

Regulatory agencies mandate aRegulatory agencies mandate a strict controlstrict control of the of the overall false-positive rateoverall false-positive rate

False positive trialFalse positive trial findings could lead to findings could lead to approval of approval of inefficacious drugsinefficacious drugs

1414

3. Multiple testing3. Multiple testing Multiple testing: Multiple testing: DefinitionDefinition

– Suppose Suppose H0 H0 isis TRUE TRUE

– Test 1 (WOMAC pain subscale): Test 1 (WOMAC pain subscale): risk = 5%risk = 5%

– Test 2 (WOMAC Physical Function Subscale): Test 2 (WOMAC Physical Function Subscale): risk = 5%risk = 5%

– Test 1 & Test 2:Test 1 & Test 2: risk risk 5% + 5% = 5% + 5% = 10% 10% of claiming that 2 of claiming that 2 treatments (on one of the tests) are different whentreatments (on one of the tests) are different when they they are notare not

– If If no adjustmentno adjustment:: multiple testing problem multiple testing problem

1515

3. Multiple testing3. Multiple testing Multiple testing: Multiple testing: Typical casesTypical cases

– 2 treatments are compared for 2 treatments are compared for several endpointsseveral endpoints

– More than 2More than 2 treatments are compared treatments are compared

– 2 treatments are compared in 2 treatments are compared in several subgroupsseveral subgroups

– 2 treatments are compared at 2 treatments are compared at several time pointsseveral time points

1616

3. Multiple testing: example3. Multiple testing: example 2 treatments are compared for several endpoints 2 treatments are compared for several endpoints

1717

3. Multiple testing: example3. Multiple testing: example More than 2 treatments are comparedMore than 2 treatments are compared

1818

3. Multiple testing: example3. Multiple testing: example 2 treatments are compared in several 2 treatments are compared in several

subgroupssubgroups– Treatments were Treatments were not significantly different overallnot significantly different overall

– Then, treatments were compared in Then, treatments were compared in subgroupssubgroups:: Males & FemalesMales & Females < 60 yrs & < 60 yrs & 60 yrs 60 yrs Diabetes & no-diabetesDiabetes & no-diabetes ........

– Suppose inSuppose in 1 1 subgroup: subgroup: P < 0.05P < 0.05, meaning????, meaning???? Significant result will be a play of

chance

1919

3. Multiple testing: example3. Multiple testing: example 2 treatments are compared at2 treatments are compared at several time pointsseveral time points

Comparison at each time point: PLAY OF CHANCE!

2020

3. Multiple testing: example3. Multiple testing: example

Protocol specified:Protocol specified:

2.2 2.2 Administration of visitsAdministration of visitsPatients will be examined at baseline (day 0), Patients will be examined at baseline (day 0), day 7day 7, , day 14day 14 and and day 28day 28. At each visit the systolic BP, etc... . At each visit the systolic BP, etc... will be measured.will be measured.

9.4 9.4 Primary endpointPrimary endpointThe The primary endpointprimary endpoint for the comparison of treatment for the comparison of treatment A A B is B is systolic BPsystolic BP..

2121

3. Multiple testing: example3. Multiple testing: example This “scientific finding” was printed in the Belgian newspapers!This “scientific finding” was printed in the Belgian newspapers!

It was even stated that those who awake before 7.21 AM, have a statistically significant higher stress level during the day, than those who awake after 7.21 AM!

2222

3. Multiple testing: example3. Multiple testing: example Signs of the times: Feb 22nd 2007 | SAN FRANCISCO Signs of the times: Feb 22nd 2007 | SAN FRANCISCO

From The Economist print edition From The Economist print edition

Interesting finding?Interesting finding?

PEOPLE born under the PEOPLE born under the astrological sign of Leoastrological sign of Leo are are 15% more 15% more likely to be admitted to hospital with gastric bleeding than likely to be admitted to hospital with gastric bleeding than those born under the other 11 signsthose born under the other 11 signs. . Sagittarians are 38% more Sagittarians are 38% more likely than others to land up there because of a broken armlikely than others to land up there because of a broken arm. .

Those are the conclusions that many medical researchers Those are the conclusions that many medical researchers would be forced to make from a set of data presented to the would be forced to make from a set of data presented to the American Association for the Advancement of Science by Peter American Association for the Advancement of Science by Peter Austin of the Institute for Clinical Evaluative Sciences in Austin of the Institute for Clinical Evaluative Sciences in Toronto. At least, they would be forced to draw them if they Toronto. At least, they would be forced to draw them if they applied the lax statistical methods of their own work to the applied the lax statistical methods of their own work to the records of hospital admissions in Ontario, Canada, used by Dr records of hospital admissions in Ontario, Canada, used by Dr Austin.Austin.

2323

3. Multiple testing3. Multiple testing Multiple testing: Multiple testing: Solution??Solution??

– Choose Choose 1 primary endpoint1 primary endpoint risk = 5%risk = 5%

– What if more than one endpoint is needed?What if more than one endpoint is needed? Construct Construct combinedcombined endpoint endpoint based based on on

clinical/statisticalclinical/statistical reasoning reasoning Correct for Correct for multiple testingmultiple testing

– What for other (secondary+ tertiary) endpoints?What for other (secondary+ tertiary) endpoints? Call analyses Call analyses EXPLORATORYEXPLORATORY Correct for Correct for multiple testingmultiple testing

2424

3. Multiple testing3. Multiple testing Multiple testing: Multiple testing: Solution??Solution??

– Test 1 (WOMAC pain subscale): Test 1 (WOMAC pain subscale): risk = 5%risk = 5%

– Test 2 (WOMAC Physical Function Subscale): Test 2 (WOMAC Physical Function Subscale): risk = 5%risk = 5%

– Test 1 & Test 2:Test 1 & Test 2: risk = 10%risk = 10%

– Both tests claim significance if P < 0.05Both tests claim significance if P < 0.05– Bonferroni adjustmentBonferroni adjustment: significance if : significance if P < 0.05/2=0.025P < 0.05/2=0.025

Family-wise error rate = 0.05Family-wise error rate = 0.05– More sophisticatedMore sophisticated approaches of approaches of Simes, Holm, Hochberg Simes, Holm, Hochberg

and Hommel, Closed Testing procedures, ...and Hommel, Closed Testing procedures, ...

2.5%

2.5%

5%

2525

3. Multiple testing3. Multiple testing CPMP guidance documentCPMP guidance document

“ “Points to consider on multiplicity issues in clinical Points to consider on multiplicity issues in clinical trials” trials” (Sept 19, 2002)(Sept 19, 2002)

““A clinical study that requires no adjustment of the A clinical study that requires no adjustment of the Type I error is one that consists of Type I error is one that consists of two treatment two treatment groupsgroups, that uses a , that uses a single primary variablesingle primary variable, and has , and has a confirmatory statistical strategy that pre-specifies a confirmatory statistical strategy that pre-specifies just one just one single null hypothesissingle null hypothesis relating to the primary relating to the primary variable and variable and no interim analysisno interim analysis””

2626

4. Type II error4. Type II error Type I error:Type I error:

– Result is statistically significant (P Result is statistically significant (P < < 0.05)0.05)– Risk of making an errorRisk of making an error when H0 is true= when H0 is true= 5%5%

– (We do (We do NOT knowNOT know if H0 is true) if H0 is true)

Type II error: Type II error: – Result is Result is NOTNOT statistically significant (P statistically significant (P 0.05)0.05)– Risk of making an errorRisk of making an error when H0 is when H0 is NOTNOT true= true= ??????

– (We do (We do NOT knowNOT know if H0 is if H0 is NOTNOT true) true)

2727

5. Sample size calculation5. Sample size calculation P(Type II error): P(Type II error): 1-1- = 1- Power = 1- Power

– LARGELARGE(R) in(R) in smallsmall studiesstudies– Can be controlled by Can be controlled by adaptingadapting study (sample) sizestudy (sample) size– Calculation sample size:Calculation sample size:

Determine Determine clinically importantclinically important difference difference SearchSearch for information for information

– % rate control group% rate control group– SD of measurementsSD of measurements

Fix Fix P(Type II) P(Type II) 0.20 0.20 Power Power 0.80 (80%) 0.80 (80%) Look for statisticianLook for statistician ((s)he will look for computer ((s)he will look for computer

program)program) PrayPray Let computer work Let computer work sample size sample size

2828

5. Sample size calculation: 5. Sample size calculation: exampleexample

= 0.05= 0.05

power = power = 0.950.95

= 20%= 20%

n = 2x300n = 2x300

2929

5. Sample size calculation: 5. Sample size calculation: exampleexample????

3030

6. Negative studies6. Negative studies Negative study: Negative study: Not significant studyNot significant study

– Sample size calculation done Sample size calculation done (power at least 80%)(power at least 80%) ? ?

– Yes:Yes: Difference between treatments is probably Difference between treatments is probably smaller than smaller than

– No:No: Message ????Message ????

DOES NOT imply: DOES NOT imply: NO differenceNO difference between treatments between treatments

3131

6. Negative studies: example6. Negative studies: example

Sample size calculation????

Message????

3232

6. Negative studies: “Trend”6. Negative studies: “Trend” Trend in the data:Trend in the data:

– P > 0.05, P > 0.05, but difference is in the good directionbut difference is in the good direction

– One speaks of a “One speaks of a “trend in the datatrend in the data””

– OK?OK? NoNo, for , for confirmatoryconfirmatory study study

PerhapsPerhaps, for , for pilot studypilot study or or exploratory studiesexploratory studies

3333

7. Testing at baseline7. Testing at baseline

Why no P-values?How many significant (at 0.05) tests would you

expect?

3434

8. Statistical significance 8. Statistical significance clinical relevance clinical relevance

Statistical significance:Statistical significance:– P < 0.05P < 0.05

– MessageMessage: two treatments are (probably/possibly) different: two treatments are (probably/possibly) different

Clinical relevance:Clinical relevance:– Difference is Difference is clinically relevantclinically relevant

3535

8. Statistical significance 8. Statistical significance clinical relevance: Example clinical relevance: Example

Compare two treatmentsCompare two treatments– Response = 10-year mortalityResponse = 10-year mortality– 2 x 200 patients2 x 200 patients– A: 2%, B: 10%A: 2%, B: 10%– Chi-square test: Chi-square test: P < 0.001P < 0.001

Measures of effectMeasures of effect– ar ar = 10%-2% = 8% = 10%-2% = 8% (abs risk reduction)(abs risk reduction)– rr rr = 10%/2% = 5= 10%/2% = 5 (risk ratio)(risk ratio)

3636

8. Statistical significance 8. Statistical significance clinical relevance: Example clinical relevance: Example

Compare two treatmentsCompare two treatments– Response = 10-year mortalityResponse = 10-year mortality– 2 x 100,000 patients2 x 100,000 patients– A: 0.002%, B: 0.0010%A: 0.002%, B: 0.0010%– Chi-square test: Chi-square test: P < 0.001P < 0.001

Measures of effectMeasures of effect– ar ar = 0.0010%-0.002% = = 0.0010%-0.002% = 0.008%0.008% (abs risk reduction)(abs risk reduction)– rr rr = 0.0010%/0.002% = 5= 0.0010%/0.002% = 5 (risk ratio)(risk ratio)

3737

8. Statistical significance 8. Statistical significance clinical relevance: Conclusion clinical relevance: Conclusion

ConclusionConclusion– For each (small) (≠0),

there is a sample size such that H0 is rejected with high probability

ImplicationsImplications– Clinical trials are often Clinical trials are often too smalltoo small to detect to detect rarerare safety safety

issues issues – When registered and on the market,

after several years a safety issue appears (VIOX story)

3838

8. Statistical significance 8. Statistical significance clinical relevance: Further clinical relevance: Further

reflectionsreflections

Practical conclusionsPractical conclusions– Even if result is not significant, Even if result is not significant,

we will NOT conclude that H0 is truewe will NOT conclude that H0 is true– Why doing the significance test, Why doing the significance test,

if we don’t believe in it?if we don’t believe in it?– Better Better estimate difference in treatment effectestimate difference in treatment effect

+ + uncertaintyuncertainty

Reality

Treatments = 0 0

Conclusion same OK type II from

sample different type I OK

Classical table indicating two types of errors (Decision-theoretic approach of Neyman-Pearson).Indicates that we can conclude in practicethat the 2 treatments are equally good

It is not possible in statistics to show that 2 treatments are equally good (non-inferiority talk).

We even DO NOT BELIEVE that DO NOT BELIEVE that

H0 is TRUE in H0 is TRUE in practicepractice!

3939

9. Confidence interval 9. Confidence interval P-value P-value

4040

9. Confidence interval 9. Confidence interval P-value P-value 95% confidence interval95% confidence interval

– Expresses Expresses uncertaintyuncertainty about true difference about true difference– When When smallsmall good idea about true treatment effectgood idea about true treatment effect

ExamplesExamples– WOMAC Pain Subscale:WOMAC Pain Subscale:

E E C C: 95% CI = : 95% CI = [-7.02, 0.77] [-7.02, 0.77] 0 is possible 0 is possible E E P P: 95% CI = : 95% CI = [-19.72, -10.41][-19.72, -10.41] E is better E is better C C P P: 95% CI = : 95% CI = [-16.57, -7.32] [-16.57, -7.32] C is better C is better

GENERAL RESULT:GENERAL RESULT: P P<0.05<0.05 95% CI 95% CI does not contain 0does not contain 0

4141

9. Confidence interval 9. Confidence interval P-value P-value

Medicatie 95% betrouwbaarheidsinterval

mmHg Studie -6 -3 0 3 6 9 12 P

A 1 NS

A 2 NS

A 3 *

A 4 **

A 5 ***

medication

study

Medicatie 95% betrouwbaarheidsinterval

mmHg Studie -6 -3 0 3 6 9 12 P

A 1 NS

A 2 NS

A 3 *

A 4 **

A 5 ***

95% confidence intervalmedication

study

Two anti-hypertensive drugs

95% CI gives a clearer message

4242

10. P-value10. P-valueclinical trial clinical trial epi study epi study

Clinical trialClinical trial– RandomizedRandomized– No confoundingNo confounding– P < 0.05P < 0.05 causal effect causal effect of treatment on patient’s conditionof treatment on patient’s condition

Epidemiological studyEpidemiological study– ObservatoryObservatory– Possible confoundingPossible confounding– P < 0.05 P < 0.05 at at most associationmost association, , correctioncorrection for confounding for confounding

4343

10. P-value10. P-valueclinical trial clinical trial epi study epi study

4444

11. Biased set up & reporting11. Biased set up & reporting

4545

11. Biased setup & reporting11. Biased setup & reporting Bias in set upBias in set up of studies, e.g. inappropriate doses of of studies, e.g. inappropriate doses of

competing drugcompeting drug

Choice of patient populationsChoice of patient populations, e.g. exclusion of , e.g. exclusion of patients who were previously nonresponder to patients who were previously nonresponder to treatmenttreatment

Noninferiority Noninferiority designs with designs with different thresholdsdifferent thresholds

Biased reportingBiased reporting, e.g. minimal information on , e.g. minimal information on negative aspects of drug of sponsornegative aspects of drug of sponsor

4646

12. Take home messages12. Take home messages If possible, take If possible, take 1 primary endpoint1 primary endpoint

Always determine Always determine necessary sample sizenecessary sample size

Always Always WATCH OUTWATCH OUT for problem of for problem of multiple multiple testingtesting

Always and ONLY interpret Always and ONLY interpret NSNS as as NOT possibleNOT possible to to show “show “differencedifference””

AlwaysAlways be careful be careful when talking about “ when talking about “trendtrend””

Always determine Always determine 95% confidence intervals95% confidence intervals

Thank you for your attentionThank you for your attention

use and abuse of p values

Documents