biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf ·...

30
SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 1 A case study in genomics: Biomarker discovery: fact or artefact?

Upload: others

Post on 26-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 1

A case study in genomics:

Biomarker discovery: fact or artefact?

Page 2: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 2

Outline

• a recent controversy in Cancer genomics

• some general suggestions for dealing with this problem

Page 3: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 3

A recent experience

• Dave et al published a high-profile study in NEJM, reporting

that they had found two sets of genes whose expression were

highly predictive of survival in patients with Follicular

Lyphoma.

• the paper got a lot of attention at the recent ASH meeting,

because the genes in the clusters were largely expressed in

non-tumor cells, suggesting that the host-response was the

important factor

• One of my medical collaborators- Ron Levy, asked me to look

over their paper- he wanted to apply their model to the

Stanford FL patient population.

Page 4: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 4

n e n g l j m e d

3 5 1 ;2 1

w w w .n e j m .o r g n o v e m b e r

1 8 , 2 0 0 4

2159

The

n e w en g lan djo u r n al

of

m ed icin e

e sta b lish e d in 1812

n o ve m b e r

18

,

2004

vo l.351 n o .21

Prediction of Survival in Follicular Lym phom a Based on M olecular Features of Tum or-Infiltrating Im m une C ells

Sandeep S. Dave, M.D., George Wright, Ph.D., Bruce Tan, M.D., Andreas Rosenwald, M.D., Randy D. Gascoyne, M.D., Wing C. Chan, M.D., Richard I. Fisher, M.D., Rita M. Braziel, M.D.,

Lisa M. Rimsza, M.D., Thomas M. Grogan, M.D., Thomas P. Miller, M.D., Michael LeBlanc, Ph.D., Timothy C. Greiner, M.D., Dennis D. Weisenburger, M.D., James C. Lynch, Ph.D., Julie V ose, M.D.,

James O. Armitage, M.D., Erlend B. Smeland, M.D., Ph.D., Stein Kvaloy, M.D., Ph.D., Harald Holte, M.D., Ph.D., Jan Delabie, M.D., Ph.D., Joseph M. Connors, M.D., Peter M. Lansdorp, M.D., Ph.D., Qin Ouyang, Ph.D.,

T. Andrew Lister, M.D., Andrew J. Davies, M.D., Andrew J. Norton, M.D., H. Konrad Muller-Hermelink, M.D., German Ott, M.D., Elias Campo, M.D., Emilio Montserrat, M.D., Wyndham H. Wilson, M.D., Ph.D., Elaine S. Jaffe, M.D., Richard Simon, Ph.D., Liming Y ang, Ph.D., John Powell, M.S., Hong Zhao, M.S.,

Neta Goldschmidt, M.D., Michael Chiorazzi, B.A., and Louis M. Staudt, M.D., Ph.D.

a b st r a c t

From National Cancer Institute (S.S.D.,G.W., B.T., A.R., W.H.W., E.S.J., R.S., H.Z.,N.G., M.C., L.M.S.); Center for InformationTechnology (L.Y ., J.P.); and National Heart,Lung, and Blood Institute (S.S.D.) — all inBethesda, Md.; British Columbia CancerCenter, V ancouver, Canada (R.D.G., J.M.C.,P.M.L., Q.O.); University of NebraskaMedical Center, Omaha (W.C.C., T.C.G.,D.D.W., J.C.L., J.V ., J.O.A.); Southwest On-cology Group, San Antonio, Tex . (R.I.F.,T.M.G., T.P.M., M.L.); University of Roch-ester School of Medicine, Rochester, N.Y .(R.I.F.); Oregon Health and Science Univer-sity, Portland (R.M.B.); University of ArizonaCancer Center, Tucson (L.M.R., T.M.G.,T.P.M.); Fred Hutchinson Cancer ResearchCenter, Seattle (M.L.); Norwegian RadiumHospital, Oslo (E.B.S., S.K., H.H., J.D.);Cancer Research UK, St. Bartholomew’sHospital, London (T.A.L., A.J.D., A.J.N.);University of Würzburg, Würzburg, Ger-many (A.R., H.K.M.-H., G.O.); and Univer-sity of Barcelona, Barcelona, Spain (E.C.,E.M.). Address reprint requests to Dr.Staudt at the National Cancer Institute,Bldg. 10, Rm. 4N114, NIH, Bethesda, MD208 92, or at lstaudt@ mail.nih.gov.

N Engl J Med 2004;351:2159-69.

C opyright © 2004 Massachusetts Medical Society.

backg ro u n d

Patients w ith follicular lym phom a m ay survive for periods of less than 1 year to m ore

than 20 years after diagnosis. W e used gene-expression profiles of tum or-biopsy spec-

im ens obtained at diagnosis to develop a m olecular predictor of the length of survival.

m eth o d s

G ene-expression profiling w as perform ed on 191 biopsy specim ens obtained from pa-

tients w ith untreated follicular lym phom a. Supervised m ethods w ere used to discover

expression patterns associated w ith the length of survival in a training set of 95 speci-

m ens. A m olecular predictor of survival w as constructed from these genes and validat-

ed in an independent test set of 96 specim ens.

resu lts

Individual genes that predicted the length of survival w ere grouped into gene-expres-

sion signatures on the basis of their expression in the training set, and tw o such signa-

tures w ere used to construct a survival predictor. The tw o signatures allow ed patients

w ith specim ens in the test set to be divided into four quartiles w ith w idely disparate m e-

dian lengths of survival (13.6, 11.1, 10.8, and 3.9 years), independently of clinical

prognostic variables. Flow cytom etry show ed that these signatures reflected gene ex-

pression by nonm alignant tum or-infiltrating im m une cells.

co n clu sio n s

The length of survival am ong patients w ith follicular lym phom a correlates w ith the

m olecular features of nonm alignant im m une cells present in the tum or at diagnosis.

Copyright © 2004 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .

Page 5: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 5

Summary of their findings

• They started with the expression of approximately 49,000

genes measured on 189 patient samples, derived from DNA

microarrays. A survival time (possibly censored) was available

for each patient

• they randomly split the data into a training set of 89 patients

and a test set of 90 patients

• using a multi-step procedure (described below), they extracted

two clusters of genes, called IR1 (immune response 1) and IR2

(immune response 2).

• They averaged the gene expression of the genes in each cluster,

to create two “super-genes”.

Page 6: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 6

... continued

• They then fit these super-genes together in a Cox model for

survival, and applied it to the training and test sets. The

p-value in the training set was < 10e − 7 and 0.003 in the test

set. IR1 correlates with good prognosis; IR2 with poor

prognosis

• In the remainder of the paper they interpret the genes in their

model

Page 7: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 7

What happened next...

• I downloaded the data

• Applied some familiar statistical tools- eg SAM (Significance

analysis of microarrays), less familiar ones- supervised principal

components, and also gave the data to Brad Efron. Our initial

finding- no significant correlation between gene expression and

survival.

• I spent 2-3 weeks emailing back and forth with their

statistician (George Wright) and programming in R, to recreate

their analysis

• Confession- it was fun being a “forensic statistician”

Page 8: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 8

Page 9: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 9

−4 −2 0 2 4

−4−2

02

4

expected score

obse

rved

sco

re

SAM plot

1 10 100 1000 10000

1e−0

41e

−03

1e−0

21e

−01

1e+0

0

gene

p−va

lue

Ordered p−values

Page 10: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 10

Details of their analysis

1) Divide the data randomly into training and test sets of

approximately equal numbers of patients. Apply the following

recipe [steps 2–6] to the training set.

2) Choose all genes with univariate Cox score > 1.5 in absolute

value. This reduced the number of genes from roughly 49,000

to roughly 3,000, with about a 50-50 split between good

prognosis genes (negative scores) and poor prognosis genes

(positive scores).

3) Do separate hierarchical clusterings (correlation metric, average

linkage) of the good and poor prognosis genes.

Page 11: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 11

Details continued...

4) Find all clusters in the dendrograms (clustering trees)

containing between 25 and 50 genes, with internal correlation

at least 0.5. Represent each cluster by the average expression

of all genes in the cluster– a “supergene” Try every pair of

supergenes as predictors in Cox models for predicting survival.

5) Choose the most significant pair from this process. The authors

call the resulting pair of clusters IR1 (good prognosis) and IR2

(poor prognosis).

6) Finally use the model (IR1, IR2) in a Cox model to predict

survival in the test set.

Page 12: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 12

genes

patients

survival times

Training set

Expression data

Cox scores

>1.5

< −1.5

49,000

Extract clusters withbetween 25 and 50 genes,and corr > 0.5

Hierarchicalclustering

89

Test set90 patients

Apply best model to test set

in a Cox survival modeland try all possible pairs of supergenes

Average each cluster into a supergene,~1500 Positive genes

~1500 Negative genes

Page 13: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 13

n e n g l j m e d

3 5 1 ;2 1

w w w .n e j m .o r g n o v e m b e r

1 8 , 2 0 0 4

pr e d ic tio n o f su r viva l in fo llic u la r lym ph o m a

2163

Training Set of TumorSpecimens (N= 95)

Genes Associated

with Poor

Prognosis

0.33 31

Relative level of expression(¬ median value)

Immune-Response 1 Signature

Immune-Response 2 Signature

ACTN1ATP8B2BIN2C1RLC6orf37C9orf52CD7CD8B1DDEF2DKFZP566G1424DKFZP761D1624FLJ32274FLNAFLT3LGGALNT12GNAQHCSTHOXB2IL7R

TNFRSF1BTNFRSF25TNFSF12TNFSF13B

BLVRAC17orf31C1QAC1QBC3AR1C4AC6orf145CEB1DHRS3DUSP3F8FCGR1AGPRC5BHOXD8LGMNME1

Genes Associated

with F avorable Prognosis

IMAGE:5289004INPP1ITKJAM KIAA1128KIAA1450LEF1LGALS2LOC340061NFICPTRFRAB27ARALGDSSEMA4CSEPW1STAT4TBC1D4TEAD1TMEPAI

MITFMRVI1NDNOASLPELOSCARB2SEPT10TLR5

Pro

bab

ility

of

Su

rviv

al 0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

Y ears

No. at Risk 12296699147187

A B

C D

F ollicular lymphoma biopsy specimens(N = 191)

T raining set (N = 95) T est set (N = 96)

Identify genes with expression patterns

associated with favorable prognosis

C reate optimal multi-variate model withsurvival signature

averages

V alidate survival predictor in test set

Identify genes with expression patterns

associated with poor prognosis

A verage gene- expression levels for genes in each survival signature

Use hierarchicalclustering to

identify survivalsignatures

Use hierarchicalclustering to

identify survivalsignatures

A verage gene- expression levels for genes in each survival signature

Copyright © 2004 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .

Page 14: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 14

n e

ng

l j m

ed

351;2

1

ww

w.n

ejm

.or

gn

ov

em

be

r

18, 2

00

4

pr

ed

ic

ti

on

of

su

rv

iv

al

in

fo

ll

ic

ul

ar

ly

mp

ho

ma

21

65

Figure 2. Development of a Molecular Predictor of Survival in Follicular L ymphoma.

Panel A shows overall survival among the patients with biopsy specimens in the test set, according to the quartile of the survival-predictor score (SPS). Panel B shows overall survival

according to the International Prognostic Index (IPI) risk group for all the patients for whom these data were available. Panel C shows overall survival among the patients with specimens

in the test set for the indicated IPI risk group, stratified according to the quartile of the SPS.

Pro

bab

ility

of

Su

rviv

al

0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

P< 0.001

Y ears

No. at RiskIPI, 0 or 1IPI, 2 or 3IPI, 4 or 5

8 1 0

19 5 0

38 18 0

55 31 1

75 48

6

91 64

9

Pro

bab

ility

of

Su

rviv

al

0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

P< 0.001

Y ears

No. at RiskSPS quartile 1SPS quartile 2SPS quartile 3SPS quartile 4

1 2 2 1

4 3 6 1

9 10 10 4

14 16 13 7

21 23 18 12

23 242423

1234

51 58 54 29

86 86 69 38

13.6 11.1 10.8 3.9

Pro

bab

ility

of

Su

rviv

al

0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

P< 0.001

Y ears

No. at RiskSPS quartile 1SPS quartile 2SPS quartile 3SPS quartile 4

1 1 2 0

3 2 4 0

5 3 6 1

13 7 8 2

9 5 8 2

15 8 9 6

0.8

0.6

0.2

1.0

0.4

0.00 3 6 9 12 15

P< 0.001

Y ears

0 0 0 0

0 0 2 0

2 4 4 2

3 7 5 3

3 10 9 6

3 10 11 12

Quartile of SPS SurvivalMedian (yr) 5 yr (%) 10 yr (%)

A

B CIPI, 0 or 1 IPI, 2 or 3

Copyright ©

2004 Massachusetts M

edical Society. A

ll rights reserved. D

ownloaded from

ww

w.nejm

.org at Stanford U

niversity on May 10, 2005 .

Page 15: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 15

p-values of all cluster pairs

1e−08 1e−06 1e−04 1e−02

5e−0

45e

−03

5e−0

25e

−01

train set p−value

test

set

p−v

alue

The total number of points (cluster pairs) with test set p-values less

than 0.05 (239) is far fewer than we’d expect to see by chance (735)

Page 16: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 16

Univariate p-values

1e−04 5e−04 2e−03 1e−02

0.05

0.20

0.50

training pval

test

pva

l

poor prognosis genes

5e−04 2e−03 1e−020.

10.

20.

51.

0

training pval

test

pva

l

good prognosis genes

Page 17: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 17

Swapping train and test sets

1e−11 1e−08 1e−05 1e−02

0.00

10.

005

0.05

00.

500

train set p−value

test

set

p−v

alue

There are only 85 pairs out of 11628 that are significant in the test

set at the 0.05 level, while we would expect 11628*.05=581 pairs

just by chance.

Page 18: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 18

Swapping train and test sets, ctd...

2e−04 1e−03 5e−03 2e−02

0.1

0.2

0.5

1.0

training pval

test

pva

l

poor prognosis genes

5e−05 5e−04 5e−030.

10.

20.

51.

0

training pval

test

pva

l

good prognosis genes

Page 19: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 19

Cluster size ranges (30,60) rather than (25,50)

1e−07 1e−05 1e−03 1e−01

5e−0

45e

−03

5e−0

25e

−01

train set p−value

test

set

p−v

alue

Page 20: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 20

The aftermath

• I published a short letter to NEJM in March 2005; full details

of my re-analysis appear on my website

• The authors published a rebuttal in the same issue.

“Nothing in Tibshirani’s analysis calls into dispute the fact that

we discovered and validated a strong association between gene

expression in follicular lymphoma and overall survival. ”

Their arguments:

– (1) we followed standard statistical procedures, found a

small p-value on the test set, therefore our finding is correct;

– (2) our method found an interaction, which SAM can’t find

– (3) we get small p-values if we apply our model to random

halves of the data (????!!!!!!)

Page 21: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 21

n engl j med 352;14 w w w .nejm.o r g a p r il 7 , 20 0 5

correspondence

1497

Our predictor could not have been discovered

w ith the SAM m ethod, w hich relies solely on uni-

variate associations w ith survival. R ather, our pre-

dictor derives its strength from the synergistic

com bination of tw o gene-expression signatures in

a m ultivariate m odel. Tibshirani confuses the abil-

ity of our m ethod to discover a survival association

w ith the fact that w e actually found one that validat-

ed the association. W hen he exchanged the training

set for the test set, he w as unable to rediscover our

gene-expression predictor because som e genes in

our predictor fell below the P value threshold for

association w ith survival in the test set. This does

not negate the fact that our m odel is highly associ-

ated w ith survival in the test set.

H ong et al. have m ade three errors. First, in our

sorted subpopulations, the C D 19¡ fraction con-

tained, on average, 12.6 percent contam ination

w ith follicular-lym phom a cells, not 25 percent, as

they claim . To believe that the higher expression of

the im m une-response signatures in the C D 19¡

fraction is due to this 12.6 percent contam ination

requires that the lym phom a cells in the C D 19¡

fraction have expression of the im m une-response

signatures that w as m ore than eight tim es as high

as that in their counterparts in the C D 19+ fraction.

Second, H ong et al. incorrectly discount the im -

m une-response 1 signature, w hich contributes sig-

nificantly to the survival m odel in the test set

(P<0.001). Third, m any of the im m une-response

signature genes are selectively expressed in T cells,

m onocytes, or dendritic cells, or in m ore than one

of these, but not in B cells, m aking the contention

of H ong et al. even m ore im plausible.

Louis M. Staudt, M.D., Ph.D.George W right, Ph.D.Sandeep Dave, M.D.

National C ancer InstituteBethesda, MD [email protected]

1. R ansohoff D F. R ules of evidence for cancer m olecular-m arker

discovery and validation. N at R ev C ancer 2004;4:309-14.

When Doctors Go to War

to the editor: Like Bloche and M arks in their Per-

spective article on doctors in com bat (Jan. 6 is-

sue),1 the Am erican M edical Association (AM A)

applauds the outstanding w ork of m ilitary physi-

cians in treating w ounded soldiers under extrem ely

challenging circum stances.2 Physicians are de-

fined by their com m on calling to prevent harm and

treat people w ho are ill or injured and by their uni-

versal com m itm ent to uphold recognized principles

of m edical ethics w henever patients rely on their

Figure 1. Results of Samplings of Half-Sets and Association between Gene Expression and Survival.

No

. o

f H

alf-

Set

s O

ut

of

20

00

Sam

plin

gs

200

250

150

100

50

010¡11 10¡10 10¡9 10¡8 10¡7 10¡6 10¡5 10¡4 10¡3 10¡2 10¡1 100

P V alue for Association of Gene-Expression– Based Model with Survival

300

Copyright © 2005 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at Stanford University on May 10, 2005 .

Page 22: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 22

General comments

• Their finding is fragile. I don’t believe that it is real or

reproducible

• This experience uncovers a problem that is of general

importance to our field:

– with many predictors, it is too easy to overfit the data and

find spurious results

– we can inadvertently mislead the reader, and mislead

ourselves. I have been guilty of this too

Page 23: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 23

Some recommendations

• encourage authors to publish not only the raw data, but a

script of their analysis

• encourage authors to use “canned” methods/packages, with

built-in cross-validation to validate the model search process -

see “supervised principal components” later today

• develop crude, global measures of the “information content” in

a dataset: see “tail strength” later today

• develop measures of the fragility of an analysis

Page 24: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 24

A sobering point of view

Page 25: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLD

MII

Tre

vor

Hast

ieand

Rob

Tib

shir

ani,

Sta

nfo

rdU

niv

ers

ity

25

PLoS Medicine | www.plosmedicine.org 0696

Essay

Open access, freely available online

August 2005 | V olume 2 | Issue 8 | e124

Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion

and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key

factors that infl uence this problem and some corollaries thereof.

Modeling the F ramew ork for F alse

P ositiv e F indings

Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confi rmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research fi ndings solely on the basis of a single study assessed by formal statistical signifi cance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles

should be interpreted based only on p-values. Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings.

As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R

is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus

T he Essay section contains opinion pieces on topics

of broad interest to a general medical audience.

W hy Most P ublished Research F indings

Are F alse J ohn P . A. Ioannidis

C itation: Ioannidis JPA (2005) Why most published research fi ndings are false. PLoS Med 2(8): e124.

C opyright: © 2005 John P. A. Ioannidis. T his is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abbrev iation: PPV , positive predictive value

John P. A. Ioannidis is in the D epartment of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, D epartment of Medicine, T ufts-New England Medical Center, T ufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]

C ompeting Interests: T he author has declared that no competing interests exist.

DOI: 10.1371/journal.pmed.0020124

Summary

T here is increasing concern that most current published research fi ndings are false. T he probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientifi c fi eld. In this framework, a research fi nding is less likely to be true when the studies conducted in a fi eld are smaller; when effect siz es are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater fl exibility in designs, defi nitions, outcomes, and analytical modes; when there is greater fi nancial and other interest and prejudice; and when more teams are involved in a scientifi c fi eld in chase of statistical signifi cance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientifi c fi elds, claimed research fi ndings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

It can be prov en that most claimed research

fi ndings are false.

Page 26: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 26

• Let R= the ratio of “true relationships” to “no relationships”

among those tested in a field.

• We carry out a study with type I error α and power 1 − β,

probing c relationships. Then given a positive finding from a

study, the author derives its PPV (positive predictive value).

• He models the effect of bias µ= proportion of analyses that

would not have been “research findings” if the study had been

carried out in an unbiased way.

Page 27: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 27

Main results

Research findings are less likely to be true:

• the smaller the studies conducted in a scientific field,

• the smaller the effect sizes in a scientific field,

• the greater the number and the lesser the selection of tested

relationships in a scientific field,

• the greater the flexibility in designs, definitions, outcomes, and

analytical modes in a scientific field,

• the greater the financial and other interests and prejudices in a

scientific field,

• the hotter a scientific field (with more scientific teams involved)

Page 28: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 28

Page 29: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 29

An example

Page 30: Biomarker discovery: fact or artefact?statweb.stanford.edu/~tibs/sta306bfiles/foll_lym.pdf · supergenes as predictors in Cox models for predicting survival. 5) Choose the most signi

SLDM II Trevor Hastie and Rob Tibshirani, Stanford University 30

PL oS M edicine | www.plosmedicine.org 0699

alternating extreme research claims and extremely opposite refutations [29]. Empirical evidence suggests that this sequence of extreme opposites is very common in molecular genetics [29].

These corollaries consider each factor separately, but these factors often infl uence each other. For example, investigators working in fi elds where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fi elds where true effect sizes are perceived to be large. Or prejudice may prevail in a hot scientifi c fi eld, further undermining the predictive value of its research fi ndings. Highly prejudiced stakeholders may even create a barrier that aborts efforts at obtaining and disseminating opposing results. Conversely, the fact that a fi eld

is hot or has strong invested interests may sometimes promote larger studies and improved standards of research, enhancing the predictive value of its research fi ndings. Or massive discovery-oriented testing may result in such a large yield of signifi cant relationships that investigators have enough to report and search further and thus refrain from data dredging and manipulation.

Most Resear ch F indings A r e F alse

f or Most Resear ch Designs and f or

Most F ields

In the described framework, a PPV exceeding 50% is quite diffi cult to get. Table 4 provides the results of simulations using the formulas developed for the infl uence of power, ratio of true to non-true relationships, and bias, for various types of situations that may be characteristic of specifi c study designs and settings. A fi nding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is

eventually true about 85% of the time. A fairly similar performance is expected of a confi rmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely, a meta-analytic fi nding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research fi ndings from underpowered, early-phase clinical trials would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in fi ve chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [30,31], PPV for each claimed relationship is extremely low, even with considerable

B ox 1. A n Example: Science at L ow P r e-Study Odds

L et us assume that a team of inv estigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schiz ophrenia. B ased on what we k now about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schiz ophrenia, with relativ ely similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. T hen R = 10/100,000 = 10−4, and the pre-study probability for any polymorphism to be associated with schiz ophrenia is also R/(R + 1) = 10−4. L et us also suppose that the study has 60% power to fi nd an association with an odds ratio of 1.3 at α = 0.05. T hen it can be estimated that if a statistically signifi cant association is found with the p-v alue barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.

Now let us suppose that the inv estigators manipulate their design,

analyses, and reporting so as to mak e more relationships cross the p = 0.05 threshold ev en though this would not hav e been crossed with a perfectly adhered to design and analysis and with perfect comprehensiv e reporting of the results, strictly according to the original study plan. Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, inv estigation of genetic contrasts that were not originally specifi ed, changes in the disease or control defi nitions, and v arious combinations of selectiv e or distorted reporting of the results. C ommercially av ailable “data mining” pack ages actually are proud of their ability to yield statistically signifi cant results through data dredging. In the presence of bias with u = 0.10, the post-study probability that a research fi nding is true is only 4.4 × 10−4. F urthermore, ev en in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them fi nds a formally statistically signifi cant association, the probability that the research fi nding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensiv e research was undertak en!

DOI: 10.1371/journal.pmed.0020124.g002

F igur e 2. PPV (Probability T hat a Research F inding Is T rue) as a F unction of the Pre-Study Odds for Various Numbers of C onducted Studies, n

Panels correspond to power of 0.20, 0.50, and 0.80.

A ugust 2005 | Volume 2 | Issue 8 | e124