teaching reproducible research in bioinformatics

11
Teaching reproducible research in bioinformatics Robert Castelo [email protected] @robertclab Dept. of Experimental and Health Sciences Universitat Pompeu Fabra 1st. HEIRRI Conference - Barcelona - March 18th, 2016 Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 1 / 11

Upload: heirri

Post on 28-Jul-2016

222 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Teaching reproducible research in bioinformatics

Robert [email protected]

@robertclab

Dept. of Experimental and Health Sciences

Universitat Pompeu Fabra

1st. HEIRRI Conference - Barcelona - March 18th, 2016

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 1 / 11

Bioinformatics

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 2 / 11

Funding and scientific output

Funding in genomics research has reached $2.9 billion/year worldwide (Pohlhaus& Cook-Deegan, BMC Genomics, 9:472, 2008)

The number of articles published every year in the biomedical literature (asindexed in PubMed) has doubled in the last 15 years.

1950 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Year of publication

0

200000

400000

600000

800000

1000000

1200000

Nr.

of a

rtic

les

in P

ubM

ed p

er y

ear

Source: http://www.ncbi.nlm.nih.gov/pubmed

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 3 / 11

Problems: drug development stagnation

The number of new drugs approved per billion US dollars spent on R&D hashalved every 9 years since 1950, falling around 80-fold in inflation-adjustedterms (Scannell et al., Nat Rev Drug Discov, 11:191-200, 2012).

Source: Fig. 1. Scannell et al. Diagnosing the decline in pharmaceutical R&D efficiency. Nat Rev DrugDiscov, 11:191-200, 2012.

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 4 / 11

Problems: lack of reproducibility

ANALYS IS

Repeatability of published microarray gene expression analysesJohn P A Ioannidis1–3, David B Allison4, Catherine A Ball5, Issa Coulibaly4, Xiangqin Cui4, Aedín C Culhane6,7, Mario Falchi8,9, Cesare Furlanello10, Laurence Game11, Giuseppe Jurman10, Jon Mangion11, Tapan Mehta4, Michael Nitzberg5, Grier P Page4,12, Enrico Petretto11,13 & Vera van Noort14

Given the complexity of microarray-based gene expression studies, guidelines encourage transparent design and public data availability. Several journals require public data deposition and several public databases exist. However, not all data are publicly available, and even when available, it is unknown whether the published results are reproducible by independent scientists. Here we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005–2006. One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. Repeatability of published microarray studies is apparently limited. More strict publication rules enforcing public data availability and explicit description of data processing and analysis should be considered.

Microarray-based research is a prolific scientific field1 where extensive data are generated and published. The field has been sensitized to the need for transparent design and public data deposition2–5 and public databases have been designed for this purpose6–8. Issues surrounding the ability to reproduce published results with publicly available data have drawn attention in microarray-related research9–11 and beyond. The reproducibility of scientific results has been a concern of the scientific community for decades and in every scientific discipline. In biomedical

research, the Uniform Guidelines of the International Committee of Medical Journal Editors state that authors should “identify the methods, apparatus and procedures in sufficient detail to allow other workers to reproduce the results”12. Making primary data publicly available has many challenges but also many benefits13. Public data availability allows other investigators to confirm the results of the original authors, exactly replicate these results in other studies and try alternative analyses to see whether results are robust and to learn new things. Journals such as Nature Genetics require public data deposition as a prerequisite for publication for microarray-based research. Yet, the extent to which data are indeed made fully and accurately publicly available and permit con-firmation of originally reported findings in many areas, including gene expression microarray research, is unknown.

In this project, we aimed to evaluate the repeatability of published microarrays studies. We focused specifically on the ability to repeat the published analyses and get the same results. This is one impor-tant component in the wider family of replication and reproducibility issues. We evaluated 18 articles published in Nature Genetics in 2005 or 2006 that presented data from comparative analyses of microarrays experiments that had not been previously published elsewhere. Detailed eligibility criteria and search strategies are presented in the Methods section. Of 20 initially selected articles14–33, 2 were excluded21,26 when they were found to use previously published data. The 18 evaluated articles14–20,22–25,27–33 and the selected tables or figures we attempted to reproduce are shown in Table 1. They cover a wide variety of meth-ods and applications, as expected from a multidisciplinary genetics journal. Of the 18 articles, 16 declare in either the primary article or its supplements that the gene expression profiling experimental data

1Clinical and Molecular Epidemiology Unit, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina 45110, Greece. 2Biomedical Research Institute, Foundation for Research and Technology–Hellas, Ioannina 45110, Greece. 3Center for Genetic Epidemiology and Modeling, Tufts Medical Center and Department of Medicine, Tufts University School of Medicine, Boston, Massachusetts 02111, USA. 4Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA. 5Department of Biochemistry, Stanford University School of Medicine, Stanford, California 94305, USA. 6Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA. 7Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA. 8Genomic Medicine, Faculty of Medicine, Imperial College London, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK. 9Department of Twin Research & Genetic Epidemiology, St. Thomas’ Campus, King’s College London, Lambeth Palace Road, London SE1 7EH, UK. 10Fondazione Bruno Kessler, via Sommarive 18, 38100 Povo-Trento, Italy. 11Medical Research Council Clinical Sciences Centre Microarray Centre, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK. 12Statistics and Epidemiology Unit, RTI International, Atlanta, Georgia 30341, USA. 13Department of Epidemiology, Public Health and Primary Care, Faculty of Medicine, Imperial College, Praed Street, London W2 1PG, UK. 14European Molecular Biology Laboratory Heidelberg, Meyerhofstraße 1, 69117 Heidelberg, Germany. Correspondence should be addressed to J.P.A.I. ([email protected]). Received 15 September 2008; accepted 4 November 2008; published online 28 January 2009; doi:10.1038/ng.295

NATURE GENETICS | VOLUME 41 | NUMBER 2 | FEBRUARY 2009 149

©20

09 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

L I N K TO O R I G I N A L A RT I C L E

A recent report by Arrowsmith noted that the success rates for new development projects in Phase II trials have fallen from 28% to 18% in recent years, with insufficient efficacy being the most frequent reason for failure (Phase II failures: 2008–2010. Nature Rev. Drug Discov. 10, 328–329 (2011))1. This indicates the limi-tations of the predictivity of disease models and also that the validity of the targets being investigated is frequently questionable, which is a crucial issue to address if success rates in clinical trials are to be improved.

Candidate drug targets in industry are derived from various sources, including in-house target identification campaigns, in-licensing and public sourcing, in particular based on reports published in the literature and presented at conferences. During the transfer of projects from an academic to a company setting, the focus changes from ‘interesting’

to ‘feasible/marketable’, and the financial costs of pursuing a full-blown drug discovery and development programme for a particular tar-get could ultimately be hundreds of millions of Euros. Even in the earlier stages, investments in activities such as high-throughput screen-ing programmes are substantial, and thus the validity of published data on potential targets is crucial for companies when deciding to start novel projects.

To mitigate some of the risks of such invest-ments ultimately being wasted, most phar-maceutical companies run in-house target validation programmes. However, validation projects that were started in our company based on exciting published data have often resulted in disillusionment when key data could not be reproduced. Talking to scien-tists, both in academia and in industry, there seems to be a general impression that many

results that are published are hard to repro-duce. However, there is an imbalance between this apparently widespread impression and its public recognition (for example, see REFS 2,3), and the surprisingly few scientific publica-tions dealing with this topic. Indeed, to our knowledge, so far there has been no published in-depth, systematic analysis that compares reproduced results with published results for wet-lab experiments related to target identifica-tion and validation.

Early research in the pharmaceutical indus-try, with a dedicated budget and scientists who mainly work on target validation to increase the confidence in a project, provides a unique opportunity to generate a broad data set on the reproducibility of published data. To substanti-ate our incidental observations that published reports are frequently not reproducible with quantitative data, we performed an analysis of our early (target identification and valida-tion) in-house projects in our strategic research fields of oncology, women’s health and cardio-vascular diseases that were performed over the past 4 years (FIG. 1a). We distributed a ques-tionnaire to all involved scientists from target discovery, and queried names, main relevant published data (including citations), in-house data obtained and their relationship to the pub-lished data, the impact of the results obtained for the outcome of the projects, and the models

Believe it or not: how much can we rely on published data on potential drug targets?Florian Prinz, Thomas Schlange and Khusru Asadullah

Figure 1 | Analysis of the reproducibility of published data in 67 in-house projects. a | This figure illustrates the distribution of projects within the oncology, women’s health and cardiovascular indications that were ana-lysed in this study. b | Several approaches were used to reproduce the pub-lished data. Models were either exactly copied, adapted to internal needs (for example, using other cell lines than those published, other assays and so on) or the published data was transferred to models for another indication. ‘Not applicable’ refers to projects in which general hypotheses could not be verified. c | Relationship of published data to in-house data. The proportion

of each of the following outcomes is shown: data were completely in line with published data; the main set was reproducible; some results (including the most relevant hypothesis) were reproducible; or the data showed incon-sistencies that led to project termination. ‘Not applicable’ refers to projects that were almost exclusively based on in-house data, such as gene expres-sion analysis. The number of projects and the percentage of projects within this study (a– c) are indicated. d | A comparison of model usage in the repro-ducible and irreproducible projects is shown. The respective numbers of projects and the percentages of the groups are indicated.

0CVWTG�4GXKGYU�^�&TWI�&KUEQXGT[

+PEQPUKUVGPEKGU0QV�CRRNKECDNG.KVGTCVWTG�FCVC�CTG�KP�NKPG�YKVJ�KP�JQWUG�FCVC/CKP�FCVC�UGV�YCU�TGRTQFWEKDNG5QOG�TGUWNVU�YGTG�TGRTQFWEKDNG

1PEQNQI[ /QFGN�CFCRVGF�VQ�KPVGTPCN�PGGFU�9QOGPoU�JGCNVJ%CTFKQXCUEWNCT

.KVGTCVWTG�FCVC�VTCPUHGTTGF�VQ�CPQVJGT�KPFKECVKQP�0QV�CRRNKECDNG��/QFGN�TGRTQFWEGF�����

������� ������� �������

������ �������

�����

�����

�������

����� �����

������������

+P�JQWUG�FCVC�KP�NKPG�YKVJ�RWDNKUJGF�TGUWNVU

+PEQPUKUVGPEKGU�VJCV�NGF�VQ�RTQLGEV�VGTOKPCVKQP

/QFGN�TGRTQFWEGF����

�����

�������

/QFGN�CFCRVGF�VQ�KPVGTPCN�PGGFU�EGNN�NKPG��CUUC[U�

�������

�������

.KVGTCVWTG�FCVC�VTCPUHGTTGF�VQ�CPQVJGT�KPFKECVKQP

�����

0QV�CRRNKECDNG

�����

�����

C

F

D E

CORRESPONDENCE

NATURE REVIEWS | DRUG DISCOVERY www.nature.com/reviews/drugdisc

© 2011 Macmillan Publishers Limited. All rights reserved

L I N K TO O R I G I N A L A RT I C L E

A recent report by Arrowsmith noted that the success rates for new development projects in Phase II trials have fallen from 28% to 18% in recent years, with insufficient efficacy being the most frequent reason for failure (Phase II failures: 2008–2010. Nature Rev. Drug Discov. 10, 328–329 (2011))1. This indicates the limi-tations of the predictivity of disease models and also that the validity of the targets being investigated is frequently questionable, which is a crucial issue to address if success rates in clinical trials are to be improved.

Candidate drug targets in industry are derived from various sources, including in-house target identification campaigns, in-licensing and public sourcing, in particular based on reports published in the literature and presented at conferences. During the transfer of projects from an academic to a company setting, the focus changes from ‘interesting’

to ‘feasible/marketable’, and the financial costs of pursuing a full-blown drug discovery and development programme for a particular tar-get could ultimately be hundreds of millions of Euros. Even in the earlier stages, investments in activities such as high-throughput screen-ing programmes are substantial, and thus the validity of published data on potential targets is crucial for companies when deciding to start novel projects.

To mitigate some of the risks of such invest-ments ultimately being wasted, most phar-maceutical companies run in-house target validation programmes. However, validation projects that were started in our company based on exciting published data have often resulted in disillusionment when key data could not be reproduced. Talking to scien-tists, both in academia and in industry, there seems to be a general impression that many

results that are published are hard to repro-duce. However, there is an imbalance between this apparently widespread impression and its public recognition (for example, see REFS 2,3), and the surprisingly few scientific publica-tions dealing with this topic. Indeed, to our knowledge, so far there has been no published in-depth, systematic analysis that compares reproduced results with published results for wet-lab experiments related to target identifica-tion and validation.

Early research in the pharmaceutical indus-try, with a dedicated budget and scientists who mainly work on target validation to increase the confidence in a project, provides a unique opportunity to generate a broad data set on the reproducibility of published data. To substanti-ate our incidental observations that published reports are frequently not reproducible with quantitative data, we performed an analysis of our early (target identification and valida-tion) in-house projects in our strategic research fields of oncology, women’s health and cardio-vascular diseases that were performed over the past 4 years (FIG. 1a). We distributed a ques-tionnaire to all involved scientists from target discovery, and queried names, main relevant published data (including citations), in-house data obtained and their relationship to the pub-lished data, the impact of the results obtained for the outcome of the projects, and the models

Believe it or not: how much can we rely on published data on potential drug targets?Florian Prinz, Thomas Schlange and Khusru Asadullah

Figure 1 | Analysis of the reproducibility of published data in 67 in-house projects. a | This figure illustrates the distribution of projects within the oncology, women’s health and cardiovascular indications that were ana-lysed in this study. b | Several approaches were used to reproduce the pub-lished data. Models were either exactly copied, adapted to internal needs (for example, using other cell lines than those published, other assays and so on) or the published data was transferred to models for another indication. ‘Not applicable’ refers to projects in which general hypotheses could not be verified. c | Relationship of published data to in-house data. The proportion

of each of the following outcomes is shown: data were completely in line with published data; the main set was reproducible; some results (including the most relevant hypothesis) were reproducible; or the data showed incon-sistencies that led to project termination. ‘Not applicable’ refers to projects that were almost exclusively based on in-house data, such as gene expres-sion analysis. The number of projects and the percentage of projects within this study (a– c) are indicated. d | A comparison of model usage in the repro-ducible and irreproducible projects is shown. The respective numbers of projects and the percentages of the groups are indicated.

0CVWTG�4GXKGYU�^�&TWI�&KUEQXGT[

+PEQPUKUVGPEKGU0QV�CRRNKECDNG.KVGTCVWTG�FCVC�CTG�KP�NKPG�YKVJ�KP�JQWUG�FCVC/CKP�FCVC�UGV�YCU�TGRTQFWEKDNG5QOG�TGUWNVU�YGTG�TGRTQFWEKDNG

1PEQNQI[ /QFGN�CFCRVGF�VQ�KPVGTPCN�PGGFU�9QOGPoU�JGCNVJ%CTFKQXCUEWNCT

.KVGTCVWTG�FCVC�VTCPUHGTTGF�VQ�CPQVJGT�KPFKECVKQP�0QV�CRRNKECDNG��/QFGN�TGRTQFWEGF�����

������� ������� �������

������ �������

�����

�����

�������

����� �����

������������

+P�JQWUG�FCVC�KP�NKPG�YKVJ�RWDNKUJGF�TGUWNVU

+PEQPUKUVGPEKGU�VJCV�NGF�VQ�RTQLGEV�VGTOKPCVKQP

/QFGN�TGRTQFWEGF����

�����

�������

/QFGN�CFCRVGF�VQ�KPVGTPCN�PGGFU�EGNN�NKPG��CUUC[U�

�������

�������

.KVGTCVWTG�FCVC�VTCPUHGTTGF�VQ�CPQVJGT�KPFKECVKQP

�����

0QV�CRRNKECDNG

�����

�����

C

F

D E

CORRESPONDENCE

NATURE REVIEWS | DRUG DISCOVERY www.nature.com/reviews/drugdisc

© 2011 Macmillan Publishers Limited. All rights reserved

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 5 / 11

Problems: growth of retractionsBrembs et al. Consequences of journal rank

FIGURE 1 | Current trends in the reliability of science. (A) Exponential fitfor PubMed retraction notices (data from pmretract.heroku.com). (B)Relationship between year of publication and individual study effect size.Data are taken from Munafò et al. (2007), and represent candidate genestudies of the association between DRD2 genotype and alcoholism. Theeffect size (y-axis) represents the individual study effect size (odds ratio; OR),on a log-scale. This is plotted against the year of publication of the study(x-axis). The size of the circle is proportional to the IF of the journal theindividual study was published in. Effect size is significantly negativelycorrelated with year of publication. (C) Relationship between IF and extent towhich an individual study overestimates the likely true effect. Data are taken

from Munafò et al. (2009), and represent candidate gene studies of a numberof gene-phenotype associations of psychiatric phenotypes. The bias score(y-axis) represents the effect size of the individual study divided by the pooledeffect size estimated indicated by meta-analysis, on a log-scale. Therefore, avalue greater than zero indicates that the study provided an over-estimate ofthe likely true effect size. This is plotted against the IF of the journal the studywas published in (x-axis), on a log-scale. The size of the circle is proportionalto the sample size of the individual study. Bias score is significantly positivelycorrelated with IF, sample size significantly negatively. (D) Linear regressionwith confidence intervals between IF and Fang and Casadevall’s RetractionIndex (data provided by Fang and Casadevall, 2011).

is submitted for publication. However, a less readily quantifiedbut more frequent phenomenon (compared to rare retractions)has recently garnered attention, which calls into question theeffectiveness of this training. The “decline-effect,” which is nowwell-described, relates to the observation that the strength of evi-dence for a particular finding often declines over time (Simmonset al., 1999; Palmer, 2000; Møller and Jennions, 2001; Ioannidis,2005b; Møller et al., 2005; Fanelli, 2010; Lehrer, 2010; Schooler,2011; Simmons et al., 2011; Van Dongen, 2011; Bertamini andMunafo, 2012; Gonon et al., 2012). This effect provides widerscope for assessing the unreliability of scientific research thanretractions alone, and allows for more general conclusions to bedrawn.

Researchers make choices about data collection and analy-sis which increase the chance of false-positives (i.e., researcherbias) (Simmons et al., 1999; Simmons et al., 2011), and sur-prising and novel effects are more likely to be published than

studies showing no effect. This is the well-known phenomenonof publication bias (Song et al., 1999; Møller and Jennions, 2001;Callaham, 2002; Møller et al., 2005; Munafò et al., 2007; Dwanet al., 2008; Young et al., 2008; Schooler, 2011; Van Dongen,2011). In other words, the probability of getting a paper pub-lished might be biased toward larger initial effect sizes, whichare revealed by later studies to be not so large (or even absententirely), leading to the decline effect. While sound methodologycan help reduce researcher bias (Simmons et al., 1999), publica-tion bias is more difficult to address. Some journals are devotedto publishing null results, or have sections devoted to these, butcoverage is uneven across disciplines and often these are not par-ticularly high-ranking or well-read (Schooler, 2011; Nosek et al.,2012). Publication therein is typically not a cause for excitement(Giner-Sorolla, 2012; Nosek et al., 2012), leading to an overalllow frequency of replication studies in many fields (Kelly, 2006;Carpenter, 2012; Hartshorne and Schachner, 2012; Makel et al.,

Frontiers in Human Neuroscience www.frontiersin.org June 2013 | Volume 7 | Article 291 | 2

Brembs et al. Consequences of journal rank

FIGURE 1 | Current trends in the reliability of science. (A) Exponential fitfor PubMed retraction notices (data from pmretract.heroku.com). (B)Relationship between year of publication and individual study effect size.Data are taken from Munafò et al. (2007), and represent candidate genestudies of the association between DRD2 genotype and alcoholism. Theeffect size (y-axis) represents the individual study effect size (odds ratio; OR),on a log-scale. This is plotted against the year of publication of the study(x-axis). The size of the circle is proportional to the IF of the journal theindividual study was published in. Effect size is significantly negativelycorrelated with year of publication. (C) Relationship between IF and extent towhich an individual study overestimates the likely true effect. Data are taken

from Munafò et al. (2009), and represent candidate gene studies of a numberof gene-phenotype associations of psychiatric phenotypes. The bias score(y-axis) represents the effect size of the individual study divided by the pooledeffect size estimated indicated by meta-analysis, on a log-scale. Therefore, avalue greater than zero indicates that the study provided an over-estimate ofthe likely true effect size. This is plotted against the IF of the journal the studywas published in (x-axis), on a log-scale. The size of the circle is proportionalto the sample size of the individual study. Bias score is significantly positivelycorrelated with IF, sample size significantly negatively. (D) Linear regressionwith confidence intervals between IF and Fang and Casadevall’s RetractionIndex (data provided by Fang and Casadevall, 2011).

is submitted for publication. However, a less readily quantifiedbut more frequent phenomenon (compared to rare retractions)has recently garnered attention, which calls into question theeffectiveness of this training. The “decline-effect,” which is nowwell-described, relates to the observation that the strength of evi-dence for a particular finding often declines over time (Simmonset al., 1999; Palmer, 2000; Møller and Jennions, 2001; Ioannidis,2005b; Møller et al., 2005; Fanelli, 2010; Lehrer, 2010; Schooler,2011; Simmons et al., 2011; Van Dongen, 2011; Bertamini andMunafo, 2012; Gonon et al., 2012). This effect provides widerscope for assessing the unreliability of scientific research thanretractions alone, and allows for more general conclusions to bedrawn.

Researchers make choices about data collection and analy-sis which increase the chance of false-positives (i.e., researcherbias) (Simmons et al., 1999; Simmons et al., 2011), and sur-prising and novel effects are more likely to be published than

studies showing no effect. This is the well-known phenomenonof publication bias (Song et al., 1999; Møller and Jennions, 2001;Callaham, 2002; Møller et al., 2005; Munafò et al., 2007; Dwanet al., 2008; Young et al., 2008; Schooler, 2011; Van Dongen,2011). In other words, the probability of getting a paper pub-lished might be biased toward larger initial effect sizes, whichare revealed by later studies to be not so large (or even absententirely), leading to the decline effect. While sound methodologycan help reduce researcher bias (Simmons et al., 1999), publica-tion bias is more difficult to address. Some journals are devotedto publishing null results, or have sections devoted to these, butcoverage is uneven across disciplines and often these are not par-ticularly high-ranking or well-read (Schooler, 2011; Nosek et al.,2012). Publication therein is typically not a cause for excitement(Giner-Sorolla, 2012; Nosek et al., 2012), leading to an overalllow frequency of replication studies in many fields (Kelly, 2006;Carpenter, 2012; Hartshorne and Schachner, 2012; Makel et al.,

Frontiers in Human Neuroscience www.frontiersin.org June 2013 | Volume 7 | Article 291 | 2

Source: Fig. 1. Brembs et al. Deep impact:

unintended consequences of journal rank. Front Hum

Neurosci, 7:29, 2013.

Misconduct accounts for the majority of retractedscientific publicationsFerric C. Fanga,b,1, R. Grant Steenc,1, and Arturo Casadevalld,1,2

Departments of aLaboratory Medicine and bMicrobiology, University of Washington School of Medicine, Seattle, WA 98195; cMediCC! MedicalCommunications Consultants, Chapel Hill, NC 27517; and dDepartment of Microbiology and Immunology, Albert Einstein College of Medicine, Bronx, NY 10461

Edited by Thomas Shenk, Princeton University, Princeton, NJ, and approved September 6, 2012 (received for review July 18, 2012)

A detailed review of all 2,047 biomedical and life-science researcharticles indexed by PubMed as retracted on May 3, 2012 revealedthat only 21.3%of retractionswere attributable to error. In contrast,67.4% of retractions were attributable to misconduct, includingfraud or suspected fraud (43.4%), duplicate publication (14.2%), andplagiarism (9.8%). Incomplete, uninformative or misleading retrac-tion announcements have led to a previous underestimation of therole of fraud in the ongoing retraction epidemic. The percentage ofscientific articles retracted because of fraud has increased !10-foldsince 1975. Retractions exhibit distinctive temporal and geographicpatterns that may reveal underlying causes.

bibliometric analysis | biomedical publishing | ethics | research misconduct

The number and frequency of retracted publications are im-portant indicators of the health of the scientific enterprise,

because retracted articles represent unequivocal evidence ofproject failure, irrespective of the cause. Hence, retractions areworthy of rigorous and systematic study. The retraction of flawedpublications corrects the scientific literature and also providesinsights into the scientific process. However, the rising frequencyof retractions has recently elicited concern (1, 2). Studies of se-lected retracted articles have suggested that error is more com-mon than fraud as a cause of retraction (3–5) and that rates ofretraction correlate with journal-impact factor (6). We undertooka comprehensive analysis of all retracted articles indexed byPubMed to ascertain the validity of the earlier findings. Retractedarticles were classified according to whether the cause of re-traction was documented fraud (data falsification or fabrication),suspected fraud, plagiarism, duplicate publication, error, un-known, or other reasons (e.g., journal error, authorship dispute).

ResultsCauses of Retraction. PubMed references more than 25 millionarticles relating primarily to biomedical research published sincethe 1940s. A comprehensive search of the PubMed database inMay 2012 identified 2,047 retracted articles, with the earliestretracted article published in 1973 and retracted in 1977. Hence,retraction is a relatively recent development in the biomedicalscientific literature, although retractable offenses are not neces-sarily new. To understand the reasons for retraction, we consultedreports from the Office of Research Integrity and other publishedresources (7, 8), in addition to the retraction announcements inscientific journals. Use of these additional sources of informationresulted in the reclassification of 118 of 742 (15.9%) retractions inan earlier study (4) from error to fraud. A list of 158 articles forwhich the cause of retraction was reclassified because of consul-tation of secondary sources is provided in Table S1. For example,a retraction announcement in Biochemical and Biophysical Re-search Communications reported that “results were derived fromexperiments that were found to have flaws in methodologicalexecution and data analysis,” giving the impression of error (9).However, an investigation of this article conducted by HarvardUniversity and reported to the Office of Research Integrity in-dicated that “many instances of data fabrication and falsifica-tion were found” (10). In another example, a retraction notice

published by the authors of a manuscript in the Journal of CellBiology stated that “In follow-up experiments . . . we have shownthat the lack of FOXO1a expression reported in figure 1 is notcorrect” (11). A subsequent report from the Office of ResearchIntegrity states that the first author committed “research mis-conduct by knowingly and intentionally falsely reporting . . . thatFOXO1a was not expressed . . . by selecting a specific FOXO1aimmunoblot to show the desired result” (12). In contrast to earlierstudies, we found that the majority of retracted articles wereretracted because of some form of misconduct, with only 21.3%retracted because of error. The most common reason for re-traction was fraud or suspected fraud (43.4%), with additionalarticles retracted because of duplicate publication (14.2%) orplagiarism (9.8%). Miscellaneous reasons or unknown causesaccounted for the remainder. Thus, for articles in which thereason for retraction is known, three-quarters were retractedbecause of misconduct or suspected misconduct, and only one-quarter was retracted for error.

Temporal Trends. A marked recent rise in the frequency of re-traction was confirmed (2, 13), but was not uniform among thevarious causes of retraction (Fig. 1A). A discernible rise in re-tractions because of fraud or error was first evident in the 1990s,with a subsequent dramatic rise in retractions attributable tofraud occurring during the last decade. A more modest increasein retractions because of error was observed, and increasingretractions because of plagiarism and duplicate publication area recent phenomenon, seen only since 2005. The recent increasein retractions for fraud cannot be attributed solely to an increasein the number of research publications: retractions for fraud orsuspected fraud as a percentage of total articles have increasednearly 10-fold since 1975 (Fig. 1B).

Geographic Origin and Impact Factor. Retracted articles were auth-ored in 56 countries, and geographic origin was found to varyaccording to the cause for retraction (Fig. 2). The United States,Germany, Japan, and China accounted for three-quarters ofretractions because of fraud or suspected fraud. China and Indiacollectively accounted for more cases of plagiarism than theUnited States, and duplicate publication exhibited a pattern sim-ilar to that of plagiarism. The relationship between journal impactfactor and retraction rate was also found to vary with the cause ofretraction. Journal-impact factor showed a highly significant cor-relation with retractions because of fraud or error but not withthose because of plagiarism or duplicate publication (Fig. 3 A–C).Moreover, the mean impact factors of journals retracting articles

Author contributions: F.C.F., R.G.S., and A.C. designed research, performed research, an-alyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.1F.C.F., R.G.S., and A.C. contributed equally to this work.2To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1212247109/-/DCSupplemental.

17028–17033 | PNAS | October 16, 2012 | vol. 109 | no. 42 www.pnas.org/cgi/doi/10.1073/pnas.1212247109

The rate of retractions grows andis higher in HIF journals.

“Too good to be true” results,reviewers COI and poormethodology, seem to be potenialcauses (Brembs et al., 2013).

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 6 / 11

Society, journals and funding agencies start reacting

A growing chorus of concern, from scientists and laypeople, contends that the complex system for ensuring

the reproducibility of biomedical research is failing and is in need of restructuring1,2. As leaders of the US National Institutes of Health (NIH), we share this concern and here explore some of the significant inter-ventions that we are planning.

Science has long been regarded as ‘self-correcting’, given that it is founded on the replication of earlier work. Over the long term, that principle remains true. In the

shorter term, however, the checks and balances that once ensured scientific fidelity have been hobbled. This has compromised the ability of today’s researchers to reproduce others’ findings.

Let’s be clear: with rare exceptions, we have no evidence to suggest that irreproduc-ibility is caused by scientific misconduct. In 2011, the Office of Research Integrity of the US Department of Health and Human Ser-vices pursued only 12 such cases3. Even if this represents only a fraction of the actual problem, fraudulent papers are vastly

outnumbered by the hundreds of thousands published each year in good faith.

Instead, a complex array of other factors seems to have contributed to the lack of reproducibility. Factors include poor train-ing of researchers in experimental design; increased emphasis on making provocative statements rather than presenting technical details; and publications that do not report basic elements of experimental design4. Crucial experimental design elements that are all too frequently ignored include blind-ing, randomization, replication, sample-size calculation and the effect of sex differences. And some scientists reputedly use a ‘secret sauce’ to make their experiments work — and withhold details from publication or describe them only vaguely to retain a com-petitive edge5. What hope is there that other scientists will be able to build on such work to further biomedical progress?

Exacerbating this situation are the policies and attitudes of funding agencies, academic centres and scientific publishers. Fund-ing agencies often uncritically encourage the overvaluation of research published in high-profile journals. Some academic cen-tres also provide incentives for publications in such journals, including promotion and tenure, and in extreme circumstances, cash rewards6.

Then there is the problem of what is not published. There are few venues for researchers to publish negative data or papers that point out scientific flaws in pre-viously published work. Further compound-ing the problem is the difficulty of accessing unpublished data — and the failure of fund-ing agencies to establish or enforce policies that insist on data access.

PRECLINICAL PROBLEMSReproducibility is potentially a problem in all scientific disciplines. However, human clini-cal trials seem to be less at risk because they are already governed by various regulations that stipulate rigorous design and independ-ent oversight — including randomization, blinding, power estimates, pre-registration of outcome measures in standardized, pub-lic databases such as ClinicalTrials.gov and oversight by institutional review boards and data safety monitoring boards. Furthermore, the clinical trials community has taken important steps towards adopting standard reporting elements7.

Preclinical research, especially work that uses animal models1, seems to be the area that is currently most susceptible to repro-ducibility issues. Many of these failures have simple and practical explanations: different animal strains, different lab environments or subtle changes in protocol. Some irreproduc-ible reports are probably the result of coinci-dental findings that happen to reach statistical significance, coupled with publication bias.

NIH plans to enhance reproducibility

Francis S. Collins and Lawrence A. Tabak discuss initiatives that the US National Institutes of Health is exploring to restore the self-correcting nature of

preclinical research.

CH

RIS

RYA

N/N

ATU

RE

6 1 2 | N A T U R E | V O L 5 0 5 | 3 0 J A N U A R Y 2 0 1 4

COMMENT

© 2014 Macmillan Publishers Limited. All rights reserved

“poor training is probably responsible for at least

some of the challenges” (of reproducibility)

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 7 / 11

Reproducibility vs Replicability

Reproducibility: start from the same samples/data, use the samemethods, get the same results.

Replicability: conduct again the experiment with independent samplesand/or methods to get confirmatory results.

Replicability = Reproducilibity + Conduct experiment again

Replicability and reproducibility might be challenging in epidemiology(recruit again a cohort) or molecular biology (complex cell manipulation).

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 8 / 11

Reproducible research in bioinformatics

In computationally orientedresearch (e.g., bioinformatics)reproducibility is always feasible(Claerbout and Karrenbach.Electronic documents givereproducible research a newmeaning, 1992).

The software toolkit is alreadythere (literate programming,version control systems, unittesting, data and code repositories,etc).

PERSPECTIVE

Reproducible Research inComputational ScienceRoger D. Peng

Computational science has led to exciting new developments, but the nature of the work hasexposed limitations in our ability to evaluate published findings. Reproducibility has the potentialto serve as a minimum standard for judging scientific claims when full independent replication of astudy is not possible.

The rise of computational science has led toexciting and fast-moving developments inmany scientific areas. New technologies,

increased computing power, and methodologicaladvances have dramatically improved our abilityto collect complex high-dimensional data (1, 2).Large data sets have led to scientists doing morecomputation, as well as researchers in computa-tionally oriented fields directly engaging in morescience. The availability of large public databaseshas allowed for researchers to make meaningfulscientific contributions without using the tradi-tional tools of a given field. As an example ofthis overall trend, the Sloan Digital Sky Survey,a large publicly available astronomical surveyof the Northern Hemisphere, was ranked the mostcited observatory (3), allowing astronomers with-out telescopes to make discoveries using data col-lected by others. Similar developments can befound in fields such as biology and epidemiology.

Replication is the ultimate standard by whichscientific claims are judged. With replication,independent investigators address ascientific hypothesis and build upevidence for or against it. The scien-tific community’s “culture of replica-tion” has served to quickly weed outspurious claims and enforce on thecommunity a disciplined approach toscientific discovery. However, withcomputational science and the corre-sponding collection of large and com-plex data sets the notion of replicationcan be murkier. It would require tre-mendous resources to independentlyreplicate the Sloan Digital Sky Sur-vey. Many studies—for example, inclimate science—require computing power thatmay not be available to all researchers. Even ifcomputing and data size are not limiting factors,replication can be difficult for other reasons. Inenvironmental epidemiology, large cohort studiesdesigned to examine subtle health effects of en-vironmental pollutants can be very expensive and

require long follow-up times. Such studies aredifficult to replicate because of time and expense,especially in the time frame of policy decisionsthat need to be made regarding regulation (2).

Researchers across a range of computationalscience disciplines have been calling for repro-ducibility, or reproducible research, as an attain-able minimum standard for assessing the value ofscientific claims, particularly when full independentreplication of a study is not feasible (4–8). Thestandard of reproducibility calls for the data andthe computer code used to analyze the data bemadeavailable to others. This standard falls short of fullreplication because the same data are analyzedagain, rather than analyzing independently col-lected data. However, under this standard, limitedexploration of the data and the analysis code ispossible andmay be sufficient to verify the qualityof the scientific claims. One aim of the reproduc-ibility standard is to fill the gap in the scientificevidence-generating process between full repli-cation of a study and no replication. Between

these two extreme end points, there is a spectrumof possibilities, and a study may be more or lessreproducible than another depending onwhat dataand code are made available (Fig. 1). A recentreview of microarray gene expression analysesfound that studies were either not reproducible,partially reproducible with some discrepancies, orreproducible. This range was largely explained bythe availability of data and metadata (9).

The reproducibility standard is based on thefact that every computational experiment has, intheory, a detailed log of every action taken by the

computer. Making these computer codes avail-able to others provides a level of detail regardingthe analysis that is greater than the analagous non-computational experimental descriptions printedin journals using a natural language.

A critical barrier to reproducibility in manycases is that the computer code is no longer avail-able. Interactive software systems often used forexploratory data analysis typically do not keeptrack of users’ actions in any concrete form. Evenif researchers use software that is run by writtencode, often multiple packages are used, and thecode that combines the different results togetheris not saved (10). Addressing this problem willrequire either changing the behavior of the soft-ware systems themselves or getting researchersto use other software systems that are more ame-nable to reproducibility. Neither is likely to hap-pen quickly; old habits die hard, and many willbe unwilling to discard the hours spent learningexisting systems. Non–open source software canonly be changed by their owners, who may notperceive reproducibility as a high priority.

In order to advance reproducibility in com-putational science, contributions will need tocome from multiple directions. Journals can playa role here as part of a joint effort by the scientificcommunity. The journal Biostatistics, for whichI am an associate editor, has implemented a pol-icy for encouraging authors of accepted papersto make their work reproducible by others (11).Authors can submit their code or data to thejournal for posting as supporting online materialand can additionally request a “reproducibilityreview,” in which the associate editor for repro-ducibility runs the submitted code on the data

and verifies that the code produces the resultspublished in the article. Articles with data orcode receive a “D” or “C” kite-mark, respec-tively, printed on the first page of the articleitself. Articles that have passed the reproduc-ibility review receive an “R.” The policy was im-plemented in July 2009, and as of July 2011,21 of 125 articles have been published with akite-mark, including five articles with an “R.”The articles have reflected a range of topicsfrom biostatistical methods, epidemiology, andgenomics. In this admittedly small sample, we

Data Replication & Reproducibility

Department of Biostatistics, Johns Hopkins BloombergSchool of Public Health, Baltimore MD 21205, USA.

To whom correspondence should be addressed. E-mail:[email protected]

Reproducibility Spectrum

Not reproducible Gold standard

Full replication

Publication only

Publication +

Code Code and data

Linked and executable

code and data

Fig. 1. The spectrum of reproducibility.

2 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org1226

on M

arch

13,

201

6D

ownl

oade

d fro

m o

n M

arch

13,

201

6D

ownl

oade

d fro

m

PERSPECTIVE

Reproducible Research inComputational ScienceRoger D. Peng

Computational science has led to exciting new developments, but the nature of the work hasexposed limitations in our ability to evaluate published findings. Reproducibility has the potentialto serve as a minimum standard for judging scientific claims when full independent replication of astudy is not possible.

The rise of computational science has led toexciting and fast-moving developments inmany scientific areas. New technologies,

increased computing power, and methodologicaladvances have dramatically improved our abilityto collect complex high-dimensional data (1, 2).Large data sets have led to scientists doing morecomputation, as well as researchers in computa-tionally oriented fields directly engaging in morescience. The availability of large public databaseshas allowed for researchers to make meaningfulscientific contributions without using the tradi-tional tools of a given field. As an example ofthis overall trend, the Sloan Digital Sky Survey,a large publicly available astronomical surveyof the Northern Hemisphere, was ranked the mostcited observatory (3), allowing astronomers with-out telescopes to make discoveries using data col-lected by others. Similar developments can befound in fields such as biology and epidemiology.

Replication is the ultimate standard by whichscientific claims are judged. With replication,independent investigators address ascientific hypothesis and build upevidence for or against it. The scien-tific community’s “culture of replica-tion” has served to quickly weed outspurious claims and enforce on thecommunity a disciplined approach toscientific discovery. However, withcomputational science and the corre-sponding collection of large and com-plex data sets the notion of replicationcan be murkier. It would require tre-mendous resources to independentlyreplicate the Sloan Digital Sky Sur-vey. Many studies—for example, inclimate science—require computing power thatmay not be available to all researchers. Even ifcomputing and data size are not limiting factors,replication can be difficult for other reasons. Inenvironmental epidemiology, large cohort studiesdesigned to examine subtle health effects of en-vironmental pollutants can be very expensive and

require long follow-up times. Such studies aredifficult to replicate because of time and expense,especially in the time frame of policy decisionsthat need to be made regarding regulation (2).

Researchers across a range of computationalscience disciplines have been calling for repro-ducibility, or reproducible research, as an attain-able minimum standard for assessing the value ofscientific claims, particularly when full independentreplication of a study is not feasible (4–8). Thestandard of reproducibility calls for the data andthe computer code used to analyze the data bemadeavailable to others. This standard falls short of fullreplication because the same data are analyzedagain, rather than analyzing independently col-lected data. However, under this standard, limitedexploration of the data and the analysis code ispossible andmay be sufficient to verify the qualityof the scientific claims. One aim of the reproduc-ibility standard is to fill the gap in the scientificevidence-generating process between full repli-cation of a study and no replication. Between

these two extreme end points, there is a spectrumof possibilities, and a study may be more or lessreproducible than another depending onwhat dataand code are made available (Fig. 1). A recentreview of microarray gene expression analysesfound that studies were either not reproducible,partially reproducible with some discrepancies, orreproducible. This range was largely explained bythe availability of data and metadata (9).

The reproducibility standard is based on thefact that every computational experiment has, intheory, a detailed log of every action taken by the

computer. Making these computer codes avail-able to others provides a level of detail regardingthe analysis that is greater than the analagous non-computational experimental descriptions printedin journals using a natural language.

A critical barrier to reproducibility in manycases is that the computer code is no longer avail-able. Interactive software systems often used forexploratory data analysis typically do not keeptrack of users’ actions in any concrete form. Evenif researchers use software that is run by writtencode, often multiple packages are used, and thecode that combines the different results togetheris not saved (10). Addressing this problem willrequire either changing the behavior of the soft-ware systems themselves or getting researchersto use other software systems that are more ame-nable to reproducibility. Neither is likely to hap-pen quickly; old habits die hard, and many willbe unwilling to discard the hours spent learningexisting systems. Non–open source software canonly be changed by their owners, who may notperceive reproducibility as a high priority.

In order to advance reproducibility in com-putational science, contributions will need tocome from multiple directions. Journals can playa role here as part of a joint effort by the scientificcommunity. The journal Biostatistics, for whichI am an associate editor, has implemented a pol-icy for encouraging authors of accepted papersto make their work reproducible by others (11).Authors can submit their code or data to thejournal for posting as supporting online materialand can additionally request a “reproducibilityreview,” in which the associate editor for repro-ducibility runs the submitted code on the data

and verifies that the code produces the resultspublished in the article. Articles with data orcode receive a “D” or “C” kite-mark, respec-tively, printed on the first page of the articleitself. Articles that have passed the reproduc-ibility review receive an “R.” The policy was im-plemented in July 2009, and as of July 2011,21 of 125 articles have been published with akite-mark, including five articles with an “R.”The articles have reflected a range of topicsfrom biostatistical methods, epidemiology, andgenomics. In this admittedly small sample, we

Data Replication & Reproducibility

Department of Biostatistics, Johns Hopkins BloombergSchool of Public Health, Baltimore MD 21205, USA.

To whom correspondence should be addressed. E-mail:[email protected]

Reproducibility Spectrum

Not reproducible Gold standard

Full replication

Publication only

Publication +

Code Code and data

Linked and executable

code and data

Fig. 1. The spectrum of reproducibility.

2 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org1226

on M

arch

13,

201

6D

ownl

oade

d fro

m o

n M

arch

13,

201

6D

ownl

oade

d fro

m

“An article about computational sciencein a scientific publication is not thescholarship itself, it is merely advertisingof the scholarship. The actualscholarship is the complete softwaredevelopment environment and thecomplete set of instructions whichgenerated the figures”

–Jon Claerbout, Stanford University

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 9 / 11

Reproducible research in bioinformatics

Short project at the “Information Extraction from Omics Technologies” (IEO)subject from the UPF MSc Program in Bioinformatics for Health.

Students pick a publicly available high-throughput gene expression profilingdata set.

Try to reproduce one of the figures of the accompanying paper or address asimple question on differential expression with a subset of the data.

Data analyses must be reproducible: coded in R (statisticallanguage/software), within a dynamic markdown document (literateprogramming) that can automatically generate results and figures.

Results must be described in a short 4-6 page report following a giventemplate, where the coded analyses constitute its supplementary material.

Evaluation using a rubric the students know: justified analysis decisions,paper sectioning, text flow, figure labeling, self-contained captions, etc.

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 10 / 11

Concluding remarks

Replicability is at the core of the scientific method, but might bechallenging to achieve.

Reproducibility is a minimum standard and is always feasible inbioinformatics given that code and data are available.

The software toolkit for reproducible research is available and growing (see,e.g., Roger Peng Coursera materials at http://github.com/rdpeng/courses).

“The most important tool is the mindset, when starting, that the endproduct will be reproducible”

–Keith Baggerly, MD Anderson(shared via Twitter by @kwbroman)

Robert Castelo - [email protected] - @robertclab Teaching reproducible research in bioinformatics 11 / 11