stas2cal,consul2ng,topic, - university of...
TRANSCRIPT
Reproducible Research
Sta2s2cal Consul2ng Topic
Discussion based on ar.cle in AMSTAT News, January 2011: SCIENCE POLICY Reproducible Research Keith A. Baggerly and Donald A. Berry, MD Anderson Cancer Center
• h?p://www.nature.com/nature/focus/reproducibility/
• Special issue.
Reproducible Research (RR)
1. Can another lab collect new data and reproduce the findings that have been published? • Perhaps the most common way to think of RR.
2. Can another sta2s2cian use the SAME DATA (provided by the author) and find the same published results and conclusions? • Topic of this ar2cle.
Why not reproducible?
• Ethical issue – Perhaps not presented honestly
• Vaccine/au2sm, cold fusion, high voltage wires/cancer
• Technical issue – Coding issue – Data issue
• Should be reproducible if you have these things… – Code – Data – Analysis
Raw Data ‘normaliza2on’
(pre-‐processing) Analysis
Data Explosion
• “As data sets become more voluminous and complex, RR is increasingly important because our intui2on fails in high dimensions.”
– e.g. Gene expression analysis
Many choices
Non-‐RR discussed in AMSTAT Ar2cle
• Earlier Published Report: Gene expression pa?erns could predict pa2ent response to cancer chemotherapy.
– Method descrip2on in paper was incomplete – Tried to re-‐construct the analysis – Uncovered basic errors (e.g. mis-‐labeling) – “Bioinforma2c Forensics” was published
– Big consequences (resigna2on, retrac2on)
Responsible research
• Be well organized • Double-‐check data & coding • Include ‘spot-‐checks’ (do a few plots of data) • Mistakes can happen to the best of us – Disconcer2ng to have others dissect your work – Done for accuracy (not accusa2on) – Done to evolve methods to the next step
Journal Editors Taking Note
• Biosta's'cs journal • Associate editor (AE) of reproducibility – Grading submi?ed papers
• D: Data is available • C: Code is available • R: AE can re-‐run the code and reproduce the results without much difficulty
Peng, Roger D. (2009). Reproducible research and Biosta2s2cs. Biosta's'cs, Vol. 10: 405-‐408.
Journal Editors May Also Be Skep2cal of Forensic Process
• JASA past editor (David Banks): “… my own sense is that very few applied papers are perfectly reproducible. Most do not come with code or data, and even if they did, I expect careful check would find discrepancies from the publish paper. The reasons are innocent: code wriFen by graduate students is con'nually tweaked and has sketchy documenta'on.”
(but he acknowledges the need)
“Clearly, the problem of RR is fundamental to the integrity of science and public trust… We should be op'mis'c about the future, and proud of the role that sta's'cians have played in crea'ng it.”
Submission-‐Publica2on Timeline*
2005 2006 2007
July 2007 2008 2009
Submission
Feb Apr
Reviewer comments
Re-‐Submit
May Sept
Accepted
Appears
Feb
*Gene2cs
Submission-‐Publica2on Timeline*
2005 2006 2007
July 2007 2008 2009
Submission
Feb Apr
Reviewer comments
Re-‐Submit
May Sept
Accepted
Appears
Feb
*Gene2cs: A Publica2on of The Gene2cs Society of America
Oct Aug Sept
co-‐author Weizmann Inst. of Science (Israel)
Johns Hopkins
Requests for data
4+ years for accessing
data
Code/Data for the Public
• Be prepared, you never know which ar2cle will be popular with requests – Some public databases for uploading
• Easy to follow organiza2on is helpful for user (and yourself)
• readme.txt • descrip'on_of_files.txt
• Commen2ng your code • Transparency
Example Code
Master file content.
File holding all func2ons called by master file.
DeCook, R., Ne?leton, D., Foster, C., and Wurtele, E.S. (2006). Iden2fying differen2ally expressed genes in unreplicated mul2ple-‐treatment microarray 2mecourse experiments. Computa'onal Sta's'cs and Data Analysis, Vol. 50: 518-‐532.
readme.txt
We had requests for data and code from this paper from 2006 – 2010.
Reading Other Author’s Code
Design matrix for a 2-‐sample (t-‐test) permuta2on test
######## FDR by permutations choose(6,3) # number of unique permutations for 6 arrays, 3 per group samp.1=expand.grid(0:1,0:1,0:1,0:1,0:1,0:1) # 2^6=64 possible 6-element vectors samp.1=samp.1[apply(samp.1,1,sum)==3,] # subset to 20 rows with exactly 3-1’s samp.1=as.matrix(samp.1) > samp.1
Var1 Var2 Var3 Var4 Var5 Var68 1 1 1 0 0 012 1 1 0 1 0 014 1 0 1 1 0 0
53 0 0 1 0 1 157 0 0 0 1 1 1
….
Reading Other Author’s Code
Read journal ar2cles ! be?er journal ar2cle writer Read other’s R code ! be?er R coder Read other’s SAS code ! be?er SAS coder
References
Peng, Roger D. (2009). Reproducible research and Biosta2s2cs. Biosta's'cs, Vol. 10: 405-‐408.
Baggerly, K.A., Morris, J.S., Edmonson, S.R. and K.R. Coombes (2005). Signal to noise: Evalua2ng reported reproducibility of serum proteomic tests for ovarian cancer. Journal Natl Cancer Inst, Vol. 97: 307-‐309.
Baggerly, K.A. and D.A. Berry (2011). Science policy: Reproducible research. AMSTAT News, January(403): 16-‐17.
Banks, D. (2011). Reproducible research: A range of response. Sta's'cs, Poli'cs, and Policy, Vol. 2, Issue 1, Ar2cle 4.