Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Software and Workflow for ReproducibleResearch
Garret Christensen1
1UC Berkeley: Berkeley Initiative for Transparency in the Social SciencesBerkeley Institute for Data Science
Annual Meeting, December 2015
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Outline
1 Introduction
2 ProblemsIrreproducible Workflow
3 SolutionsWorkflowLiterate ProgrammingWorkflow SuggestionsVersion ControlDynamic Documents
4 Conclusion
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Reproducibility & Transparency
What are practical tools to implementreproducibility solutions?
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Problems
Data not availableCode not available/unintelligibleCode and data cannot reproduce original results
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Irreproducible Workflow
Even with the help of the original author (yourself?),you can’t get the data to reproduce the publishedresults. Or you just can’t find the data to begin with.Journal of Money, Credit, and Banking Project. (Dewaldet al., AER 1986)Martin Feldstein on Social Security and private savings,Reinhart and Rogoff on debt and GDP growth.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Reproducible Workflow
Literate ProgramingVersion control
GithubOSF
Dynamic DocumentsR Markdown and R StudioKetchup in Stata
Data SharingHarvard’s Dataverse
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Literate Programming
First, programming is key to reproducibility. Working inExcel is not reproducible.See Reinhart and Rogoff “Growth in a Time of Debt”controversy:
Original Paper, AER P & P 2010Herndon et. al (2013) finding.New Yorker summary.
Random number generation in Excel–set seed withData Analysis Toolpak.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Literate Programming
If you are using SPSS, use of ‘syntax’ to record all thecommands you run is simple. (See UCLA tutorial.)Similarly in Stata, ‘commandlog’.Better is to write scripts. R, Stata, SAS, Python, orwhatever you please.Open source has some advantages (being free, forone) but you’re going to use what everyone in your fielduses.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Literate Programming
Second, literate programming is key to reproducibility.Write code to be read by a human being, with the codefor the computer secondary.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Literate Programming
“I believe that the time is ripe for significantly betterdocumentation of programs, and that we can bestachieve this by considering programs to be worksof literature. Hence, my title: “LiterateProgramming.”Let us change our traditional attitude to theconstruction of programs: Instead of imagining thatour main task is to instruct a computer what to do,let us concentrate rather on explaining to humanbeings what we want a computer to do.
(cont.)
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Literate Programming
“The practitioner of literate programming can beregarded as an essayist, whose main concern iswith exposition and excellence of style. Such anauthor, with thesaurus in hand, chooses the namesof variables carefully and explains what eachvariable means. He or she strives for a programthat is comprehensible because its concepts havebeen introduced in an order that is best for humanunderstanding, using a mixture of formal andinformal methods that reinforce each other.”
–Donald Knuth The Computer Journal, 1984 Quotes Original
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Organizing and Recording Workflow
“Reproducibility is just collaboration with peopleyou don’t know, including yourself next week”
—Philip Stark, UC Berkeley Statistics
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Organizing and Recording Workflow
Practical coding and organizational suggestionsLong (2008) The Workflow of Data Analysis Using StataMaking any changes to a file that has beenposted/shared means it gets a new name.Use version commands to ensure others get sameresults.Keep a daily research log.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Version Control
Using version control (AKA revision control) can help tomake your work more reproducible.What is version control?
Version control is a system that recordschanges to a file or set of files over time sothat you can recall specific versions later. Forthe examples in this book you will usesoftware source code as the files beingversion controlled, though in reality you can dothis with nearly any type of file on a computer.
–Git, About Version Control
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Version Control
Using version control (AKA revision control) can help tomake your work more reproducible.What is version control?
Version control is a system that recordschanges to a file or set of files over time sothat you can recall specific versions later. Forthe examples in this book you will usesoftware source code as the files beingversion controlled, though in reality you can dothis with nearly any type of file on a computer.
–Git, About Version Control
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Version Control
Using version control (AKA revision control) can help tomake your work more reproducible.What is version control?
Version control is a system that recordschanges to a file or set of files over time sothat you can recall specific versions later. Forthe examples in this book you will usesoftware source code as the files beingversion controlled, though in reality you can dothis with nearly any type of file on a computer.
–Git, About Version Control
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Version Control
With version control you can:CollaborateTrack who made every changeEasily switch between versions of filesCompare versions of filesBackupWork with the same files on different machinesExperiment with a new version of code withoutbreaking things
Link1 Link2 Link3
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Version Control
Places you’re already using version control without knowingit:
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Version Control
Places you’re already using version control without knowingit:
Google DocsWikipediaEvery piece of software you use.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Version Control
Isn’t this just a complicated version of the “date and initial”method?
regressions2015.08.24.doregressions2015.08.25.doregressions2015.08.25GC.doHassleConfusion
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Version Control
Here is a good rule of thumb: If you are trying tosolve a problem, and there are multi-billion dollarfirms whose entire business model depends onsolving the same problem, and there are wholecourses at your university devoted to how to solvethat problem, you might want to figure out what theexperts do and see if you can’t learn somethingfrom it....Not one piece of commercial software you have onyour PC, your phone, your tablet, your car, or anyother modern computing device was written withthe “date and initial” method.
–Matthew Gentzkow and Jesse M. Shapiro “Code and Datafor the Social Sciences: A Practitioner’s Guide”
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Examples
GitHub and OSF Examples:Slides for this workshop on Github.comhttp://www.github.com/bitss/annual2015
Slides also available on the Open Science Frameworkhttps://osf.io/7pbm5/
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Dynamic Documents
Even if you write perfect (version controlled) code, youcan still run into problems going from your code topaper. This is where dynamic documents come in.A dynamic document includes your data, code,analysis, and output all in one place. Fully automated,you can guarantee no mistakes from copying andpasting.Do this with R Markdown in R Studio or Markdoc inStata.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Dynamic Documents
Even if you write perfect (version controlled) code, youcan still run into problems going from your code topaper. This is where dynamic documents come in.A dynamic document includes your data, code,analysis, and output all in one place. Fully automated,you can guarantee no mistakes from copying andpasting.Do this with R Markdown in R Studio or Markdoc inStata.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Dynamic Documents
Even if you write perfect (version controlled) code, youcan still run into problems going from your code topaper. This is where dynamic documents come in.A dynamic document includes your data, code,analysis, and output all in one place. Fully automated,you can guarantee no mistakes from copying andpasting.Do this with R Markdown in R Studio or Markdoc inStata.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Dynamic Documents
Include tables by linking to a file, instead of a staticimage.Include number by linking to a value calculated by ananalysis file, instead of a static number typed manually.Automatically update tables and numbers.Produce entire paper with one or two clicks.
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Examples
R Studio ExampleStata Example
Software andWorkflow
Christensen
Introduction
ProblemsIrreproducibleWorkflow
SolutionsWorkflow
LiterateProgramming
WorkflowSuggestions
Version Control
Dynamic Documents
Conclusion
Conclusion
Simple tools exist to help you transparently and reproduciblytake your research from beginning to end.
Version ControlOpen Science FrameworkDynamic DocumentsTrusted Public Data Archive
Read more in my Manual of Best Practices in TransparentSocial Science Research on GitHub.