software and workflow for reproducible research · simple tools exist to help you transparently and...

34
Software and Workflow Christensen Introduction Problems Irreproducible Workflow Solutions Workflow Literate Programming Workflow Suggestions Version Control Dynamic Documents Conclusion Software and Workflow for Reproducible Research Garret Christensen 1 1 UC Berkeley: Berkeley Initiative for Transparency in the Social Sciences Berkeley Institute for Data Science Annual Meeting, December 2015

Upload: others

Post on 08-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Software and Workflow for ReproducibleResearch

Garret Christensen1

1UC Berkeley: Berkeley Initiative for Transparency in the Social SciencesBerkeley Institute for Data Science

Annual Meeting, December 2015

Page 2: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Outline

1 Introduction

2 ProblemsIrreproducible Workflow

3 SolutionsWorkflowLiterate ProgrammingWorkflow SuggestionsVersion ControlDynamic Documents

4 Conclusion

Page 3: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Reproducibility & Transparency

What are practical tools to implementreproducibility solutions?

Page 4: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Problems

Data not availableCode not available/unintelligibleCode and data cannot reproduce original results

Page 5: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Irreproducible Workflow

Even with the help of the original author (yourself?),you can’t get the data to reproduce the publishedresults. Or you just can’t find the data to begin with.Journal of Money, Credit, and Banking Project. (Dewaldet al., AER 1986)Martin Feldstein on Social Security and private savings,Reinhart and Rogoff on debt and GDP growth.

Page 6: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Reproducible Workflow

Literate ProgramingVersion control

GithubOSF

Dynamic DocumentsR Markdown and R StudioKetchup in Stata

Data SharingHarvard’s Dataverse

Page 7: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Literate Programming

First, programming is key to reproducibility. Working inExcel is not reproducible.See Reinhart and Rogoff “Growth in a Time of Debt”controversy:

Original Paper, AER P & P 2010Herndon et. al (2013) finding.New Yorker summary.

Random number generation in Excel–set seed withData Analysis Toolpak.

Page 8: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Literate Programming

If you are using SPSS, use of ‘syntax’ to record all thecommands you run is simple. (See UCLA tutorial.)Similarly in Stata, ‘commandlog’.Better is to write scripts. R, Stata, SAS, Python, orwhatever you please.Open source has some advantages (being free, forone) but you’re going to use what everyone in your fielduses.

Page 9: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Literate Programming

Second, literate programming is key to reproducibility.Write code to be read by a human being, with the codefor the computer secondary.

Page 10: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Literate Programming

“I believe that the time is ripe for significantly betterdocumentation of programs, and that we can bestachieve this by considering programs to be worksof literature. Hence, my title: “LiterateProgramming.”Let us change our traditional attitude to theconstruction of programs: Instead of imagining thatour main task is to instruct a computer what to do,let us concentrate rather on explaining to humanbeings what we want a computer to do.

(cont.)

Page 11: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Literate Programming

“The practitioner of literate programming can beregarded as an essayist, whose main concern iswith exposition and excellence of style. Such anauthor, with thesaurus in hand, chooses the namesof variables carefully and explains what eachvariable means. He or she strives for a programthat is comprehensible because its concepts havebeen introduced in an order that is best for humanunderstanding, using a mixture of formal andinformal methods that reinforce each other.”

–Donald Knuth The Computer Journal, 1984 Quotes Original

Page 12: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Organizing and Recording Workflow

“Reproducibility is just collaboration with peopleyou don’t know, including yourself next week”

—Philip Stark, UC Berkeley Statistics

Page 13: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Organizing and Recording Workflow

Practical coding and organizational suggestionsLong (2008) The Workflow of Data Analysis Using StataMaking any changes to a file that has beenposted/shared means it gets a new name.Use version commands to ensure others get sameresults.Keep a daily research log.

Page 14: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Version Control

Using version control (AKA revision control) can help tomake your work more reproducible.What is version control?

Version control is a system that recordschanges to a file or set of files over time sothat you can recall specific versions later. Forthe examples in this book you will usesoftware source code as the files beingversion controlled, though in reality you can dothis with nearly any type of file on a computer.

–Git, About Version Control

Page 15: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Version Control

Using version control (AKA revision control) can help tomake your work more reproducible.What is version control?

Version control is a system that recordschanges to a file or set of files over time sothat you can recall specific versions later. Forthe examples in this book you will usesoftware source code as the files beingversion controlled, though in reality you can dothis with nearly any type of file on a computer.

–Git, About Version Control

Page 16: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Version Control

Using version control (AKA revision control) can help tomake your work more reproducible.What is version control?

Version control is a system that recordschanges to a file or set of files over time sothat you can recall specific versions later. Forthe examples in this book you will usesoftware source code as the files beingversion controlled, though in reality you can dothis with nearly any type of file on a computer.

–Git, About Version Control

Page 17: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Version Control

With version control you can:CollaborateTrack who made every changeEasily switch between versions of filesCompare versions of filesBackupWork with the same files on different machinesExperiment with a new version of code withoutbreaking things

Link1 Link2 Link3

Page 18: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Version Control

Places you’re already using version control without knowingit:

Page 19: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science
Page 20: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science
Page 21: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Version Control

Places you’re already using version control without knowingit:

Google DocsWikipediaEvery piece of software you use.

Page 22: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Version Control

Isn’t this just a complicated version of the “date and initial”method?

regressions2015.08.24.doregressions2015.08.25.doregressions2015.08.25GC.doHassleConfusion

Page 23: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Version Control

Here is a good rule of thumb: If you are trying tosolve a problem, and there are multi-billion dollarfirms whose entire business model depends onsolving the same problem, and there are wholecourses at your university devoted to how to solvethat problem, you might want to figure out what theexperts do and see if you can’t learn somethingfrom it....Not one piece of commercial software you have onyour PC, your phone, your tablet, your car, or anyother modern computing device was written withthe “date and initial” method.

–Matthew Gentzkow and Jesse M. Shapiro “Code and Datafor the Social Sciences: A Practitioner’s Guide”

Page 24: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science
Page 26: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science
Page 27: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science
Page 28: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Examples

GitHub and OSF Examples:Slides for this workshop on Github.comhttp://www.github.com/bitss/annual2015

Slides also available on the Open Science Frameworkhttps://osf.io/7pbm5/

Page 29: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Dynamic Documents

Even if you write perfect (version controlled) code, youcan still run into problems going from your code topaper. This is where dynamic documents come in.A dynamic document includes your data, code,analysis, and output all in one place. Fully automated,you can guarantee no mistakes from copying andpasting.Do this with R Markdown in R Studio or Markdoc inStata.

Page 30: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Dynamic Documents

Even if you write perfect (version controlled) code, youcan still run into problems going from your code topaper. This is where dynamic documents come in.A dynamic document includes your data, code,analysis, and output all in one place. Fully automated,you can guarantee no mistakes from copying andpasting.Do this with R Markdown in R Studio or Markdoc inStata.

Page 31: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Dynamic Documents

Even if you write perfect (version controlled) code, youcan still run into problems going from your code topaper. This is where dynamic documents come in.A dynamic document includes your data, code,analysis, and output all in one place. Fully automated,you can guarantee no mistakes from copying andpasting.Do this with R Markdown in R Studio or Markdoc inStata.

Page 32: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Dynamic Documents

Include tables by linking to a file, instead of a staticimage.Include number by linking to a value calculated by ananalysis file, instead of a static number typed manually.Automatically update tables and numbers.Produce entire paper with one or two clicks.

Page 33: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Examples

R Studio ExampleStata Example

Page 34: Software and Workflow for Reproducible Research · Simple tools exist to help you transparently and reproducibly take your research from beginning to end. Version Control Open Science

Software andWorkflow

Christensen

Introduction

ProblemsIrreproducibleWorkflow

SolutionsWorkflow

LiterateProgramming

WorkflowSuggestions

Version Control

Dynamic Documents

Conclusion

Conclusion

Simple tools exist to help you transparently and reproduciblytake your research from beginning to end.

Version ControlOpen Science FrameworkDynamic DocumentsTrusted Public Data Archive

Read more in my Manual of Best Practices in TransparentSocial Science Research on GitHub.