reproducible data science with r

16
Reproducible Data Science with R David Smith R Community Lead, Microsoft @revodavid

Upload: revolution-analytics

Post on 21-Jan-2018

13.801 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Reproducible Data Science with R

Reproducible Data Science with R

David SmithR Community Lead, Microsoft

@revodavid

Page 2: Reproducible Data Science with R

What is reproducibility?

“Two honest researchers wouldget the same result”

– John Mount

• Transparent data sourcing and availability

• Fully automated analysis pipeline (code provided)

• Traceability from published results back to data

2

Page 3: Reproducible Data Science with R

Why Reproducibility?

• Save time

• Better science

• More authoritative research

• Reduce risk of errors

• Facilitate collaboration

3

Page 4: Reproducible Data Science with R

Why Reproducibility?

• Save time

• Better science

• More authoritative research

• Reduce risk of errors

• Facilitate collaboration

4

Page 5: Reproducible Data Science with R

Why Reproducibility?

• Save time

• Better science

• More authoritative research

• Reduce risk of errors

• Facilitate collaboration

5

Page 6: Reproducible Data Science with R

Why Reproducibility?

• Save time

• Better science

• More authoritative research

• Reduce risk of errors

• Facilitate collaboration

6

Page 7: Reproducible Data Science with R

Why Reproducibility?

• Save time

• Better science

• More authoritative research

• Reduce risk of errors

• Facilitate collaboration

7

Page 8: Reproducible Data Science with R

Accessing Data

• Data sources– databases, sensor logs, spreadsheets, file download, APIs, …

– Remember, these are sources: don’t modify them!

• Snapshot data into local static files (text is good)– You will likely include some ETL steps here

– Record a timestamp of when data was extracted: Sys.time()

– With big data, sample during development, scale up to finalize

– Document how this was all done, preferably with code (source file)!

• Import text files using R functions– Recommended package: readr (avoids issues with locale, dates, etc)

8

Page 9: Reproducible Data Science with R

Analysis process

• Interactively explore data & develop analyses as usual

• Capture the entire process in an R script– library(tidyverse) is helpful for cleaning, feature generation, etc.

– Re-usable components can be shared as (private) packages

• Generate artifacts using scripts– Graphics (please, no JPGs!)

– Tables

– Documents

• Include timestamps in output: Sys.time()

9

Page 10: Reproducible Data Science with R

A reproducible analysis environment

• Operating system and R version– For most purposes, not the biggest cause of issues

• But do document your R session info: sessionInfo()

– For production, consider VM / container

• A clean R environment– Organize work into independent projects (directories)

– Use relative paths in scripts

– Avoid use of .Rprofile

– Set explicit random seeds

– Do not save R workspace

10

Page 11: Reproducible Data Science with R

Managing a changing R package ecosystem

• One command to lock package versions to a specific date:checkpoint("2017-04-11")

(For collaborators, same command downloads required package versions)

• Applies just to this project– Global package upgrades won’t break reproducibility in other projects

• checkpoint package avail on CRAN, included with Microsoft R

11

Page 12: Reproducible Data Science with R

Presenting results

• Eliminate manual processes (as far as possible)– Annotations (graphs / tables)– Cut-and-paste into documents

• Notebooks– Combines code, output and

narrative– Good for collaboration with

other researchers

• Document Generation– Best for automating reports

12

Page 13: Reproducible Data Science with R

knitr / Rmarkdown

• Generate HTML, Word, or PDF reports

– Or books, blogs, slides, …

• Combine narrative and R code in a single document

• Human-readable, easy to edit

• Single-click update!

13

Page 14: Reproducible Data Science with R

Collaboration and sharing

• Just share R project folder

• Publish on Github

– Happy Git and GitHub for the useR, Jenny Bryanhttp://happygitwithr.com/

– Version retention and tracking

– Collaboration (code and comments)

14

Page 15: Reproducible Data Science with R

Take-Aways

Reproducibility is Beneficial

• Saves time

• Produces better science

• More trusted research

• Reduced risk of errors

• Encourages collaboration

Reproducibility is Simple

• Document and automate processes with R scripts

• Read and clean data with tidyverse

• Use checkpoint to manage package versions

• Generate documents with knitr

• Share reproducible projects with Github

15

Page 16: Reproducible Data Science with R

Reproducible Data Science with R

David SmithR Community Lead, Microsoft

@revodavid