reproducibility pi manifesto - icerm.brown.edu · reproducibility pi manifesto lorena a. barba...

68
Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University

Upload: others

Post on 20-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Reproducibility

PI Manifesto

Lorena A. Barba

Mechanical Engineering, Boston University

Introduction

Importance of an Identity

Identity Platform

Brand Personality

Terminology

05

Terminology cont.

Boston University Brand Identity Standards

Boston University College of Arts & Sciences

Boston University College of Arts & Sciences

boston university master logo

bu sub-brand logotype

bu icon

bu sub-brand signature

Page 2: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

1I will teach my graduate students about reproducibility

a.

Lab notebookb.

version controlc.

workflowd.

publication-quality plots at

group meetings

Page 3: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

2All our research code (and writing) is under version control.

a.

Local svn repo for prototypes on Python/Matlab/CUDA C and for LaTeX documents (reports, manuscripts, et al.)b.

Google code for released research codesc.

Bitbucket or Github for collaborative projects

Page 4: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

3We will always carry out verification and validation

V&V reports are posted to figshare

Example:

Validation of the cuIBM code for Navier-Stokes equations with immersed boundary methods. Anush Krishnan, Lorena A. Barba. figshare.

Retrieved 18:16, Dec 12, 2012 (GMT)

http://dx.doi.org/10.6084/m9.figshare.92789

Page 5: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

3We will always carry out verification and validation

V&V reports are posted to figshare

Example:

Validation of the cuIBM code for Navier-Stokes equations with immersed boundary methods. Anush Krishnan, Lorena A. Barba. figshare.

Retrieved 18:16, Dec 12, 2012 (GMT)

http://dx.doi.org/10.6084/m9.figshare.92789

Page 6: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

4For main results in a paper, we will share data, plotting script & figureunder CC-BY

Posted to figshare

Get a DOI and use in the paper under CC-BY, with citation.

Example:

Weak scaling of parallel FMM vs. FFT up to 4096 processes. Lorena A. Barba, Rio Yokota. figshare.

Retrieved 18:23, Dec 12, 2012 (GMT)

http://dx.doi.org/10.6084/m9.figshare.92425

1 8 64 512 40960

0.2

0.4

0.6

0.8

1

1.2

Number of processes

Para

llel e

ffici

ency

FMMspectral method

Page 7: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

4For main results in a paper, we will share data, plotting script & figureunder CC-BY

Posted to figshare

Get a DOI and use in the paper under CC-BY, with citation.

Page 8: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

5We will upload the preprint to arXiv at the time of submission of a paper.

... & update the preprint post peer-review.

Caveat

Collaboration with biologist — still an arXiv-unfriendly field (but author website posting allowed in this case)

5000

4000

3000

Monthly submission rate for arXiv

First 16.7 years to 29 March’08 = 470,055

Page 9: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

6We will release code at the time of submission of a paper.

under MIT license

As preparatory measure: I will declare this intention in grant proposals.

I have endorsed the Science Code Manifesto.

http://sciencecodemanifesto.org

Page 10: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

7We will add a “Reproducibility” declaration at the end of each paper.

Page 11: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

8I will keep an up-to-date web presence

CorollaryI will develop a consistent open science policy

Why do so many scientists have a terrible (or no) website?

@LorenaABarba

Please visit us at:http://barbagroup.bu.edu

Page 12: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Three themes

1. New publication models2. Workflow standards3. Social dynamic

Page 13: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

ITN TrialShare: Promoting reproducible research and transparency in clinical trials

Adam Asare, PhD

Director, Data Analysis 12/11/2012

Page 14: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Reproducibility  of  results  in  published  literature  

2  

Ionnidis,  P.  et  al.  Repeatability  of  published  microarray  gene  expression  analyses.  Nat  Gen  ,  41:2,  Feb  2009    

Page 15: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Immune  Tolerance  Network(ITN)  

•  Clinical  trials  consor3um  whose  mission  is  to  accelerate  the  development  of  immune  tolerance  therapies  –  Therapeu3c  areas  include:  allergy/asthma,  transplanta3on,  and  

autoimmune  diseases    –  Over  50  Phase  II  and  III  clinical  trials  at  various  stages  of  

development  –  NIH/NIAID  funded  for  over  12  years,  ~30M/year  –  Numerous  academic  and  industry  partners  

•  Core  objec3ves:  –  Rigorous  and  reproducible  research  –  Adherence  to  standardized  plaOorms  and  processes  –  Use  best  available  sta3s3cal  methodology  and  data  management  

prac3ces  –  Dissemina3on  of  data  and  results  to  research  community  

Page 16: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Web  access  to  data  and  analyses:  itntrialshare.org  

Page 17: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Data  analysis  study  archive  

Page 18: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Manuscript  figures  and  analysis  code  console  

Download  data  and  code  base  Re-­‐run  analysis  within  system  

Page 19: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Create  a  new  paradigm  for  research  publishing    

– Generate  high-­‐profile  manuscripts  that  are  interac3ve  at  the  data  and  code  level  •  Query  for  informa3on  regarding  how  the  published  data  was  generated  •  Allow  explora3on  of    alterna3ve  analysis    approaches    

– Benefits  realized  •  Allows  rapid  review  of  data  and  analysis  approaches  by  study  team  internally  for  manuscript  development  •  Public  resource  to  research  community  on  reference  datasets  and  analysis  approaches  

– Expect  addi3onal  func3ons  to  be  phased  in  over  3me      

Page 20: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Funding  AYribu3on  Slide:  General  ITN  

Na3onal  Ins3tute  of  Diabetes  &    Diges3ve  &  Kidney  Diseases  

FUNDED  BY:  

Na3onal  Ins3tute  of  Allergy  &  Infec3ous  Diseases  

Food  Allergy  Ini3a3ve  

Page 21: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

TC

Finding Canonically Represented Theorems

Sara Billey University of Washington

Reproducibility in Computational and Experimental Math

ICERM December 13, 2012

Page 22: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

TC

The Standard Inquiry

“Do you know anything about the following math problem?” How many permutations of {1,2,…,n} have exactly two peaks

in positions 2 and n-1? For n=5 there are 16 such permutations including 12 with peaks 4,5 such as [1 4 2 5 3] 4 with peaks 3,5 such as [1 3 2 5 4]

Page 23: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

TC

The Standard Reply

“Probably not!” But in this case I know where to find out if it is known. Let’s check OEIS! ”

Calculate a few terms: n=4 0 n=5 16 n=6 80 n=7 288

Page 24: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

TC

Experimentation to Theorem

Theorem [Billey-Burdzy-Sagan 2012] The number of permutations in Sn with peak set

S is given by 2(n-|S|) times a fixed polynomial depending on S provided n is larger than the maximum in S.

Page 25: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

TC

Theorem Finders

The OEIS is a “Theorem Finder”. The theorems in the OEIS all have a canonical form represented by terms in an integer sequence that allow lookup. What other types of theorems have canonical forms? •  Permutation Patterns (Tenner’s Database) •  Graphs •  Hypergeometric Series •  Finite Simple Groups

Page 26: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

TC

Theorem Finders

Goal: Each canonical theorem class should have a Theorem Finder database doing Community+Machine Learning. Why should we do this? •  Enhances experimental mathematics. •  Makes unexpected connections between fields. •  Improves refereeing of articles. See my Theorem Finder Homepage for more info.

Page 27: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

David Koop

VisTrails Provenance SDK

1

Maya

ParaView

GIMP

Provenance and reproducibility areimportant in many tools/apps–don’t want to instrument each

Page 28: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

David Koop

Using ProvSDK

2

Capture

Reproduce/Playback

Volume Rendering

Create Reader

Apply

New Colors Clip With Error

Slice Only

Color Change

vtkSMRepresentationProxy

Isosurface

Multiple Isosurfaces

Versions are materialized by asequence of actions; serialization

is delegated to the app/plugin

Page 29: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

David Koop

ProvSDK Ingredients

3

Browse/Search

Efficiency

Data

INVERSE

CHECKPOINT

PERFORMED

!FILE NOTFOUND

Teaching/Analysis

www.vistrails.com

Task

1

Task

2

Task

3

Task

4

Task

6

Task

5

Page 30: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for NNSA

U N C L A S S I F I E D Slide 1

Silent Data Corruption and

Other Anomalies

Sarah Michalak

Statistical Sciences Group

Los Alamos National Laboratory

[email protected]

Page 31: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for NNSA

U N C L A S S I F I E D

Silent Data Corruption (SDC)   “SDC occurs when incorrect data is delivered by a computing system

to the user without any error being logged” C. Constantinescu (AMD)

  SDC can have many causes [Constantinescu 2008] •  Environmental (temperature/voltage fluctuations; particles), manufacturing

residues, oxide breakdown, electro-static discharge

  New technologies may experience more data corruptions [Borkar 2012] •  Increasing number of bits/latches, noise levels; decreasing voltages

  Can affect desktops/laptops and larger-scale systems

  Greater issue with increasing scale •  ECC/parity provide protection, but perfect systems can’t be assumed

  Impact is hard to quantify precisely: SDC is silent and rare •  Different applications may have different susceptibility

Slide 2

Page 32: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for NNSA

U N C L A S S I F I E D

LANL SDC Research   Platform Testing:

•  9 platforms tested: all with HPL; 5 with HPL and Crisscross interconnect tester

•  HPL SDC observed on 4 platforms post-decomissioning

Slide 3

  Laboratory Testing: •  Dual-core 65nm processor on a high-performance overclocking MOBO

—  Manipulate frequency, voltage, fan speed/temp while running Linpack •  Multiple forms of SDC outside of nominal conditions

•  Incorrect Linpack results, erroneous timestamps/environmental data •  Other Errors: program termination with no error, program crash, system crash

  Neutron Beam Testing of Roadrunner (1st petaflop/s supercomputer): •  2 SDCs observed on Cell processor; 2 SDCs observed on Opteron processor •  Estimate 1 cosmic-ray-neutron-induced SDC every 1.5 months of operation

References: Michalak et al. 2012, 2010

A single node participated in 70 incorrect Linpack calculations

Page 33: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for NNSA

U N C L A S S I F I E D

Other Anomalies

  Data parsing (grep?) issue: different counts of failed HPL calculations •  5 false positives in 10M lines parsed; rate replicated on second cluster •  Appears related to a binary failing to link to a required dynamic library

—  NOT silent if you capture all messages; how many people do this?

  R duplicated lines reading in file using read.table command

Slide 4

Computational Reproducibility Must Account for SDC and Other Anomalies

Page 34: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for NNSA

U N C L A S S I F I E D Slide 5

Collaborators   Sean Blanchard

  Andy DuBois

  Dave DuBois

  Andrea Manuzzato

  Dave Modl

  Heather Quinn

  Bill Rust

  Ruben Salazar

  Curt Storlie

  Bruce Takala

  Steve Wender

  And many others…

Page 35: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Dec 2012 Ian M. Mitchell (UBC Computer Science) 1

Reproducibility(?) Review Proposal

• Proposal to evaluate reproducibility of submissions to an annual CS / engineering conference – Conference is ACM sponsored & published

– Accepts 25 to 30 papers / year, perhaps 70% include a computational element

– Program committee has expressed support

• Inspired by SIGMOD “repeatability & workability” evaluation procedures – Bonnet et al, SIGMOD Record, DOI: 10.1145/2034863.2034873

• This conference has much more homogenous

computational efforts than SIGMOD

– Typically a handful of plots generated by a few hundred lines

of Matlab that runs in a few hours on a laptop

Page 36: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Procedure

• Repeatability Evaluation Committee (REC)

– Get recommendations for postdocs / senior grad students

from members of the program committee (PC)

– Papers go through normal PC review process

– Authors of accepted papers are invited to submit a

repeatability package (RP) at time of final paper submission

– Authors and REC are provided evaluation criteria in advance

• RP contains a document, software and data

– Document explains what elements of the paper are

repeatable, system requirements and a procedure for

installation, execution and extraction of results

– Software can be provided by: link to public repository,

archive file, VM, AMI, runmycode.org, …?

Dec 2012 Ian M. Mitchell (UBC Computer Science) 2

Page 37: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Evaluation Criteria

• Three criteria, each rated 0-4

– “Repeatable” if average score of 2, all scores > 0

– Not clear how to combine scores from different reviewers

– Not clear what elements of the reviews should be public

– If not repeatable, no effect on the paper

– If repeatable, the instruction document must be included in

ACM DL supplemental material, small software and data

could be included

• Criteria 1: Coverage

– 0: no computational elements are repeatable

– 1: at least one repeatable element

– 2: majority of elements are repeatable

– 3: all repeatable and/or most extensible

– 4: all extensible

Dec 2012 Ian M. Mitchell (UBC Computer Science) 3

Page 38: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Evaluation Criteria

• Criteria 2: Instructions

– 0: none included

– 1: installation instructions but little else

– 2: for every computational element that is repeatable there is

a specific instruction explaining how to repeat it

– 3: there is a single command that almost exactly recreates

each repeatable element

– 4: additional explanations of design decisions, extensions, …

• Criteria 3: Quality(?)

– 0: No evidence of documentation or testing

– 1: The purpose of almost all files is documented

– 2: Almost all elements within source code and all data file

formats are documented

– 3: At least some components of the code have some testing

– 4: Significant unit and system test coverage Dec 2012 Ian M. Mitchell (UBC Computer Science) 4

Page 39: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

40 60 80 100 120

40

60

80

mmTowards turnkey reproducibilityGeoffrey M. Oxberry

Lawrence Livermore National LaboratoryComputational Engineering Division

Energy Conversion and Storage

This work performed under the auspices of the U.S. Department of Energyby Lawrence Livermore National Laboratory under Contract

DE-AC52-07NA27344.

December 13, 2012

(LLNL-PRES-607580-DRAFT) Turnkey reproducibility ICERM 2012 1 / 6

Page 40: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

40 60 80 100 120

40

60

80

mm

Problem: Reproducing someone’s workcan be hard

Need to install necessary software (assume opensource)

I Takes time, expertise, patience, privilegesI Could affect system stability

Could wrap source in VM (virtual machine) imageI Usually requires ≥ 300 MB to host; big, unwieldyI No separation of source code & environment means no

flexibility

High barrier means people don’t run the code

Lower hosting & time barrier by specifyingenvironment in separate repo using configurationmanagement software

(LLNL-PRES-607580-DRAFT) Turnkey reproducibility ICERM 2012 2 / 6

Page 41: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

40 60 80 100 120

40

60

80

mm

Solution: Specify environment withconfiguration management software

Config management tools specify config in text filesI Shell scripts (simplest, fewest prepackaged features)I Puppet (puppetlabs.com)I Chef (wiki.opscode.com)I Fabric (http://docs.fabfile.org/en/1.5/)I Related: Hashdist, Blueprint, Reprozip, others...

Instantiate config using virtualization toolsI Serial, small parallel jobs: Vagrant (vagrantup.com) +

VirtualBox (virtualbox.org)I StarCluster (star.mit.edu/cluster)I CloudFormation (aws.amazon.com/cloudformation)I Any other virtualization software + hardwareI Use web services instead (like Wakari, RunMyCode)

Idea is flexibility: pick & choose (even none, mix)(LLNL-PRES-607580-DRAFT) Turnkey reproducibility ICERM 2012 3 / 6

Page 42: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

40 60 80 100 120

40

60

80

mm

Example: Install Python interface forDASSL

DASSL: differential-algebraic equation solverpackage in Fortran (L. Petzold)PyDAS: Python interface to DASSL (J. W. Allen,on GitHub)Example: Solve Robertson problem in IPythonnotebook using PyDASPresentation, environment and source repos onhttps://github.com/goxberry (all labeled withICERM-2012)

I Requires Vagrant + VirtualBoxI Vagrantfile to specify VM to create (here, Ubuntu 12.04)I Configuration in PuppetI README with directions for running software

(LLNL-PRES-607580-DRAFT) Turnkey reproducibility ICERM 2012 4 / 6

Page 43: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

40 60 80 100 120

40

60

80

mm

Acknowledgments

Dr. Matt McNenly

Dr. Dan Flowers

Dr. David I. Ketcheson

Dr. Aron Ahmadia

US DOE

DOE CSGF

Gurpreet Singh, program manager for the DOEEERE Advanced Combustion Engine Program, forhis continued support of the Advanced CombustionNumerics project at LLNL

(LLNL-PRES-607580-DRAFT) Turnkey reproducibility ICERM 2012 5 / 6

Page 44: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

40 60 80 100 120

40

60

80

mm

(LLNL-PRES-607580-DRAFT) Turnkey reproducibility ICERM 2012 6 / 6

Page 45: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D Slide 1

EPSum  -­‐-­‐  The  Problem  

•  Different  results  when  varying  no.  of  processors  •  Order  of  opera@ons  change  with  number  of  processors  

•  Precision  errors  same  order  of  magnitude  as  programming  errors  

Page 46: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D Slide 2

EPSum  –  Kahan  Sum  

•  Add  carry  digits  through  a  second  double  precision  number  

!"!#"!

$ #"!%&#'

!"# $%&&'$()%*

#"!%&#'#"!

$ #"!%&#'

&("(!))'!)*(+!,#,#,&#-#"!!!!!!!!!!!!!!!(((!)

Page 47: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D Slide 3

EPSum  –  The  Code  •  In  prac@ce,  the  Kahan  sum  produced  as  good  a  result  as  the  Knuth  sum  at  lower  cost  •  The  setup  

struct  esum_type{            double  sum;            double  correc@on;  };  int  commuta@ve  =  1;  MPI_Datatype  MPI_TWO_DOUBLES;  MPI_Op  KAHAN_SUM;  MPI_Type_con@guous(2,  MPI_DOUBLE,  &MPI_TWO_DOUBLES);  MPI_Type_commit(&MPI_TWO_DOUBLES);  MPI_Op_create((MPI_User_func@on  *)kahan_sum,  commuta@ve,  &KAHAN_SUM);  

•  And  the  cleanup  MPI_Op_free(&KAHAN_SUM);  MPI_Type_free(&MPI_TWO_DOUBLES);  

Page 48: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D Slide 4

EPSum  –  The  Code  •  The  MPI  Opera@on  

void  kahan_sum(struct  esum_type  *in,  struct  esum_type  *inout,  int  *len,            MPI_Datatype  *MPI_TWO_DOUBLES)  {            double  corrected_next_term,  new_sum;            corrected_next_term  =  in-­‐>sum  +  (in.correc@on+inout.correc@on);            new_sum  =  inout-­‐>sum  +  corrected_next_term;            inout-­‐>correc@on  =  corrected_next_term  -­‐  (new_sum  -­‐  inout-­‐>sum);            inout-­‐>sum  =  new_sum;  }  

Page 49: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D Slide 5

EPSum  –  The  Code  double  simple_parallel_kahan_sum(double  **var)  {            double  corrected_next_term,  new_sum;            struct  esum_type  local,  global;            local.sum=0.0;            local.correc@on=0.0;            for(j=0;  j<jsize;  j++){                      for(i=0;  i<isize;  i++){                                corrected_next_term  =  var[j][i]  +  local.correc@on;                                new_sum  =  local.sum  +  corrected_next_term;                                local.correc@on  =  corrected_next_term  -­‐  (new_sum-­‐local.sum);                                local.sum  =  new_sum;                      }            }            MPI_Allreduce(&local,  &global,  1,  MPI_TWO_DOUBLES,  KAHAN_SUM,  MPI_COMM_WORLD);            return(global.sum);  }  

Page 50: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D Slide 6

EPSum  -­‐-­‐  Results  

Proc   Normal   Long-­‐Double  

Kahan   Knuth  

1   -­‐68009   63.017   0   0  

2   -­‐51316   33.374   -­‐2   -­‐2  

4   -­‐32740   -­‐3.325   2   2  

8   2586   -­‐3.004   -­‐2   -­‐2  

16   2360   -­‐2.500   -­‐2   -­‐2  

32   1968   -­‐1.621   -­‐2   -­‐2  

64   1078   0.134   -­‐4   -­‐4  

128   -­‐726   -­‐1.031   -­‐2   -­‐2  

256   192   -­‐0.998   -­‐2   -­‐2  

*In multiples of machine epsilon

Page 51: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

The role of reproducibility in number theory,from my perspective

Michael Rubinstein

University of Waterloo

ICERM, Dec 13, 2012

Page 52: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Computation in number theory can:• contribute to rigorous mathematical proof.• help verify conjectures experimentally.• stimulate mathematical discovery, theorems, relationships,

conjectures, and uncover phenomena.• motivate work on algorithms and complexity (exs:

factoring, primality testing, Ghaith Hiary’s algorithm forcomputing the zeta function).

• involve automated theorem proving and proof verification.

Page 53: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

What does it mean for a computation to be rigorous? Evenassuming that the algorithms/methods are correct in principle,how does one certify, potentially, thousands of lines of code asbeing correct when there are many places where things can gowrong:• mathematical coding errors (ex: signs, branches of log).• programming errors (ex: wrong loop or array bounds,

memory issues, confusing types),• loss of precision (accumulated round-off, cancellation),

wrong error analysis/inequalities.• reliance on black boxes: computer chips (Intel division

bug), compilers and optimizers (constantly releasing newversions with bug fixes), computer memory (can getcorrupted), closed source software (Mathematica), otherpeople’s packages, and using them in ways that were notoriginally intended or foreseen.

Page 54: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

At present we usually declare a program to be bug free once itproduces output that is consistent with our expectations. Wetend to give more trust to computational results that arereproducible using separate code and hardware, or for whichthere is more than method that gives the same output, or forwhich the correctness of the result can be easily tested(examples: factorization of an integer, explicit formula for zerosof an L-function).

Page 55: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Excerpt from Tao’s paper:

Page 56: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Rigour and bugs in computationOnce upon a time I had a very frustrating bug that took me 3days to find. It had to do with my project, in 1994 (published1998), with Bjorn Poonen on determining the number ofintersection points formed by the diagonals of a regular n-gon.

Page 57: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

(Poonen, R. 1994)

TheoremFor n ≥ 3, the number of interior intersection points formed bythe diagonals of a regular n-gon, I(n), is given by

I(n) =

(n4

)+ (−5n3 + 45n2 − 70n + 24)/24 · δ2(n)− (3n/2) · δ4(n)

+ (−45n2 + 262n)/6 · δ6(n) + 42n · δ12(n) + 60n · δ18(n)

+ 35n · δ24(n)− 38n · δ30(n)− 82n · δ42(n)− 330n · δ60(n)

− 144n · δ84(n)− 96n · δ90(n)− 144n · δ120(n)− 96n · δ210(n).

The form of I(n) was obtained by studying an equation involvinga dozen roots of unity. The specific coefficients were found byfitting I(n) to actual intersection counts for various n ≤ 420. Ourderivation was, thus, reduced to a finite computation.

Page 58: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

na

2(n)n

a3(n)n

a4(n)n

a5(n)n

a6(n)n

a7(n)n

I(n)−1

nn

a2(n)n

a3(n)n

a4(n)n

a5(n)n

a6(n)n

a7(n)n

I(n)−1

n

6 2 2 216 392564 4848 119 49 397580

12 19 5 1 25 222 426836 5166 126 54 432182

18 84 12 3 3 102 228 463303 5441 127 54 468925

24 256 36 11 1 304 234 501762 5718 129 57 507666

30 460 75 14 6 4 1 560 240 541612 6121 165 61 5 547964

36 1179 109 11 6 1305 246 584782 6340 140 60 591322

42 1786 194 27 13 2020 252 629399 6693 137 70 636299

48 3168 220 25 7 3420 258 676580 6972 147 63 683762

54 4722 288 24 12 5046 264 725976 7276 151 61 733464

60 6251 422 63 12 5 6753 270 777420 7643 150 66 4 1 785284

66 9172 460 35 15 9682 276 831575 7969 155 66 839765

72 12428 504 35 13 12980 282 887986 8326 161 69 896542

78 15920 642 42 18 16622 288 947132 8640 161 67 956000

84 20007 805 43 28 20883 294 1008358 9056 174 76 1017664

90 25230 863 45 21 4 1 26164 300 1072171 9462 203 72 5 1081913

96 31240 948 53 19 32260 306 1139436 9780 171 75 1149462

102 37786 1096 56 24 38962 312 1208944 10164 179 73 1219360

108 45447 1201 53 24 46725 318 1281100 10582 182 78 1291942

114 53768 1368 63 27 55226 324 1356315 10957 179 78 1367529

120 62652 1601 95 31 5 64384 330 1434110 11375 189 81 4 1 1445760

126 73676 1658 72 34 75440 336 1514816 11856 193 89 1526954

132 85319 1825 71 30 87245 342 1598970 12216 192 84 1611462

138 97990 2002 77 33 100102 348 1685843 12661 197 84 1698785

144 112100 2136 77 31 114344 354 1775788 13108 203 87 1789186

150 127070 2345 84 36 4 1 129540 360 1868312 13669 231 91 5 1882308

156 143635 2549 85 36 146305 366 1965272 14010 210 90 1979582

162 161520 2736 87 39 164382 372 2064919 14465 211 90 2079685

168 180504 3008 95 47 183654 378 2167754 14930 219 97 2183000

174 201448 3178 98 42 204766 384 2274136 15396 221 91 2289844

180 223251 3470 129 42 5 226897 390 2383690 15885 224 96 4 1 2399900

186 247562 3630 105 45 251342 396 2496999 16369 221 96 2513685

192 273144 3844 109 43 277140 402 2613536 16896 231 99 2630762

198 300294 4092 108 48 304542 408 2733888 17380 235 97 2751600

204 329171 4357 113 48 333689 414 2857752 17898 234 102 2875986

210 359556 4661 125 55 4 1 364402 420 2984383 18598 273 112 5 3003371

Table: The number of intersection points for one piece of the pie (i.e.2π/n radians), n = 6,12, . . . ,420.

Page 59: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Initially we formed this table by computing floating pointapproximations to the intersection points and, experimentally,declaring points equal if they agreed to 12, and then 40 decimalplaces.The patterns in our table broke down around n ≈ 150, when wefirst ran our computation. It took me 3 days to track down thebug to the value of

π = 3.14159265389793238462643383279502884197 . . .

Later, we redid the computation rigorously, by also checkingthat the intersection points of multiplicity > 1 fell into ourclassification of possible intersection points (Theorem 4 of ourpaper). That boiled down to comparing integers and we coulddo it exactly. Not only did it provide a rigorous count of theintersection points, but it also served as an extra check that ourclassification was complete.

Page 60: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Another anecdoteUsing number theoretic heuristics, and guided by techniquesand results from random matrix theory, Conrey, Farmer,Keating, R., and Snaith conjectured:

For positive integer k , and any ε > 0,∫ T

0|ζ(1/2 + it)|2kdt ∼

∫ T

0Pk(log t

)dt ,

where Pk is the polynomial of degree k2 given implicitly bya complicated 2k -fold residue.

Page 61: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

From the residue, we developed elaborate formulas for thecoefficients of Pk (x) and found, for example, (coefficients arerounded):

P3(x) = 0.000005708527034652788398376841445252313 x9

+ 0.00040502133088411440331215332025984 x8

+ 0.011072455215246998350410400826667 x7

+ 0.14840073080150272680851401518774 x6

+ 1.0459251779054883439385323798059 x5

+ 3.984385094823534724747964073429 x4

+ 8.60731914578120675614834763629 x3

+ 10.274330830703446134183009522 x2

+ 6.59391302064975810465713392 x+ 0.9165155076378930590178543.

A completely different method was then developed, usinganother approach/formula and very high precision (thousandsof digits) to see our way through high order poles thatcancelled, to obtain the same coefficients to about 20-25 digits.

Page 62: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

-0.003

-0.002

-0.001

0

0.001

0.002

0.003

0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 8e+07

data

(T)/

conj

ectu

re(T

) -

1

T

k=3

Graph of:R T

0 |ζ(1/2+it)|6dtR T0 P3(log(t)/(2π))dt

− 1, for 0 < T < 8× 107.

(R.-Yamagishi) Agreement is to about 4-5 decimal places out of15.

Page 63: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

ReproZip  Packing  Experiments  for  Sharing  

and  Publication Fernando Chirigati, Juliana Freire | NYU-Poly

Dennis Shasha | NYU

Page 64: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Motivation •  Published articles are not made reproducible

•  Computational reproducibility may be difficult to achieve

•  Some current solutions require the user to adopt a system o  GenePattern [1], Madagascar [2], Scientific Workflow Systems [3]

•  Other solutions rely on capturing information about the computational environment o  Virtual Machines

o  CDE [4]

Author

How  to  encapsulate  my  experiment?  Too  many  dependencies…  Too  many  files  to  keep  track… Sigh. Reviewers  

Collaborators

How  to  compile  this  program?  How  to  execute  it? How  to  explore  it? Sigh.

ReproZip:  Packing  Experiments  for  Sharing  and  Publication

Fernando  Chirigati  –  NYU-­‐‑Poly

Page 65: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

ReproZip •  ReproZip is a packaging solution

o  It makes it easier for authors to pack experiments and for reviewers to verify computational results

•  It creates reproducible packages from existing experiments on computational environment E o  No need to port experiments to other system

o  Leverages provenance of computational results

•  It unpacks an experiment on computational environment E’

•  It generates a workflow specification that encapsulates the execution of the experiment o  Eases the verification process

o  Allows users to explore the experiment, while keeping track of provenance

ReproZip:  Packing  Experiments  for  Sharing  and  Publication

Fernando  Chirigati  –  NYU-­‐‑Poly

Page 66: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Overview packing  (on  environment  E)

files +

binaries +

workflow Reproducible

Package Workflow Provenance  Tree

Experiment

unpacking  (on  environment  E’)

Reproducible Package

Experiment Extraction

files +

binaries +

workflow

verification  and  exploration

ReproZip:  Packing  Experiments  for  Sharing  and  Publication

Fernando  Chirigati  –  NYU-­‐‑Poly

Page 67: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

References 1.  GenePattern. http://www.broadinstitute.org/cancer/software/

genepattern/

2.  Madagascar. http://www.ahay.org/wiki/Main_Page

3.  S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD, pages 1345-1350, 2008

4.  P. Guo. CDE: A Tool for Creating Portable Experimental Software Packages. Computing in Science and Engineering, 14(4):32-35, 2012

5.  SystemTap. http://sourceware.org/systemtap/

6.  MongoDB. http://www.mongodb.org/

Page 68: Reproducibility PI Manifesto - icerm.brown.edu · Reproducibility PI Manifesto Lorena A. Barba Mechanical Engineering, Boston University Introduction Importance of an Identity Identity

Thank  You! Fernando Chirigati

[email protected] http://vgc.poly.edu/~fchirigati