reproducibile scientific workflows - acting on change 2016
TRANSCRIPT
Reproducible scientific workflows
Tomasz MiksaVienna University of Technology
& SBA Research, Austria
Tomasz Miksa [email protected]
eScience and Research Infrastructures
Scientists exchange- facilities- resources- services- datasets
Research requires- special tooling and software- workflows to
• capture• transform• visualize• interpret the data
Tomasz Miksa [email protected]
Taverna Workflow
Workflows and Context
‘Workflows’ can be- ad hoc commands and scripts
executed manually - well-structured processes
executed within a controlled environment
Workflows - share infrastructure with other processes- delegate tasks to tools installed in the system- require specific configurations- can use distributed systems
#!/bin/bash
# fetch datajava -jar GestBarragensWSClientIQData.jarunzip -o IQData.zip
# fix encoding#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r
# generate referencesR --vanilla < iq_utf8.r > IQout.txt
# create pdfpdflatex iq.texpdflatex iq.tex
Script
Tomasz Miksa [email protected]
Reproducibility
Current studies show very low reproducibility in
- medicine
- economy
- computer science
Reproducibility requires
- well documented research workflows
- precise information
on the experiment's environment
Tomasz Miksa [email protected]
Reproducibility Neuroanatomical studies
FreeSurfer Software- cortical thickness and volume of neuroanatomical structures
Different - FreeSurfer Versions
• v4.3.1, v4.5.0, v5.0.0
- Workstation • Mac, Hewlett‐Packard
- Operating system version• OSX 10.5, OSX 10.6
E. Gronenschild, P. Habets, H. I. L. Jacobs, R. Mengelers, N. Rozendaal, J. van Os, and M. Marcelis, “The effects of freesurfer version, workstation type, and macintosh operating system version on anatomical volume and cortical thickness measurements,” 2012.
Tomasz Miksa [email protected]
Reproducibility Computer Science
613 papers in 8 ACM conferences
C. Collberg and T. Proebsting, “Measuring reproducibility in computer systems research,” 2014. [Online]. Available: http://reproducibility.cs.arizona.edu/tr.pdf
Tomasz Miksa [email protected]
ReproducibilityComputer Science
E-mail responses from authors- Wrong version- Code will be available soon- Programmer left- Bad backup practices- Commercial code- Proprietary academic code- Intellectual property- No intention to release- …
Variety of solutions
Workflow systems Interactive notebooks Virtualisation Containers Code repositories Automated builds
Service monitoring Metadata standards Provenance Preservation planning Repositories
Tomasz Miksa [email protected]
TIMBUS - Process preservation
Digital preservation of business processes Based on risk management Context modelling is the key
Tomasz Miksa [email protected]
TIMBUS - Context modelling
Context model Automated extractors Process execution monitoring Service monitoring
TIMBUS - Risk mitigation strategies
Metadata and documentation
Migration- File formats- Storage media- Alternative services
• Open source service• In‐housing of services
Emulation Virtualisation Mock‐up of systems
Tomasz Miksa [email protected]
Summary
Scientific experiments- workflows for data processing with software dependencies
Risks affecting reproducibility - low due to insufficient experiment description
Solutions for improving reproducibility- improve data management, sharing and reuse
TIMBUS approach for process preservation- based on risk management practices- using context modelling to evaluate preservation alternatives