2017-11-03 scientific workflow systems
TRANSCRIPT
Partners Funding
bioexcel.eu
Scientific Workflow Systems
1
Stian Soiland-Reyes
eScience Lab, The University of Manchester
2017-11-03, Aix-en-Provence
CESAB workshop: Reproducible Workflows
orcid.org/0000-0001-9842-9718 @soilandreyes
This work is licensed under aCreative Commons Attribution 4.0 International License.
bioexcel.eu
What is a Workflow?
Orchestrating computational tasks
Managing the control and data flow
Homogeneous or heterogeneous tasks:– Local / remote
– Own / third party
– White, grey or black boxes
– Reliable / fragile
– Reserved / dynamic
– Various underpinning infrastructure
– Various access controls
BioExcel: Biomolecular recognition
bioexcel.eu
Not on the agenda: Business workflows
Control flow of who has responsibility for what
BPM
Business workflows + computational workflows
IBISBA
3
bioexcel.eu
Why use workflows?Automation– Automate computational aspects
– Repetitive pipelines, sweep campaigns
Scaling – compute cycles– Make use of computational infrastructure &
handle large data
Abstraction – people cycles– Shield complexity and incompatibilities
– Report, re-use, evolve, share, compare
– Repeat –Tweak - Repeat
– First class commodities
Provenance - reporting– Capture, report and utilize log and data lineage
auto-documentation
– Traceable evolution, audit, transparency
– Compare
Findable
Accessible
Interoperable
Reusable
(Reproducible)
4 Adapted from Bertram Ludäscher at WORKS2015 https://www.slideshare.net/ludaesch/works-2015provenancemileage
bioexcel.eu
The humble Makefile
5
https://github.com/vak/makefile2dot
bioexcel.eu
Laser Interferometer Gravitational-Wave ObservatoryFirst detection of gravitational waves from colliding black holes
https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/
https://pegasus.isi.edu/
bioexcel.eu
Workflow Environment Ecosystem
7
bioexcel.euhttps://s.apache.org/existing-workflow-systems
bioexcel.eu
https://www.knime.org/
https://www.openphacts.org/
Pharmacological queriestarget, compound and pathway data
https://doi.org/10.1371/journal.pone.0115460
http://www.myexperiment.org/workflows/4292
bioexcel.eu
Stop Press!GUIs not essential!
GUI: Canvas, drag-drop blocks, arrows,
run button, data visualization
Script: Textual, command line, view data
externally. Script easily run from other apps.
Scripts can be workflows!
Workflow systems ⇆ Scripts
Scripts on ASAP meter:
Automation: ★ ★ ★ ★ ★
Scaling: ★ ★
Abstraction: ★
Provenance: ★ ★
bioexcel.eu
https://www.nextflow.io/
Script-like, define flow as channels
Streaming
Automatic Parallelism
Checkpoints
Virtualization and packaging
Portable
Reproducibility
bioexcel.eu
Snakemake
MakeFile + Python ⇝SnakeMake
Filename patterns
Shell commands
Inline Python, R
Scalable to grid/cloud
14
https://snakemake.readthedocs.io/
bioexcel.eu
YesWorkflow
Declare workflow steps as
#annotations in existing scripts
Graphical visualization of workflow
15
http://yesworkflow.org/
bioexcel.eu
https://github.com/chapmanb/bcbio-
nextgen
Distributed workflows for
Next-Gen Sequencing
analysis
Domain-specific language
Focus on parameters,
algorithms
Workflow fixed –
no command lines!
https://bcbio-nextgen.readthedocs.org
bioexcel.eu
http://commonwl.org/
Workflow interoperability
Common workflow format
Community based standards effort
Designed for clusters & clouds
Use containers (e.g. Docker)
Textual YAML files
(GUIs available)
Workflow: Steps with data dependencies
Step: command line or inline scripts
Scatter/gather on steps
Rich annotations
bioexcel.eu
http://www.commonwl.org/
bioexcel.eu
ContainersLinux Container technology
..light-weight "virtual" virtual machine
A container is started from a image
Images downloaded from Docker Hub
Dockerfile: Layer-based recipe
Philosophy: One service, one
image → microservices
Cloud's best friend: scalable, reproducible,
customizable
19
bioexcel.eu
Publish your own
container images
20
https://hub.docker.com/r/openphacts/
Dockerfile
bioexcel.eu
https://view.commonwl.org/
http://doi.org/10.7490/f1000research.1114375.1
bioexcel.eu
Running workflows,tracking provenance
bioexcel.eu
Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.
http://www.w3.org/TR/prov-overview/
ProvenanceW3C standard: PROV
But multiple formats
Multiple styles
Multiple extensions
Best practice for Workflow Provenance?
wfprov (Research Object, Taverna)OPMW/P-Plan (WINGS)ProvONE (DataOne)
https://w3id.org/ro/2016-01-28/wfprov/http://www.opmw.orghttp://vcvcomputing.com/provone/provone.html
bioexcel.eu
https://twitter.com/ianholmes/status/288689712636493824
bioexcel.euhttps://doi.org/10.1016/j.websem.2015.01.003
application/vnd.wf4ever.robundle+zip
Research Object Bundlehttp://www.researchobject.org/
Partners Funding
bioexcel.eu
Acknowledgements
27
Carole Goble
Michael R. Crusoe
Apache Taverna
BioExcel
Common Workflow Language
Research Object