the rhetoric of research objects
TRANSCRIPT
The Rhetoric of Research Objects
Professor Carole Goble
The University of Manchester UK
carolegoblemanchesteracuk
researchobjectorg
ISWC2017 SemSci Workshop Vienna 21 October 2017
Acknowledgements
Stian Soiland-Reyes Catarina Martins
Scholarly Communication
ldquoThe art of discourse wherein a writer or speaker strives to inform persuade or motivate particular audiences in specific situationsrdquo
httpsenwikipediaorgwikiRhetoric
Rhetoric
papers should describe the results and provide a clear enough method to allow successful repetition and extensionbull announce a resultbull convince readers the
result is correct
Virtual Witnessing
Accessible Reproducible Research Science 22 January 2010 Vol 327 no 5964 pp 415-416 DOI 101126science1179653
Leviathan and the Air-Pump Hobbes Boyle and the Experimental Life (1985) Shapin and Schaffer
From Manuscripts to Research Objects
ldquoAn article about computational science in a scientific publication is not the scholarship itself it is merely advertising of the scholarship The actual scholarship is the complete software development environment [the complete data] and the complete set of instructions which generated the figuresrdquo David Donoho ldquoWavelab and Reproducible Researchrdquo 1995
Datasets Data collectionsStandard operating proceduresSoftware algorithmsConfigurations Tools and apps servicesCodes code librariesWorkflows scriptsSystem software Infrastructure Compilers hardware
Research Components in a study and backing an article are Many and Various
workflow commons
Collection in a Data Catalogue
Third party remote web services or command line tools
Workflows of local or remotely executed codes
16 datafiles (kinetic flux inhibition runout)
19 models (kinetics validation)
13 SOPs
3 studies (model analysis construction validation)
24 assaysanalyses (simulations model characterisations)
Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237
Research Components in a study and backing an article are Many and Various
Investigation
Study Analysis
Data
Model
SOP(Assay)
httpsfairdomhuborginvestigations56
Systems Biology Commons
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Acknowledgements
Stian Soiland-Reyes Catarina Martins
Scholarly Communication
ldquoThe art of discourse wherein a writer or speaker strives to inform persuade or motivate particular audiences in specific situationsrdquo
httpsenwikipediaorgwikiRhetoric
Rhetoric
papers should describe the results and provide a clear enough method to allow successful repetition and extensionbull announce a resultbull convince readers the
result is correct
Virtual Witnessing
Accessible Reproducible Research Science 22 January 2010 Vol 327 no 5964 pp 415-416 DOI 101126science1179653
Leviathan and the Air-Pump Hobbes Boyle and the Experimental Life (1985) Shapin and Schaffer
From Manuscripts to Research Objects
ldquoAn article about computational science in a scientific publication is not the scholarship itself it is merely advertising of the scholarship The actual scholarship is the complete software development environment [the complete data] and the complete set of instructions which generated the figuresrdquo David Donoho ldquoWavelab and Reproducible Researchrdquo 1995
Datasets Data collectionsStandard operating proceduresSoftware algorithmsConfigurations Tools and apps servicesCodes code librariesWorkflows scriptsSystem software Infrastructure Compilers hardware
Research Components in a study and backing an article are Many and Various
workflow commons
Collection in a Data Catalogue
Third party remote web services or command line tools
Workflows of local or remotely executed codes
16 datafiles (kinetic flux inhibition runout)
19 models (kinetics validation)
13 SOPs
3 studies (model analysis construction validation)
24 assaysanalyses (simulations model characterisations)
Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237
Research Components in a study and backing an article are Many and Various
Investigation
Study Analysis
Data
Model
SOP(Assay)
httpsfairdomhuborginvestigations56
Systems Biology Commons
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Scholarly Communication
ldquoThe art of discourse wherein a writer or speaker strives to inform persuade or motivate particular audiences in specific situationsrdquo
httpsenwikipediaorgwikiRhetoric
Rhetoric
papers should describe the results and provide a clear enough method to allow successful repetition and extensionbull announce a resultbull convince readers the
result is correct
Virtual Witnessing
Accessible Reproducible Research Science 22 January 2010 Vol 327 no 5964 pp 415-416 DOI 101126science1179653
Leviathan and the Air-Pump Hobbes Boyle and the Experimental Life (1985) Shapin and Schaffer
From Manuscripts to Research Objects
ldquoAn article about computational science in a scientific publication is not the scholarship itself it is merely advertising of the scholarship The actual scholarship is the complete software development environment [the complete data] and the complete set of instructions which generated the figuresrdquo David Donoho ldquoWavelab and Reproducible Researchrdquo 1995
Datasets Data collectionsStandard operating proceduresSoftware algorithmsConfigurations Tools and apps servicesCodes code librariesWorkflows scriptsSystem software Infrastructure Compilers hardware
Research Components in a study and backing an article are Many and Various
workflow commons
Collection in a Data Catalogue
Third party remote web services or command line tools
Workflows of local or remotely executed codes
16 datafiles (kinetic flux inhibition runout)
19 models (kinetics validation)
13 SOPs
3 studies (model analysis construction validation)
24 assaysanalyses (simulations model characterisations)
Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237
Research Components in a study and backing an article are Many and Various
Investigation
Study Analysis
Data
Model
SOP(Assay)
httpsfairdomhuborginvestigations56
Systems Biology Commons
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
From Manuscripts to Research Objects
ldquoAn article about computational science in a scientific publication is not the scholarship itself it is merely advertising of the scholarship The actual scholarship is the complete software development environment [the complete data] and the complete set of instructions which generated the figuresrdquo David Donoho ldquoWavelab and Reproducible Researchrdquo 1995
Datasets Data collectionsStandard operating proceduresSoftware algorithmsConfigurations Tools and apps servicesCodes code librariesWorkflows scriptsSystem software Infrastructure Compilers hardware
Research Components in a study and backing an article are Many and Various
workflow commons
Collection in a Data Catalogue
Third party remote web services or command line tools
Workflows of local or remotely executed codes
16 datafiles (kinetic flux inhibition runout)
19 models (kinetics validation)
13 SOPs
3 studies (model analysis construction validation)
24 assaysanalyses (simulations model characterisations)
Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237
Research Components in a study and backing an article are Many and Various
Investigation
Study Analysis
Data
Model
SOP(Assay)
httpsfairdomhuborginvestigations56
Systems Biology Commons
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Research Components in a study and backing an article are Many and Various
workflow commons
Collection in a Data Catalogue
Third party remote web services or command line tools
Workflows of local or remotely executed codes
16 datafiles (kinetic flux inhibition runout)
19 models (kinetics validation)
13 SOPs
3 studies (model analysis construction validation)
24 assaysanalyses (simulations model characterisations)
Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237
Research Components in a study and backing an article are Many and Various
Investigation
Study Analysis
Data
Model
SOP(Assay)
httpsfairdomhuborginvestigations56
Systems Biology Commons
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
workflow commons
Collection in a Data Catalogue
Third party remote web services or command line tools
Workflows of local or remotely executed codes
16 datafiles (kinetic flux inhibition runout)
19 models (kinetics validation)
13 SOPs
3 studies (model analysis construction validation)
24 assaysanalyses (simulations model characterisations)
Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237
Research Components in a study and backing an article are Many and Various
Investigation
Study Analysis
Data
Model
SOP(Assay)
httpsfairdomhuborginvestigations56
Systems Biology Commons
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Collection in a Data Catalogue
Third party remote web services or command line tools
Workflows of local or remotely executed codes
16 datafiles (kinetic flux inhibition runout)
19 models (kinetics validation)
13 SOPs
3 studies (model analysis construction validation)
24 assaysanalyses (simulations model characterisations)
Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237
Research Components in a study and backing an article are Many and Various
Investigation
Study Analysis
Data
Model
SOP(Assay)
httpsfairdomhuborginvestigations56
Systems Biology Commons
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
16 datafiles (kinetic flux inhibition runout)
19 models (kinetics validation)
13 SOPs
3 studies (model analysis construction validation)
24 assaysanalyses (simulations model characterisations)
Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237
Research Components in a study and backing an article are Many and Various
Investigation
Study Analysis
Data
Model
SOP(Assay)
httpsfairdomhuborginvestigations56
Systems Biology Commons
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Investigation
Study Analysis
Data
Model
SOP(Assay)
httpsfairdomhuborginvestigations56
Systems Biology Commons
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip
Spans repository silos Regardless of locationIn househellipExternal - subject specific general
Structured organisation
Retaining contextover fragmentation
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
A Research Object bundles and relates digital resources of a scientific
experimentinvestigation + context
bull Data used and results produced in experimental study
bull Methods employed to produce and analyse that data
bull Provenance and settings for the experiments
bull People involved in the investigation
bull Annotations about these resources to improve understanding and interpretation
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Standards-based metadata framework for bundling embedded and referenced resources with context
Citable Reproducible Packaging
researchobjectorg
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Container
Research Object in a nutshell
Packaging FrameworksZip Archives BagIt Docker images
Platforms FAIRDOM myExperiment
Rhetorical Analogy 1
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Systems Biology Research Objects exchange portability and maintenance
components packaged into
various containers
ISA-TABchecksum
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
RO Commons and Currency
Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek
httpsdoiorg1015490seek1investigation56
Active entry evolves
Version
information travels with the data and models
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Rhetorical Analogies hellip
ReproducibilityPreservation
ReleaseExchange
Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1
FAIR CommonsCurrency of Scholarship
Interpretation ComparisonPreservation Repair
Portability ReuseExecution
Active ResearchEvolving codes
New data
Software ReleaseExecutable Papers
Scientific InstrumentsMachines
Interpretation ComparisonPortability Reuse
Credit Citation
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
22102017
An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo
Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo
Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo
httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article
Release
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Instrument Analogy
Methodstechniques algorithms spec of the steps models versions robustness
Materialsdatasets parameters thresholds versions algorithm seeds
Experiment
Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets
Laboratory
computational environment versions
Setup
Report
Run
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Instrument Analogy
bull Instruments Break
bull Technologies materials and methods change
bull Scope of use robustness
bull Blackboxes ndashdark and complicated
Workflow preservation amp repair
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Reports + Machines Workflow Research Objects
bull W3C PROVbull Provenance
Templatesbull Trajectory
mapping
workflow engine
Workflow RunProvenance
Inputs Outputs
Intermediates
ParametersConfigs
Checksum
Communityontologies amp formats
Narrative
Linked DataJSON-LDRDF
EDAM
Errors
tools
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
BioCompute Objects
Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783
Linked Data JSON-LD Ontologies (EDAM SWO)
Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
How do we build manifests
Rich self-describing semantic descriptions about resources and
their relationshipshellip
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Manifest Construction
Manifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their
relationships
Container
Research Objects = Metadata Objects
Manifest Description
Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed
Manifest
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Containers are Many and Various
pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed
repository of gt2700 bioinformatics packages ready to use with conda install
Old Favourites
Zip Archives
BagIt Archives
ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
bull all resources including external resources and outside references
bull attribution and provenance of each resource for credit and right versions
bull any part of the RO to be further described textually or semantically
bull extensibility point for community-driven standards
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Manifest ConstructionManifest
Identificationto locate things
Aggregatesto link things together
Annotationsabout things amp their relationships
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
RRI DOI URI ORCID
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Manifest Construction
Identificationto locate and resolve things
Aggregatesto link things together
Annotationsabout things amp their relationships
RRI DOI URI ORCID
Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications
W3C Web Annotation Vocabulary
Open Archives InitiativeObject Exchange and Reuse
httpwwwresearchobjectorgspecifications
Manifest
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Artists Impression
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
The real manifest
bull A Manifest
for 27 A4 pages hellip
RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
The need for embedded tools
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Manifest Description Profiles
where it came fromits evolution
what else is needed
what should be there for types
Manifest
Project LabSpecific
Community-based TypesContext
All
VoID
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
OmicsDI
Trend JSON(-LD) + Schemas
Manifest schemaorg tailored to the Biosciences
Datarepository
Datarepository
TrainingResource
Bioschemas BioschemasBioschemas
Search engines
RegistriesData
Aggregators
Standardised metadatamark-up
Metadata published and harvested without APIs or special feeds
CommodityOff the Shelf toolsApp eco-systemLightweight
Sample Catalogue
BBMRI-ERIC Directory
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Training materials amp EventsLaboratory protocols Workflows and Tools
See Alasdair Grayrsquos Poster
Manifest schemaorg tailored to the Biosciences
13 public datasets marked up including Gigascience data journal
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Minimum informationfor one content type
Common propertiesamong content
types
Manifest Description ProfilesManifest
Minim model for defining checklists
Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489
httppurlorgminimdescription
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Validation and Monitoring Toolsrich RDF-based generated from the workflow systems
Bespoke tooling SPIN-based checking
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
How can we express the Syntax and Semantics of Profiles to make generic tools
bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)
Manifest construction Check cross-reference
constraints on identifiers
Check URI patterns eg ldquostarts with rdquo
Check JSON Structure
Different levelsfrom Whole studies to Complex types
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
identifiersorg
PROV
JSON
manifestjson
httpsdoiorg101109BigData20167840618
The manifest ties everything together
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Case study Back to Workflows
Workflow descriptionTool description
EDAM OntologySWO OntologyData FormatsBioschemasorg
Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability
Based on wf4ever wfdesc
bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for
workflow structure matches 11 with YAML
bull schemaorg annotations
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Download as a Research Object Bundle
Over an active githubentry for an actively developing workflow
permalink to snapshot the GitHub entry and RO identifier
Common Workflow Language Viewer
CWL files packaged in a RO CWL RO + added richness
Lift out parts into the manifest
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here
Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred
Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided
Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana
httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values
The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format
Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more
httpsviewcommonwlorgabout
shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL
shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL
step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat
shouldHaveFormat cwlFile dctformat ( iana | edam )
iana IRI
^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf
lthttpedamontologyorgformat_1915gt
Capturing Common Workflow Language Profile as ShEx
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
ShEx is SPO testing not Graph Link Following
Info for Constraints arebull Embedded in a specific format
ndash Extractconvert from domain-specific formats
bull Embedded in annotation resourcesndash Use existing schemaorg
annotations
bull Need to be acquired ndash eg URI look-ups (ORCID -gt author
name)
bull Custom amp hardcoded namespacesndash Pre-declare ontologies
ndash Add derived annotations post-processing
RDF must already be in a single graph
Canrsquot check if resource exists (eg 404)
Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)
Canrsquot apply nested RDF shapes to Linked Data resources
Canrsquot say ldquoMust be term from any resolvable ontology
Canrsquot check the format is actually in the EDAM ontology
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
RDF Shape that indicates to follow links
RO pre-processing to merge to single graph
Bespoke validators unpackers to iterate over the RO
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Domain specific
bull ldquoMust have a workflow that analyses next-gen sequencing datardquo
bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo
bull ldquoAll required data files must be providedrdquo
bull ldquoGeneric names should be avoidedrdquo
General Tools that do their best at unpacking and handing off
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Did anyone take any notice
Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool
Standardised packing of Systems Biology models
European Space Agency RO LibraryEverest Project
Metagenomics pipelines and LARGE datasets
U Rostock
ISI USC
Public Heath Learning Systems
Asthma Research e-Labsharing and computing statistical cohort studies
Precision medicineNGS pipelines regulation
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012
Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)
FAIRPORT January 2014 Lorenz Centre Leiden
Ted SlaterYARC OpenBEL
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
A trendhellip Using JSON(-LD) + schemaorg
httpsdokieli
httpslinkedresearchorg
Manifest Schemaorg JSON-LD RDFArchive targz
Reproducible Document Stack project
eLife Substance and Stencila
BagIT data profile + schemaorg JSON-LD annotations
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles
Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Take-up Analogy Start Ups
Community
Driver
ToolsEasy to make
Hard to consume
Workflows
Reproducibility
Portability between
platforms
Platform amp user buy-in from the get-go
Passionate dedicated leadership
Stan
dard
s
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Open Questions
Stewardshipbull owners sites authors
Spanningbull platforms researchers
Lifecyclebull composition forkinghellip
Governance
Creditbull micro-credit amp citation
propagation attribution
Tamper proofingbull blockchain ethereum
Maintenancebull of evolving contentbull incrementality amp
degradation
Manifestsbull profile amp
template makingbull auto
manufacture
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
Who gets credit for whatUsing Provenance for Credit Mapping
[Paolo Missier]
1
3
2
2
34
11
1
2
2
5
3
3
4
3Alice
Charlie
Bob
Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016
W3C PROVdependency graph
ldquoProvletsrdquo
Granularity Atomicity Aggregation
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
bull Tracking RO usage and indirect contributions
bull Awarding fractional credit to contributors
1 ldquoContriponentsrdquo
bull contributors + components
2 Weighted contribution
3 Networked Credit maps
bull Travel with the contriponents
Transitive Credit contribution[Dan Katz and Arfon Smith]
Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby
D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
bull Manifests using semantics
bull Commons of components
bull A new scholarly currency
bull Necessity for reproducible machines
bull Foundation of release of research
bull Ramps rather than Revolution
The Rhetoric of Research Objectsresearchobjectorg
Reports of the death of the
scientific paper are greatly
exaggerated
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith
All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas
httpwwwresearchobjectorg
httpwwwwf4ever-projectorg
httpwwwfair-domorg
httpseek4scienceorg
httprightfieldorguk
httpwwwbioschemasorg
httpwwwcommonwlorg
httpwwwbioexceleu
Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft
Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith