the rhetoric of research objects

53
The Rhetoric of Research Objects Professor Carole Goble The University of Manchester, UK [email protected] researchobject.org ISWC2017 SemSci Workshop, Vienna, 21 October 2017

Upload: carole-goble

Post on 22-Jan-2018

563 views

Category:

Science


0 download

TRANSCRIPT

Page 1: The Rhetoric of Research Objects

The Rhetoric of Research Objects

Professor Carole Goble

The University of Manchester UK

carolegoblemanchesteracuk

researchobjectorg

ISWC2017 SemSci Workshop Vienna 21 October 2017

Acknowledgements

Stian Soiland-Reyes Catarina Martins

Scholarly Communication

ldquoThe art of discourse wherein a writer or speaker strives to inform persuade or motivate particular audiences in specific situationsrdquo

httpsenwikipediaorgwikiRhetoric

Rhetoric

papers should describe the results and provide a clear enough method to allow successful repetition and extensionbull announce a resultbull convince readers the

result is correct

Virtual Witnessing

Accessible Reproducible Research Science 22 January 2010 Vol 327 no 5964 pp 415-416 DOI 101126science1179653

Leviathan and the Air-Pump Hobbes Boyle and the Experimental Life (1985) Shapin and Schaffer

From Manuscripts to Research Objects

ldquoAn article about computational science in a scientific publication is not the scholarship itself it is merely advertising of the scholarship The actual scholarship is the complete software development environment [the complete data] and the complete set of instructions which generated the figuresrdquo David Donoho ldquoWavelab and Reproducible Researchrdquo 1995

Datasets Data collectionsStandard operating proceduresSoftware algorithmsConfigurations Tools and apps servicesCodes code librariesWorkflows scriptsSystem software Infrastructure Compilers hardware

Research Components in a study and backing an article are Many and Various

workflow commons

Collection in a Data Catalogue

Third party remote web services or command line tools

Workflows of local or remotely executed codes

16 datafiles (kinetic flux inhibition runout)

19 models (kinetics validation)

13 SOPs

3 studies (model analysis construction validation)

24 assaysanalyses (simulations model characterisations)

Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237

Research Components in a study and backing an article are Many and Various

Investigation

Study Analysis

Data

Model

SOP(Assay)

httpsfairdomhuborginvestigations56

Systems Biology Commons

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 2: The Rhetoric of Research Objects

Acknowledgements

Stian Soiland-Reyes Catarina Martins

Scholarly Communication

ldquoThe art of discourse wherein a writer or speaker strives to inform persuade or motivate particular audiences in specific situationsrdquo

httpsenwikipediaorgwikiRhetoric

Rhetoric

papers should describe the results and provide a clear enough method to allow successful repetition and extensionbull announce a resultbull convince readers the

result is correct

Virtual Witnessing

Accessible Reproducible Research Science 22 January 2010 Vol 327 no 5964 pp 415-416 DOI 101126science1179653

Leviathan and the Air-Pump Hobbes Boyle and the Experimental Life (1985) Shapin and Schaffer

From Manuscripts to Research Objects

ldquoAn article about computational science in a scientific publication is not the scholarship itself it is merely advertising of the scholarship The actual scholarship is the complete software development environment [the complete data] and the complete set of instructions which generated the figuresrdquo David Donoho ldquoWavelab and Reproducible Researchrdquo 1995

Datasets Data collectionsStandard operating proceduresSoftware algorithmsConfigurations Tools and apps servicesCodes code librariesWorkflows scriptsSystem software Infrastructure Compilers hardware

Research Components in a study and backing an article are Many and Various

workflow commons

Collection in a Data Catalogue

Third party remote web services or command line tools

Workflows of local or remotely executed codes

16 datafiles (kinetic flux inhibition runout)

19 models (kinetics validation)

13 SOPs

3 studies (model analysis construction validation)

24 assaysanalyses (simulations model characterisations)

Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237

Research Components in a study and backing an article are Many and Various

Investigation

Study Analysis

Data

Model

SOP(Assay)

httpsfairdomhuborginvestigations56

Systems Biology Commons

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 3: The Rhetoric of Research Objects

Scholarly Communication

ldquoThe art of discourse wherein a writer or speaker strives to inform persuade or motivate particular audiences in specific situationsrdquo

httpsenwikipediaorgwikiRhetoric

Rhetoric

papers should describe the results and provide a clear enough method to allow successful repetition and extensionbull announce a resultbull convince readers the

result is correct

Virtual Witnessing

Accessible Reproducible Research Science 22 January 2010 Vol 327 no 5964 pp 415-416 DOI 101126science1179653

Leviathan and the Air-Pump Hobbes Boyle and the Experimental Life (1985) Shapin and Schaffer

From Manuscripts to Research Objects

ldquoAn article about computational science in a scientific publication is not the scholarship itself it is merely advertising of the scholarship The actual scholarship is the complete software development environment [the complete data] and the complete set of instructions which generated the figuresrdquo David Donoho ldquoWavelab and Reproducible Researchrdquo 1995

Datasets Data collectionsStandard operating proceduresSoftware algorithmsConfigurations Tools and apps servicesCodes code librariesWorkflows scriptsSystem software Infrastructure Compilers hardware

Research Components in a study and backing an article are Many and Various

workflow commons

Collection in a Data Catalogue

Third party remote web services or command line tools

Workflows of local or remotely executed codes

16 datafiles (kinetic flux inhibition runout)

19 models (kinetics validation)

13 SOPs

3 studies (model analysis construction validation)

24 assaysanalyses (simulations model characterisations)

Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237

Research Components in a study and backing an article are Many and Various

Investigation

Study Analysis

Data

Model

SOP(Assay)

httpsfairdomhuborginvestigations56

Systems Biology Commons

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 4: The Rhetoric of Research Objects

From Manuscripts to Research Objects

ldquoAn article about computational science in a scientific publication is not the scholarship itself it is merely advertising of the scholarship The actual scholarship is the complete software development environment [the complete data] and the complete set of instructions which generated the figuresrdquo David Donoho ldquoWavelab and Reproducible Researchrdquo 1995

Datasets Data collectionsStandard operating proceduresSoftware algorithmsConfigurations Tools and apps servicesCodes code librariesWorkflows scriptsSystem software Infrastructure Compilers hardware

Research Components in a study and backing an article are Many and Various

workflow commons

Collection in a Data Catalogue

Third party remote web services or command line tools

Workflows of local or remotely executed codes

16 datafiles (kinetic flux inhibition runout)

19 models (kinetics validation)

13 SOPs

3 studies (model analysis construction validation)

24 assaysanalyses (simulations model characterisations)

Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237

Research Components in a study and backing an article are Many and Various

Investigation

Study Analysis

Data

Model

SOP(Assay)

httpsfairdomhuborginvestigations56

Systems Biology Commons

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 5: The Rhetoric of Research Objects

Research Components in a study and backing an article are Many and Various

workflow commons

Collection in a Data Catalogue

Third party remote web services or command line tools

Workflows of local or remotely executed codes

16 datafiles (kinetic flux inhibition runout)

19 models (kinetics validation)

13 SOPs

3 studies (model analysis construction validation)

24 assaysanalyses (simulations model characterisations)

Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237

Research Components in a study and backing an article are Many and Various

Investigation

Study Analysis

Data

Model

SOP(Assay)

httpsfairdomhuborginvestigations56

Systems Biology Commons

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 6: The Rhetoric of Research Objects

workflow commons

Collection in a Data Catalogue

Third party remote web services or command line tools

Workflows of local or remotely executed codes

16 datafiles (kinetic flux inhibition runout)

19 models (kinetics validation)

13 SOPs

3 studies (model analysis construction validation)

24 assaysanalyses (simulations model characterisations)

Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237

Research Components in a study and backing an article are Many and Various

Investigation

Study Analysis

Data

Model

SOP(Assay)

httpsfairdomhuborginvestigations56

Systems Biology Commons

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 7: The Rhetoric of Research Objects

Collection in a Data Catalogue

Third party remote web services or command line tools

Workflows of local or remotely executed codes

16 datafiles (kinetic flux inhibition runout)

19 models (kinetics validation)

13 SOPs

3 studies (model analysis construction validation)

24 assaysanalyses (simulations model characterisations)

Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237

Research Components in a study and backing an article are Many and Various

Investigation

Study Analysis

Data

Model

SOP(Assay)

httpsfairdomhuborginvestigations56

Systems Biology Commons

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 8: The Rhetoric of Research Objects

16 datafiles (kinetic flux inhibition runout)

19 models (kinetics validation)

13 SOPs

3 studies (model analysis construction validation)

24 assaysanalyses (simulations model characterisations)

Penkler G du Toit F Adams W Rautenbach M Palm D C van Niekerk D D and Snoep J L (2015) Construction and validation of a detailed kinetic model of glycolysis in Plasmodium falciparum FEBS J 282 1481ndash1511 doi101111febs13237

Research Components in a study and backing an article are Many and Various

Investigation

Study Analysis

Data

Model

SOP(Assay)

httpsfairdomhuborginvestigations56

Systems Biology Commons

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 9: The Rhetoric of Research Objects

Investigation

Study Analysis

Data

Model

SOP(Assay)

httpsfairdomhuborginvestigations56

Systems Biology Commons

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 10: The Rhetoric of Research Objects

Multi-results amp VersionsData of many typeshellipPrimary secondary tertiaryhellipMethods models scripts hellip

Spans repository silos Regardless of locationIn househellipExternal - subject specific general

Structured organisation

Retaining contextover fragmentation

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 11: The Rhetoric of Research Objects

A Research Object bundles and relates digital resources of a scientific

experimentinvestigation + context

bull Data used and results produced in experimental study

bull Methods employed to produce and analyse that data

bull Provenance and settings for the experiments

bull People involved in the investigation

bull Annotations about these resources to improve understanding and interpretation

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 12: The Rhetoric of Research Objects

Standards-based metadata framework for bundling embedded and referenced resources with context

Citable Reproducible Packaging

researchobjectorg

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 13: The Rhetoric of Research Objects

Container

Research Object in a nutshell

Packaging FrameworksZip Archives BagIt Docker images

Platforms FAIRDOM myExperiment

Rhetorical Analogy 1

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 14: The Rhetoric of Research Objects

Systems Biology Research Objects exchange portability and maintenance

components packaged into

various containers

ISA-TABchecksum

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 15: The Rhetoric of Research Objects

RO Commons and Currency

Author List Joe Bloggs Jane DoeTitle My Investigation Date September 2016DOI httpsdoiorg1015490seek

httpsdoiorg1015490seek1investigation56

Active entry evolves

Version

information travels with the data and models

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 16: The Rhetoric of Research Objects

Rhetorical Analogies hellip

ReproducibilityPreservation

ReleaseExchange

Goble De Roure Bechhofer Accelerating Knowledge Turns DOI 101007978-3-642-37186-8_1

FAIR CommonsCurrency of Scholarship

Interpretation ComparisonPreservation Repair

Portability ReuseExecution

Active ResearchEvolving codes

New data

Software ReleaseExecutable Papers

Scientific InstrumentsMachines

Interpretation ComparisonPortability Reuse

Credit Citation

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 17: The Rhetoric of Research Objects

22102017

An ldquoevolving manuscriptrdquo would begin with a pre-publication pre-peer review ldquobeta 09rdquo version of an article followed by the approved published article itself [ hellip ] ldquoversion 10rdquo

Subsequently scientists would update this paper with details of further work as the area of research develops Versions 20 and 30 might allow for the ldquoaccretion of confirmation [and] reputationrdquo

Ottoline Leyser [hellip] assessment criteria in science revolve around the individual ldquoPeople have stopped thinking about the scientific enterpriserdquo

httpwwwtimeshighereducationcouknewsevolving-manuscripts-the-future-of-scientific-communication2020200article

Release

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 18: The Rhetoric of Research Objects

Instrument Analogy

Methodstechniques algorithms spec of the steps models versions robustness

Materialsdatasets parameters thresholds versions algorithm seeds

Experiment

Instruments (by reference)tools codes services scripts underlying libraries versions workflows reference datasets

Laboratory

computational environment versions

Setup

Report

Run

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 19: The Rhetoric of Research Objects

Instrument Analogy

bull Instruments Break

bull Technologies materials and methods change

bull Scope of use robustness

bull Blackboxes ndashdark and complicated

Workflow preservation amp repair

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 20: The Rhetoric of Research Objects

Reports + Machines Workflow Research Objects

bull W3C PROVbull Provenance

Templatesbull Trajectory

mapping

workflow engine

Workflow RunProvenance

Inputs Outputs

Intermediates

ParametersConfigs

Checksum

Communityontologies amp formats

Narrative

Linked DataJSON-LDRDF

EDAM

Errors

tools

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects J Web Semantics doi101016jwebsem201501003Hettne KM et al (2014) Structuring research methods and data with the research object model genomics workflows as a case study J Biomedical Semantics 5 41

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 21: The Rhetoric of Research Objects

BioCompute Objects

Alterovitz Dean II Goble Crusoe Soiland-Reyes e t al Enabling Precision Medicine via standard communication of NGS provenance analysis and results biorxivorg 2017 httpsdoiorg101101191783

Linked Data JSON-LD Ontologies (EDAM SWO)

Precision MedicineNGS workflow exchange FDA regulatory review submissionsEmphasis on the parametric domain and robust safe reuse

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 22: The Rhetoric of Research Objects

How do we build manifests

Rich self-describing semantic descriptions about resources and

their relationshipshellip

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 23: The Rhetoric of Research Objects

Manifest Construction

Manifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their

relationships

Container

Research Objects = Metadata Objects

Manifest Description

Type Checklistswhat should be thereProvenancewhere it came fromVersioningits evolutionDependencies what else is needed

Manifest

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 24: The Rhetoric of Research Objects

Containers are Many and Various

pre-packaged Docker images containing a bioinformatics tool and standardised interface through which data and parameters are passed

repository of gt2700 bioinformatics packages ready to use with conda install

Old Favourites

Zip Archives

BagIt Archives

ePUB Open Container Format (OCF)Adobe Universal Container Format (UCF)

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 25: The Rhetoric of Research Objects

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

bull all resources including external resources and outside references

bull attribution and provenance of each resource for credit and right versions

bull any part of the RO to be further described textually or semantically

bull extensibility point for community-driven standards

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 26: The Rhetoric of Research Objects

Manifest ConstructionManifest

Identificationto locate things

Aggregatesto link things together

Annotationsabout things amp their relationships

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

RRI DOI URI ORCID

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 27: The Rhetoric of Research Objects

Manifest Construction

Identificationto locate and resolve things

Aggregatesto link things together

Annotationsabout things amp their relationships

RRI DOI URI ORCID

Structured ZIP-filebased on ePub (OCF) amp Adobe UCF specifications

W3C Web Annotation Vocabulary

Open Archives InitiativeObject Exchange and Reuse

httpwwwresearchobjectorgspecifications

Manifest

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 28: The Rhetoric of Research Objects

Artists Impression

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 29: The Rhetoric of Research Objects

The real manifest

bull A Manifest

for 27 A4 pages hellip

RO manifest from FAIRDOMhttpsdoiorg1015490seek1investigation5

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 30: The Rhetoric of Research Objects

The need for embedded tools

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 31: The Rhetoric of Research Objects

Manifest Description Profiles

where it came fromits evolution

what else is needed

what should be there for types

Manifest

Project LabSpecific

Community-based TypesContext

All

VoID

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 32: The Rhetoric of Research Objects

OmicsDI

Trend JSON(-LD) + Schemas

Manifest schemaorg tailored to the Biosciences

Datarepository

Datarepository

TrainingResource

Bioschemas BioschemasBioschemas

Search engines

RegistriesData

Aggregators

Standardised metadatamark-up

Metadata published and harvested without APIs or special feeds

CommodityOff the Shelf toolsApp eco-systemLightweight

Sample Catalogue

BBMRI-ERIC Directory

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 33: The Rhetoric of Research Objects

Training materials amp EventsLaboratory protocols Workflows and Tools

See Alasdair Grayrsquos Poster

Manifest schemaorg tailored to the Biosciences

13 public datasets marked up including Gigascience data journal

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 34: The Rhetoric of Research Objects

Minimum informationfor one content type

Common propertiesamong content

types

Manifest Description ProfilesManifest

Minim model for defining checklists

Gamble Zhao Klyne Goble MIM A Minimum Information Model Vocabulary and Framework for Scientific Linked Data IEEE eScience 2012 Chicago USA October 2012) httpdxdoiorg101109eScience20126404489

httppurlorgminimdescription

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 35: The Rhetoric of Research Objects

Validation and Monitoring Toolsrich RDF-based generated from the workflow systems

Bespoke tooling SPIN-based checking

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 36: The Rhetoric of Research Objects

How can we express the Syntax and Semantics of Profiles to make generic tools

bull Use RDF shapes (SHACL ShEx) to capture requirements amp consumer expectationsbull Validate profile using a ShEx schema and off-the-shelf validators (eg Validata)

Manifest construction Check cross-reference

constraints on identifiers

Check URI patterns eg ldquostarts with rdquo

Check JSON Structure

Different levelsfrom Whole studies to Complex types

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 37: The Rhetoric of Research Objects

identifiersorg

PROV

JSON

manifestjson

httpsdoiorg101109BigData20167840618

The manifest ties everything together

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 38: The Rhetoric of Research Objects

Case study Back to Workflows

Workflow descriptionTool description

EDAM OntologySWO OntologyData FormatsBioschemasorg

Community led standard way of expressing and running workflows and the command line tools they orchestrateSupports containers for portability

Based on wf4ever wfdesc

bull Richly describedbull Multi tiered descriptionsbull Lots of filesbull CWL in RDFhellipbull CWL vocabulary for

workflow structure matches 11 with YAML

bull schemaorg annotations

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 39: The Rhetoric of Research Objects

Download as a Research Object Bundle

Over an active githubentry for an actively developing workflow

permalink to snapshot the GitHub entry and RO identifier

Common Workflow Language Viewer

CWL files packaged in a RO CWL RO + added richness

Lift out parts into the manifest

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 40: The Rhetoric of Research Objects

Best PracticesIn order to ensure that your workflow is well presented in CWL Viewer we recommend the following of CWL Best Practices Those which are specifically relevant to the viewer are detailed below but it is suggested that you try to meet as many as possible to include the general quality and reproducibility of your workflowsSome limitations of the CWL Viewer which you may need to be aware of are also described here

Label StringsInclude a top level short label summarising each tool and workflowLabels give the user an easy human-readable version of the name for the tool or workflowFor workflows this will be displayed at the top of the page as the title and for tools it will be displayed in the table and as the name of the step in the visualisation If a label is given at the step level it will take priority over the top level tool label You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Doc StringsIf useful include a top level doc string providing a longer more detailed description than was provided in the label (see above)Docs give the user a detailed description of the role a tool or workflow performsFor workflows this will be displayed at the top of the page under the title and for tools it will be displayed in the table If a doc string is given at the step level it will take priority over the top level tool doc You can use this to provide a more descriptive label of the tools application in the particular step if preferred

Conceptual IdentifiersAll input and output identifiers should reflect their conceptual identity Generic and uninformative names such as result or inputoutput should be avoidedHelpful identifiers allow for the links between steps in the CWL file to be easily distinguishedIdentifiers are displayed in the tables and are unique to the step The label is also used as a replacement for the identifier in the visualisation if provided

Format SpecificationThe format field should be specified for all input and output FilesTools should use format identifiers from a relevant ontology such as the EDAM Ontology in the case of Bioinformatics tools For plain types use the IANA media type list with$namespaces iana

httpswwwianaorgassignmentsmedia-types for example ianatextplain ianatexttab-separated-values

The use of formal standards for format fields enables implementations to provide checks for compatibility in formatting of filesOntologies will be parsed and the name of and link to the format displayed in the table on workflow pages Plain formats will have the ianaorg link given but will not display the name of the format

Separation of ConcernsEach CommandLineTool description should focus on a single operation only even if the (sub)command is capable of more

httpsviewcommonwlorgabout

shouldHaveDoc ( a cwlWorkflow | a cwlTool ) rdfscomment LITERAL

shouldHaveLabel ( a cwlWorkflow | a cwlTool ) rdfslabel LITERAL

step a cwlStep cwlinputs shouldHaveFormat cwloutputs shouldHaveFormat

shouldHaveFormat cwlFile dctformat ( iana | edam )

iana IRI

^httpswwwianaorgassignmentsmedia-typesedam IRI^httpsedamontologyorgformat_rdfssubClassOf

lthttpedamontologyorgformat_1915gt

Capturing Common Workflow Language Profile as ShEx

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 41: The Rhetoric of Research Objects

ShEx is SPO testing not Graph Link Following

Info for Constraints arebull Embedded in a specific format

ndash Extractconvert from domain-specific formats

bull Embedded in annotation resourcesndash Use existing schemaorg

annotations

bull Need to be acquired ndash eg URI look-ups (ORCID -gt author

name)

bull Custom amp hardcoded namespacesndash Pre-declare ontologies

ndash Add derived annotations post-processing

RDF must already be in a single graph

Canrsquot check if resource exists (eg 404)

Canrsquot test formatrepresentation of resource (ldquois it actually an Excel filerdquo)

Canrsquot apply nested RDF shapes to Linked Data resources

Canrsquot say ldquoMust be term from any resolvable ontology

Canrsquot check the format is actually in the EDAM ontology

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 42: The Rhetoric of Research Objects

RDF Shape that indicates to follow links

RO pre-processing to merge to single graph

Bespoke validators unpackers to iterate over the RO

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 43: The Rhetoric of Research Objects

Domain specific

bull ldquoMust have a workflow that analyses next-gen sequencing datardquo

bull ldquoMust be part of $fundedProjectrsquos Investigationrdquo

bull ldquoAll required data files must be providedrdquo

bull ldquoGeneric names should be avoidedrdquo

General Tools that do their best at unpacking and handing off

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 44: The Rhetoric of Research Objects

Did anyone take any notice

Research Object Bundles for Data Releasesas if they were softwareDataset ldquobuildrdquo tool

Standardised packing of Systems Biology models

European Space Agency RO LibraryEverest Project

Metagenomics pipelines and LARGE datasets

U Rostock

ISI USC

Public Heath Learning Systems

Asthma Research e-Labsharing and computing statistical cohort studies

Precision medicineNGS pipelines regulation

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 45: The Rhetoric of Research Objects

Did anyone take any notice httpwwwyoutubecomwatchv=p-W4iLjLTrQamplist=PLC44A300051D052E5 STM Innovations Seminar 2012

Howard Ratner Chair STMFuture Labs Committee CEO EVP Nature Publishing GroupDirector of Development for CHORUS (Clearinghouse for the Open Research of US)

FAIRPORT January 2014 Lorenz Centre Leiden

Ted SlaterYARC OpenBEL

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 46: The Rhetoric of Research Objects

A trendhellip Using JSON(-LD) + schemaorg

httpsdokieli

httpslinkedresearchorg

Manifest Schemaorg JSON-LD RDFArchive targz

Reproducible Document Stack project

eLife Substance and Stencila

BagIT data profile + schemaorg JSON-LD annotations

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 47: The Rhetoric of Research Objects

We should have called this ldquoResearch Objectsrdquo Donrsquot be too clever about your titles

Combining ISA-based Research Objects with nanopublicatiionsComplementary approaches

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 48: The Rhetoric of Research Objects

Take-up Analogy Start Ups

Community

Driver

ToolsEasy to make

Hard to consume

Workflows

Reproducibility

Portability between

platforms

Platform amp user buy-in from the get-go

Passionate dedicated leadership

Stan

dard

s

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 49: The Rhetoric of Research Objects

Open Questions

Stewardshipbull owners sites authors

Spanningbull platforms researchers

Lifecyclebull composition forkinghellip

Governance

Creditbull micro-credit amp citation

propagation attribution

Tamper proofingbull blockchain ethereum

Maintenancebull of evolving contentbull incrementality amp

degradation

Manifestsbull profile amp

template makingbull auto

manufacture

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 50: The Rhetoric of Research Objects

Who gets credit for whatUsing Provenance for Credit Mapping

[Paolo Missier]

1

3

2

2

34

11

1

2

2

5

3

3

4

3Alice

Charlie

Bob

Paolo Missier Data Trajectories tracking reuse of published data for transitive credit attribution IDCC 2016

W3C PROVdependency graph

ldquoProvletsrdquo

Granularity Atomicity Aggregation

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 51: The Rhetoric of Research Objects

bull Tracking RO usage and indirect contributions

bull Awarding fractional credit to contributors

1 ldquoContriponentsrdquo

bull contributors + components

2 Weighted contribution

3 Networked Credit maps

bull Travel with the contriponents

Transitive Credit contribution[Dan Katz and Arfon Smith]

Katz DS amp Smith AM (2015) Transitive Credit and JSON-LD Journal of Open Research Software 3(1) pe7 DOI httpdoiorg105334jorsby

D S Katz Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products Journal of Open Research Software v2(1) e20 pp 1-4 2014 DOI 105334jorsbe

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 52: The Rhetoric of Research Objects

bull Manifests using semantics

bull Commons of components

bull A new scholarly currency

bull Necessity for reproducible machines

bull Foundation of release of research

bull Ramps rather than Revolution

The Rhetoric of Research Objectsresearchobjectorg

Reports of the death of the

scientific paper are greatly

exaggerated

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith

Page 53: The Rhetoric of Research Objects

All the members of the Wf4Ever teamColleagues in Manchesterrsquos Information Management GroupELIXIR-UK Bioschemas

httpwwwresearchobjectorg

httpwwwwf4ever-projectorg

httpwwwfair-domorg

httpseek4scienceorg

httprightfieldorguk

httpwwwbioschemasorg

httpwwwcommonwlorg

httpwwwbioexceleu

Mark RobinsonAlan WilliamsJo McEntyreNorman MorrisonStian Soiland-ReyesPaul GrothTim ClarkAlejandra Gonzalez-BeltranPhilippe Rocca-SerraIan CottamSusanna SansoneKristian GarzaDaniel GarijoCatarina MartinsAlasdair GrayRafael JimenezIain BuchanCaroline JayMichael CrusoeKaty Wolstencroft

Barend MonsSean BechhoferPhilip BourneMatthew GambleRaul PalmaJun ZhaoNeil Chue HongJosh SommerMatthias ObstJacky SnoepDavid GavaghanRebecca LawrenceStuart OwenFinn BacallPaolo MissierPhil CrouchOscar CorchoDan Katz Arfon Smith