fair software (and data) citation: europe, research object systems, networks and off the shelf...
TRANSCRIPT
FAIR Software (and Data) Citation: Europe, Research Object Systems,
Networks and Off the Shelf Infrastructure
Professor Carole GobleThe University of Manchester, UK
Software Sustainability Institute UKELIXIR-UK, ELIXIR Interop Platform
[email protected] 0000-0003-1219-2137
NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel
AcknowledgementsU Manchester• Stian Soiland-Reyes• Stuart Owen• Caroline Jay• Robert Haines• Norman MorrisonU Newcastle• Paolo MissierU Illinois Urbana-Champaign• Dan KatzMurphy Mitchell Consulting Ltd• Fiona MurphyF1000• Liz AllenU Oxford• Neil Jefferies• Lucie BurgessISI, USC• Yolanda Gil• Daniel Garijo
Force11 DCIP / Harvard• Tim ClarkELIXIR / BioSchemas.org• Rafael Jimenez (Hub)• Niall Beard (ELIXIR UK)• Aleks Nenadic (ELIXIR UK)• Jo McEntyre (EBI, THOR)NIH BD2K • Susanna Sansone (bioCADDIE,
ELIXIR)• Ian Fore (NIH)Software Sustainability Institute• Shoaib Sufi • Neil Chue Hong • Mike Jackson STFC• Catherine Jones
Chief Contexts
Workflow Repository
Systems and Synthetic Biology Projects
FAIRFindable
Accessible
Interoperable
ReusableIntelligible
Reproducible
Citable
Track & Countable
Findable
Accessible
Interoperable
ReusableIntelligible
Change
Citable
Track & Countable
FAIR Credit
sciencecodemanifesto.org
http://www.elixir-europe.org/
17 ELIXIR members2 observers
major bioinformaticsservice providers (~150)
Co-operation Long term support
ob
Germany
ob
Data Citation in Europe PMC full text
http://dliservice.research-infrastructures.eu/#/
https://www.openaire.eu/
https://www.rd-alliance.org/groups/rdawds-publishing-data-services-wg.html
European Open Science Cloud
Technical and Human infrastructure for
Open Research
• interoperability and integration between ORCID and DataCite infrastructures
• PID e-infrastructure: promote uptake and sustain
https://project-thor.eu/
Giving Researchers Credit for their Data
https://www.jisc.ac.uk/rd/projects/research-data-spring
• Carrots for authors, ”pain-free” submission• Helper app for submitting data papers and
data for papers (using DataCite and ORCID)
http://www.software.ac.uk/software-credit
Over 90 GuidesWar StoriesPolicy, Supporthttp://www.software.ac.uk/software-management-plans
digital curation centrehttp://dcc.ac.uk
http://openresearchsoftware.metajnl.com/
http://www.software.ac.uk/how-cite-and-describe-software
Mike Jackson
http://rse.ac.uk
Not all creditable software is a “downloadable application”
Registration is hit and miss
Metrics Indica
tors
Counts
Community Smarts
Software Citation Space
Science as a Service
Open Source Codes
Virtual Machines
Portable Packaging
Libraries
Applications
Scripting environments
Infrastructure
Commercial tools
Scripts /
Workflows
Packages GEMS
Dynamic Deployments
Reproducible Research: Citing your execution environment using Docker and a DOI
http://www.software.ac.uk/blog/2016-03-29-reproducible-research-citing-your-execution-environment-using-docker-and-doi
+ +Caroline Jay, Robert Haines
http://idinteraction.cs.manchester.ac.uk‘ABC: Using Object Tracking to Automate Behavioural Coding.’ CHI 2016.
=FixityPublishing
Service vs ScienceBackground vs Foreground Software
Software and Data* in foreground most likely cited. Same software and data viewed as background not or not explicitly cited though equally essential
* Wynholds, et al (2012) Data, data use, and scientific inquiry: two case studies of data practices 10.1145/2232817.2232822
The invisibility of software, esp:• widely used• infrastructural• component/library• cross-discipline
Credit DriftImmediate
teamBackground
team
“Foreground”software
Authorship Authorship?
Cited?Acknowledged
Cited?Mentioned
Ignored“Background”
software
Cited
Transitive, Fractional CreditNot all software is equal
* Wynholds, et al (2012) Data, data use, and scientific inquiry: two case studies of data practices 10.1145/2232817.2232822
https://mr-c.github.io/shouldacitehttp://bit.ly/shouldacite
SSI Collaborations Workshop 2016
Should I cite the software?
Overcoming Barriers to Software Citationsurvey of experiences citing software in research
publications
http://bit.ly/1WxWFY7
Caroline Jay, Robert Haines, University of Manchester, UKRobin Wilson, University of Southampton, UK
System Biology Projects Common
s
http://fair-dom.org
Systems and Synthetic Biology ProjectsLinking, “Packaging” &
Citing Codes, Data, Models,
SOPs, Samples, Strains, Articles, People,
Projects….
Repository spanning catalogue, reference (“cite”) distributed 3rd party content
Standards
Public data archives
Project data repositories
Literature archives
Public model archives
Uploaded content Plugin Model
tools
FAIR
DO
M
Plugin Data tools
Structured Metadata Capture
metadata sheets sample sheets
data sheets
http://www.rightfield.org.uk
[Martin Scharm, Rostock University]
Haus et al, BMC Systems Biology, 2011, 5:10Solvent production by Clostridium acetobutylicum
https://dx.doi.org/10.1111/febs.13237
https://doi.org/10.15490/seek.1.investigation.56
http://data.datacite.org/10.15490/seek.1.investigation.56
Citation G. Penkler; F. du Toit; W. Adams; M. Rautenbach; D. C. Palm; D. D. van Niekerk; J. L. Snoep; (2014): Glucose metabolism in Plasmodium falciparum trophozoites; FAIRDOMHub. http://dx.doi.org/10.15490/seek.1.investigation.56
Fixity Publishing, URIs -> DOIs
"Mapping present and future predicted distribution patterns for a meso-grazer guild in the Baltic Sea" Sonja Leidenberger et al
CreditsAttributions
In Multiple Packs
Track?
Workflows
Pointer to 3rd Party Data Collection
Pointer to 3rd Party Code
Local files
• Aggregated• Granularity• Atomicity / Subsets• Recombined• Distributed• Dynamic and versioned
• Multi-contributors• Spans resources• Independently stewarded• Shift and change
Content Contribution
• Metadata Framework: Bundles and relate multi-hosted scattered digital resources of a scientific experiment or investigation using standard mechanisms
• Exchange, Publishing, Reproducibility, Portability, Repair
See Stephen Abrams Talk yesterday
Datasets, Data collectionsStandard operating proceduresSoftware, algorithmsConfigurations, Tools and apps, services
Slide
share
Github
figsh
are
Commun
ityDB
Arxiv.o
rg
Pubm
ed
Docke
rim
age
Codes, code librariesWorkflows, scriptsSystem software Infrastructure Compilers, hardware
Input Data
WorkflowDescripti
on
Provenance
trace
Version of
Codes / Services
Output
Manifest Constructi
on
Manifest
Identificationto locate things
Aggregates to link things together
Annotations about things & their
relationships
Container
Metadata Objects Citable Reproducible Packaging
Manifest Descripti
on Type Checklists what should be thereProvenance where it came fromVersioning its evolutionDependencies what else is needed
Manifest
Packaging content & links: Zip files, BagIt, Docker
images
Catalogues & Commons Platforms: FAIRDOM SEEK, STELAR eLab
OAI
ORE
W3C
OADM
RO Types: Manifest Content Profilesminimal, maximal, extensible
PIDCitation
Checklist
Version
Prov
enan
ce
Dependencies
JATSComms
DC DCAT
Exp
ISAEFODomain
SBMLMIAME CWL
Common properties
among content types
Minimum information
for one content type
Workflow RO BundleZIP or BagIt folder structure
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, J Web Semantics doi:10.1016/j.websem.2015.01.003
application/vnd.wf4ever.robundle+zip
JSON and YAML
Persistent Identification of Software: a building block to citation & curation
[email protected] B. Matthews, I. Gent, J. Tedds & S LamertonProject URL http://rrr.cs.st-andrews.ac.uk/
Guidelines for persistently identifying software using DataCite
https://epubs.stfc.ac.uk/work/24058274
• Most recent?– Location indicator, crosslink– Credit the contributors now, the version now– Strong presumption it exists and is living
• Fixed Snapshot?– Defend publication, Reuse – Credit the contributors then, the version then– Presumption it exists and is archived
• Line in the sand?– Credit the contributors then, the version then– Weak presumption it exists
• Warrant?• Acknowledgement not contribution• Don’t care if it exists• Important “influence” citation for its contributors
What does the citation meanfor the author or reader?
Identifier Resolution, Citation Persistence, Content Decay?
Commons
my Disk
Commons
• DOI proliferation– Channelling for Counting and
Landing Pages
• Authenticity: Tamper-proof Exchange and Provenance– Hashing & Checksums – Secure signature & probity
services– Block chain
• anti tampering transaction logging
• https://www.ethereum.org/– Proll and Rauber, Scalable
data citation in dynamic, large databases: Model and reference implementation, (2014) 10.1109/BigData.2013.6691588
• Uber Collection / Hierarchy / subsetting (cf. Dryad, DataONE, DataVerse)*
• RO author/contributor information in its manifest
• ROs manifest => constituent resources, provenance for contribution.
*Ball, A. & Duke, M. (2011). "How to Cite Datasets and Link to Publications?". DCC How-to Guides. Edinburgh: Digital Curation Centre. http://www.dcc.ac.uk/resources/how-guides/cite-datasets.
Granularity Atomicity
Aggregation
Robust Transitivity & PropagationCitation and Credit Aggregation and Granularity
• Backward Citation– What was this based
on, who did it?• Forward Citation
– What is using this, who did that?
• “PageRank”
Credit Aggregation
Citation GranularityDrift
D. S. Katz, "Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products," Journal of Open Research Software, v.2(1): e20, pp. 1-4, 2014. DOI: 10.5334/jors.be
1
3
2
2
34
11
1
2
25
3
3
4
3
Who gets credit for what?
Using Provenance for Credit Mapping
Paolo Missier
Alice
Charlie
Bob
Paolo Missier, Data Trajectories: tracking reuse of published data for transitive credit attribution, IDCC 2016
W3C PROVdependency graph
“Provlets”
• Tracking RO usage and indirect contributions
• Awarding fractional credit to contributors
1. “Contriponents” • contributors +
components2. Weighted contribution3. Networked Credit maps
• Travel with the contriponents
Transitive Credit contributionDan Katz and Arfon Smith
*Katz, D.S. & Smith, A.M., (2015). Transitive Credit and JSON-LD. Journal of Open Research Software. 3(1), p.e7, DOI: http://doi.org/10.5334/jors.by
D. S. Katz, "Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products," Journal of Open Research Software, v.2(1): e20, pp. 1-4, 2014. DOI: 10.5334/jors.be
How do we weight and
track ?
https://www.refme.com/uk/
http://depsy.org/
• Literature mining– Duck et al Ambiguity and
variability of database and software names in bioinformatics (2015) DOI: 10.1186/s13326-015-0026-0
• Infrastructure– Identifier and provenance
infrastructure, dependency managers, metrics services, repositories, machine readable and processable metadata, reference managers
• CReDIT – contributor taxonomy– http://casrai.org/CRediT– Time for revision?
http://mdc.lagotto.io/
http://ivory.idyll.org/blog/2015-authorship-on-software-papers.html
Find | Cite | CreditRamps “Riding the metadata COTS-tails”
• 3rd of web pages• Opening out -> community groups and extensions• Builds on a shared core and data structure• Simple embedding in web pages and CMS• Widespread tooling, harvesters and indexing• Search engines and Integration tools• It’s all about the metadata and knowledge graph
Google, Bing, Yahoo, Yandex
Find | Cite | CreditRamps “Riding the metadata COTS-tails”
DepthDATS
Reach
http://codemeta.github.io/
http://ontosoft.org/
Find | Cite | Credit Ramps “Riding the metadata COTS-tails”
Reach
Depth
Bioschemas.org
Specification
Data model
Minimum information
Controlled vocabularies
Cardinality
Documentation
Examples
New (properties | types)
Restrictions
Constraints
Extensions
BioSchemas.orgminimal, maximal, extensible
Trainingmaterials
Events Organizations
Data
Standards
Software
Minimum information
for one content type
Trainingmaterials
Events Organizations
DataSoftware
Standards
Common properties
among content types
Identifier, Title, Description, Author, Topics, Audience, Publication Date, …
Schema.orgBioSchemas.org, W3C FHIR WG
Daniel Mietchen et al , Adapting JATS to support data citation, Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015, Bethesda (MD): National Center for Biotechnology Information 2015.
Journal Article Tag Suite
DATS
SoftwareSourceCode
• Stretch in all directions– Granularity, Atomicity, Aggregation– Only partially automatable
• Dynamic Citation – “Citable Units” – Buneman et al, https://tinyurl.com/bdf-cacm
• ROs & Contriponents– Standardised metadata manifests – Tracking fabrics– Distributed => will break
• Keep it simple– Incremental, Commodity based, Low Tech– Guidelines & Conventions– Ramps – like Bioschemas.org– Capture metadata all along the way….
Open Questions?
Getting folks (authors, reviewers, editors) to cite software and data
For Further Information• http://www.researchobject.org• http://www.wf4ever-project.org• http://www.fair-dom.org• http://seek4science.org• http://www.software.ac.uk• http://www.bioschemas.org• http://codemeta.github.io/• http://myexperiment.org• http://www.commonwl.org/
EXTRAS
unshown