inserm - data management & reuse of health data - may 2017
TRANSCRIPT
On community-standards, FAIR data and scholarly communication
Susanna-Assunta Sansone, PhDORCID: 0000-0001-5306-5690
INSERM Workshop 246 “Management and reuse of health data: methodological issues”, Bordeaux, 14-17 May 2017
Data Consultant,Founding Academic Editor
Associate Director,Principal Investigator
www.slideshare.net/SusannaSansone
Source: https://www.dataone.org/best-practices
Simplified research data life cycle
• Available in a public repository• Findable through some sort of search facility• Retrievable in a standard format• Self-describing so that third parties can make sense of it• The product of careful planning, organization and stewardship• Intended to outlive the experiment for which they were
collected
To do better science, more efficiently we need data that are…
Key problem: low findability and understandability
• Not always well cited and storedo True for data as well as for any other digital asset
• Poorly described for third party reuseo Different level of details and annotation
• Reporting and annotation activities are perceived as time consumingo Often rushed and minimally done
We need content or reporting standards
• To harmonized the datasets with respect to the structureand level or annotation of their:§ experimental components (e.g., design, conditions, parameters),
§ fundamental biological entities (e.g., samples, genes, cells),
§ complex concepts (such as bioprocesses, tissues, diseases),
§ analytical process and the mathematical models, and
§ their instantiation in computational simulations (from the
molecular level through to whole populations of individuals)
Minimum information reporting requirements, checklists
o Report the same core, essential information
o e.g. MIAME guidelines
Controlled vocabularies, taxonomies, thesauri, ontologies etc.
o Unambiguous identification and definition of concepts
o e.g. Gene Ontology
Conceptual model, schema, exchange formats etc
o Define the structure and interrelation of information, and the transmission format
o e.g. FASTAFormats Terminologies Guidelines
Types of content standards
de jure de factograss-roots
groupsstandard
organizations
Nanotechnology Working Group
Formats Terminologies Guidelines
Community-driven efforts, just few examples
Formats Terminologies Guidelines
224
115
500+
source sourcesource
MIAMEMIRIAM
MIQASMIXMIGEN
ARRIVEMIAPE
MIASE
MIQE
MISFISHIE….
REMARK
CONSORT
SRAxml
SOFT FASTADICOM
MzMLSBRML
SEDML…
GELML
ISA
CML
MITAB
AAOCHEBIOBI
PATO ENVOMOD
BTOIDO…
TEDDY
PROXAO
DO
VO
Content standards in numbers
How to discover the ‘right’ standards for your data?
Aweb-based,curatedandsearchableportalthat monitorsthedevelopment and
evolution ofstandards,theiruse indatabases andtheadoptionofbothindata
policies,toinform andeducate theusercommunity
Data policies by funders, journals and other organizations
Content standards
Formats Terminologies Guidelines
Map this complex and evolving landscape
Databases
Allrecordsaremanuallycuratedin-house
andverifiedbythecommunitybehindeachresource
Data policies by funders, journals and other organizations
Databases
Content standards
Formats Terminologies Guidelines
Using indicators to describe ‘status’
Readyforuse,implementation,orrecommendation
Indevelopment
Statusuncertain
Deprecatedassubsumedorsuperseded
Understanding how standards are used
Understanding how standards are used
Guideline
Understanding how standards are used
Formats
Guideline
Understanding how standards are used
Formats
Guideline
Formats
Understanding how standards are used
Formats
Guideline
Formats
Terminology
Data policies by funders, journals and other organizations
Databases
Content standards
Formats Terminologies Guidelines
Using indicators to indicate ‘adoption’
Standard developing groups:Journal, publishers:
Cross-links, data exchange:
Societies and organisations: Institutional RDM services:
Projects, programmes:
Technologically-delineated views of the world
Biologically-delineated views of the world
Generic features (‘common core’)- description of source biomaterial- experimental design components
Arrays
Scanning Arrays &Scanning
Columns
GelsMS MS
FTIR
NMR
Columns
transcriptomics proteomics metabolomics
plant biologyepidemiology microbiology
Duplications & lack of interoperability among standards
Arrays
Scanning Arrays &Scanning
Columns
GelsMS MS
FTIR
NMR
Columns
transcriptomics proteomics metabolomics
plant biologyepidemiology microbiology
Hard to use them in combinations, e.g. to represent:
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut microbiota profiling
Arrays
Scanning Arrays &Scanning
Columns
GelsMS MS
FTIR
NMR
Columns
transcriptomics proteomics metabolomics
plant biologyepidemiology microbiology
Enhancing modularization
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut microbiota profiling
Arrays
Scanning Arrays &Scanning
Columns
GelsMS MS
FTIR
NMR
Columns
transcriptomics proteomics metabolomics
plant biologyepidemiology microbiology
Enhancing modularization
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut microbiota profiling
bsg-000174
biosharing:ReportingGuideline
bsg-000161
MINSEQE
MIMARKS
sample information
sample identifier
taxonomyidentifier
sequence read
geo location
High-level information about the metadata standards
Representations of the standards elements
Template elementsfor
el-000001
el-000002
el-000003
provenance: MINSEQE
provenance: MINSEQE
and MIMARKS
provenance:MIMARKS
Serve machine-readable content metadata standards, providing provenance for their elements, rendering standards invisible to the researchers
Inform the creation of metadata templates
How to discover the datasets relevant to your work?
OmicsDI: Nature Biotechnology 35, 406–409 (2017) doi:10.1038/nbt.3790
omicsdi.org
datamed.org
DataMed: bioRxiv 094888; https://doi.org/10.1101/094888 Nature Genetics (in press)
DATS: bioRxiv 103143; https://doi.org/10.1101/103143 Scientific Data (in press)
• Discoverability and reusabilityo Complementing community
databases• Incentive, credit for sharing
o Big and small datao Unpublished datao Long tail of datao Curated aggregation
• Peer review of data• Value of data vs. analysis
Growing number of data papers and data journals, e.g:
nature.com/scientificdataHonorary Academic Editor Susanna-Assunta Sansone, PhD
Managing EditorAndrew L Hufton, PhD
Editorial CuratorVarsha Khodiyar
PublisherIain Hrynaszkiewicz
A new open-access, online-only publication for descriptions of scientifically valuable datasets
Supported by
• A peer reviewed description of data, to maximize usage• Citable publications that give credit for reusable data• It requires data deposition to the appropriate repository(s)• Is complementary and can be associated or not to traditional article(s)
New article type
Res
earc
hpa
pers
Dat
a re
cord
sD
ata
Des
crip
tors
Value added component – complementing articles and repositories
• Title• Abstract• Background & Summary• Methods• Data Records• Technical Validation• Usage Notes • Figures & Tables • References• Data Citations
• following the Joint Declaration of Data Citation Principles
Detailed description of the methods and technical analyses supporting the
quality of the measurements; no scientific hypotheses
Article structure
Focus on data peer review
• Completeness = can others reproduce?• Consistency = were community standards followed?• Integrity = are data in the best repository?• Experimental rigour, technical quality = were the methods sound?
Does not focus on perceived impact, importance, size, complexity of data
Credit for data producers, data managers/curators etc.
Credit to: Varsha Khodiyar
“The Data Descriptor made it easier to use the data, for me it was critical that everything was there…all the technical details like voxel size.”
Professor Daniele Marinazzo
Credit to: Varsha Khodiyar
Data (re)use made easier
Decades old dataset
Aggregated or curated data
resources
Computationally produced data
productsLarge
consortium dataset
Data from a single
experiment
Data that YOU find valuable
and that others might find useful too
Data associated with a high impact
analysis article
What makes a good ?
Experimental metadata or structured component
(in-house curated, machine-readable formats)
Article or narrative component
(PDF and HTML)
Data Descriptors has two components
The Data Curation Editor is responsible for creating and curating the machine-readable structured component• Enables browsing and searching the articles• Facilitates links to related journal articles and repository
records
Curation and discoverability
Created with the input of the authors, includes value-added semantic annotation of the experimental metadata
analysis method script
Data file or record in a database
Data Descriptors: structured component
Complementary roles of ISA and nanopublications
From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data Models and Workflows in Bioinformatics. https://doi.org/10.1371/journal.pone.0127612
PloS ONE (2015)
The (long) road to FAIR
Responsibilities lie across several stakeholder groups
Understand the benefits of sharing FAIR datasets and enact them
Engage and assist researchers to enable them to share FAIR datasets
Release or endorse practices and polices, but also incentive
and credit mechanisms for researchers, curators and
developers
“As Data Science culture grows,digital research outputs (such asdata, computational analysis andsoftware) are being established asfirst-class citizens.
This cultural shift is required to goone step further: to recognizeinteroperability standards as digitalobjects in their own right, with theirassociated research, developmentand educational activities”.
Sansone, Susanna-Assunta; Rocca-Serra, Philippe (2016). Interoperability Standards - Digital Objects in Their Own Right. Wellcome Trust” https://dx.doi.org/10.6084/m9.figshare.4055496.v1
Philippe Rocca-Serra, PhDSenior Research Lecturer
AlejandraGonzalez-Beltran, PhDResearch Lecturer
Milo Thurston, DPhDResearch Software Engineer
MassimilianoIzzo, PhDResearch Software Engineer
Peter McQuilton, PhDKnowledge Engineer
Allyson Lister, PhDKnowledge Engineer
EamonnMaguire, DphilContractor
David Johnson, PhDResearch Software Engineer
MelanieAdekale, PhDBiocurator Contractor
DelphineDauga, PhDBiocurator Contractor
We work with and for
to make data and other digital research assets
Susanna-Assunta Sansone, PhDPrincipal Investigator, Associate Director and Data Consultant for Springer Nature
enabling open science, driving science and discoveries