seminario en cifasis, rosario, argentina - seminar in cifasis, rosario, argentina
TRANSCRIPT
Publicación de Datos en Acción BioSharing y la infraestructura ISA
Alejandra González-Beltrán, PhD Oxford e-Research Centre, University of Oxford
[email protected] @alegonbel
Centro Internacional Franco-Argentino de Ciencias de la Información y Sistemas (CIFASIS)
19 de diciembre 2014 Rosario, Argentina
Outline
Research data management Data interoperability
BioSharing: data policies, data standards, databases registries
Enabling reproducible research ISA infrastructure
Reproducibility study from experimental design to publication of results
Data publication
h"ps://projects.ac/blog/five4top4reasons4to4protect4your4data4and4prac9se4safe4science/<
Scien9fic<data<and<the<effect<of<poor<<research<data<management<
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
The experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
The experimental workflow
metadata
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
The experimental workflow
metadata
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Interoperability
The experimental workflow
Reproducibility
Data Review
The experimental workflow
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Data Reusability
Including minimum information reporting requirements, or checklists to report the same core, essential information
Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another
Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’
Notes in Lab Books(information for humans)
Spreadsheets and Tables( the compromise)
Facts as RDF statements(information for machines)
Notes and narrative! Spreadsheets and tables! Linked data and data publication!
Notes in Lab Books(information for humans)
Spreadsheets and Tables( the compromise)
Facts as RDF statements(information for machines)
Notes in Lab Books(information for humans)
Spreadsheets and Tables( the compromise)
Facts as RDF statements(information for machines)
Increase the level of annotation at the source, tracking provenance and using community standards
Enabling reproducible research and open science, driving science and discoveries !
http://www.am
a-roch
ester.o
rg/W
P/wp-co
nten
t/up
load
s/20
13/01/three-pillars.png
18
A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or tools) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including: !
• stem cell discovery • system biology • transcriptomics • toxicogenomics • also by communities working to build a library of cellular
signatures
!• environmental health • environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics
General-purpose, configurable format designed to support: !• description of the experimental metadata, making the annotation explicit and discoverable !• provenance tracking !
• use of community standards, such as minimal reporting guidelines and terminologies !• designed to be converted to - a growing number of - other metadata formats, e.g. used by the European Bioinformatics Institute (EBI) repositories !
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
obi:material entity
obi:material sample
obi:material processing
obi:processed material
obi:planned process
isa:raw data file
bfo:derives from
The experimental plan - life sciences case
experimental design!sample characteristic(s)!
experimental variable(s)!
2-week systemic rat study using male Wistar rats (N=15 per dose group)
14 proprietary drug candidates from participating companies and 2 reference toxic compounds
InnoMed PredTox Project
The experimental plan - life sciences case
experimental design!sample characteristic(s)!
experimental variable(s)!
technology(s)!measurement(s)!protocols(s)!data file(s)!…!
The experimental plan - computational case
Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.
The experimental plan - computational case
•open peer-review •availability of
•data •analysis scripts •documentation
Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.
The experimental plan - computational case
•open peer-review •availability of
•data •analysis scripts •documentation
Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.
genome assembly algorithm
genome size
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genomehuman genome
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genomehuman genome
bacterial genome
insect genomehuman genomebacterial genome
insect genomehuman genome
Predictor Variables!(Factor Name, Factor Type)
3x3 factorial design 9 study groups
The experimental plan - computational case
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genomehuman genome
bacterial genome
insect genomehuman genomebacterial genome
insect genomehuman genome
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
S. aureusR. sphaeroides
B. impatiens
Chinese Han genome (or YH genome)
genome assembly algorithm
genome size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genomehuman genome
bacterial genome
insect genomehuman genomebacterial genome
insect genomehuman genome
Predictor Variables!(Factor Name, Factor Type)
The experimental plan - computational case
Response Variables!
genome coverage
computation run time
memory consumption
SOAPdenovo2
http://isa-tools.github.io/soapdenovo2
ISA-Tab for the identification of experimental variables
and experimental design and to specify unambiguously
relevan electronic resources (e.g. records from public
repositories) with their official identifiers
SOAPdenovo2
http://isa-tools.github.io/soapdenovo2
Galaxy workflows to re-enact the data analysis
http://isa-tools.github.io/soapdenovo2
SOAPdenovo2
Nanopublication: represents structured data along with
its provenance in a single publishable and citable entity
http://isa-tools.github.io/soapdenovo2
SOAPdenovo2
ResearchObject: enables the aggregation of the digital
resources contributing to findings of computational
research, including results, data and software, as citable
compound digital objects
“genome coverage increased over the human data when comparing SOAPdenovo2 against SOAPdenovo1”!
Response Variables!
genome coverage
computation run time
memory consumptioncomputation run time
OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(
Spreadsheets(Maguire(et(al,((2013(
Bioinforma?cs(
widget for ontology
annotation and tagging on
Google spreadsheets
relying on BioPortal and Linked Open Vocabularies
services
OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(
Spreadsheets(Maguire(et(al,((2013(
Bioinforma?cs(
widget for ontology
annotation and tagging on
Google spreadsheets
relying on BioPortal and Linked Open Vocabularies
services
NanoMaton https://github.com/ISA-tools/NanoMaton
Ontology for Biomedical Investigations
SemanticsScience Integrated Ontology
SOAPdenovo2
unambiguously identify electronic resources, such as are records from public repositories, by providing their official identifiers be explicit about experimental design and experimental variables, identifying the goal of the experiment, independent and response variables remain neutral and report all findings of similar importance with the same weight
report the results with respect to all the identified response variables
http://www.lomasnuevo.net/otras/actualidad/big-data-una-mina-de-oro/http://www.euroscience.org/Publications.html
Data Publication
http://gigasciencejournal.com
http://gigadb.org/dataset/100035
http://gigasciencejournal.com
http://gigadb.org/dataset/100035
Experimental metadata
or structured component
(in-house curated, machine-readable
formats)
Article or narrative
component (PDF and HTML)
A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these!
Credit for sharing your data
Focused on reuse and reproducibility
Peer reviewed, curated
Promoting Community Data Repositories
Open Access
Data Scientist
Visualization
Analysis
Planning
Data Management
Data CollectionPublication
Use existing data
Perform new experiment
Findable, Accessible, Interoperable, Reusable!FAIR data
Our areas of research:!Data capture and curation!Data (nano)publication !Data provenance !Open, community ontologies and standards!Semantic web!Software development!Training
Communities we work with/for:!
As part of:!UK, European and international consortia!Pre-competitive informatics public-private partnerships!Standardisation initiatives!
!Some of the groups we engage with incl.:!
eTRIKS – european Translational Information and Knowledge management Services Consortium of academic (Imperial College, CNRS, Un of Luxemburg) and pharmas (Janssen, Merck, AZ, Lilly, Lundbeck, Pfizer, Roche, Sanofi, Bayer, GSK) building a sustainable, open translational research informatics platform
• Nature Publishing Group‘s Scientific Data • BioMedCentral and BGI‘s GigaScience • F1000 Research • Oxford University Press
StatO – Statistics Ontology
CEDAR – Centre for Expanded Data Annotation and Retrieval BioCADDIE – Biomedical and healthCAre Data Discovery and Indexing Ecosystem
COPO – Collaboratively Open Plant Omics Consortium of academic (TGAC, EBI, Oxford, Warwick) building a sustainable, open research informatics platform for plant science
funders
acknowledgements
Scott Edmunds, GigaScience
Peter Li, GigaScience
Jun Zhao, Lancaster University
María Susana Avila García, Oxford University
Marco Roos, Leiden University
Mark Thompson, Leiden University
Ruibang Luo, University of Hong Kong Tin-Lap Lee, Chinese University of Hong Kong
Tak-wah Lam, University of Hong Kong
SOAPdenovo2 use case
Eelke van der Host, Leiden University
Rajaram Kaliyaperumal, Leiden University
Questions?You can email us...
View our blog http://isatools.wordpress.com
Follow us on Twitter @isatools
View our websites
View our Git repo & contribute http://github.com/ISA-tools
Thanks for your attention!