seminario en cifasis, rosario, argentina - seminar in cifasis, rosario, argentina

60
Publicación de Datos en Acción BioSharing y la infraestructura ISA Alejandra González-Beltrán, PhD Oxford e-Research Centre, University of Oxford [email protected] @alegonbel Centro Internacional Franco-Argentino de Ciencias de la Información y Sistemas (CIFASIS) 19 de diciembre 2014 Rosario, Argentina

Upload: alejandra-gonzalez-beltran

Post on 14-Jul-2015

146 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Publicación de Datos en Acción BioSharing y la infraestructura ISA

Alejandra González-Beltrán, PhD Oxford e-Research Centre, University of Oxford

[email protected] @alegonbel

Centro Internacional Franco-Argentino de Ciencias de la Información y Sistemas (CIFASIS)

19 de diciembre 2014 Rosario, Argentina

Outline

Research data management Data interoperability

BioSharing: data policies, data standards, databases registries

Enabling reproducible research ISA infrastructure

Reproducibility study from experimental design to publication of results

Data publication

h"ps://projects.ac/blog/five4top4reasons4to4protect4your4data4and4prac9se4safe4science/<

Scien9fic<data<and<the<effect<of<poor<<research<data<management<

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The experimental workflow

metadata

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The experimental workflow

metadata

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Interoperability

The experimental workflow

Reproducibility

Data Review

The experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Reusability

http://betterhealth.mckesson.com/2013/04/incenting-interoperability/

data interoperability

Including minimum information reporting requirements, or checklists to report the same core, essential information

Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another

Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’

Nanotechnology Working Group

What standards can I use?

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Notes and narrative! Spreadsheets and tables! Linked data and data publication!

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Increase the level of annotation at the source, tracking provenance and using community standards

Enabling reproducible research and open science, driving science and discoveries !

http://www.am

a-roch

ester.o

rg/W

P/wp-co

nten

t/up

load

s/20

13/01/three-pillars.png

18

A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or tools) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including: !

• stem cell discovery • system biology • transcriptomics • toxicogenomics • also by communities working to build a library of cellular

signatures

!• environmental health • environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics

General-purpose, configurable format designed to support: !• description of the experimental metadata, making the annotation explicit and discoverable !• provenance tracking !

• use of community standards, such as minimal reporting guidelines and terminologies !• designed to be converted to - a growing number of - other metadata formats, e.g. used by the European Bioinformatics Institute (EBI) repositories !

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

obi:material entity

obi:material sample

obi:material processing

obi:processed material

obi:planned process

isa:raw data file

bfo:derives from

http://mousebreath.com/2012/03/max-monday-green-monster/

Reproducibility Study

The experimental plan - life sciences case

experimental design!sample characteristic(s)!

experimental variable(s)!

2-week systemic rat study using male Wistar rats (N=15 per dose group)

14 proprietary drug candidates from participating companies and 2 reference toxic compounds

InnoMed PredTox Project

The experimental plan - life sciences case

experimental design!sample characteristic(s)!

experimental variable(s)!

technology(s)!measurement(s)!protocols(s)!data file(s)!…!

The experimental plan - computational case

Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.

The experimental plan - computational case

•open peer-review •availability of

•data •analysis scripts •documentation

Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.

The experimental plan - computational case

•open peer-review •availability of

•data •analysis scripts •documentation

Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.

The experimental plan - computational case

genome assembly algorithm

genome size

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

bacterial genome

insect genomehuman genomebacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

3x3 factorial design 9 study groups

The experimental plan - computational case

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

bacterial genome

insect genomehuman genomebacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

S. aureusR. sphaeroides

B. impatiens

Chinese Han genome (or YH genome)

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

bacterial genome

insect genomehuman genomebacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

Response Variables!

genome coverage

computation run time

memory consumption

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

ISA-Tab for the identification of experimental variables

and experimental design and to specify unambiguously

relevan electronic resources (e.g. records from public

repositories) with their official identifiers

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

Galaxy workflows to re-enact the data analysis

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

Nanopublication: represents structured data along with

its provenance in a single publishable and citable entity

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

ResearchObject: enables the aggregation of the digital

resources contributing to findings of computational

research, including results, data and software, as citable

compound digital objects

Reproducing SOAPdenovo2 results Galaxy workflows

S. aureus pipeline

Reproducing SOAPdenovo2 results Galaxy workflows

Reproducing SOAPdenovo2 results Galaxy workflows

2241 400

30

119.0 11 106 24 68

0

Reproducing SOAPdenovo2 results Galaxy workflows

“genome coverage increased over the human data when comparing SOAPdenovo2 against SOAPdenovo1”!

Response Variables!

genome coverage

computation run time

memory consumptioncomputation run time

OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(

Spreadsheets(Maguire(et(al,((2013(

Bioinforma?cs(

widget for ontology

annotation and tagging on

Google spreadsheets

relying on BioPortal and Linked Open Vocabularies

services

OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(

Spreadsheets(Maguire(et(al,((2013(

Bioinforma?cs(

widget for ontology

annotation and tagging on

Google spreadsheets

relying on BioPortal and Linked Open Vocabularies

services

NanoMaton https://github.com/ISA-tools/NanoMaton

Ontology for Biomedical Investigations

SemanticsScience Integrated Ontology

SOAPdenovo2

unambiguously identify electronic resources, such as are records from public repositories, by providing their official identifiers be explicit about experimental design and experimental variables, identifying the goal of the experiment, independent and response variables remain neutral and report all findings of similar importance with the same weight

report the results with respect to all the identified response variables

http://www.lomasnuevo.net/otras/actualidad/big-data-una-mina-de-oro/http://www.euroscience.org/Publications.html

Data Publication

http://gigasciencejournal.com

http://gigadb.org/dataset/100035

http://gigasciencejournal.com

http://gigadb.org/dataset/100035

Experimental metadata

or structured component

(in-house curated, machine-readable

formats)

Article or narrative

component (PDF and HTML)

A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these!

Credit for sharing your data

Focused on reuse and reproducibility

Peer reviewed, curated

Promoting Community Data Repositories

Open Access

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Findable, Accessible, Interoperable, Reusable!FAIR data

http://goo.gl/tWRjYI

Our areas of research:!Data capture and curation!Data (nano)publication !Data provenance !Open, community ontologies and standards!Semantic web!Software development!Training

Communities we work with/for:!

As part of:!UK, European and international consortia!Pre-competitive informatics public-private partnerships!Standardisation initiatives!

!Some of the groups we engage with incl.:!

eTRIKS – european Translational Information and Knowledge management Services Consortium of academic (Imperial College, CNRS, Un of Luxemburg) and pharmas (Janssen, Merck, AZ, Lilly, Lundbeck, Pfizer, Roche, Sanofi, Bayer, GSK) building a sustainable, open translational research informatics platform

•  Nature Publishing Group‘s Scientific Data •  BioMedCentral and BGI‘s GigaScience •  F1000 Research •  Oxford University Press

StatO – Statistics Ontology

CEDAR – Centre for Expanded Data Annotation and Retrieval BioCADDIE – Biomedical and healthCAre Data Discovery and Indexing Ecosystem

COPO – Collaboratively Open Plant Omics Consortium of academic (TGAC, EBI, Oxford, Warwick) building a sustainable, open research informatics platform for plant science

funders

acknowledgements

Scott Edmunds, GigaScience

Peter Li, GigaScience

Jun Zhao, Lancaster University

María Susana Avila García, Oxford University

Marco Roos, Leiden University

Mark Thompson, Leiden University

Ruibang Luo, University of Hong Kong Tin-Lap Lee, Chinese University of Hong Kong

Tak-wah Lam, University of Hong Kong

SOAPdenovo2 use case

Eelke van der Host, Leiden University

Rajaram Kaliyaperumal, Leiden University

Questions?You can email us...

[email protected]

View our blog http://isatools.wordpress.com

Follow us on Twitter @isatools

View our websites

View our Git repo & contribute http://github.com/ISA-tools

Thanks for your attention!