towards reproducible science: a few building blocks from my personal experience

61
Oscar Corcho (with contributions from Olga Giraldo, Alexander García, and Idafen Santana) http:/ /www.oeg-upm.net/index.php/en/researchareas/3- semanticscience/ index.html Ontology Engineering Group Universidad Politécnica de Madrid, Spain Towards Reproducible Science: a few building blocks from my personal experience [email protected] @ocorcho 22/10/2017 S4BioDiv2017, Vienna

Upload: oscar-corcho

Post on 28-Jan-2018

418 views

Category:

Science


1 download

TRANSCRIPT

Oscar Corcho(with contributions from Olga Giraldo, Alexander García,

and Idafen Santana)

http://www.oeg-upm.net/index.php/en/researchareas/3-

semanticscience/index.html

Ontology Engineering Group

Universidad Politécnica de Madrid, Spain

Towards Reproducible Science: a

few building blocks from my

personal experience

[email protected]

@ocorcho

22/10/2017

S4BioDiv2017, Vienna

Towards Reproducible Science

Introduction

2

HYPOTHESISCONVINCE

AUDIENCE

REPEATABLE

SCIENTIFIC EXPERIMENTS

Towards Reproducible Science

Introduction

3

SCIENTIFIC EXPERIMENTS

IN VIVO/VITRO IN SILICO

Alison’s

biodiversity

scientists

Towards Reproducible Science

Introduction

4

SCIENTIFIC EXPERIMENTS

IN VIVO/VITRO IN SILICO

REPEATABILITY

Alison’s

biodiversity

scientists

Towards Reproducible Science 5

Before continuing….

What does reproducibility

mean for you?

And for your colleagues?

And for the colleagues from

other disciplines?

Towards Reproducible Science

The R* brouhaha

6

Source: The R* brouhaha. Goble C. RDA-Europe’s workshop on RepScience 2016.

Towards Reproducible Science

My own take on terminology

PRESERVATION

CONSERVATION

7

Towards Reproducible Science

My own take on terminology

PRESERVATION

CONSERVATION

REPLICABILITY

REPRODUCIBILITY

8

Towards Reproducible Science

Experiment components

9

DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN V

IVO

/VIT

RO

IN S

ILIC

O

Towards Reproducible Science

Experiment components

10

DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN V

IVO

/VIT

RO

IN S

ILIC

O

This has attracted most

of the attention so far

Towards Reproducible Science

Block 1. Experimental Protocols

11

Olga Giraldo

Alexander Garcia

Explore alternative ways for documenting and

retrieving information from experimental protocols

Using Semantics and NLP in the SMART Protocols Repository. Giraldo O, García-Castro

A, Corcho O - ICBO, 2015

Using Semantics and Natural Language Processing in Experimental Protocols. Giraldo

O, García-Castro A, Figueredo J, Corcho O - J Biomedical Semantics, to appear

SMART protocols: semantic representation for experimental protocols. Giraldo O,

García-Castro A, Corcho O – Linked Science 2014

Towards Reproducible Science

What is an experimental protocol

Experimental protocols

are like cooking recipes

They have ingredients:

reagents and sample

They have appliances:

equipment,

They have a list of instructions,

The protocols should have

complete information that

allows anybody to recreate an

experiment.

They have a total time

They have critical steps…

Towards Reproducible Science

Some of the issues we aim at addressing

• Incubate the

centrifuge tubes in a

water bath.

• Incubate the samples

for 5 min with gentle

shaking.

• Rinse DNA briefly in

1-2 ml of wash.

• Incubate at -20C

overnight.

some protocols present insufficient

granularity,

the instructions can be imprecise or

ambiguous due to the use of natural

language.

The protocols lack structure

Towards Reproducible Science

Bio-ontologies

OBI, EXPO, EXACT, BAO, IAO, ERO…

Data repository

for making data

available

few efforts focus on

representing and

standardizing

experimental protocols.

For reproducibility

purposes, if the data

must be available, so

does the experimental

protocol detailing the

methodology followed

to derive the data.

Resources for

reporting guidelines or

Minimum Information

standards

Ingredients for Improving Reproducibility

Towards Reproducible Science

Main research question

How to formalize the information from

laboratory protocols as a knowledge base?

Towards Reproducible Science

Our approach

• Ontology model representing lab protocols

• Gazetteer-based method: use existing lists of named

entities Lists of proper nouns, which refer to real-life entities

• Rule-based approaches:

write manual extraction

rules

• Development of a Gold

Standard of protocols

annotated manually

Towards Reproducible Science

SMART Protocols ontology

17

http://vocab.linkeddata.es/SMARTProtocols/

https://smartprotocols.github.io/

Towards Reproducible Science

The SIRO model

Sample/Specimen(whole organism, anatomical part, bodily fluids, etc.)

Instruments(equipment, devices, consumables, software)

Reagents(chemical compounds, mixtures)

Objective(purpose)

The SIRO model supports search, retrieval and classification of experimental protocols

Towards Reproducible Science

Design of semantic Gazetteer and JAPE rules

Design of semantic Gazetteers• Facilitate the annotation of instances

related to:

Experimental actions

Instruments

Samples/ organisms

Reagents

Design of grammar

rules• Facilitate the

annotation of

instructions

Towards Reproducible Science

Development of a Gold Standard

100 protocols published in

several repositories

Annotators - experts in

life sciences

http://smart-

protocols.labs.linkingdata.io/dist/d

ev/#/login

The SMART Protocols

Annotation Tool

Guidelines about What

and How annotate

Materials:

• BioTechniques,

• CSH-Protocols,

• Current protocols,

• Genet and Mol. Res,

• Journal of Biolog. Methods,

• Jove,

• MethodsX,

• Nature protocols exchange,

• Nature protocols

• Curso BIOS 2016, Colombia

• Universidad del Valle,

Colombia

• Japan (Database Center for

Life Science (DBCLS),

Robotic Biology Institute

(RBI), Spiber, Yachie-Lab,

University of Tokyo).

• Universidad Santiago de

Cali, Colombia

Towards Reproducible Science

Preliminary results

Entities sample instrument reagent objective

Sample Neural cell 3 0 0 0

neural stem cells (NSCs) 3 0 0 0

Instrument Cell culture centrifuge 0 3 0 0

cell culture incubator 0 3 0 0

Microscope 0 3 0 0

Millicell culture plate inserts 8-?m pore size 0 3 0 0

reagent B27 supplement 0 0 3 0

DMEM/F12 0 0 3 0

FGF2 neutralizing antibody 0 0 3 0

glucose 0 0 3 0

objective Here we describe two migration assays, a matrigel migration assay

and a Boyden chamber migration assay, which allow the in

vitro assessment of neural migration under defined conditions

(Ladewig, Koch and Brüstle, 2014).

0 0 0 3

entities sample instrument reagent

Reagent - Sample/Organism Ac-omega viral DNA 1 2

baculoviral 1 2

DNA insert 2 1

I-Sce I meganuclease 1 2

Sample/Organism Insect cells 3

Instrument spinner 3

Centrifuge 3

Flask 3

Reagent IPL-41 powdered 3

Liposome formulation 3

Phenol:chloroform 3

Fleiss Kappa for 3

raters = 1.0

Fleiss Kappa for 3

raters = 0.755

Towards Reproducible Science

Our ongoing work

22

So far, this is ok for handling protocols that have

been already reported in papers

Can we actually change the way in which

these protocols are produced?

Towards Reproducible Science

Platform for publishing semantic protocols

Features:

Open semantic publishing platform

o The protocols are born semantic

Self describing documents

o Meaningful entities

o Machine procesable workflows

Documents will reference existing URIs

o Samples/organisms

o Reagents/chemical compounds

o Instruments

SMART Protocols Ontology /

Gazetteers / Grammar rules

UniProt

NCBI taxonomy

PubChem

Vendors

Towards Reproducible Science

Platform available at: http://smartprotocols.labs.linkingdata.io/app/protocols

The platform

Towards Reproducible Science

25

Capturing relevant elements in the document

Towards Reproducible Science

Organisms come from the UniProt Taxon API

26

After selecting

an organism,

the

correspondent

ID is

automatically

recorded

Towards Reproducible Science

Reagents come from the PubChem API

Towards Reproducible Science

Machine processable

workflows

Step

Step

Step

Step

Step

Towards Reproducible Science

Final edited protocol, also available as bioschemas

Towards Reproducible Science

Block 2. Computational Environments

30

Idafen Santana

Is it possible to describe the main properties of the

Execution Environment of a Computational Scientific

Experiment and, based on this description, derive a

reproduction process for generating an equivalent

environment using virtualization techniques?

Conservation of Computational Scientific Execution Environments for Workflow-

based Experiments Using Ontologies. Santana-Pérez I. PhD thesis, 2016.

http://oa.upm.es/39520/

Towards Reproducible Science

Experiment components

31

DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN V

IVO

/VIT

RO

IN S

ILIC

O

Towards Reproducible Science

Experiment components

32

DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN S

ILIC

O

Towards Reproducible Science

Experiment components

33

DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN S

ILIC

O

Towards Reproducible Science

Experiment components

34

DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN S

ILIC

O

Towards Reproducible Science

Experiment components

35

DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN S

ILIC

O

Towards Reproducible Science

bundles and relates digital resources of a scientific experiment

or investigation using standard mechanisms, “tool middleware”

http://www.w3.org/community/rosc/http://www.researchobject.org/

Towards Reproducible Science

Experiment components

38

DATA SCIENTIFIC PROCEDURE EQUIPMENT

IN V

IVO

/VIT

RO

IN S

ILIC

O

Towards Reproducible Science

Open Research Problems

39

Towards Reproducible Science

Open Research Problems

40

Computational Infrastructures are usually a predefined

element of a Computational Scientific Workflow.

Towards Reproducible Science

Open Research Problems

41

Computational Infrastructures are usually a predefined

element of a Computational Scientific Workflow.

Execution Environments are poorly described.

Towards Reproducible Science

Open Research Problems

42

Computational Infrastructures are usually a predefined

element of a Computational Scientific Workflow.

Execution Environments are poorly described.

Current reproducibility approaches for computational

experiments consider mostly data and procedure.

Towards Reproducible Science

Representation

43

CLOUD

Describing execution environments

FORMER

EQUIPMENT

ANNOTATE REPRODUCE

SEMANTIC

ANNOTATIONS

EQUIVALENT

EXECUTION

ENVIRONMENT

Towards Reproducible Science

Representation

WICUS ontology network

o Workflow Infrastructure Conservation Using Semantics

o http://purl.org/net/wicus

o 5 ontologies

• WICUS Workflow Execution Requirements ontology

• WICUS Software Stack ontology

• WICUS Hardware Specs ontology

• WICUS Scientific Virtual Appliance ontology

• WICUS Ontology: links the previous ontologies

44

Towards Reproducible Science

WICUS ontology network

WICUS Workflow Execution Requirements ontology

o http://purl.org/net/wicus-reqs

45

Towards Reproducible Science

WICUS ontology network

WICUS Software Stack ontology

o http://purl.org/net/wicus-stack

46

Towards Reproducible Science

WICUS ontology network

WICUS Scientific Virtual Appliance ontology

o http://purl.org/net/wicus-sva

47

Towards Reproducible Science

WICUS ontology network

WICUS Hardware Specs ontology

o http://purl.org/net/wicus-hwspecs

48

Towards Reproducible Science

WICUS ontology network

WICUS ontology network

o http://purl.org/net/wicus

49

Towards Reproducible Science

WICUS ontology network

WICUS ontology network

o http://purl.org/net/wicus

50

Towards Reproducible Science

WICUS system

Overview, inputs and outputs

51

Towards Reproducible Science

Evaluation

Workflows reproduced

o 3 scientific domains

o 3 workflow management systems

o 6 different workflows

52

Domain Seismic Astronomy Bio

WMS dispel4py Pegasus Makeflow

Name xcorrInternal

ExtinctionMontage Epigenomics SoyKB BLAST

(2003) (2014)(2014) (2015) (2011)(2011)

Towards Reproducible Science

Evaluation

53

Domain Seismic Astronomy Bio

WMS dispel4py Pegasus Makeflow

Name xcorrInternal

ExtinctionMontage Epigenomics SoyKB BLAST

Results

FORMER

EQUIPMENT

ANNOTATE REPRODUCE

CLOU

D

EQUIVALENT EXECUTION

ENVIRONMENTSEMANTIC

ANNOTATIONS

COMPARE

Towards Reproducible Science

Evaluation

54

Domain Seismic Astronomy Bio

WMS dispel4py Pegasus Makeflow

Name xcorrInternal

ExtinctionMontage Epigenomics SoyKB BLAST

Results

CLOU

D

FORMER

EQUIPMENT

ANNOTATE REPRODUCE

SEMANTIC

ANNOTATIONS

EQUIVALENT EXECUTION

ENVIRONMENT

COMPARE

Towards Reproducible Science

Evaluation

55

Domain Seismic Astronomy Bio

WMS dispel4py Pegasus Makeflow

Name xcorrInternal

ExtinctionMontage Epigenomics SoyKB BLAST

Results

CLOU

D

FORMER

EQUIPMENT

ANNOTATE REPRODUCE

SEMANTIC

ANNOTATIONS

EQUIVALENT EXECUTION

ENVIRONMENT

COMPARE

• Non-deterministic

• Standard and error output

• Generated files equivalent

Towards Reproducible Science

Evaluation

56

Domain Seismic Astronomy Bio

WMS dispel4py Pegasus Makeflow

Name xcorrInternal

ExtinctionMontage Epigenomics SoyKB BLAST

Results

CLOU

D

FORMER

EQUIPMENT

ANNOTATE REPRODUCE

SEMANTIC

ANNOTATIONS

EQUIVALENT EXECUTION

ENVIRONMENT

COMPARE

• Same results

• Results from Int. Extinction

may vary

Towards Reproducible Science

Evaluation

57

Domain Seismic Astronomy Bio

WMS dispel4py Pegasus Makeflow

Name xcorrInternal

ExtinctionMontage Epigenomics SoyKB BLAST

Results

CLOU

D

FORMER

EQUIPMENT

ANNOTATE REPRODUCE

SEMANTIC

ANNOTATIONS

EQUIVALENT EXECUTION

ENVIRONMENT

COMPARE

• Genomic data

• Exact match

Towards Reproducible Science

Evaluation

58

Domain Seismic Astronomy Bio

WMS dispel4py Pegasus Makeflow

Name xcorrInternal

ExtinctionMontage Epigenomics SoyKB BLAST

Results

CLOU

D

FORMER

EQUIPMENT

ANNOTATE REPRODUCE

SEMANTIC

ANNOTATIONS

EQUIVALENT EXECUTION

ENVIRONMENT

COMPARE

Towards Reproducible Science

Summarizing

Two building blocks towards reproducibility of

scientific experiments

o In vivo/vitro

• Focus on providing structured descriptions of methods

(laboratory protocols)

• Our tools: ontologies, gazeteers, NLP tools and

automatic and manual annotation tools

• Challenge: make protocols be more structured (and

semantic) from the beginning

o In silico

• Focus on the equipment (computational infrastructure)

for workflow-based experiments

• Ontologies, automatic and manual annotation tools, and

an execution environment

• Challenge: keep track of all types of appliances, and

make scientists work on providing annotations

Is this enough?

o Clearly not, but a step forward towards reproducibility59

Towards Reproducible Science

Summarizing

Is this enough?

Clearly not, but a step forward

towards ensuring reproducibility

(with a focus on methods)

60

Oscar Corcho(with contributions from Olga Giraldo, Alexander García,

and Idafen Santana)

Ontology Engineering Group

Universidad Politécnica de Madrid, Spain

Towards Reproducible Science: a

few building blocks from my

personal experience

[email protected]

@ocorcho

22/10/2017

S4BioDiv2017, Vienna

Towards Reproducible Science

Light pollution (www.stars4all.eu)