reproducible, open data science in the life sciences

23
Reproducible, Open Data Science in the Life Sciences Digital Research 2013, 9th-10th September 2013 Eamonn Maguire Lead Software Engineer University of Oxford e-Research Centre

Upload: eamonn-maguire

Post on 26-Jan-2015

103 views

Category:

Technology


0 download

DESCRIPTION

Digital Research 2013 presentation on "Reproducible, Open Data Science in the Life Sciences"

TRANSCRIPT

Page 1: Reproducible, Open  Data Science in the  Life Sciences

Reproducible, Open Data Science in the

Life Sciences

Digital Research 2013, 9th-10th September 2013

Eamonn MaguireLead Software Engineer

University of Oxford e-Research Centre

Page 2: Reproducible, Open  Data Science in the  Life Sciences

“data science” - the storage, management and analysis of data sets

or...

What is “data science”...

these days, all hard sciences are “data sciences”

"data science" - formalizing a hypothesis given a set of observations and assumptions, grabbing data about that hypothesis, testing it and

analyzing it to either confirm or falsify the hypothesis. 

we shift the focus of science from performing physical experiments where data is the by-product used to test a hypothesis, to working

directly with the data

both definitions have different levels of validity in terms of the etymology of the word “science”, but in this presentation, both go very much hand in hand.

Page 3: Reproducible, Open  Data Science in the  Life Sciences

Why reproducible and open?

openexperiments are expensive...and often funded publicly.

data from experiments may extend way beyond the realms originally intended...

one without the other is really no use at all.

reproduciblefindings need to be robust...and testable by the wider scientific community...

provided withdata, metadata, analysis methods, algorithms

enabled bydata, metadata, analysis methods, algorithms

Page 4: Reproducible, Open  Data Science in the  Life Sciences

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The workflow of a “data scientist”...

Page 5: Reproducible, Open  Data Science in the  Life Sciences

PlanningPlanning

What is your hypothesis? What do you need to prove/disprove this?

You need an experimental design. Preferably balanced groups, but enough samples to make it statistically valid.

If you need to generate the data...there’s an app for that.

+Design Wizard

Plan your experiment by answering questions about what you want to measure and how you want to measure it...then let the tool create the design

Creates the ISA-Tab stub, leaving you to fill in which files match which biological samples.

Page 6: Reproducible, Open  Data Science in the  Life Sciences

PlanningPlanning

You also need a data management strategy.

Which ontologies, minimal information checklists and exchange formats can be used for my domain?

What are the requirements of my funder for data deposition?

Which databases support my data?

Page 7: Reproducible, Open  Data Science in the  Life Sciences

Data CollectionData Collection

Use existing data

Perform new experiment

New experiment data Use existing data

Collect the data and metadata from an experiment

Or use existing data and metadata to test a hypothesis...

SAMPLE 1

SAMPLE 2

SAMPLE 3

SAMPLE 4

SAMPLE 5

SAMPLE 6

SAMPLE 7

SAMPLE 8

SAMPLE 9

SAMPLE 10

SAMPLE 11

FILE 1

FILE 2

FILE 3

FILE 4

FILE 5

FILE 6

FILE 7

FILE 8

FIL

FIL

FIL

SAMPLE 1

SAMPLE 2

SAMPLE 3

SAMPLE 4

SAMPLE 5

SAMPLE 6

SAMPLE 7

SAMPLE 8

SAMPLE 9

SAMPLE 10

SAMPLE 11

FILE 1

FILE 2

FILE 3

FILE 4

FILE 5

FILE 6

FILE 7

FILE 8

FIL

FIL

FIL

Page 8: Reproducible, Open  Data Science in the  Life Sciences

Data Collection

Use existing data

Perform new experiment

New experiment data

Excel

Create templates to !t the type of experiments to be

described

Curate your experiment using a desktop-based,

platform independent tool.

Describe & curate your experiment with geographically

distributed collaborators

Check out http://isa-tools.org to download.

Enables the creation of meaningful experiment data in a simple extendable format

Page 9: Reproducible, Open  Data Science in the  Life Sciences

Data Collection

Use existing data

Perform new experiment

Use existing data

Page 10: Reproducible, Open  Data Science in the  Life Sciences

Data ManagementData

Management

Page 11: Reproducible, Open  Data Science in the  Life Sciences

Data ManagementData

ManagementShifting towards a new system

Page 12: Reproducible, Open  Data Science in the  Life Sciences

AnalysisAnalysis

The interesting bit...doing something with our data and metadata...

Analysis of ISA Tab data in the R language. Brings together the context and data to enable more meaningful analysis.

Also suggests packages to use for analysis based on the data types in the ISA Tab file.

Publication coming soon...

Analysis of ISA-Tab data in the Galaxy Environment.

Creates Galaxy Library objects from ISA-Tab files.

Analysis of ISA-Tab data in the GenomeSpace Environment.

Load and edit files stored on distributed servers.

Created by Brad Chapman at the Harvard School for Public Health

Page 13: Reproducible, Open  Data Science in the  Life Sciences

VisualizationVisualization

Check out your experiment, visualize experimental design

Visual Compression of Workflow Visualizations with Automated Detection of Macro Motifs

Maguire et al, 2013IEEE TVCG

Taxonomy!based Glyph Design – with a Case Study on Visualizing Workflows of Biological Experiments

Maguire et al, 2012IEEE TVCG

Page 14: Reproducible, Open  Data Science in the  Life Sciences

PublicationVisualization

Publish, along with your research

articles

& specialised community repositories

Share, link and reason over

experiments with linked data

Getting your work out there...

Page 15: Reproducible, Open  Data Science in the  Life Sciences

Publication - current workVisualization

See http://www.slideshare.net/GigaScience/scott-edmunds-ismb-talk-on-big-data-publishing for a use case showing how we achieve this.

Analysis results

Data !les

Publications

MetadataEncapsulates all the information about the experiment, providing links to the data files, publications and analysis protocols

Analysis workflows in the Galaxy Environment

Work"ows

Presentations

Logs

Page 16: Reproducible, Open  Data Science in the  Life Sciences

Box it all up

The role of a data scientist, (or in the life sciences, a bioinformatician) is multi fold

We’ve presented a suite of tools to help the data scientist in the management of their data and support the creation of open, meaningful life science data

Data Scientist

VisualizationData ManagementData Collection PublicationPlanning

a b c e f

Analysis

d

“data science” - the storage, management and analysis of data sets

"data science" - formalizing a hypothesis given a set of observations and assumptions, grabbing data about that hypothesis, testing it and analyzing it to either confirm or falsify the hypothesis. 

With the systems we have in place for data discovery paired with data already created with the ISA suite of tools, we make possible data integration

Page 17: Reproducible, Open  Data Science in the  Life Sciences

The workflow of a “data scientist”...

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Making science with data possible...

Page 18: Reproducible, Open  Data Science in the  Life Sciences

The workflow of a “data scientist”...

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

And this...

Making science with data possible...

Page 19: Reproducible, Open  Data Science in the  Life Sciences

The workflow of a “data scientist”...

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Making science with data possible...

Page 20: Reproducible, Open  Data Science in the  Life Sciences

The workflow of a “data scientist”...

...

Making science with data possible...

Data Scientist

Analysis

Planning

Data Management

Data Collection

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Publication

Data Scientist

Analysis

Planning

Data Management

Data Collection

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Publication

Data Scientist

Planning

Data

Data Collection

Use existing data

Perform new experiment

Data Scientist

Planning

Data

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Planning

Data

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Planning

Data

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Planning

Data

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Planning

Publication

Data Scientist

Analysis

Planning

Data Management

Data Collection

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Publication

Page 21: Reproducible, Open  Data Science in the  Life Sciences

Recent PublicationsVisual Compression of Workflow Visualizations with Automated Detection of Macro Motifs

Maguire et al, 2013

IEEE TVCG

Taxonomy!based Glyph Design – with a Case Study on Visualizing Workflows of Biological Experiments

Maguire et al, 2012

IEEE TVCG

ISA software suite: Overview of ISA-Tab and first set of tools

Rocca-Serra et al, 2010

Bioinformatics

Towards interoperable bioscience data: Pre-senting the ISA Commons, authored by more than 50 collaborators at over 30 scientific organizations around the globe.

Sansone et al, 2012

Nature Genetics

OntoMaton: a Bioportal powered ontology widget for Google Spreadsheets.

Maguire et al, 2013

Bioinformatics

The Harvard Stem Cell Discovery Engine: an integrated repository and analysis system for FDQFHU�VWHP�FHOO��SRZHUHG�E\�,6$�WRROV�

Ho Sui et al, 2012

Nucleic Acids Research

Taxonomy-based Glyph Design

Maguire et al, 2012

IEEE TVCG

Visualizing (ISA based) workflows of biological experiments

Standardizing data: ISA-Tab-Nano: ISA-Tab extension for nanotechnology applications au-thored by over 20 organizations inlc. government agencies, academia and industry.

Baker et al, 2013

Nature Nanotechnology

MetaboLights: an open-access repository for metabolomics at EBI powered by ISA.

Haug et al, 2013

Nucleic Acids Research

The ToxBank Data Warehouse: a research cluster of 7 EU FP7 Health systems toxicology and toxicogenomics projects develops the ISAtoRDF moduleKohonen et al, 2013

Molecular Informatics

Page 22: Reproducible, Open  Data Science in the  Life Sciences

Thanks to

ISA team

Susanna-Assunta SansonePhilippe Rocca-SerraEamonn MaguireAlejandra Gonzalez-Beltran

Contributors

Marco BrandiziNatalija SklyarBrad ChapmanBob MacCallumKenneth HaugPablo ConesaAudrey Kauffman

Funders

& Our Many Collaborators!

S t e m C e ll C o m m o n sNanotechnology

Informatics Working Group

Page 23: Reproducible, Open  Data Science in the  Life Sciences

Questions?

http://isa-tools.orghttp://isacommons.org

http://biosharing.org