jesse xiao at codata2017: updates to the gigadb open access data publishing platform

28
Updates to the GigaDB open access data publishing platform Jesse Xiao [email protected] ORCID ID:0000-0003-3408-2852

Upload: gigascience-bgi-hong-kong

Post on 28-Jan-2018

123 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Updates to the GigaDB open access data publishing platform

Jesse Xiao

[email protected]

ORCID ID:0000-0003-3408-2852

Page 2: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

About the Journal

GigaScience is an open access, open data, open peer-review journal focusing on ‘big data’ research from the life and biomedical sciences

Page 3: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

What is the point of publishing?

• To disseminate information/knowledge/ideas.

• To present material so it can be reasonably assessed for its level of quality (and interest).

• To gain credit for career advancement.

Page 4: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Kahn, Goodman, & Mittleman. Dragging Scientific Publishing into the 21st Century 2014 http://genomebiology.com/2014/15/12/556

From Journal Delivery to PDF Delivery

Page 5: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Lack of Data and Software Availability Impacts Reproducibility

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14

2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 6: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Deconstructing a paper into accessible, useable, trackable, interlinked units

Need to provide credit to reward sharing and proper organization of:

• Narrative• Data/Metadata

availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses

Data/MetaData

Software

Methods

Narrative

Page 7: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Deconstructing a paper into accessible, useable, trackable, interlinked units

Currently we provide credit for this:

• Narrative• Data/Metadata

availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses

Data/MetaData

Software

Methods

Narrative

Sometimes we publish these as Methods Papers

Page 8: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Beyond the NarrativeData And Tools

Page 9: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Getting past…

…look but don't touch

Page 10: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Data publishing

http://gigasciencejournal.com/

Launched July 2012. Publishes “Data Notes” for CC0 data. Uses ISA-Tab.

Page 11: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Data publishing

APC covers 1TB storage in GigaDB

Page 12: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

FAIR DATA in GigaDB

Findable Accessible Interoperable Reusable

Page 13: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

FindableWe have 373 published datasets in GigaDB,& around 30 TB data. Every dataset has a DOIand the individual dataset page.

Provides powerful search engine and API search functione.g.http://gigadb.org/api/search

Page 14: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

AccessibleAll data in GigaDB can be accessed in the public ftp server.We provide three stable ftp sites in 2 geographic locations (HK & Shenzhen)

1. ftp://penguin.genomics.cn // The main ftp server2. ftp://ftp.cngb.org/pub/gigadb/ // The mirror ftp server in the cloud3. ftp://ftp2.cngb.org/pub/gigadb// The mirror ftp server in the cloud

Download Speed

We are working with China National Gene Bank and will to use UDP protocol software(Data Expedition) to provide faster data download speed.

The source code for all software and tools published in GigaDB can access in the Githubhttps://github.com/gigascience

Page 15: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Accessible via APIWe provide a REST API to allow user retrieve and search all metadata held in GigaDB.

The current API returns result in XML (the XML file based on the database schema), and we plan to have the option to also return results in JSON or ISA2.0-JSON in our next version

Page 16: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Accessible via APIThe website http://www.gigadb.org/site/help#0.1_API provides detailed instructions on how to use the GigaDB API

Page 17: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Interoperable and reusableIntegrating tools (inc Jbrowse genome browser …) to visualize data

Page 18: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

First journal with deep integration with

Launched 2nd June 2016

Reward better handling of “wet” protocols…

• Create, share, modify forkeable protocols in repo.

• Download & run on smartphone app.

• Get discoverability, credit, DOIs for sharing methods.

• Create your own, or let us set up & you claim.

http://protocols.io/

Page 19: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

The GigaDB dataset page embedsthe protocol.io in the iframe.

e.g. RNA extraction protocol

Page 20: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Interoperable and reusableGigaDB provides an online submission wizard and excel spreadsheet to help users curate their own metadata

Page 21: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

https://codeocean.com/

Cloud-based executable research platform

Browse, share & run code on AWS

Creates compute capsule: encapsulation of the data, code, and computation environment

Integration into the paper, share via DOIs

First examples published in GigaScience

Integrated plugin into GigaDB

Share your code this way!

Interoperable and reusable

Page 22: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

gigagalaxy.net

Reward Sharing of Workflows

Interoperable and reusable

Page 23: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

http://www.gigasciencejournal.com/content/3/1/23http://www.gigasciencejournal.com/content/4/1/19

Virtual Machines/containers

• Downloadable as virtual harddisk/available as Amazon Machine Image

• Now publishing many container (docker) submissions

Interoperable and reusable

Page 24: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

How FAIR can we get?

Data sets

Analyses

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18

>50,000 accesses& >1000 citations

Open-Code

7 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-Pipelines

Open-Workflows

DOI:10.5524/100038

Open-Data

78GB CC0 data

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>40,000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

Page 25: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612

Quantifying how FAIR can we get

Page 26: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Methods

Answer

Metadata

softwareAnalysis

(Pipelines)

Workflows/

Environments

Idea

Study

Rewarding the

DOI, etc.

Publication

Publication

Publication

Data

Page 27: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

www.gigasciencejournal.com

Give us your data, papers & pipelines

Help GigaPandamake it happen!

[email protected]@gigasciencejournal.com

Contact us:

Page 28: Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing platform

Thanks to:Laurie Goodman, Editor in Chief

Nicole Nogoy, EditorHans Zauner, Assistant Editor

Peter Li, Lead Data Manager

Chris Hunter, Lead BioCurator

Xiao (Jesse) Si Zhe, Database DeveloperChen Qi, Shenzhen Office.

All of BGI

@GigaScience

facebook.com/GigaScience

gigasciencejournal.com/blog/

Follow us:

www.gigasciencejournal.comwww.gigadb.org

+

Weibo

& WeChat