jesse xiao at codata2017: updates to the gigadb open access data publishing platform
TRANSCRIPT
Updates to the GigaDB open access data publishing platform
Jesse Xiao
ORCID ID:0000-0003-3408-2852
About the Journal
GigaScience is an open access, open data, open peer-review journal focusing on ‘big data’ research from the life and biomedical sciences
What is the point of publishing?
• To disseminate information/knowledge/ideas.
• To present material so it can be reasonably assessed for its level of quality (and interest).
• To gain credit for career advancement.
Kahn, Goodman, & Mittleman. Dragging Scientific Publishing into the 21st Century 2014 http://genomebiology.com/2014/15/12/556
From Journal Delivery to PDF Delivery
Lack of Data and Software Availability Impacts Reproducibility
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
Deconstructing a paper into accessible, useable, trackable, interlinked units
Need to provide credit to reward sharing and proper organization of:
• Narrative• Data/Metadata
availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses
Data/MetaData
Software
Methods
Narrative
Deconstructing a paper into accessible, useable, trackable, interlinked units
Currently we provide credit for this:
• Narrative• Data/Metadata
availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses
Data/MetaData
Software
Methods
Narrative
Sometimes we publish these as Methods Papers
Beyond the NarrativeData And Tools
Getting past…
…look but don't touch
Data publishing
http://gigasciencejournal.com/
Launched July 2012. Publishes “Data Notes” for CC0 data. Uses ISA-Tab.
Data publishing
APC covers 1TB storage in GigaDB
FAIR DATA in GigaDB
Findable Accessible Interoperable Reusable
FindableWe have 373 published datasets in GigaDB,& around 30 TB data. Every dataset has a DOIand the individual dataset page.
Provides powerful search engine and API search functione.g.http://gigadb.org/api/search
AccessibleAll data in GigaDB can be accessed in the public ftp server.We provide three stable ftp sites in 2 geographic locations (HK & Shenzhen)
1. ftp://penguin.genomics.cn // The main ftp server2. ftp://ftp.cngb.org/pub/gigadb/ // The mirror ftp server in the cloud3. ftp://ftp2.cngb.org/pub/gigadb// The mirror ftp server in the cloud
Download Speed
We are working with China National Gene Bank and will to use UDP protocol software(Data Expedition) to provide faster data download speed.
The source code for all software and tools published in GigaDB can access in the Githubhttps://github.com/gigascience
Accessible via APIWe provide a REST API to allow user retrieve and search all metadata held in GigaDB.
The current API returns result in XML (the XML file based on the database schema), and we plan to have the option to also return results in JSON or ISA2.0-JSON in our next version
Accessible via APIThe website http://www.gigadb.org/site/help#0.1_API provides detailed instructions on how to use the GigaDB API
Interoperable and reusableIntegrating tools (inc Jbrowse genome browser …) to visualize data
First journal with deep integration with
Launched 2nd June 2016
Reward better handling of “wet” protocols…
• Create, share, modify forkeable protocols in repo.
• Download & run on smartphone app.
• Get discoverability, credit, DOIs for sharing methods.
• Create your own, or let us set up & you claim.
http://protocols.io/
The GigaDB dataset page embedsthe protocol.io in the iframe.
e.g. RNA extraction protocol
Interoperable and reusableGigaDB provides an online submission wizard and excel spreadsheet to help users curate their own metadata
https://codeocean.com/
Cloud-based executable research platform
Browse, share & run code on AWS
Creates compute capsule: encapsulation of the data, code, and computation environment
Integration into the paper, share via DOIs
First examples published in GigaScience
Integrated plugin into GigaDB
Share your code this way!
Interoperable and reusable
gigagalaxy.net
Reward Sharing of Workflows
Interoperable and reusable
http://www.gigasciencejournal.com/content/3/1/23http://www.gigasciencejournal.com/content/4/1/19
Virtual Machines/containers
• Downloadable as virtual harddisk/available as Amazon Machine Image
• Now publishing many container (docker) submissions
Interoperable and reusable
How FAIR can we get?
Data sets
Analyses
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
>50,000 accesses& >1000 citations
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>40,000 downloads
Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612
Quantifying how FAIR can we get
Methods
Answer
Metadata
softwareAnalysis
(Pipelines)
Workflows/
Environments
Idea
Study
Rewarding the
DOI, etc.
Publication
Publication
Publication
Data
www.gigasciencejournal.com
Give us your data, papers & pipelines
Help GigaPandamake it happen!
[email protected]@gigasciencejournal.com
Contact us:
Thanks to:Laurie Goodman, Editor in Chief
Nicole Nogoy, EditorHans Zauner, Assistant Editor
Peter Li, Lead Data Manager
Chris Hunter, Lead BioCurator
Xiao (Jesse) Si Zhe, Database DeveloperChen Qi, Shenzhen Office.
All of BGI
@GigaScience
facebook.com/GigaScience
gigasciencejournal.com/blog/
Follow us:
www.gigasciencejournal.comwww.gigadb.org
+