scott edmunds: gigascience - big-data, data citation and future data handling
DESCRIPTION
Scott Edmunds talk on GigaScience Big-Data, Data Citation and future data handling at the International Conference of Genomics on the 15th November 2011.TRANSCRIPT
Scott Edmunds
: Big Data, Data Citation and Future Data Handling
www.gigasciencejournal.comcc Flickr allan*
William Gibson: "Information is the currency of the future world"
Data Tsunami?
Flickr cc: opensourceway
Data Bonanza?
19961997
19981999
20002001
20022003
20042005
20062007
20080
100
200
300
400
500
600
700rice wheat
Rice v Wheat: consequences of publically available genome data.
Sharing aids everyone…
Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
Sharing Detailed Research Data Is Associated with Increased Citation Rate.
Every 10 datasets collected contributes to at least 4 papers in the following 3-years.Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
“Trans-Omics”
• Genomics • Transcriptomics• Proteomics • Metabolomics
Objective to integrate data from:
Problems?
Flickr cc: opensourceway
~100,000X
Sequencing cost ($ per Mbp)
Moore’s Law
Sequencing
Source: E Lander/Broad
Sequencing Output
Data
Moore’s/Kryders Law
Storage
Sequencing Output
Data
Dissemination?
Publication
1 Illumina HiSeq 2000 (+Truseq upgrade)
= 600Gb/run (12 days)
X 128 Hiseq = 6Tb/day = >2Pb/year
= ~ 2000 Human Genomes/day
Potential sequencing capacity
Flickr cc: opensourceway
Difficulties keeping up…
Flickr cc: opensourceway
Do we have models for long term funding?
Human Gene Mutation Database
?
Kyoto Encyclopedia of Genes and Genomes
?
Are there now too many hurdles?
?
Are there now too many hurdles?
Technical: too large volumes too heterogeneous
no home for many data typestoo time consuming
Economic: too expensive, no long-term funding
Cultural: inertiano incentives to share unaware of how
Potential solutions?
Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)
Prepublication data sharing (Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it can later be used to reflect impact of the data sets.” Nature 461, 168-170 (2009)
Datacitation: Datacite and DOIs
Digital Object Identifiers (DOIs) offer a solution
Mostly widely used identifier for scientific articles
Researchers, authors, publishers know how to use them
Put datasets on the same playing field as articles
DatasetYancheva et al (2007). Analyses on sediment of Lake Maar. PANGAEA.doi:10.1594/PANGAEA.587840
Datacitation: Datacite and DOIs
>1 million DOIs since Dec 2009
Central metadata repository to link with WoS/ISI
- finally can track and credit use!
www.gigasciencejournal.com
Large-Scale Data Journal/Database
Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhD
In conjunction with:
Now taking submissions…
Now taking submissions…
www.gigasciencejournal.com
Editorial Board: International
Stephen O'Brien, USA Hanchuan Peng, USA Russell Poldrack, USAMing Qi, China/USA Susanna-Assunta Sansone, UK Michael Schatz, USA David Schwartz, USAFritz Sommer, USA Lincoln Stein, CanadaSumio Sugano, Japan Thomas Wachtler, Germany Jun Wang, ChinaAlistair Young, New ZealandZang Yufeng, China Marie Zins, France
Stephan Beck, UKAlvis Brazma, UKAnn-Shyn Chiang, Taiwan Richard Durbin, UK Paul Flicek, UK Robert Hanner, Canada Yoshihide Hayashizaki, Japan Henning Hermjakob, UK Wolfgang Huber, GermanyGary King, USA Tin-Lap Lee, Hong KongDonald Moerman, CanadaKaren Nelson, USA Francis Ouellette, Canada
www.gigasciencejournal.com
Editorial Board: International
Stephen O'Brien, GenomicsHanchuan Peng, Imaging/Neuro Russell Poldrack, NeuroscienceMing Qi, GeneticsSusanna-Assunta Sansone, Standards Michael Schatz, Cloud ComputingDavid Schwartz, Optical MappingFritz Sommer, NeuroscienceLincoln Stein, Cloud ComputingSumio Sugano, GenomicsThomas Wachtler, Neuroscience Jun Wang, GenomicsAlistair Young, Medical ImagingZang Yufeng, NeuroscienceMarie Zins, Medicine
Stephan Beck, EpigenomicsAlvis Brazma, TranscriptomicsAnn-Shyn Chiang, NeuroscienceRichard Durbin, Genetics/GenomicsPaul Flicek, GenomicsRobert Hanner, DNA Barcoding/Ecology Yoshihide Hayashizaki, GenomicsHenning Hermjakob, ProteomicsWolfgang Huber, Functional GenomicsGary King, MedicineTin-Lap Lee, GenomicsDonald Moerman, Functional GenomicsKaren Nelson, MetagenomicsFrancis Ouellette, Genomics
www.gigasciencejournal.com
Criteria and Focus of Journal/DatabaseReproducibility/ReuseUtility/UsabilityStandards/Searchability/Scale/SharingData publishing/DOI
www.gigasciencejournal.com
Use of Data = Importance + Usability
easier to assesssubjective?
www.gigasciencejournal.com
Reproducibility/Reuse BGI Cloud Computing resources for handling and analyzing large-scale data.Integrated tools to promote more widespread access, viewing, and analysis of data.Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files).
www.gigasciencejournal.com
Special Series/Hub for cloud-based toolsTechnical notes: test tools in the BGI-Cloud.Tools + Test Data (BGI or user) in one place.Aids reproducibility. Aids reviewers (free)Aids authors: visibility (pubmed, etc.)
hosting (included/free offers)
–contact us: [email protected]
Oledoe flickr cc
www.gigasciencejournal.com
Standards/Searchability/Sharing ISA-Tab compatibility to aid and promote best practice in metadata reporting.All supporting data must be publically available.Ask for MIBBI compliance and use of reporting checklists.Part of the Biosharing network.
www.gigasciencejournal.com
Data publishing/DOINew journal format combines standard manuscript publication with an extensive database to host all associated data. Data hosting will follow standard funding agency and community guidelines.DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking.
of data use/release?
The era of the data consumer?
The era of the data consumer?
?
The era of the data consumer?
?
Free access to data – but analysis hubs/nodes for will form around it
Genomic Data Submission and Analytical platform
Big data from the
“Sequencing Farm”
Data Modeling
Pipeline design
Validation
Commercial applications
GDSAP:
Data, Data, Data…
Tin-Lap Lee, CUHK
“Apps”
www.gigaDB.org
New Database
www.gigaDB.org
New Database
BGI Datasets Get DOI®s
doi:10.5524/100004
PLANTSChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum
MicrobeE. Coli O104:H4 TY-2482
Cell-LineChinese Hamster Ovary
Human Asian individual (YH) - DNA Methylome - Genome Assembly- TranscriptomeAncient DNA (coming soon)- Saqqaq Eskimo - Aboriginal Australian
VertebratesGiant panda Macaque - Chinese rhesus - Crab-eatingNaked mole rat Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope
InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSilkworm
BGI Datasets Get DOI®s
doi:10.5524/100004
PLANTSChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum
MicrobeE. Coli O104:H4 TY-2482
Cell-LineChinese Hamster Ovary
Human Asian individual (YH) - DNA Methylome - Genome Assembly- TranscriptomeAncient DNA (coming soon)- Saqqaq Eskimo - Aboriginal Australian
VertebratesGiant panda Macaque - Chinese rhesus - Crab-eatingNaked mole rat Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope
InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSilkworm
Many unpublished…
Data also submitted to NCBI (including SV data to dbVar)
Complemented by citable form, and data-types including:
Assemblies of 3 strains Raw Data
SNPsInDels
CNVsSV
Coming soon…
To maximize its utility to the research community and aid those fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. http://dx.doi.org/10.5524/100001
Our first DOI:
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
“The way that the genetic data of the 2011 E. coli strain were disseminated globally suggests a more effective approach for tackling public health problems. Both groups put their sequencing data on the Internet, so scientists the world over could immediately begin their own analysis of the bug's makeup. BGI scientists also are using Twitter to communicate their latest findings.”
“German scientists and their colleagues at the Beijing Genomics Institute in China have been working on uncovering secrets of the outbreak. BGI scientists revised their draft genetic sequence of the E. coli strain and have been sharing their data with dozens of scientists around the world as a way to "crowdsource" this data. By publishing their data publicy and freely, these other scientists can have a look at the genetic structure, and try to sort it out for themselves.”
www.gigasciencejournal.com
We want your data!
@gigascience
facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog/