trends in genomics big data, ncbi perspective, and 1,000 genomes in the cloud

18
-- Don Preuss NCBI/NLM/NIH Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud Every decade a new, lower priced computer class forms with new programming platform, network, and interface resulting in new usage and industry. - Bell’s Law of computer

Upload: elda

Post on 25-Feb-2016

49 views

Category:

Documents


1 download

DESCRIPTION

Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud. -- Don Preuss NCBI/NLM/NIH. Every decade a new, lower priced computer class forms with new programming platform, network, and interface resulting in new usage and industry. - Bell’s Law of computer classes. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

-- Don Preuss NCBI/NLM/NIH

Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud

Every decade a new, lower priced computer class forms with new programming platform, network, and interface resulting in new usage and industry. - Bell’s Law of computer classes

Page 2: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

2National Center for Biotechnology Information (NCBI)

Outline

Emerging trends on "Big Data“ and large scale networking and "the cloud" in the genomics community. Trends in data transfer and data compression Cloud initiatives – 1,000 Genomes in the cloud

Page 3: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

3National Center for Biotechnology Information (NCBI)

National Center for Biotechnology InformationCreated by Public Law 100-607 in 1988 as part of National Library of Medicine at NIH to:

Create automated systems for knowledge about molecular biology, biochemistry, and genetics

Perform research into advanced methods of analyzing and interpreting molecular biology data.

Enable biotechnology researchers and medical care personnel to use the systems and methods developed.

The NCBI advances science and health by providing access to biomedical and genomic information.Builders and providers of GenBank, Entrez, BLAST, PubMed, dbGaP, SRA, dbSNP, Pubchem and much, much, more….Center for basic research and training in computational biology.

Page 4: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

4National Center for Biotechnology Information (NCBI)

20100

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

Web users: 3.1 million per day

Data downloaded: 26.6 TB per day

Peak web hits: 7,000 per second

Web page views: 28 million per day

NCBI Daily Users

Page 5: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

5National Center for Biotechnology Information (NCBI)

Page 6: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

6National Center for Biotechnology Information (NCBI)

Sequencers

Page 7: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

7National Center for Biotechnology Information (NCBI)

DNA Sequencing Caught in Deluge of DataBGI, based in China, is the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day. BGI churns out so much data that it often cannot transmit its results to clients or collaborators over the Internet or other communications lines because that would take weeks. Instead, it sends computer disks containing the data, via FedEx

Page 8: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

8National Center for Biotechnology Information (NCBI)

Big Data in Scientific Discovery

8

Physics: Large Hadron ColliderBiology: 1000 Genomes Project

Trunnell 2012

Page 9: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

9National Center for Biotechnology Information (NCBI)

NLM I2 Traffic Stats 2009-2012

Page 10: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

10National Center for Biotechnology Information (NCBI)

Oct-00 Feb-02 Jun-03 Nov-04 Mar-06 Aug-07 Dec-08 May-10 Sep-11100000000

1000000000

10000000000

100000000000

1000000000000

10000000000000

100000000000000

1000000000000000 Trace and SRA Holdings TraceArchive BasesSRA BasesSRA Bytes

Getting exponential growth under control

Page 11: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

11National Center for Biotechnology Information (NCBI)

0

50

100

150

200

250

300 Submitted BAMRead IDs as stringsOriginal quality & recalibrated quality scoresAdditional analysis tags

cSRA (lossless)Read IDs as integers40-level read qualities using recalibrated quality scores cSRA (lossy)

8 level qualities for all sitesUniform binning of recalibrated quality scores

Variant CallFormat (VCF)

Genotype likelihoods for all variantsTotal

Project Size

Lossless cSRA

LossycSRA

Analysis Genotypes

Size

(Ter

abyt

es)

What is the Big Data Problem in Biology?Example: Reducing the 1000 Genome Dataset

250TB

85TB

30TB0.1TB

Page 12: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

12National Center for Biotechnology Information (NCBI)12Flicek

Page 13: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

13National Center for Biotechnology Information (NCBI)

Problem: Enable Access to Data1,000 genome data set is very largeMany sites do not have capacity for 50-200TB downloadsRequest – Can the 1,000 genomes project store the data in the cloud?Reduce cost for extramural investigators and increase accessibility to dataIn addition, it supports Federal Open Data

A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government…

Latest release announced at #ICGH2011, more releases coming. Part of the National Big Data Initiative Announcement

Page 14: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

14National Center for Biotechnology Information (NCBI)

Why is NCBI interested in cloud computing?

Quantity of Data NCBI has petabytes of sequence data that is made available to

researchers around the world. Bandwidth

NIH has a good bit of network capacity, and Network capacity is available for many sites to download data

sets, especially those on Internet II. However, for many more, it is not available, reducing their practical access to research data

Analysis Tools and Platforms Some need simple tools – Extract a portion of data

(chromosome, area of interest) Others use more complex tools – Genome browsers, analysis

tools for epigenomics using Elastic MapReduce If we can bring compute to the data we can improve access to

the dataReferences in this talk to any specific commercial products, process, service, manufacturer, company, or trademark does not constitute its endorsement or recommendation by the U.S. Government, HHS, or NIH. As an agency of the U.S. Government, NIH cannot endorse or appear to endorse any specific commercial products or services.

Page 15: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

15National Center for Biotechnology Information (NCBI)

1,000 Genomes in the Clouds

The 1,000 Genome Project files are loaded in Amazon S3Millions of files have been uploaded (200TB)AMIs have been developed to analyze and review the data

Cloudbiolinux, GalaxyThis is a public data set with storage provided by AWS

NIH is funding several efforts to port genome pipelines to cloud computing environmentsResearch labs, such as those at Emory and UCSC have placed versions of their software in AWS to make 1,000 genome data readily accessible through browser interfaces in the cloud

Page 16: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

16National Center for Biotechnology Information (NCBI)

What is Galaxy

Galaxy is a framework for integrating computational tools. It allows nearly any tool that can be run from the command line to be wrapped in a structured well defined interface.On top of these tools, Galaxy provides an accessible environment for interactive analysis that transparently tracks the details of analyses, a workflow system for convenient reuse, data management, sharing, publishing, and more. Even more – Galaxy has made it easy for a researcher to extend their compute power into cloud compute systemsTools like Galaxy make it possible for a researcher to take advantage of much greater compute power without having to worry about the infrastructure details.

http://usegalaxy.orgFrom ASMB tutorial

Page 17: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

17National Center for Biotechnology Information (NCBI)

Page 18: Trends  in Genomics Big Data,  NCBI perspective, and 1,000  Genomes in the  Cloud

18National Center for Biotechnology Information (NCBI)

Summary/Questions

Compression will help slow this big data problem

Other big data problems remainNew file formats will compress data close to sequencersLast mile networking is a big issue, prevents access for researchersCloud will enable access for many more researchers internationally and at underserved institutions

Email: [email protected]