Transcript
Page 1: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

NEXTBIO 2008

Leveraging HBase for the World's Largest Curated Genomic Data Collection

Satnam Alag, Ph.D.VP of [email protected]

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 2: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Technology Generating Exponential Data

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 3: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Genomic Big Data

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 4: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Use Case 1: HBase to Store Variant Data• Each Genome has ~4 million

variants• Immutable – write once,

never change, read many times

• Bloom Filters are useful• Batch import of Data – HFile• Data to be accessed

collocated in region• Separate Hbase cluster from

Hadoop• All the smarts are in the keysFor the various tables

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

In Hbase: 1 Genome 10Million rows100 Genomes 1Billion rows100K Genomes 1Trillion rows100M Genomes 1 Quadrillion1,000,000,000,000,000

Fortunately, HBase cluster access can be partitioned by the application when required

Page 5: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Accessing Data with Pagination

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Table 1:Key: Bioset Id + Display Order

Columns

Pagination Example:Page 5, Page Size = 100

Retrieve 100 rows from Display Order = 400-500

Number of rows = 1 per SNPOrder of 4 million

Page 6: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Accessing Data with Keys

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Table 1:Key: Bioset Id + Display Order

Keys returned by search index

Page 7: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Filtering Data with Pagination

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Table 1:Key: Bioset Id + Display Order

Example:Gene: ESR1, Class: MisensePage Size = 100

Retrieve rows from Table 2Retrieve rows by keys fromTable 1

Number of rowsOrder of 0.5 million per dataset(# genes x classes)

Table 2:Id+GeneId+MutationClass

Column: Counts, Keys to Table

Page 8: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Powering the Genome Browser

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Table 1:Key: Bioset Id + Display Order

Example:Chr: 6Specified Range

Retrieve all rows

1 Row per SNP ~ 4 million per dataset

Table 2:Id+GeneId+MutationClass

Table 3:Id+ChromosomeId+Range+DisplayOrder

Page 9: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Use Case 2: Correlation Data

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 10: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Use Case 2• Each Correlation score stored as a row• HFile created for new score• Over 20 billion correlations

B1 B2 … … .. Bn Bn+1

B1

B2

Bn

Bn+1

T1: scorebioset (base table) key: biosetid_1 [+] biosetid_2

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.

Page 11: HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data Collection

Lessons Learnt• HBase Works Wells For

-- Immutable Data-- Insertions Using HFiles-- Billions of Rows-- Intelligence in Key Definition

• Road to Production-- Redundant Data in Database

© 2012 NextBio | All rights reserved | This information is proprietary and confidential.


Top Related