robinson bosc2010 bio_hdf

14
www.hdfgroup.org The HDF Group July 9, 2010 BioHDF Open Binary File Formats for Next-Generation Sequencing Data Dana Robinson The HDF Group [email protected] 1 Copyright © 2010 The HDF Group. All Rights Reserved Current Status and Future Directions

Upload: bosc-2010

Post on 18-Nov-2014

800 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Robinson bosc2010 bio_hdf

www.hdfgroup.org

The HDF Group

1July 9, 2010

BioHDFOpen Binary File Formats for

Next-Generation Sequencing Data

Dana Robinson

The HDF Group

[email protected]

Copyright © 2010 The HDF Group. All Rights Reserved

Current Status and Future Directions

Page 2: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010

NGS Data Challenges

2

Very large quantities of data (100s of GB)

"Drinking from the firehose"

Analysis methods vary greatly, so a flexible yet unified data store would be useful.

Page 3: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010

What is Needed

3

A Data ModelA data model which accurately describes the data and can be expanded to contain new types of data

A Data StoreA file format or data store which is efficient in access time and storage size and which scales well

A ToolkitA flexible software toolkit that can be used to create tools and pipelines based on the data model and file format

Page 4: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010 4

What is BioHDF?

An open-source, community-driven project, funded by an NIH SBIR grant and led by Geospiza, Inc. in collaboration with The HDF Group.

BioHDF is a particular arrangement of objects in an HDF5 file (similar to a database schema)

BioHDF is a library and C API which can be used to write applications (coming soon)

BioHDF is a set of command line tools for storing, retrieving and manipulating data in BioHDF files

Page 5: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010 5

HDF = Hierarchical Data Format

/Reads/

Alignments/

References

somefile.h5

groups

datasets

is_sorted

attributes

An example of how data is stored in HDF5

Page 6: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010 6

Benefits of BioHDF

• Portability and data sharing:Platform independent, endian independent, self describing, common data models.

• High performance:Fast random access and efficient, scalable, petabyte level compressed storage.

• Widespread adoption:MATLAB, IDL, NASA-Earth Observing System, Pacific Biosciences, SOLiD, 100's of products.

• 20 year history:Robust, performance tuned, and well supported by The HDF Group, an independent non-profit entity.

Page 7: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010

HDF in Bioinformatics

• Baylor Imaging Group• Life Technologies• Pacific Biosciences• Oxford Nanopore• GenomeData (UW)• Geospiza• Others

Page 8: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010 8

Data Stored

The prototype BioHDF stores

Reads

Alignments

Annotations

Clusters of Aligned Reads

Reference Sequences

Indexes (NCList or simple)

Page 9: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010 9

Data Stored

Additional user-specific data can be stored without breaking the library or tools.

BioHDFData

User-SpecificData

Similar to how adding additional tables to a database schema does not invalidate existing queries.

Page 10: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010 10

Project Stages

A "pipeline prototype " set of tools to demonstrate the suitability of HDF5 for NGS data storage.

An version 1.0 release of a BioHDF library and C API targeting the functionality of samtools.

A higher-level C API that abstracts out and hides the underlying storage technology.

Page 11: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010 11

BioHDF Applications andWrappers (e.g. Perl, Python)

HDF5 API and Applications

HDF5 API

Physical Storage

BioHDF API

High-Level API

Page 12: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010 12

A Higher-Level API

high-levelC API

BioHDFAPI

samtools

tool

wrapperBAMAPI

low-levelC APIs

A high-level API will encapsulate and hide the underlying storage technology.

Page 13: Robinson bosc2010 bio_hdf

www.hdfgroup.orgJuly 9, 2010 13

Acknowledgements

BioHDF is supported by NIH SBIR Phase II grant HG003792 awarded to Geospiza, Inc.

GeospizaTodd SmithMark Welsh

The HDF GroupMike Folk

Page 14: Robinson bosc2010 bio_hdf

www.hdfgroup.org

The HDF Group

14July 9, 2010

Thank you for your time!

If you are interested in using or contributing to BioHDF, please contact us!

Dana Robinson ([email protected])

http://www.biohdf.org

BOSC BoF: Friday 5:10-6:00

ISMB Poster J18: Monday, July 12: 12:40-2:30

ISMB BoF: Tuesday, July 13 1-2 pm, room 306