chambwe bosc2010

24
THE GOBY FRAMEWORK: TOWARDS EFFICIENT NEXT-GENERATION SEQUENCING DATA ANALYSIS Nyasha Chambwe , Kevin C. Dorff, Marko Srdanovic, Xutao Deng, Stuart J.D. Andrews, Fabien Campagne The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine; Department of Physiology and Biophysics Weill Medical College of Cornell University http:// goby.campagnelab.org

Upload: bosc-2010

Post on 11-Jun-2015

503 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Chambwe bosc2010

THE GOBY FRAMEWORK: TOWARDS EFFICIENT NEXT-GENERATION SEQUENCING DATA ANALYSIS

Nyasha Chambwe, Kevin C. Dorff, Marko Srdanovic, Xutao Deng, Stuart J.D. Andrews, Fabien Campagne

The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine; Department of Physiology and BiophysicsWeill Medical College of Cornell University

http://goby.campagnelab.org

Page 2: Chambwe bosc2010

McPherson J.D. Nat Methods. 2009

Applications of Next Generation Sequencing

Page 3: Chambwe bosc2010

Roche/454 GS FLX Titanium

Illumina/Solexa GA IIe

Life Technologies SOLiD 3

Helicos BioSciences Heliscope

NGS Chemistry Pyrosequencing Reversible Terminators

Sequencing by ligation

Reversible Terminators

Avg Read Length (bp)

330 75 50 32

Run Time (days) 0.35 4 7 8

Giga bases/run 0.45 18 30 37

Million reads/run 1.36 240 600 1156

Metzker, M.L. Nat Rev Genet. 2010

Next Generation Sequencers

Page 4: Chambwe bosc2010

Next Generation Sequence Data Formats

Key Limitations• Text based formats do

not scale well to handle large amounts of data

• Naïve compression prevents semi-random access

Page 5: Chambwe bosc2010

File Format Wish List

Structured schema/data representation Well specified and documented (not ambiguous)

Fast parsing speedLanguage and operating system portabilityBackward and forward compatibilityCompressionRandom accessStreaming

Page 6: Chambwe bosc2010

File Formats

File Formats

readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

Page 7: Chambwe bosc2010

File Formats

File Formats

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

Page 8: Chambwe bosc2010

Structured non-ambiguous representation

Goby uses Protocol Buffers (PB) to provide “a flexible, efficient, automated mechanism for serializing structured data” (PB website)

• PB generate parsers in different languages e.g., Java, C++, Python, Perl, R, C, C#, Visual Basic, PHP, Objective C, Ruby, Common Lisp

• Provide forward and backward compatibility

Page 9: Chambwe bosc2010

Goby compact formats Data is represented by Protocol Buffers as a

message defined by a .proto file

Page 10: Chambwe bosc2010

File Format Wish List

Structured schema/data representation Well specified and documented (not ambiguous)

Fast parsing speedLanguage and operating system portabilityBackward and forward compatibilityCompressionRandom accessStreaming

Page 11: Chambwe bosc2010

Goby compact formats

Chunking: Semi-random access Efficient parallel processing

Page 12: Chambwe bosc2010

File Format Wish List

Structured schema/data representation Well specified and documented (not ambiguous)

Fast parsing speedLanguage and operating system portabilityBackward and forward compatibilityCompressionRandom accessStreaming

Page 13: Chambwe bosc2010

Goby File Size ComparisonsMAQC sample B = Ambion Human Brain Reference RNA (HBRR or HBR, Catalog #6050) sequenced on four next-gen platforms

Page 14: Chambwe bosc2010

File Formats

File Formats readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

Page 15: Chambwe bosc2010

File Formats

File Formats readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

Page 16: Chambwe bosc2010

Alignment Iterator

Code fragment to:1. Scan through two alignments (input1, input2)2. Print information for each entry3. Print information for chromosomes 1,2,X only

Page 17: Chambwe bosc2010

File Formats

File Formats readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

Page 18: Chambwe bosc2010

File Formats

File Formats readsreads alignmentsalignments histogramshistograms

Low Level APIs

Low Level APIs

Tools/UtilitiesTools/

Utilities

ApplicationsApplications

Java, C++, PythonJava, C++, Python

ReadersReaders WritersWriters IteratorsIterators

File Format

Conversions

File Format

ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization

RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in

The Goby Software Framework

Page 19: Chambwe bosc2010

RNA-Seq Pipeline• Objective: To determine levels of expression in samples

and perform differential expression analysis• Supports:

Mapping to full genome Mapping to annotated cDNAs (reads match inside exons and

across exon-exon boundaries)• Sequencing platform independent• Published normalization methods implemented

Mortazavi A et al. Nat Methods. 2008 Bullard JH et al. BMC Bioinformatics. 2010

• Bias correction for platform specific biases Hansen KD et al. Nucleic Acids Res. 2010

Page 20: Chambwe bosc2010

Sample RNA-Seq Results

Page 21: Chambwe bosc2010

Conclusion• Goby file formats are efficient and non-

ambiguous • Alignments are about five times smaller than

BAM alignments • API makes it easy to write efficient code to

handle large datasets• Framework provides utilities and analysis

pipelines for common NGS data analysis tasks

Page 22: Chambwe bosc2010

Acknowledgements

Campagne LabFabien Campagne Kevin C. DorffMarko SrdanovicStuart J.D. Andrews

Broad InstituteJim Robinson

http://goby.campagnelab.org

FDA/NCTRLeming Shi

Sequencing Quality Control Project (SEQC)HelicosIllumina Life Technologies Roche

Page 23: Chambwe bosc2010
Page 24: Chambwe bosc2010

cDNA Search