chambwe bosc2010
TRANSCRIPT
THE GOBY FRAMEWORK: TOWARDS EFFICIENT NEXT-GENERATION SEQUENCING DATA ANALYSIS
Nyasha Chambwe, Kevin C. Dorff, Marko Srdanovic, Xutao Deng, Stuart J.D. Andrews, Fabien Campagne
The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine; Department of Physiology and BiophysicsWeill Medical College of Cornell University
http://goby.campagnelab.org
McPherson J.D. Nat Methods. 2009
Applications of Next Generation Sequencing
Roche/454 GS FLX Titanium
Illumina/Solexa GA IIe
Life Technologies SOLiD 3
Helicos BioSciences Heliscope
NGS Chemistry Pyrosequencing Reversible Terminators
Sequencing by ligation
Reversible Terminators
Avg Read Length (bp)
330 75 50 32
Run Time (days) 0.35 4 7 8
Giga bases/run 0.45 18 30 37
Million reads/run 1.36 240 600 1156
Metzker, M.L. Nat Rev Genet. 2010
Next Generation Sequencers
Next Generation Sequence Data Formats
Key Limitations• Text based formats do
not scale well to handle large amounts of data
• Naïve compression prevents semi-random access
File Format Wish List
Structured schema/data representation Well specified and documented (not ambiguous)
Fast parsing speedLanguage and operating system portabilityBackward and forward compatibilityCompressionRandom accessStreaming
File Formats
File Formats
readsreads alignmentsalignments histogramshistograms
Low Level APIs
Low Level APIs
Tools/UtilitiesTools/
Utilities
ApplicationsApplications
Java, C++, PythonJava, C++, Python
ReadersReaders WritersWriters IteratorsIterators
File Format
Conversions
File Format
ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization
RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in
The Goby Software Framework
File Formats
File Formats
Low Level APIs
Low Level APIs
Tools/UtilitiesTools/
Utilities
ApplicationsApplications
Java, C++, PythonJava, C++, Python
ReadersReaders WritersWriters IteratorsIterators
File Format
Conversions
File Format
ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization
RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in
The Goby Software Framework
Structured non-ambiguous representation
Goby uses Protocol Buffers (PB) to provide “a flexible, efficient, automated mechanism for serializing structured data” (PB website)
• PB generate parsers in different languages e.g., Java, C++, Python, Perl, R, C, C#, Visual Basic, PHP, Objective C, Ruby, Common Lisp
• Provide forward and backward compatibility
Goby compact formats Data is represented by Protocol Buffers as a
message defined by a .proto file
File Format Wish List
Structured schema/data representation Well specified and documented (not ambiguous)
Fast parsing speedLanguage and operating system portabilityBackward and forward compatibilityCompressionRandom accessStreaming
Goby compact formats
Chunking: Semi-random access Efficient parallel processing
File Format Wish List
Structured schema/data representation Well specified and documented (not ambiguous)
Fast parsing speedLanguage and operating system portabilityBackward and forward compatibilityCompressionRandom accessStreaming
Goby File Size ComparisonsMAQC sample B = Ambion Human Brain Reference RNA (HBRR or HBR, Catalog #6050) sequenced on four next-gen platforms
File Formats
File Formats readsreads alignmentsalignments histogramshistograms
Low Level APIs
Low Level APIs
Tools/UtilitiesTools/
Utilities
ApplicationsApplications
Java, C++, PythonJava, C++, Python
ReadersReaders WritersWriters IteratorsIterators
File Format
Conversions
File Format
ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization
RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in
The Goby Software Framework
File Formats
File Formats readsreads alignmentsalignments histogramshistograms
Low Level APIs
Low Level APIs
Tools/UtilitiesTools/
Utilities
ApplicationsApplications
Java, C++, PythonJava, C++, Python
ReadersReaders WritersWriters IteratorsIterators
File Format
Conversions
File Format
ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization
RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in
The Goby Software Framework
Alignment Iterator
Code fragment to:1. Scan through two alignments (input1, input2)2. Print information for each entry3. Print information for chromosomes 1,2,X only
File Formats
File Formats readsreads alignmentsalignments histogramshistograms
Low Level APIs
Low Level APIs
Tools/UtilitiesTools/
Utilities
ApplicationsApplications
Java, C++, PythonJava, C++, Python
ReadersReaders WritersWriters IteratorsIterators
File Format
Conversions
File Format
ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization
RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in
The Goby Software Framework
File Formats
File Formats readsreads alignmentsalignments histogramshistograms
Low Level APIs
Low Level APIs
Tools/UtilitiesTools/
Utilities
ApplicationsApplications
Java, C++, PythonJava, C++, Python
ReadersReaders WritersWriters IteratorsIterators
File Format
Conversions
File Format
ConversionsAlignment ProcessingAlignment Processing VisualizationVisualization
RNA-Seq PipelineRNA-Seq Pipeline IGV Plug-inIGV Plug-in
The Goby Software Framework
RNA-Seq Pipeline• Objective: To determine levels of expression in samples
and perform differential expression analysis• Supports:
Mapping to full genome Mapping to annotated cDNAs (reads match inside exons and
across exon-exon boundaries)• Sequencing platform independent• Published normalization methods implemented
Mortazavi A et al. Nat Methods. 2008 Bullard JH et al. BMC Bioinformatics. 2010
• Bias correction for platform specific biases Hansen KD et al. Nucleic Acids Res. 2010
Sample RNA-Seq Results
Conclusion• Goby file formats are efficient and non-
ambiguous • Alignments are about five times smaller than
BAM alignments • API makes it easy to write efficient code to
handle large datasets• Framework provides utilities and analysis
pipelines for common NGS data analysis tasks
Acknowledgements
Campagne LabFabien Campagne Kevin C. DorffMarko SrdanovicStuart J.D. Andrews
Broad InstituteJim Robinson
http://goby.campagnelab.org
FDA/NCTRLeming Shi
Sequencing Quality Control Project (SEQC)HelicosIllumina Life Technologies Roche
cDNA Search