cugi pilot sequencing/assembly projects christopher saski

20
CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Upload: nickolas-horton

Post on 28-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

CUGI Pilot Sequencing/Assembly Projects

Christopher Saski

Page 2: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Sequencing the Cacao Genome:3 Megabases at a Time

• Pilot project to sequence and assemble 3Mbp segment of cacao genome

• IBM in silico assembly project – Testing the assembly pipeline

Page 3: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Sequencing the Cacao Genome:3 Megabases at a Time

• Combination of:– “Old School Genomics”

• BAC libraries, physical mapping, and clone-by-clone sequencing

– Roche 454 Titanium and FLX De Novo sequencing

• Key: – Not yet accurately assembled a eukaryotic genome

with NGS alone– Reduce assembly complexity

Page 4: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

3 Megabase segments

Rounsley et al., 2009

Page 5: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Advantages• Reduce assembly complexity

• Limit number of sequencing libraries

• Prioritize critical genomic regions

• Outsource BAC pools for sequencing in parallel at any center that has a 454 Titanium/GS-FLX sequencer

• Flexibility – Start slow with minimal investment– Could redesign strategy to reduce sequence runs

Page 6: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Strategy Components

• Integrated Physical/Genetic framework• Pool development and sequencing:– BAC-end – Titanium 454 (paired/non-paired)– Draft sequence

• Assembly and integration:– Newbler– Celera (CABOG)

Page 7: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Cacao Integrated Physical/Genetic Framework

• Represents ~29X coverage (3 BAC libraries)• Assembled into small number of large contigs• Suggests reasonable levels of heterozygosity • Manageable amounts of repetitive sequence• 220 anchored genetic markers spanning 10

linkage groups– Resemble recombinational derived order

Page 8: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Pool Development

• Select contiguous BAC clones from MTP• Pools will contain 25-30 clones– 20-30kb overlap

• Complete Cacao MTP will require 120-150 pools

• Repetitive-type regions: – BAC-end sequence and physical map data

predictive tool• Modify pools accordingly

Page 9: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Pool Development

• Estimate contig size using Consensus Band (CB) algorithm

• Example: Cacao cp genome is 160,604bp– Hybridization revealed cp containing contig and is

estimated to be ~160 kb based on CB algorithm.

• Purified pool DNA can be produced at CUGI– Treat with ATP-dependent Dnase

Page 10: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Sequencing

• 3 Levels of Sequence:– Paired BAC-end Sequence – 20 kb increments– End sequencing of pool members– 454 sequencing of BAC pools• Paired 3.5X-5.1X coverage (Roche 454/FLX)• Non-paired 17X-26X coverage (Titanium)

Page 11: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

454 Runs—Whole Genome

• 454 Titanium non-paired – 26X coverage/pool– 4 pools per slide (up to 150 pools total) • Up to 38 slide runs

• 454 FLX paired-end (3kb) – 5X coverage/pool– 16 pools per slide (up to 150 pools total)• Up to 10 slide runs total

Page 12: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Assembly/Curation of 3Mbp Segment

• Preprocessing– Filter reads to remove:

• Pair-end that did not contain both ends• BAC vector• E. coli (host DNA)

• Newbler Assembler (Roche)• Celera Assembler (CABOG)– Improvements in homopolymer calls, and

heterogeneous read length issues– Recently shown N50 contig size double to Newbler

• Human (50% repetitive) and microbes

Page 13: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Assembly Curation of 3Mbp Segment

• Assembly at various depths (5X, 10X, 15X)– Determine optimal sequencing coverage

• Utilize available data to scaffold contigs:– BAC end sequences every 20kb– Genetic marker sequences– RNA-seq clusters– Arabidopsis – Cacao synteny– Draft Sequence (2X)

• Augment approach by covering regions missed by clones – assist in selecting MTP

Page 14: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Assembly Curation of 3Mbp Segment

• Deliverable will be a pseudomolecule sequence for the 3Mbp region– Gaps will be strings of N• Assess and employ lab-based gap filling strategies

• Make every attempt to close gaps

Page 15: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Assembly Validation and Correction

• In-silico virtual digest of scaffold sequence and compare to physical map restriction fragments– Draft sequence integration (DSI) via FPC• Integrate and visualize physical map, 3 Mbp segments,

and draft sequence

Page 16: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Sequence/Assembly Pipeline

Page 17: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

IBM in silico Sequences

• IBM will provide a set of sequences that mimic the pilot caco sequences– Input error

• Indels, homopolymer calls, nucleotide substitutions

• Simulated data to test pipeline:– Physical map– Simulated BAC end sequences– Simulated pseudo-reads from pooled BACs– EST clusters– Indicate reference species for syntenic comparisons

Page 18: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Pilot Project Budget

• BAC-end sequencing (30K BACs), 20Kb increments– $206,605.00

• Assembly/curation/validation of cacao 3Mbp– $16,720.00

• Assembly of IBM in-silico derived sequences– $15,400.00

Page 19: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

ESTIMATED Budget – Whole Genome Assembly

• Assembly, curation, validation of 130-150, 3Mbp segments– $147,620.00

• Automated structural/functional annotation– $8,800.00

Page 20: CUGI Pilot Sequencing/Assembly Projects Christopher Saski

Acknowledgements

• USDA-ARS• Mars Inc.• Dr. Alex Feltus• Stephen Ficklin• Dr. Keith Murphy• Dr. Margaret Staton