establishing a hybrid approach to the polyploid sugarcane genome assembly - paul berkman
DESCRIPTION
Flowering plant genomes are amongst the largest and most complex, caused by highly proliferative repetitive elements and frequent genome duplications. The sequencing revolution has now delivered over 30 plant genomes ranging from the 82 Mbp genome of floating bladderwort to the nearly 5 Gbp genome of diploid wheat. While a high quality reference genome is now a pivotal research tool in all crop improvement efforts, many projects emphasise delivery timeframes at the expense of genome quality. Our species of interest is sugarcane (Saccharum hybrid) which possesses a highly aneupolyploid genome 10 Gbp in size. In line with international efforts, our group has contributed a range of approaches to elucidate the sugarcane genome sequence. The first of these has been an international BAC-by-BAC sequencing effort to determine a "monoploid" genome sequence for the genotype R570, in which we have assembled Illumina paired-read data for 465 BACs into one or a few contigs each. Secondly, we have applied second-generation whole-genome shotgun sequencing up to 45x to de novo assemble the genome of R570. Our preliminary assembly represents over two thirds of the expected genome size with a contig NG50 of 1200 bp. Finally, we are now progressing a third-generation sequencing approach to supplement the results of the short-read approach and progress towards a final hybrid assembly. Without a robust approach structural and functional annotation cannot inform meaningful biological interpretation. As our work approaches completion, it is becoming clear that ultimately a hybrid approach combining all of these outputs will be required for a high quality reference genome for sugarcane. There is no single technology or approach to solve this problem. With an "out-of-the-box" approach nowhere in sight, assembling high quality genome sequences will likely remain an important problem for some time yet.TRANSCRIPT
Establishing a hybrid approach to the polyploid sugarcane genome assembly
Paul Berkman
CSIRO PLANT INDUSTRY
R570 genome progress – CSIRO overview
Research context
BAC sequencing contribution Assembly strategy
Progress and results
Whole Genome Shotgun approach Rationale
Approach and progress to date
Future directions: there’s no silver bullet Options for WGS assembly improvement
Hybrid approach for allele detection
CSIRO Objectives
Generation of sugarcane genomic sequence This will be done in conjunction with the International Sugarcane Genome
Consortium using next generation sequencing technology
Improvement of the genetic map of sugarcane (CSIRO) For use as a tool in assembling genome sequence
Development of a web-based platform to integrate the genomic sequence and genetic information (CSIRO) This tool will enable researchers to identify regions of the genome that are
associated with a particular trait.
International objectives
R570 was designated as the reference at the Consortium meeting in August 2009 France, Australia, Brazil, USA, South Africa
WGS sequencing thought to be unlikely approach
R570 BAC library already in existence
A comprehensive genome map is already available
Some already commenced sequencing R570 BACs
January 2013 Re-invigoration of the effort
Keygene’s whole genome profiling technology applied
Driving towards a monoploid genome sequence
Monoploid genome sequence of sugarcane
4/3/2014
Minimum tiling path of 5,029 BAC
clones estimated to cover the
monoploid genome sequence of
sugarcane
To date around 2000 BAC clones
have been sequenced
The rest will be sequenced in the
following year.
Complex Genome
sugarcane
cultivar 10000 Mbp
maize
5500 Mbp
rice
800 Mbp
sorghum
1600 Mbp Wheat
34000 Mbp
recombinants
S. spontaneum
S. officinarum
Arabidopsis
300 Mbp
Complex Genome
sugarcane
cultivar 10000 Mbp
Wheat
34000 Mbp
recombinants
S. spontaneum
S. officinarum
Human
6000 Mbp
Our BAC strategy High-throughput NGS sequencing
Towards the sugarcane genome | Karen Aitken
Illumina HiSeq2000
• > 600 Gbp/run
• > 35 Gbp/lane
• 96 barcodes/lane
• Better results than 454 raw assemblies
< $500 USD/BAC for extraction, sequencing, and assembly
• This could be reduced to < $200 USD using pooling
Developed in conjunction with Prof. Dave Edwards
Mr Paul Visendi (PhD Student)
Under development since July 2012
Raw data cleaning SOAP2 alignment to E. coli genome
– Remove aligned reads from dataset
Vector sequence (pBeloBAC11) removal
– BLAST alignment to BAC sequence
– Extract aligned reads
Our BAC strategy Custom assembly pipeline
Short-read dataset
SOAP2 against E. coli
genome
Filtered short-read
dataset
Vector sequence removal
Vector & E. coli filtered short-read
dataset
Assembly Optimal coverage ~500-fold
SaSSY BAC-assembly algorithm
– Developed at University of Queensland
– Imelfort M.R., PhD Thesis, UQ 2012
– Graph-based OLC assembler
– More robust than Eulerian approach
– Uses paired-read statistics during contig building
Scaffolding SSPACE public tool, uses Bowtie
Collates alignment results and connects contigs
Our BAC strategy Custom assembly pipeline
Vector & E. coli filtered short-read
dataset
SaSSY Assembler
SSPACE scaffolding
Assembled BAC Scaffolds (FASTA)
Validation using BAC-end sequences BLAST alignment of BES
Confirm alignment at ends of scaffold
If 2 scaffolds, combine based on BES position
Validation using raw data Apply public tool HAGFISH
– NZPFR, Ross Crowhurst et al.
Uses distance between paired reads to:
– Identify raw data gaps in assembly
Our BAC strategy Custom assembly pipeline Assembled BAC
Contigs (FASTA)
BES alignment
Raw data insert size analysis
Validation results (CSV statistics & PNG image files)
BAC results to date Progress of Analysis
• ~200 Gbp total sequence data generated for 453 BACs • Illumina HiSeq2000, 100bp paired-end reads
• Data from BGI & AGRF
• Validation component of pipeline under final development • Final assemblies of all 453 BACs to be completed April 2014
• Roughly half randomly extracted
• Other half targeted at QTL of interest
BAC results to date Assembly of first 166 BACs
11 BACs did not contain an insert
Total number of reads 1,583,129,457
Total volume sequence data (Mbp) 188,588.54
Bacterial sequence contamination 11.87%
Total number assembled contigs 976
Total assembly Size (bp) 16,574,234
Total assembly N50 (bp) 33,885
Overall longest contig (bp) 128,410
Number contigs > 10 kbp 459
Number contigs > 100 kbp 10
Assemblies comparable to 454-based approaches ~500-fold coverage appears optimal for Illumina assemblies
Combination of in-house and public tools required for best assembly
More BACs can be sequenced at lower cost BAC-pooling strategy in final development with UQ for future work
SUGESI project is now driving towards a full monoploid assembly No expressed intention to include genome-wide allelic information
BAC results to date Summary
Why try? Lots of data available quite cheaply now
Current BAC approach unlikely to provide much allelic information
Realistically, allelic information is critical
Success of WGS approach in other species Diploid wheat A and D genomes (2013)
Robust algorithms now handling complex datasets
– Assemblathons
– GAGE
Why not try?
R570 Whole Genome Shotgun Rationale
Illumina HiSeq2000 (Illumina) 90-100 bp read length
Paired reads
– Fragment ends used to generate sequence
– Fragment size (insert length) ranges from 180 bp to 32,000 bp
Very low error rate
Pacific Biosciences (Pacbio) Long reads, 1,000-20,000 bp long
Single unpaired reads
Moderate error rate
R570 Whole Genome Shotgun Sequencing Approaches
R570 Whole Genome Shotgun Sequence Data
Presentation title | Presenter name | Page 17
LIBRARY CURRENT SEQUENCE COVERAGE (x)
CURRENT PHYSICAL COVERAGE (x)
PROJECTED SEQUENCE COVERAGE (x)
Illumina 180bp 12.14 10.93 12.14
Illumina 300bp 11.06 16.59 43.39
Illumina 600bp 8.70 26.10 8.70
Illumina 2,000bp 2.83 31.44 2.83
Illumina 2,500bp 1.32 16.50 18.65
Illumina 4,500bp 1.31 29.48 18.64
Illumina 5,000bp 2.84 78.89 2.84
Illumina 7,500bp 1.21 45.38 1.21
Illumina 32,000bp 0.37 58.61 0.37
Pacbio 20,000bp 0.00 0.00 2.7
TOTAL 41.78 313.92 111.47
Assembly Commenced April 2013
4 High-Performance Computing facilities have been tested
– Ixthus (192Gb), Barrine (1Gb), Cherax (4Tb), and Zythos (6Tb)
3 Assembly algorithms have been compared
– Velvet (public), AllPaths-LG (public), Biokanga (CSIRO-developed)
~20 test assemblies of 4x coverage dataset complete
– Cumulative >600 days of compute time & max. 150Gb RAM used
~20 attempts at assembling 42x coverage dataset
– Cumulative >6000 days of compute time & max. 1.5Tb RAM used
– Preliminary assembly completed March 2014
Recent development of Biokanga has optimised it to succeed
R570 Whole Genome Shotgun Progress of Analysis
Preliminary assembly results Biokanga successfully assembled 42x data on Ixthus (150Gb RAM, 10 days)
Options to exploit this assembly: Assembly of sugarcane mitochondrial genome
Extracting genomic context information for genes of interest
R570 Whole Genome Shotgun Progress of Analysis
Total number assembled scaffolds 5,344,024
Total assembly Size (bp) 6,761,247,399
Total assembly N50 (bp) 2,160
Overall longest contig (bp) 67,936
Number contigs > 1 kbp 2,223,020
Number contigs > 10 kbp 7,326
Future directions How do we get a sugarcane genome?
• Assembling sugarcane is challenging • Perhaps a little more than I had anticipated
• THERE’S NO SILVER BULLET! • BAC approach won’t provide information on alleles, but is tried and tested
• WGS can provide allele-scale resolution, but is complex and challenging
• Hybrid approach is most likely the best approach • Using SUGESI monoploid assembly as a template
• Adopting WGS approach to overlay allelic data
• Identify gaps in monoploid assembly using WGS assembly
Acknowledgements
Team members Karen Aitken Paul Berkman Jai Perroux Jingchuan Li Rosanne Casu Anne Rae
CSIRO and BSES Ltd sugarcane marker and breeding groups
Hélène Bergès and her team at INRA, Toulouse, France
DArT Pty Ltd Team
Australian Genome Research Facility (AGRF)
UQ colleagues Paul Visendi
Dave Edwards
Thank you CSIRO Plant Industry Paul Berkman OCE Postdoctoral Fellow
t +61 7 3214 2361 e [email protected] w www.csiro.au/pi
CSIRO PLANT INDUSTRY