establishing a hybrid approach to the polyploid sugarcane genome assembly - paul berkman

22
Establishing a hybrid approach to the polyploid sugarcane genome assembly Paul Berkman CSIRO PLANT INDUSTRY

Upload: australian-bioinformatics-network

Post on 10-May-2015

620 views

Category:

Science


3 download

DESCRIPTION

Flowering plant genomes are amongst the largest and most complex, caused by highly proliferative repetitive elements and frequent genome duplications. The sequencing revolution has now delivered over 30 plant genomes ranging from the 82 Mbp genome of floating bladderwort to the nearly 5 Gbp genome of diploid wheat. While a high quality reference genome is now a pivotal research tool in all crop improvement efforts, many projects emphasise delivery timeframes at the expense of genome quality. Our species of interest is sugarcane (Saccharum hybrid) which possesses a highly aneupolyploid genome 10 Gbp in size. In line with international efforts, our group has contributed a range of approaches to elucidate the sugarcane genome sequence. The first of these has been an international BAC-by-BAC sequencing effort to determine a "monoploid" genome sequence for the genotype R570, in which we have assembled Illumina paired-read data for 465 BACs into one or a few contigs each. Secondly, we have applied second-generation whole-genome shotgun sequencing up to 45x to de novo assemble the genome of R570. Our preliminary assembly represents over two thirds of the expected genome size with a contig NG50 of 1200 bp. Finally, we are now progressing a third-generation sequencing approach to supplement the results of the short-read approach and progress towards a final hybrid assembly. Without a robust approach structural and functional annotation cannot inform meaningful biological interpretation. As our work approaches completion, it is becoming clear that ultimately a hybrid approach combining all of these outputs will be required for a high quality reference genome for sugarcane. There is no single technology or approach to solve this problem. With an "out-of-the-box" approach nowhere in sight, assembling high quality genome sequences will likely remain an important problem for some time yet.

TRANSCRIPT

Page 1: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Establishing a hybrid approach to the polyploid sugarcane genome assembly

Paul Berkman

CSIRO PLANT INDUSTRY

Page 2: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

R570 genome progress – CSIRO overview

Research context

BAC sequencing contribution Assembly strategy

Progress and results

Whole Genome Shotgun approach Rationale

Approach and progress to date

Future directions: there’s no silver bullet Options for WGS assembly improvement

Hybrid approach for allele detection

Page 3: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

CSIRO Objectives

Generation of sugarcane genomic sequence This will be done in conjunction with the International Sugarcane Genome

Consortium using next generation sequencing technology

Improvement of the genetic map of sugarcane (CSIRO) For use as a tool in assembling genome sequence

Development of a web-based platform to integrate the genomic sequence and genetic information (CSIRO) This tool will enable researchers to identify regions of the genome that are

associated with a particular trait.

Page 4: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

International objectives

R570 was designated as the reference at the Consortium meeting in August 2009 France, Australia, Brazil, USA, South Africa

WGS sequencing thought to be unlikely approach

R570 BAC library already in existence

A comprehensive genome map is already available

Some already commenced sequencing R570 BACs

January 2013 Re-invigoration of the effort

Keygene’s whole genome profiling technology applied

Driving towards a monoploid genome sequence

Page 5: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Monoploid genome sequence of sugarcane

4/3/2014

Minimum tiling path of 5,029 BAC

clones estimated to cover the

monoploid genome sequence of

sugarcane

To date around 2000 BAC clones

have been sequenced

The rest will be sequenced in the

following year.

Page 6: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Complex Genome

sugarcane

cultivar 10000 Mbp

maize

5500 Mbp

rice

800 Mbp

sorghum

1600 Mbp Wheat

34000 Mbp

recombinants

S. spontaneum

S. officinarum

Arabidopsis

300 Mbp

Page 7: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Complex Genome

sugarcane

cultivar 10000 Mbp

Wheat

34000 Mbp

recombinants

S. spontaneum

S. officinarum

Human

6000 Mbp

Page 8: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Our BAC strategy High-throughput NGS sequencing

Towards the sugarcane genome | Karen Aitken

Illumina HiSeq2000

• > 600 Gbp/run

• > 35 Gbp/lane

• 96 barcodes/lane

• Better results than 454 raw assemblies

< $500 USD/BAC for extraction, sequencing, and assembly

• This could be reduced to < $200 USD using pooling

Page 9: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Developed in conjunction with Prof. Dave Edwards

Mr Paul Visendi (PhD Student)

Under development since July 2012

Raw data cleaning SOAP2 alignment to E. coli genome

– Remove aligned reads from dataset

Vector sequence (pBeloBAC11) removal

– BLAST alignment to BAC sequence

– Extract aligned reads

Our BAC strategy Custom assembly pipeline

Short-read dataset

SOAP2 against E. coli

genome

Filtered short-read

dataset

Vector sequence removal

Vector & E. coli filtered short-read

dataset

Page 10: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Assembly Optimal coverage ~500-fold

SaSSY BAC-assembly algorithm

– Developed at University of Queensland

– Imelfort M.R., PhD Thesis, UQ 2012

– Graph-based OLC assembler

– More robust than Eulerian approach

– Uses paired-read statistics during contig building

Scaffolding SSPACE public tool, uses Bowtie

Collates alignment results and connects contigs

Our BAC strategy Custom assembly pipeline

Vector & E. coli filtered short-read

dataset

SaSSY Assembler

SSPACE scaffolding

Assembled BAC Scaffolds (FASTA)

Page 11: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Validation using BAC-end sequences BLAST alignment of BES

Confirm alignment at ends of scaffold

If 2 scaffolds, combine based on BES position

Validation using raw data Apply public tool HAGFISH

– NZPFR, Ross Crowhurst et al.

Uses distance between paired reads to:

– Identify raw data gaps in assembly

Our BAC strategy Custom assembly pipeline Assembled BAC

Contigs (FASTA)

BES alignment

Raw data insert size analysis

Validation results (CSV statistics & PNG image files)

Page 12: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

BAC results to date Progress of Analysis

• ~200 Gbp total sequence data generated for 453 BACs • Illumina HiSeq2000, 100bp paired-end reads

• Data from BGI & AGRF

• Validation component of pipeline under final development • Final assemblies of all 453 BACs to be completed April 2014

• Roughly half randomly extracted

• Other half targeted at QTL of interest

Page 13: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

BAC results to date Assembly of first 166 BACs

11 BACs did not contain an insert

Total number of reads 1,583,129,457

Total volume sequence data (Mbp) 188,588.54

Bacterial sequence contamination 11.87%

Total number assembled contigs 976

Total assembly Size (bp) 16,574,234

Total assembly N50 (bp) 33,885

Overall longest contig (bp) 128,410

Number contigs > 10 kbp 459

Number contigs > 100 kbp 10

Page 14: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Assemblies comparable to 454-based approaches ~500-fold coverage appears optimal for Illumina assemblies

Combination of in-house and public tools required for best assembly

More BACs can be sequenced at lower cost BAC-pooling strategy in final development with UQ for future work

SUGESI project is now driving towards a full monoploid assembly No expressed intention to include genome-wide allelic information

BAC results to date Summary

Page 15: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Why try? Lots of data available quite cheaply now

Current BAC approach unlikely to provide much allelic information

Realistically, allelic information is critical

Success of WGS approach in other species Diploid wheat A and D genomes (2013)

Robust algorithms now handling complex datasets

– Assemblathons

– GAGE

Why not try?

R570 Whole Genome Shotgun Rationale

Page 16: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Illumina HiSeq2000 (Illumina) 90-100 bp read length

Paired reads

– Fragment ends used to generate sequence

– Fragment size (insert length) ranges from 180 bp to 32,000 bp

Very low error rate

Pacific Biosciences (Pacbio) Long reads, 1,000-20,000 bp long

Single unpaired reads

Moderate error rate

R570 Whole Genome Shotgun Sequencing Approaches

Page 17: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

R570 Whole Genome Shotgun Sequence Data

Presentation title | Presenter name | Page 17

LIBRARY CURRENT SEQUENCE COVERAGE (x)

CURRENT PHYSICAL COVERAGE (x)

PROJECTED SEQUENCE COVERAGE (x)

Illumina 180bp 12.14 10.93 12.14

Illumina 300bp 11.06 16.59 43.39

Illumina 600bp 8.70 26.10 8.70

Illumina 2,000bp 2.83 31.44 2.83

Illumina 2,500bp 1.32 16.50 18.65

Illumina 4,500bp 1.31 29.48 18.64

Illumina 5,000bp 2.84 78.89 2.84

Illumina 7,500bp 1.21 45.38 1.21

Illumina 32,000bp 0.37 58.61 0.37

Pacbio 20,000bp 0.00 0.00 2.7

TOTAL 41.78 313.92 111.47

Page 18: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Assembly Commenced April 2013

4 High-Performance Computing facilities have been tested

– Ixthus (192Gb), Barrine (1Gb), Cherax (4Tb), and Zythos (6Tb)

3 Assembly algorithms have been compared

– Velvet (public), AllPaths-LG (public), Biokanga (CSIRO-developed)

~20 test assemblies of 4x coverage dataset complete

– Cumulative >600 days of compute time & max. 150Gb RAM used

~20 attempts at assembling 42x coverage dataset

– Cumulative >6000 days of compute time & max. 1.5Tb RAM used

– Preliminary assembly completed March 2014

Recent development of Biokanga has optimised it to succeed

R570 Whole Genome Shotgun Progress of Analysis

Page 19: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Preliminary assembly results Biokanga successfully assembled 42x data on Ixthus (150Gb RAM, 10 days)

Options to exploit this assembly: Assembly of sugarcane mitochondrial genome

Extracting genomic context information for genes of interest

R570 Whole Genome Shotgun Progress of Analysis

Total number assembled scaffolds 5,344,024

Total assembly Size (bp) 6,761,247,399

Total assembly N50 (bp) 2,160

Overall longest contig (bp) 67,936

Number contigs > 1 kbp 2,223,020

Number contigs > 10 kbp 7,326

Page 20: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Future directions How do we get a sugarcane genome?

• Assembling sugarcane is challenging • Perhaps a little more than I had anticipated

• THERE’S NO SILVER BULLET! • BAC approach won’t provide information on alleles, but is tried and tested

• WGS can provide allele-scale resolution, but is complex and challenging

• Hybrid approach is most likely the best approach • Using SUGESI monoploid assembly as a template

• Adopting WGS approach to overlay allelic data

• Identify gaps in monoploid assembly using WGS assembly

Page 21: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Acknowledgements

Team members Karen Aitken Paul Berkman Jai Perroux Jingchuan Li Rosanne Casu Anne Rae

CSIRO and BSES Ltd sugarcane marker and breeding groups

Hélène Bergès and her team at INRA, Toulouse, France

DArT Pty Ltd Team

Australian Genome Research Facility (AGRF)

UQ colleagues Paul Visendi

Dave Edwards

Page 22: Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

Thank you CSIRO Plant Industry Paul Berkman OCE Postdoctoral Fellow

t +61 7 3214 2361 e [email protected] w www.csiro.au/pi

CSIRO PLANT INDUSTRY