establishing a hybrid approach to the polyploid sugarcane genome assembly - paul berkman

Establishing a hybrid approach to the polyploid sugarcane genome assembly

Paul Berkman

CSIRO PLANT INDUSTRY

R570 genome progress – CSIRO overview

Research context

BAC sequencing contribution Assembly strategy

Progress and results

Whole Genome Shotgun approach Rationale

Approach and progress to date

Future directions: there’s no silver bullet Options for WGS assembly improvement

Hybrid approach for allele detection

CSIRO Objectives

Generation of sugarcane genomic sequence This will be done in conjunction with the International Sugarcane Genome

Consortium using next generation sequencing technology

Improvement of the genetic map of sugarcane (CSIRO) For use as a tool in assembling genome sequence

Development of a web-based platform to integrate the genomic sequence and genetic information (CSIRO) This tool will enable researchers to identify regions of the genome that are

associated with a particular trait.

International objectives

R570 was designated as the reference at the Consortium meeting in August 2009 France, Australia, Brazil, USA, South Africa

WGS sequencing thought to be unlikely approach

R570 BAC library already in existence

A comprehensive genome map is already available

Some already commenced sequencing R570 BACs

January 2013 Re-invigoration of the effort

Keygene’s whole genome profiling technology applied

Driving towards a monoploid genome sequence

Monoploid genome sequence of sugarcane

4/3/2014

Minimum tiling path of 5,029 BAC

clones estimated to cover the

monoploid genome sequence of

sugarcane

To date around 2000 BAC clones

have been sequenced

The rest will be sequenced in the

following year.

Complex Genome

sugarcane

cultivar 10000 Mbp

maize

5500 Mbp

rice

800 Mbp

sorghum

1600 Mbp Wheat

34000 Mbp

recombinants

S. spontaneum

S. officinarum

Arabidopsis

300 Mbp

Complex Genome

sugarcane

cultivar 10000 Mbp

Wheat

34000 Mbp

recombinants

S. spontaneum

S. officinarum

Human

6000 Mbp

Our BAC strategy High-throughput NGS sequencing

Towards the sugarcane genome | Karen Aitken

Illumina HiSeq2000

• > 600 Gbp/run

• > 35 Gbp/lane

• 96 barcodes/lane

• Better results than 454 raw assemblies

< $500 USD/BAC for extraction, sequencing, and assembly

• This could be reduced to < $200 USD using pooling

Developed in conjunction with Prof. Dave Edwards

Mr Paul Visendi (PhD Student)

Under development since July 2012

Raw data cleaning SOAP2 alignment to E. coli genome

– Remove aligned reads from dataset

Vector sequence (pBeloBAC11) removal

– BLAST alignment to BAC sequence

– Extract aligned reads

Our BAC strategy Custom assembly pipeline

Short-read dataset

SOAP2 against E. coli

genome

Filtered short-read

dataset

Vector sequence removal

Vector & E. coli filtered short-read

dataset

Assembly Optimal coverage ~500-fold

SaSSY BAC-assembly algorithm

– Developed at University of Queensland

– Imelfort M.R., PhD Thesis, UQ 2012

– Graph-based OLC assembler

– More robust than Eulerian approach

– Uses paired-read statistics during contig building

Scaffolding SSPACE public tool, uses Bowtie

Collates alignment results and connects contigs

Our BAC strategy Custom assembly pipeline

Vector & E. coli filtered short-read

dataset

SaSSY Assembler

SSPACE scaffolding

Assembled BAC Scaffolds (FASTA)

Validation using BAC-end sequences BLAST alignment of BES

Confirm alignment at ends of scaffold

If 2 scaffolds, combine based on BES position

Validation using raw data Apply public tool HAGFISH

– NZPFR, Ross Crowhurst et al.

Uses distance between paired reads to:

– Identify raw data gaps in assembly

Our BAC strategy Custom assembly pipeline Assembled BAC

Contigs (FASTA)

BES alignment

Raw data insert size analysis

Validation results (CSV statistics & PNG image files)

BAC results to date Progress of Analysis

• ~200 Gbp total sequence data generated for 453 BACs • Illumina HiSeq2000, 100bp paired-end reads

• Data from BGI & AGRF

• Validation component of pipeline under final development • Final assemblies of all 453 BACs to be completed April 2014

• Roughly half randomly extracted

• Other half targeted at QTL of interest

BAC results to date Assembly of first 166 BACs

11 BACs did not contain an insert

Total number of reads 1,583,129,457

Total volume sequence data (Mbp) 188,588.54

Bacterial sequence contamination 11.87%

Total number assembled contigs 976

Total assembly Size (bp) 16,574,234

Total assembly N50 (bp) 33,885

Overall longest contig (bp) 128,410

Number contigs > 10 kbp 459

Number contigs > 100 kbp 10

Assemblies comparable to 454-based approaches ~500-fold coverage appears optimal for Illumina assemblies

Combination of in-house and public tools required for best assembly

More BACs can be sequenced at lower cost BAC-pooling strategy in final development with UQ for future work

SUGESI project is now driving towards a full monoploid assembly No expressed intention to include genome-wide allelic information

BAC results to date Summary

Why try? Lots of data available quite cheaply now

Current BAC approach unlikely to provide much allelic information

Realistically, allelic information is critical

Success of WGS approach in other species Diploid wheat A and D genomes (2013)

Robust algorithms now handling complex datasets

– Assemblathons

– GAGE

Why not try?

R570 Whole Genome Shotgun Rationale

Illumina HiSeq2000 (Illumina) 90-100 bp read length

Paired reads

– Fragment ends used to generate sequence

– Fragment size (insert length) ranges from 180 bp to 32,000 bp

Very low error rate

Pacific Biosciences (Pacbio) Long reads, 1,000-20,000 bp long

Single unpaired reads

Moderate error rate

R570 Whole Genome Shotgun Sequencing Approaches

R570 Whole Genome Shotgun Sequence Data

Presentation title | Presenter name |

LIBRARY CURRENT SEQUENCE COVERAGE (x)

CURRENT PHYSICAL COVERAGE (x)

PROJECTED SEQUENCE COVERAGE (x)

Illumina 180bp 12.14 10.93 12.14

Illumina 300bp 11.06 16.59 43.39

Illumina 600bp 8.70 26.10 8.70

Illumina 2,000bp 2.83 31.44 2.83

Illumina 2,500bp 1.32 16.50 18.65

Illumina 4,500bp 1.31 29.48 18.64

Illumina 5,000bp 2.84 78.89 2.84

Illumina 7,500bp 1.21 45.38 1.21

Illumina 32,000bp 0.37 58.61 0.37

Pacbio 20,000bp 0.00 0.00 2.7

TOTAL 41.78 313.92 111.47

Assembly Commenced April 2013

4 High-Performance Computing facilities have been tested

– Ixthus (192Gb), Barrine (1Gb), Cherax (4Tb), and Zythos (6Tb)

3 Assembly algorithms have been compared

– Velvet (public), AllPaths-LG (public), Biokanga (CSIRO-developed)

~20 test assemblies of 4x coverage dataset complete

– Cumulative >600 days of compute time & max. 150Gb RAM used

~20 attempts at assembling 42x coverage dataset

– Cumulative >6000 days of compute time & max. 1.5Tb RAM used

– Preliminary assembly completed March 2014

Recent development of Biokanga has optimised it to succeed

R570 Whole Genome Shotgun Progress of Analysis

Preliminary assembly results Biokanga successfully assembled 42x data on Ixthus (150Gb RAM, 10 days)

Options to exploit this assembly: Assembly of sugarcane mitochondrial genome

Extracting genomic context information for genes of interest

R570 Whole Genome Shotgun Progress of Analysis

Total number assembled scaffolds 5,344,024

Total assembly Size (bp) 6,761,247,399

Total assembly N50 (bp) 2,160

Overall longest contig (bp) 67,936

Number contigs > 1 kbp 2,223,020

Number contigs > 10 kbp 7,326

Future directions How do we get a sugarcane genome?

• Assembling sugarcane is challenging • Perhaps a little more than I had anticipated

• THERE’S NO SILVER BULLET! • BAC approach won’t provide information on alleles, but is tried and tested

• WGS can provide allele-scale resolution, but is complex and challenging

• Hybrid approach is most likely the best approach • Using SUGESI monoploid assembly as a template

• Adopting WGS approach to overlay allelic data

• Identify gaps in monoploid assembly using WGS assembly

Acknowledgements

Team members Karen Aitken Paul Berkman Jai Perroux Jingchuan Li Rosanne Casu Anne Rae

CSIRO and BSES Ltd sugarcane marker and breeding groups

Hélène Bergès and her team at INRA, Toulouse, France

DArT Pty Ltd Team

Australian Genome Research Facility (AGRF)

UQ colleagues Paul Visendi

Dave Edwards

Thank you CSIRO Plant Industry Paul Berkman OCE Postdoctoral Fellow

t +61 7 3214 2361 e [email protected] w www.csiro.au/pi

CSIRO PLANT INDUSTRY

establishing a hybrid approach to the polyploid sugarcane genome assembly - paul berkman

Science

coli genome

bac sequence extract

bac clones

date assembly

monoploid assembly

sassy bacassembly algorithm

best assembly

total volume sequence