serc: de novo assembly workshop. francesco vezzi

16
De novo assembly, a multi-technology approach: Illumina, PacBio, and OpGen PhD. Francesco Vezzi Senior Bioinformatician, NGI-Stockholm

Upload: francesco-vezzi

Post on 22-Apr-2015

872 views

Category:

Science


1 download

DESCRIPTION

De novo assembly, a multi-technology approach: Illumina, PacBio, and OpGen. A multi technological prospective for de novo assembly projects.

TRANSCRIPT

Page 1: SeRC: de novo assembly workshop. Francesco Vezzi

De novo assembly, a multi-technology approach: Illumina, PacBio, and OpGen

PhD. Francesco VezziSenior Bioinformatician, NGI-Stockholm

Page 2: SeRC: de novo assembly workshop. Francesco Vezzi

Both Stockholm and Uppsala nodes

Illumina HiSeq 2000/2500 16

Illumina MiSeq 3

Life Technologies SOLiD 5500xl 4

Life Technologies SOLiD 5500wildfire 2

Life Technologies Ion Torrent 2

Life Technologies Ion Proton 6

Life Technologies Sanger ABI3730 2

Pacific Biosciences RSII 1

Argus Whole Genome Mapping System 1

NGI Equipment

One of 3 best-equipped sequencing sites in Europe

Page 3: SeRC: de novo assembly workshop. Francesco Vezzi

In this talkIllumina (Stockholm):

• 100/150 bp paired reads (low error rate)• 900/200 Gbp in 6/2 day(s)

PacBio (Uppsala):• 8.5 Kbp reads, (max 30Kbp, high error rate)• 375 Mbp (1 SMRT Cell) in 10 hours

OpGen Argus System (Stockholm):• ~300 Kbp maps• 10 Gbp in ~1 day

Page 4: SeRC: de novo assembly workshop. Francesco Vezzi

Optical Maps

• Restriction Map◦ Representation of the cut sites on a

given DNA molecule to provide spatial information of genetic loci

• An enzyme is selected and used to cut the molecules. This provides a 2D representation of the molecule structure

Page 5: SeRC: de novo assembly workshop. Francesco Vezzi

Optical Maps: workflow

DNA extraction directly from culture

Quality control of extracted material

Prepare a chip

Run Argus System

Data assembly

StepsTime

3-8h

1h

1.5h

1h

2-8h

Notes

Page 6: SeRC: de novo assembly workshop. Francesco Vezzi

Closing genomes with Optical Maps

De novo reconstructs parts missing in the reference strain

Correctly assembles long tandem repeats

De Novo assembly (Illumina, PacBio)

Set of un-ordered and not oriented contigs

Optical Map

Contigs

Page 7: SeRC: de novo assembly workshop. Francesco Vezzi

Case Study: Combing all the technologies

~15 Mbp genome sequenced at High Coverage with:• Illumina HiSeq:

• 500X PE libraries (180bp and 650bp insert)• 150X MP library (3Kbp)• 150X MP library (7Kbp)

• PacBio• 50/60X with reads longer than 2Kbp

• OpGen• 3 chips (only one worked really well)• 300X coverage • Average map length 320Kbp

Page 8: SeRC: de novo assembly workshop. Francesco Vezzi

Assembly Strategyhttps://github.com/vezzi/de_novo_scilife

Semi-automated pipeline for de novo assembly:• Global configuration file tools and system configuration• Sample configuration file samples description

3 modules:1. QC-module (Illumina only):

• Adaptor removal, kmer-analysis, fastqc, (insert size estimation)2. Assemble-module (Illumina only):

• Runs specified assemblers and outputs executed commands3. Validation-module:

• FRCbam, coverage analysis, GC-analysis, (N50)

I NEED USERS/FEEDBACK/CONTIRBUTIONS

Page 9: SeRC: de novo assembly workshop. Francesco Vezzi

QC-Module

Kmer analysis:• Samples complexity• Error rate• Heterozygosity

FASTQCAdaptor removalAlignment (partial assembly)

Page 10: SeRC: de novo assembly workshop. Francesco Vezzi

Assemble-ModuleIllumina only:

• SOAPdenovo• MaSuRCA• Allpaths-LG

PacBio only:• HGAP• CABOG

Hybrid:• PB-jelly (HAH)

>5000 #scaffolds totalLength maxContigLength N50 N80 percentageNs

Allpaths-LG 227 14513103 596012 139364 57619 15%MASURCA 163 18549484 1188669 526519 282507 2%HGAP 290 14399273 763592 142483 37117 0%PB-Jelly 179 14718213 747750 195225 85127 13%

• Try-and-fail process• Automated pipeline developed in order to

streamline these analysis• MASURCA surprisingly the “best” assembler

Page 11: SeRC: de novo assembly workshop. Francesco Vezzi

MaSuRCA HGAP PB-Jelly (HAH)

Validation-Module

Page 12: SeRC: de novo assembly workshop. Francesco Vezzi

FRCbam

Validation-Module

PacBio-only assembly is clearly outperforming the others

Page 13: SeRC: de novo assembly workshop. Francesco Vezzi

Optical MapsPacBio produces the best assembly however 290 contigs contigs are produced.

Optical Maps allowed to obtain the 2D representation of the 7 chromosomes.N.B. chromosome number was one of the biological questions of this project!!!

But much more can be done!!!

Page 14: SeRC: de novo assembly workshop. Francesco Vezzi

Incredible tool to finish (or almost finish) genomes

% contigs placedTotal size of placed contigs

% size placed contigs

% genome covered

pacBio+OpGene 94.12 11578995 97% 77.05Allpaths+OpGene 71.88 10692027 84% 52.88Allpaths+Masurca+Opgene 80.65 27506424 92% 69.64Allpaths+PacBio+Opgene 82.32 22271022 91% 83.05Masurca+PacBio+pgene 94.44 28393392 98% 83.79Allpaths+Masurca+PacBio+Opgene 85.42 39085419 94% 87.39

Combing all the technologies

Page 15: SeRC: de novo assembly workshop. Francesco Vezzi

Conclusions – Take home message

Attempt to automate de novo assembly process:• https://github.com/vezzi/de_novo_scilife• Not 100% automated

Illumina, PacBio, Hybrid assemblies:• PacBio alone seems to produce the best assemblers• Hybrid assembly seems to not be able to correct merged-assembly

problems

Mixing technologies is always a good idea:• Possibility to compensate technological biases• Allows to produce better assemblies

Page 16: SeRC: de novo assembly workshop. Francesco Vezzi

Thankshttps://github.com/vezzi/de_novo_scilife