serc: de novo assembly workshop. francesco vezzi
Post on 22-Apr-2015
873 Views
Preview:
DESCRIPTION
TRANSCRIPT
De novo assembly, a multi-technology approach: Illumina, PacBio, and OpGen
PhD. Francesco VezziSenior Bioinformatician, NGI-Stockholm
Both Stockholm and Uppsala nodes
Illumina HiSeq 2000/2500 16
Illumina MiSeq 3
Life Technologies SOLiD 5500xl 4
Life Technologies SOLiD 5500wildfire 2
Life Technologies Ion Torrent 2
Life Technologies Ion Proton 6
Life Technologies Sanger ABI3730 2
Pacific Biosciences RSII 1
Argus Whole Genome Mapping System 1
NGI Equipment
One of 3 best-equipped sequencing sites in Europe
In this talkIllumina (Stockholm):
• 100/150 bp paired reads (low error rate)• 900/200 Gbp in 6/2 day(s)
PacBio (Uppsala):• 8.5 Kbp reads, (max 30Kbp, high error rate)• 375 Mbp (1 SMRT Cell) in 10 hours
OpGen Argus System (Stockholm):• ~300 Kbp maps• 10 Gbp in ~1 day
Optical Maps
• Restriction Map◦ Representation of the cut sites on a
given DNA molecule to provide spatial information of genetic loci
• An enzyme is selected and used to cut the molecules. This provides a 2D representation of the molecule structure
Optical Maps: workflow
DNA extraction directly from culture
Quality control of extracted material
Prepare a chip
Run Argus System
Data assembly
StepsTime
3-8h
1h
1.5h
1h
2-8h
Notes
Closing genomes with Optical Maps
De novo reconstructs parts missing in the reference strain
Correctly assembles long tandem repeats
De Novo assembly (Illumina, PacBio)
Set of un-ordered and not oriented contigs
Optical Map
Contigs
Case Study: Combing all the technologies
~15 Mbp genome sequenced at High Coverage with:• Illumina HiSeq:
• 500X PE libraries (180bp and 650bp insert)• 150X MP library (3Kbp)• 150X MP library (7Kbp)
• PacBio• 50/60X with reads longer than 2Kbp
• OpGen• 3 chips (only one worked really well)• 300X coverage • Average map length 320Kbp
Assembly Strategyhttps://github.com/vezzi/de_novo_scilife
Semi-automated pipeline for de novo assembly:• Global configuration file tools and system configuration• Sample configuration file samples description
3 modules:1. QC-module (Illumina only):
• Adaptor removal, kmer-analysis, fastqc, (insert size estimation)2. Assemble-module (Illumina only):
• Runs specified assemblers and outputs executed commands3. Validation-module:
• FRCbam, coverage analysis, GC-analysis, (N50)
I NEED USERS/FEEDBACK/CONTIRBUTIONS
QC-Module
Kmer analysis:• Samples complexity• Error rate• Heterozygosity
FASTQCAdaptor removalAlignment (partial assembly)
Assemble-ModuleIllumina only:
• SOAPdenovo• MaSuRCA• Allpaths-LG
PacBio only:• HGAP• CABOG
Hybrid:• PB-jelly (HAH)
>5000 #scaffolds totalLength maxContigLength N50 N80 percentageNs
Allpaths-LG 227 14513103 596012 139364 57619 15%MASURCA 163 18549484 1188669 526519 282507 2%HGAP 290 14399273 763592 142483 37117 0%PB-Jelly 179 14718213 747750 195225 85127 13%
• Try-and-fail process• Automated pipeline developed in order to
streamline these analysis• MASURCA surprisingly the “best” assembler
MaSuRCA HGAP PB-Jelly (HAH)
Validation-Module
FRCbam
Validation-Module
PacBio-only assembly is clearly outperforming the others
Optical MapsPacBio produces the best assembly however 290 contigs contigs are produced.
Optical Maps allowed to obtain the 2D representation of the 7 chromosomes.N.B. chromosome number was one of the biological questions of this project!!!
But much more can be done!!!
Incredible tool to finish (or almost finish) genomes
% contigs placedTotal size of placed contigs
% size placed contigs
% genome covered
pacBio+OpGene 94.12 11578995 97% 77.05Allpaths+OpGene 71.88 10692027 84% 52.88Allpaths+Masurca+Opgene 80.65 27506424 92% 69.64Allpaths+PacBio+Opgene 82.32 22271022 91% 83.05Masurca+PacBio+pgene 94.44 28393392 98% 83.79Allpaths+Masurca+PacBio+Opgene 85.42 39085419 94% 87.39
Combing all the technologies
Conclusions – Take home message
Attempt to automate de novo assembly process:• https://github.com/vezzi/de_novo_scilife• Not 100% automated
Illumina, PacBio, Hybrid assemblies:• PacBio alone seems to produce the best assemblers• Hybrid assembly seems to not be able to correct merged-assembly
problems
Mixing technologies is always a good idea:• Possibility to compensate technological biases• Allows to produce better assemblies
Thankshttps://github.com/vezzi/de_novo_scilife
top related