genome assembly

53
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly

Upload: chance

Post on 24-Feb-2016

59 views

Category:

Documents


0 download

DESCRIPTION

Genome Assembly. Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington , Juliette Zerick. Outline. Stake Holders Biology NGS Review Introduction to Genome Assembly Challenges Analysis pipeline/ strategy - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Genome Assembly

Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick

Genome Assembly

Page 2: Genome Assembly

Outline Stake Holders Biology NGS Review Introduction to Genome Assembly Challenges Analysis pipeline/ strategy Tool selection Summary (final pipeline)

Page 3: Genome Assembly

Stakeholders

CDC (Centers for Disease Control and Prevention) GaTech Immunocompromised individuals Consumers of seafood Prediction group (and subsequent groups)

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 4: Genome Assembly

Biology…

Image of V. vulnificus

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 5: Genome Assembly

Vibrio vulnificus

Gram-negativeo Lipopolysaccharide membrane

Motile, facultative anaerobe Halophilic (salt-loving) organism

abundant in estuarine ecosystems Major cause of seafood related

deaths

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 6: Genome Assembly

Vibrio vulnificus – genome architecture Bacterial genomes are coding-

denseo Introns rare

Contains plasmids (pYJ016) V. vulnificus ~5.2mbp genome

(similar to E. coli, ~50%)o GC content: 45-47%

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 7: Genome Assembly

Vibrio navarrensis Gram-negative

Lipopolysaccharide membrane

Motile, facultative anaerobe Moderately halophilic organism

Some strains do not grow well in moderate to high salt concentrations

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 8: Genome Assembly

Vibrio navarrensis - genomic architecture

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 9: Genome Assembly

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

NGS - Review

Page 10: Genome Assembly

Sample input: Genomic DNA, BACs, amplicons, cDNA

Generation of small DNA fragments via shearing

Ligation of A/B-Adaptors flanking single- stranded DNA fragments

Emulsification of beads and fragments in water-in-oil microreactors

Clonal amplification of fragments bound to beads in microreactors

Sequencing and base calling

One Fragment

One Bead

One Read

400,000 reads per run

Roche 454 sequencing workflow overview

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 11: Genome Assembly

Flowgram

GS FLX Data analysis – flowgram generation

T Flow Order4‐

AC

mer

G

3 mer‐

TTCTGCGAA

2 mer‐1 mer‐

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Example of homopolymer errors from 454 sequencing data

Page 12: Genome Assembly

Example of 454 sff file (text format)

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 13: Genome Assembly

cBot GAIIx0.1 - 1.0μg

User or core facility

Illumina sequencing overview

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 14: Genome Assembly

Example of Illumina *.fastq file

@C3PO_0001:2:1:17:1499#0/1 TGAATTCATTGACCATAACAATCATATGCATGATGCAAATTATAATATCATTTTTAGTGACGTCGTGAATCGTTT+C3PO_0001:2:1:17:1499#0/1abaaaaaaaaaaa`a`aa_aaaaaaaaaaaaaaaa_a aaa`aaaaa^aaaaa`a]^`a YZYZ^`NJDJ\_Z@C3PO_0001:2:1:17:1291#0/1 TGTTTGAGCAAATGATTCATAATAATGTATTTCAATATTTTTAGGAATATCTCCCAATATTGCGCGTGCTGAATT+C3PO_0001:2:1:17:1291#0/1a`_`_\a_aaaa_a^Z^^a[a^aa]a_^_a_``aa `aa`X^X^^`aa_\_]VR`\a_]W\_`_a]a]][\RZV@C3PO_0001:2:2:1452:1316#0/1 GTCCATCCGCAGCAGCGAATTTTTGACGTCCCCCCCCGAANGGANGNGANNNNGNNGNNNTNTNNAAANGNNNNN+C3PO_0001:2:2:1452:1316#0/1_U a\ `]_`ZP\\_Z^[]aa^a_]XNBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB…

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 15: Genome Assembly

Genome Assembly

Page 16: Genome Assembly

Input reads

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

V. navarrensis V. vulnificus2423-01 2009V-1368

08-2462 06-2432

2541-90 08-2435

2756-81 08-2439

07-2444

Page 17: Genome Assembly

Introduction to genome assembly

An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target.

In addition to contigs, a set of unassembled or partially assembled reads is also given as an output.

Reads

Contigs multiple sequence alignment of reads plus the consensus sequence.

Scaffolds - define the contig order and orientation

Output (FASTA)

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 18: Genome Assembly

• N50

• minimum/maximum contig length

• No. of contigs

• No. of errors

• FRC (feature response curve)

How do we check the quality of our assembly?

METRICS!

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 19: Genome Assembly

Feature-by-Feature – evaluating de-novo assembly• BREAKPOINT: Points in the assembly where leftover reads partially align; • COMPRESSION: Area representing a possible repeat col- lapse; • STRETCH: Area representing a possible repeat expansion; • LOW_GOOD_CVG: Area composed of paired reads at the right distance and with the right

orientation but at low coverage; • HIGH_NORMAL_CVG: Area composed of normal oriented reads but at high coverage;• HIGH_LINKING_CVG: Area composed of reads with associated mates in another scaffold;• HIGH_SPANNING_CVG: Area composed of reads with associated mates in another contig;• HIGH_OUTIE_CVG: Area composed of incorrectly oriented mates (--> -->, <-- -->); • HIGH_SINGLEMATE_CVG: Area composed of single reads (mate not present anywhere); • HIGH_READ_COVERAGE: Region in assembly with unexpectedly high local read coverage;• HIGH_SNP: SNP with high coverage; • KMER_COV: Problematic k-mer distribution.

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 20: Genome Assembly

Feature-by-Feature – evaluating de-novo assembly• Most of the traditional metrics used to evaluate assemblies (N50, mean contig size, etc.) emphasize only size, while nothing (or

almost nothing) is said about how correct the assemblies are.

• A typical such metric (especially, in the NGS context) consists in aligning contigs back to an available reference. However, this naive technique simply counts the number of mis-assemblies without attempting to distinguish or categorize them any further.

• After running amosvalidate, each contig is assigned the number of features that correspond to doubtful sequences in the assembly.

• For a fixed feature threshold w, the contigs are sorted by size and, starting from the longest, only those contigs are tallied, if their sum of features is ƒw. For this set of contigs, the corresponding approximate genome coverage is computed, leading to a single point of the Feature-Response curve (FRC).

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 21: Genome Assembly

Assembly Challenges

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 22: Genome Assembly

Challenges Intrinsic

Genome architecture Repeats Homopolymer runs Sequence complexity Chimeras?

Contaminants

Technical Short reads Poisson distribution of coverage Sequencing errors Variable quality Sequence tags

Page 23: Genome Assembly

454 raw reads

Pre-processing

Illumina raw reads

Pre-processing

454 reads

Illumina reads

Statistical analysis

Read stats

Published Genomes from public databases

V. vulnificus

YJ016

V. vulnificus CMCP6

V. vulnificus MO6-24/O

Align Illumina against the reference

• Fastqc• Prinseq• NGS QC

Compare mapping statistics

Reference genome

• samstats

• bwa

Reference selection

Hybrid DeNovo • Ray• MIRA

Illumina/ 454/ Hybrid DeNovo assembly

454 DeNovo• Newbler• CABOG• SUTTA

Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• Abyss• Taipan• Bambus2• SUTTA

contigs * 3

Align illumina reads against 454 contigs

Unmapped reads

• Mac vector• CLC wb

contigs

Unmapped reads

Evaluation

• GAGE• Hawk-eye

Illumina/(454?) reference based

assembly• AMOScmp

contigs

Unmapped reads

DeNovo assembly

Reference based assembly

Draft/ Finished genome

Reference evaluation

Reference evaluation

• DNA Diff• DNA Diff

Parameter optimization

Contig merging

All possible combinations of the

best 3• Mimimus• MAIA

• PAGIT• Mauve

Finished genomeScaffolds

• GAGE

Genome finishing

Gap filling Nulceotide identity

• DNA Diff

• GRASS• Built-in

Process

454

Illumina

Info.

Chosen Ref.

Assemblers

Assemblers

Illumina454

LEGEND

hybrid

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges / Analysis Pipeline-Strategy / Tool Selection / Summary

Page 24: Genome Assembly

Tool Selection - Assembly Algorithm profile

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 25: Genome Assembly

Greedy Seed-and-extension Graph based Branch-and-Bound

Basic operation: given any read or contig, add one more read or contig until no more reads or contigs are available The contigs grow by “greedy extension” always incorporating a read that is found with the

highest scoring overlap

Makes locally optimal choice with the hope of finding a globally optimal choice No foresight -> misassembly

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 26: Genome Assembly

Greedy Seed-and-extension Graph based Branch-and-Bound

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

of times, it was the

times, it was the age

It was the best of

of times, it was the

times, it was the worst

was the best of times,

the best of times, it

was the worst of times,

best of times, it

was it was the age

ofit was the worst of

of times, it was the

times, it was the age

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

it was the age of

was the age of foolishness,

the worst of times, it

• It was the best of times, it was the [worst/age]

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 27: Genome Assembly

Greedy Seed-and-Extension Graph based Branch-and-Bound

Variation of the greedy assembler Common in aligners, thus some assemblers/aligners may incorporate this approach

Particularly designed for short reads based on a contig heuristic scheme

Prefix-tree data structure A contig is elongated at either end contingent upon the existence of reads with a prefix of

minimal length perfectly matching the end of the contig

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 28: Genome Assembly

Overlap: find potentially overlapping reads

Layout: layout the reads based on matching alignment

Consensus: derive the DNA sequence consensus by joining read sequences ..ACGATTACAATAGGTT..

Greedy Seed-and-extension Graph based Branch-and-Bound

Overlap-layout-consensus (OLC): pairwise consensus

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 29: Genome Assembly

Hamiltonian ApproachFind an assembled sequence that explains the observed sequence = finding a path through a graph that visits every vertex once

Repeat Repeat Repeat

Greedy Seed-and-extension Graph based Branch-and-Bound

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 30: Genome Assembly

Greedy Seed-and-extension Graph based Branch-and-Bound

Basic operation: k-mer approach Eulerian approach

GGA CTG

GGG TGG

GAC ACT CTT TTT

Reads

AAGA ACTT ACTC ACTG AGAG CCGA CGAC CTCC CTGG CTTT…

de Bruijn Graph

CCG TCC

Potential Genomes

AAGACTCCGACTGGGACTTT

AAGACTGGGACTCCGACTTTCTCCGA

AAG AGA

de-Brujin Graph

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 31: Genome Assembly

Greedy Seed-and-extension Graph based Branch-and-Bound

Basic operation: relies on “consistent layouts”; it generates all possible consistent layouts organizing them as paths in a “double tree” structure, rooted at a randomly selected seed read

Progressive evaluation of optimal criteria encoded by a set of score functions based on the set of overlaps along the layout

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 32: Genome Assembly

Tid-bits of advice

Greedy Seed-and-Extension

OLC De-Brujin Branch-and-Bound

Advantages Guaranteed to find a solution

sensitivity Suitable for low coverage long reads

Repeats are immediately recognized; suitable for high coverage short reads

Algorithm allows for checks

Disadvantages MisassemblyEasily confused by complex repeats

Can be very slow, memory usage

Computation of overlaps time intensive

RAM intensive Ambiguities delay pruning

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 33: Genome Assembly

Tools of Choice

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 34: Genome Assembly

454 platform assembly

Name Algorithm

Newbler 2.5 OLC Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data

CABOG OLC Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data

SUTTA

Branch-and-Bound

Feature-by-Feature – Evaluating De Novo Sequence Assembly

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 35: Genome Assembly

Evaluation of 454 assemblers Genomes Used For Comparison

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280

Page 36: Genome Assembly

Comparison of 454 assemblers using E. coli genome

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280

Page 37: Genome Assembly

Comparison of 454 assemblers using E. coli genome The maximum value reached by the bars is the hypothetical reconstruction HR, defined as the ratio between the assembled bases and the reference

length The white section represents the real reconstruction RR, i.e. the portion of genome correctly reconstructed by assemblers. The difference between hypothetical and RR, here called erroneous reconstruction ER, is shown in black

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280

Page 38: Genome Assembly

Illumina platform assembly

Name Algorithm Supporting Evidence

ALLPATHS-LG OLCGAGE: A critical evaluation of genome assemblies and assembly algorithms

Velvet de-BrujinComparative studies of de novo assembly tools for next-generation sequencing technologies

Taipan Hybrid(Greedy-based and graph)

A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies

SOAPdenovo de-Brujin Feature-by-Feature- Evaluating De Novo Sequence Assembly

SUTTABranch-and-

BoundFeature-by-Feature – Evaluating De Novo Sequence Assembly

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 39: Genome Assembly

Evaluation of illumina assemblers Genomes Used For Comparison

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567

Page 40: Genome Assembly

Comparison of illumina assemblers

• The best value for each column is shown in bold. For all assemblies

• The Errors column contains the number of misjoins plus indel errors >5 bp for contigs, and the total number of misjoins for scaffolds.

• Corrected N50 values were computed after correcting contigs and scaffolds by breaking them at each error. See the evaluation section in the text for details on how errors were identified.

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567

Page 41: Genome Assembly

Comparison of illumina assemblers• A ‘‘chaff’’ contig is defined as a single contig <200 bp in length. In many cases, these contigs can be as small as the k-mer size used to build

the de Bruijn graph (e.g., 36 bp) and are too short to support any further genomic analysis.

• A duplicated repeat is one that appears in more copies than necessary in the assembly, and a compressed repeat is one that occurs in fewer copies.

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567

Page 42: Genome Assembly

Comparison of illumina assemblers• ‘‘Misjoin’’ errors are perhaps the most harmful type, in that they represent a significant structural error. A misjoin occurs when an

assembler incorrectly joins two distant loci of the genome, which most often occurs within a repeat sequence. • We have tallied three types of misjoins: (1) inversions, where part of a contig or scaffold is reversed with respect to the true genome; (2)

relocations, or rearrangements that move a contig or scaffold within a chro- mosome; and (3) translocations, or rearrangements between chromosomes

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567

Page 43: Genome Assembly

Comparison of illumina assemblers• Average contig (A) and scaffold (B) sizes, measured by N50 values, versus error rates, averaged over all three genomes for which the true assembly

is known: S. aureus, R. sphaeroides, and human chromosome 14. • Errors (vertical axis) are measured as the average distance between errors, in kilobases. • In both plots, the best assemblers appear in the upper right.

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567

Page 44: Genome Assembly

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Applicability of assemblers Genomes used for comparison

A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12

Page 45: Genome Assembly

Comparison of illumina assemblers

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12

Page 46: Genome Assembly

Comparison of illumina assemblers

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12

Page 47: Genome Assembly

Hybrid Platform Assembly

Name Algorithm Supporting Evidence

RAYSBH Feature-by-Feature – Evaluating De Novo Sequence Assembly

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 48: Genome Assembly

Feature-by-Feature – evaluating de-novo assembly

• COMPRESSION: Area representing a possible repeat col- lapse; • LOW_GOOD_CVG: Area composed of paired reads at the right distance and with the right

orientation but at low coverage; • HIGH_OUTIE_CVG: Area composed of incorrectly oriented mates (--> -->, <-- -->); • HIGH_SINGLEMATE_CVG: Area composed of single reads (mate not present anywhere); • HIGH_READ_COVERAGE: Region in assembly with unexpectedly high local read coverage;• KMER_COV: Problematic k-mer distribution.

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 49: Genome Assembly

Feature-by-Feature: evaluating de-novo assembly Real Data - Long Reads

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 50: Genome Assembly

Feature-by-Feature – evaluating de-novo assembly Real Data - Short Reads

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 51: Genome Assembly

Final Approach

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 52: Genome Assembly

454 raw reads

Pre-processing

Illumina raw reads

Pre-processing

454 reads

Illumina reads

Statistical analysis

Read stats

Published Genomes from public databases

V. vulnificus

YJ016

V. vulnificus CMCP6

V. vulnificus MO6-24/O

Align Illumina against the reference

• Fastqc• Prinseq• NGS QC

Compare mapping statistics

Reference genome

• samstats

• bwa

Reference selection

Hybrid DeNovo • Ray

Illumina/ 454/ Hybrid DeNovo assembly

454 DeNovo• Newbler• CABOG• SUTTA

Illumina DeNovo• Allpaths LG• SOAP DeNovo• Velvet• Taipan• SUTTA

contigs * 3

Align illumina reads against 454 contigs

Unmapped reads

• Mac vector• CLC wb

contigs

Unmapped reads

Evaluation

• GAGE• Hawk-eye

Illumina/(454?) reference based

assembly• AMOScmp

contigs

Unmapped reads

DeNovo assembly

Reference based assembly

Draft/ Finished genome

Reference evaluation

Reference evaluation

• DNA Diff• MUMer

Parameter optimization

Contig merging

All possible combinations of the

best 3• Mimimus• MAIA

• PAGIT• Mauve

Finished genomeScaffolds

• GAGE

Genome finishing

Gap filling Nulceotide identity

• MUMer

• GRASS• Built-in

Process

454

Illumina

Info.

Chosen Ref.

Assemblers

Assemblers

Illumina454

LEGEND

hybrid

Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary

Page 53: Genome Assembly

References1. Finotello, F., et al., Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data. Brief Bioinform, 2012. 13(3): p. 269-80.2. Vezzi, F., G. Narzisi, and B. Mishra, Feature-by-feature--evaluating de novo sequence assembly. PLoS One, 2012. 7(2): p. e31002.3. Zhang, W., et al., A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One, 2011. 6(3): p. e17915.4. Salzberg, S.L., et al., GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res, 2012. 22(3): p. 557-67.5. Narzisi, G. and B. Mishra, Comparing de novo genome assembly: the long and short of it. PLoS One, 2011. 6(4): p. e19175.6. Miller, J.R., S. Koren, and G. Sutton, Assembly algorithms for next-generation sequencing data. Genomics, 2010. 95(6): p. 315-27.7. Li, Z., et al., Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Brief Funct Genomics, 2012. 11(1): p. 25-37.8. Lin, Y., et al., Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics, 2011. 27(15): p. 2031-7.9. Zhang, J., et al., The impact of next-generation sequencing on genomics. J Genet Genomics, 2011. 38(3): p. 95-109.