1 de novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer...
TRANSCRIPT
1
De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer
David Hernandez, Patrice François, Laurent Farinelli, Magne
Østerås, Jacques Schrenzel
Presented by Lucas Lochovsky
2
Outline1. Introduction2. Edena’s Methodology
Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production
3. Results Assemblers Assembly tasks
4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth
5. Conclusions
3
Outline1. Introduction2. Edena’s Methodology
Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production
3. Results Assemblers Assembly tasks
4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth
5. Conclusions
4
1) Introduction NGS will allow us to explore strange new
genomes, blah blah blah….
WGS assemblers we’ve covered so far: Medvedev-Brudno assembler Arachne AMOS-Cmp Velvet ALLPATHS
Think you’ve seen it all?
5
1) Introduction (cont’d) Edena: De novo short read assembler
Uses a classic overlap graph approach to assembly
Anyone else get a feeling of déjà vu?
Compare to other recently published NGS read assemblers De novo assembly of two bacterial genomes
sequenced with the Illumina/Solexa platform
6
Outline1. Introduction2. Edena’s Methodology
Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production
3. Results Assemblers Assembly tasks
4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth
5. Conclusions
7
2) Edena’s Methodology Built around a standard overlap-layout-
consensus workflow Opted to use exact matching for overlap
detection Reduce # of spurious overlaps Faster than using approximate matching
Also assume that all reads have the same length Is this assumption valid?
8
2) Edena’s Methodology (cont’d)
Four major steps:1. Remove redundant reads so that
dataset size is more manageable2. Overlap detection and overlap graph
construction3. Graph cleaning: simplification and
ambiguity resolution4. Produce contigs
9
2) Edena’s Methodology (cont’d)
1) Practice your 3 R’s: Reducing Read Redundancy
Illumina Genome Analyzer has high amount of over-sampling → many redundant reads
Reduce dataset so it contains only a single copy of each read → non-redundant
Index all reads into a prefix tree Identical reads will be mapped to the same key
→ no duplicate reads in this structure
10
2) Edena’s Methodology (cont’d)
Prefix trees are associative arrays for strings where all descendants of a node have a common prefix
Reads and their reverse complements are considered the same read → merged into the same tree key
11
2) Edena’s Methodology (cont’d)
Ambiguous reads discarded, since they won’t work with exact matching Opens up possibility of coverage gaps in
read data (not explored by the authors) Original read data still useful for getting
read frequencies Contig coverage depth Repeat identification
12
2) Edena’s Methodology (cont’d)
2) Overlap Graph Construction Non-redundant read dataset is indexed by a
suffix array Déjà vu moment: Almost exactly like suffix trees from
MUMmer/MUMmerGPU! Information used to produce a bidirected
overlap graph Déjà vu moment: Just like the Medvedev-Brudno
assembler! (which I presented!)
13
2) Edena’s Methodology (cont’d)This slide should be review for all of you! Bidirected graphs are kind of like directed graphs, except
each edge has an orientation on each of its ends Gives rise to three types of edges:
Edges where one arrow points out of a vertex, and one arrow points into a vertex
Edges with both arrows pointing out, and
Edges with both arrows pointing in (easiest one to do in PowerPoint!)
For a walk in a bidirected graph, for each vertex on that walk, the orientation of the edge entering the vertex must be opposite that of the edge leaving the vertex
14
2) Edena’s Methodology (cont’d)More review! In a bidirected overlap graph, each vertex is a double-stranded
read Edges represent read overlaps Three possible ways that two double-stranded reads can
overlap (corresponds to the three types of edges) Suppose we have two ds reads r1 and r2
Each read can be oriented to the left or to the right The three possible overlaps are:
i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) ii) r1 points left and r2 points right iii) r1 points right and r2 points left
15
2) Edena’s Methodology (cont’d)
Parameter: Minimum overlap size Sensitivity vs. specificity tradeoff
Small value: Higher frequency of chance overlaps → causes path branching in graph (sensitivity favoured)
Large value: Creates more dead-end (DE) paths, i.e. reads not extended by overlapping reads on one side (specificity favoured)
16
2) Edena’s Methodology (cont’d)
3a) Transitive Edge Reduction Simplifies paths by removing nonessential
nodes/edges Generally speaking, a path of the form v1 → v2
→ v3 can be reduced to v1 → v3, representing the same sequence with fewer nodes
Reduces graph complexity by the over-sampling rate c = NL/G N: Number of reads L: Read length G: Genome size
17
2) Edena’s Methodology (cont’d)
For sequences, it’s about removing reads for which another read with the same sequence overlaps the first read to a greater extent
18
2) Edena’s Methodology (cont’d)
3b) Graph Cleanup Can have multiple paths branching off a
single node (branching paths) Due to genomic repetitions, sequencing
errors, and clonal polymorphisms Genomic repetitions cannot be fixed
without additional information But the other two can be resolved
19
2) Edena’s Methodology (cont’d)
Sequencing errors produce short dead-end (DE) paths
Attempt to elongate branching nodes up to a certain depth md (minimum depth)
Reads that cannot be extended to a depth of md are removed
Experimentally determined that md=10 is the best value
20
2) Edena’s Methodology (cont’d)
21
2) Edena’s Methodology (cont’d) Also disambiguate bubbles in the graph caused
by single base substitutions (aka “p-bubbles”) Length of p-bubble is at most ms = 4L - 2T - 1
L: Read length T: Min. overlap size
Explore each branching path up to length ms (guaranteed upper bound)
Remove path with less coverage Polymorphisms can be retained for later
analysis
22
2) Edena’s Methodology (cont’d)
23
2) Edena’s Methodology (cont’d)4) Contig Production If run in strict mode, Edena starts generating contig
sequences In non-strict mode, one more cleaning step is
performed Longer overlaps more reliable than shorter ones Save only edges at branching nodes that have the
highest overlap of all edges Produce contig sequence by following non-
intersecting simple paths in overlap graph Nodes must have in-degree and out-degree of exactly
one
24
Outline1. Introduction2. Edena’s Methodology
Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production
3. Results Assemblers Assembly tasks
4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth
5. Conclusions
25
3) Results
Survivor: WGS Assembly Four assemblers Two challenges One winner
26
3) Results (cont’d)Contestant #1: SSAKE Indexes reads in a prefix tree based upon first
eleven 5’ bases Identify highest possible overlap between pairs
of reads Use most highly-covered reads as starting
points for read extension (i.e. assembly “nucleation points”)
So far only used for partial genome sequencing for comparative metagenomic analysis (e.g. bacterial species distinction)
27
3) Results (cont’d)Contestant #2: Velvet k-mer/q-gram/k-gram/q-mer de Bruijn graph
representation of readsContestant #3: SHARCGS Can accept base quality scores along with read
data for read filtering (low quality reads discarded)
Also filter out reads with low coverage Assembly performed with a prefix treeContestant #4: Edena
28
3) Results (cont’d)
Reward Challenge Assemble the 2.82 Mbp genome sequence and
the 20.7 Kbp plasmid sequence of the Staphylococcus aureus MW2 strain from Illumina reads
Immunity Challenge Assemble 1.55 Mbp genome sequence and the
3.66 Kbp plasmid sequence of the Helicobacter acinonychis Sheeba strain from Illumina reads
29
3) Results (cont’d)
Staphylococcus aureus results Evaluated each assembler on the
parameter configurations that produced the best results
Edena: Min. overlap size: 21 bases Velvet: k-mer value: 23 SHARCGS: Max. gap span: 14 SSAKE: Default parameters
30
3) Results (cont’d) Compared contig assembly to published
reference sequence Non-strict mode tends to produce longer contigs
at the expense of additional misassemblies Velvet comparable to Edena strict
31
3) Results (cont’d) SHARCGS unable to assemble significant
contigs → insufficient coverage depth SSAKE produced a large number of
mismatches mostly at contig boundaries
32
3) Results (cont’d) Authors also tried combining contig results from Edena
and Velvet due to significant overlaps between their contigs
N50 and mean contig size increased relative to original results
Edena non-strict has similar influence on results as previously
33
3) Results (cont’d)
Helicobacter acinonychis results Best parameter settings: Edena: Min. overlap size: 27 (strict), 26
(non-strict) Velvet: k-mer value: 27 SHARCGS: Max. gap span: 10 (also
must remove last four bases from each read)
SSAKE: Default parameters
34
3) Results (cont’d) Results similar to those from the previous
assembly challenge
35
3) Results (cont’d)
Survivor: WGS Assembly Conclusion Granted Immunity: Edena, Velvet Sent to the Tribal Council: SSAKE,
SHARCGS
36
Outline1. Introduction2. Edena’s Methodology
Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production
3. Results Assemblers Assembly tasks
4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth
5. Conclusions
37
4) Additional Edena AnalysesGraph Cleaning Effectiveness Demonstrate the effectiveness of DE path removal
and p-bubble fixing Created an ideal read pool from the S. aureus MW2
strain Consists of one read at every possible position No errors No polymorphisms
Distinguish between positive and negative reads Positive reads have at least one exact occurrence in the
reference sequence Negative reads have none
38
4) Additional Edena Analyses (cont’d)
Ideal dataset indicates branching nodes and p-bubbles caused by genomic repetition
Anomalies in real datasets only due to negative reads
Due to small quantity of branching nodes in the ideal dataset, branch removal procedure is extremely effective
39
4) Additional Edena Analyses (cont’d)
Though many p-bubbles consist of sequences made of negative reads, most cannot be explained by base calling errors
Thought to correspond to underrepresented clonal polymorphisms
40
4) Additional Edena Analyses (cont’d)
Since there are no DE paths in the ideal dataset, expect that DE removal should remove all DE paths in real dataset (i.e. dead-ends correspond to negative reads)
From tests with different md values (below), authors decided 10 was best
Not so clear-cut to me
41
4) Additional Edena Analyses (cont’d)
Most DE paths have length 1 Correspond to paths created by base calling
errors Longer DE paths exist that do not appear
to be caused by such errors Thought to be clonal polymorphisms in low
abundance → can’t form a complete p-bubble
42
4) Additional Edena Analyses (cont’d)
Effective Coverage Depth Computed effective coverage depth according
to formula from Lander and Waterman E = N(L-T)/G
N: # of usable reads L: Read length T: Req. overlap length G: Genome size
Can also estimate gaps in read coverage with N•e-E
43
4) Additional Edena Analyses (cont’d)
S. aureus sequencing Raw coverage depth: 48x Effective coverage depth: 14x
H. acinonychis sequencing Raw coverage depth: 284x Effective coverage depth: 36x
Statistics imply that there should be no gaps in H. acinonychis assembly, and only a few in S. aureus
But each actual assembly contained several hundred gaps
44
4) Additional Edena Analyses (cont’d)
Statistics assume uniform read sampling Investigated underrepresented parts of
genomes After alignment of reads to reference
genome, extracted low coverage sequences
These sequences have complex motifs and single base repeats → cause difficulty in replication
45
Outline1. Introduction2. Edena’s Methodology
Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production
3. Results Assemblers Assembly tasks
4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth
5. Conclusions
46
5) Conclusions Edena holds up well against other recent
assemblers, in both assembly quality and computational resources
Some assemblers are partially complementary to each other (Edena and Velvet) → can use together to produce results better than each individual assembler’s results
Rise of NGS paired read data will help produce longer contigs and clean up ambiguities
47
Is Edena The One?The One that will herald the beginning of cost-effective whole genome assembly with NGS?
Maybe you should ask the Oracle…
48
That’s all folks!Discussion Questions What were the strengths/weaknesses of the
Edena? How would you improve it? How do you think Edena compares to the other
assemblers tested? Would you test it against other assemblers not tested here?
Given Edena’s limitations, would you trust it for de novo genome assembly over traditional sequence assembly?
Why did we have to discuss yet another NGS genome assembler today?