1 de novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer...

1

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

David Hernandez, Patrice François, Laurent Farinelli, Magne

Østerås, Jacques Schrenzel

Presented by Lucas Lochovsky

2

Outline1. Introduction2. Edena’s Methodology

Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production

3. Results Assemblers Assembly tasks

4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth

5. Conclusions

3





5. Conclusions

4

1) Introduction NGS will allow us to explore strange new

genomes, blah blah blah….

WGS assemblers we’ve covered so far: Medvedev-Brudno assembler Arachne AMOS-Cmp Velvet ALLPATHS

Think you’ve seen it all?

5

1) Introduction (cont’d) Edena: De novo short read assembler

Uses a classic overlap graph approach to assembly

Anyone else get a feeling of déjà vu?

Compare to other recently published NGS read assemblers De novo assembly of two bacterial genomes

sequenced with the Illumina/Solexa platform

6





5. Conclusions

7

2) Edena’s Methodology Built around a standard overlap-layout-

consensus workflow Opted to use exact matching for overlap

detection Reduce # of spurious overlaps Faster than using approximate matching

Also assume that all reads have the same length Is this assumption valid?

8

2) Edena’s Methodology (cont’d)

Four major steps:1. Remove redundant reads so that

dataset size is more manageable2. Overlap detection and overlap graph

construction3. Graph cleaning: simplification and

ambiguity resolution4. Produce contigs

9


1) Practice your 3 R’s: Reducing Read Redundancy

Illumina Genome Analyzer has high amount of over-sampling → many redundant reads

Reduce dataset so it contains only a single copy of each read → non-redundant

Index all reads into a prefix tree Identical reads will be mapped to the same key

→ no duplicate reads in this structure

10


Prefix trees are associative arrays for strings where all descendants of a node have a common prefix

Reads and their reverse complements are considered the same read → merged into the same tree key

11


Ambiguous reads discarded, since they won’t work with exact matching Opens up possibility of coverage gaps in

read data (not explored by the authors) Original read data still useful for getting

read frequencies Contig coverage depth Repeat identification

12


2) Overlap Graph Construction Non-redundant read dataset is indexed by a

suffix array Déjà vu moment: Almost exactly like suffix trees from

MUMmer/MUMmerGPU! Information used to produce a bidirected

overlap graph Déjà vu moment: Just like the Medvedev-Brudno

assembler! (which I presented!)

13

2) Edena’s Methodology (cont’d)This slide should be review for all of you! Bidirected graphs are kind of like directed graphs, except

each edge has an orientation on each of its ends Gives rise to three types of edges:

Edges where one arrow points out of a vertex, and one arrow points into a vertex

Edges with both arrows pointing out, and

Edges with both arrows pointing in (easiest one to do in PowerPoint!)

For a walk in a bidirected graph, for each vertex on that walk, the orientation of the edge entering the vertex must be opposite that of the edge leaving the vertex

14

2) Edena’s Methodology (cont’d)More review! In a bidirected overlap graph, each vertex is a double-stranded

read Edges represent read overlaps Three possible ways that two double-stranded reads can

overlap (corresponds to the three types of edges) Suppose we have two ds reads r1 and r2

Each read can be oriented to the left or to the right The three possible overlaps are:

i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) ii) r1 points left and r2 points right iii) r1 points right and r2 points left

15


Parameter: Minimum overlap size Sensitivity vs. specificity tradeoff

Small value: Higher frequency of chance overlaps → causes path branching in graph (sensitivity favoured)

Large value: Creates more dead-end (DE) paths, i.e. reads not extended by overlapping reads on one side (specificity favoured)

16


3a) Transitive Edge Reduction Simplifies paths by removing nonessential

nodes/edges Generally speaking, a path of the form v1 → v2

→ v3 can be reduced to v1 → v3, representing the same sequence with fewer nodes

Reduces graph complexity by the over-sampling rate c = NL/G N: Number of reads L: Read length G: Genome size

17


For sequences, it’s about removing reads for which another read with the same sequence overlaps the first read to a greater extent

18


3b) Graph Cleanup Can have multiple paths branching off a

single node (branching paths) Due to genomic repetitions, sequencing

errors, and clonal polymorphisms Genomic repetitions cannot be fixed

without additional information But the other two can be resolved

19


Sequencing errors produce short dead-end (DE) paths

Attempt to elongate branching nodes up to a certain depth md (minimum depth)

Reads that cannot be extended to a depth of md are removed

Experimentally determined that md=10 is the best value

20


21

2) Edena’s Methodology (cont’d) Also disambiguate bubbles in the graph caused

by single base substitutions (aka “p-bubbles”) Length of p-bubble is at most ms = 4L - 2T - 1

L: Read length T: Min. overlap size

Explore each branching path up to length ms (guaranteed upper bound)

Remove path with less coverage Polymorphisms can be retained for later

analysis

22


23

2) Edena’s Methodology (cont’d)4) Contig Production If run in strict mode, Edena starts generating contig

sequences In non-strict mode, one more cleaning step is

performed Longer overlaps more reliable than shorter ones Save only edges at branching nodes that have the

highest overlap of all edges Produce contig sequence by following non-

intersecting simple paths in overlap graph Nodes must have in-degree and out-degree of exactly

one

24





5. Conclusions

25

3) Results

Survivor: WGS Assembly Four assemblers Two challenges One winner

26

3) Results (cont’d)Contestant #1: SSAKE Indexes reads in a prefix tree based upon first

eleven 5’ bases Identify highest possible overlap between pairs

of reads Use most highly-covered reads as starting

points for read extension (i.e. assembly “nucleation points”)

So far only used for partial genome sequencing for comparative metagenomic analysis (e.g. bacterial species distinction)

27

3) Results (cont’d)Contestant #2: Velvet k-mer/q-gram/k-gram/q-mer de Bruijn graph

representation of readsContestant #3: SHARCGS Can accept base quality scores along with read

data for read filtering (low quality reads discarded)

Also filter out reads with low coverage Assembly performed with a prefix treeContestant #4: Edena

28

3) Results (cont’d)

Reward Challenge Assemble the 2.82 Mbp genome sequence and

the 20.7 Kbp plasmid sequence of the Staphylococcus aureus MW2 strain from Illumina reads

Immunity Challenge Assemble 1.55 Mbp genome sequence and the

3.66 Kbp plasmid sequence of the Helicobacter acinonychis Sheeba strain from Illumina reads

29


Staphylococcus aureus results Evaluated each assembler on the

parameter configurations that produced the best results

Edena: Min. overlap size: 21 bases Velvet: k-mer value: 23 SHARCGS: Max. gap span: 14 SSAKE: Default parameters

30

3) Results (cont’d) Compared contig assembly to published

reference sequence Non-strict mode tends to produce longer contigs

at the expense of additional misassemblies Velvet comparable to Edena strict

31

3) Results (cont’d) SHARCGS unable to assemble significant

contigs → insufficient coverage depth SSAKE produced a large number of

mismatches mostly at contig boundaries

32

3) Results (cont’d) Authors also tried combining contig results from Edena

and Velvet due to significant overlaps between their contigs

N50 and mean contig size increased relative to original results

Edena non-strict has similar influence on results as previously

33


Helicobacter acinonychis results Best parameter settings: Edena: Min. overlap size: 27 (strict), 26

(non-strict) Velvet: k-mer value: 27 SHARCGS: Max. gap span: 10 (also

must remove last four bases from each read)

SSAKE: Default parameters

34

3) Results (cont’d) Results similar to those from the previous

assembly challenge

35


Survivor: WGS Assembly Conclusion Granted Immunity: Edena, Velvet Sent to the Tribal Council: SSAKE,

SHARCGS

36





5. Conclusions

37

4) Additional Edena AnalysesGraph Cleaning Effectiveness Demonstrate the effectiveness of DE path removal

and p-bubble fixing Created an ideal read pool from the S. aureus MW2

strain Consists of one read at every possible position No errors No polymorphisms

Distinguish between positive and negative reads Positive reads have at least one exact occurrence in the

reference sequence Negative reads have none

38

4) Additional Edena Analyses (cont’d)

Ideal dataset indicates branching nodes and p-bubbles caused by genomic repetition

Anomalies in real datasets only due to negative reads

Due to small quantity of branching nodes in the ideal dataset, branch removal procedure is extremely effective

39


Though many p-bubbles consist of sequences made of negative reads, most cannot be explained by base calling errors

Thought to correspond to underrepresented clonal polymorphisms

40


Since there are no DE paths in the ideal dataset, expect that DE removal should remove all DE paths in real dataset (i.e. dead-ends correspond to negative reads)

From tests with different md values (below), authors decided 10 was best

Not so clear-cut to me

41


Most DE paths have length 1 Correspond to paths created by base calling

errors Longer DE paths exist that do not appear

to be caused by such errors Thought to be clonal polymorphisms in low

abundance → can’t form a complete p-bubble

42


Effective Coverage Depth Computed effective coverage depth according

to formula from Lander and Waterman E = N(L-T)/G

N: # of usable reads L: Read length T: Req. overlap length G: Genome size

Can also estimate gaps in read coverage with N•e-E

43


S. aureus sequencing Raw coverage depth: 48x Effective coverage depth: 14x

H. acinonychis sequencing Raw coverage depth: 284x Effective coverage depth: 36x

Statistics imply that there should be no gaps in H. acinonychis assembly, and only a few in S. aureus

But each actual assembly contained several hundred gaps

44


Statistics assume uniform read sampling Investigated underrepresented parts of

genomes After alignment of reads to reference

genome, extracted low coverage sequences

These sequences have complex motifs and single base repeats → cause difficulty in replication

45





5. Conclusions

46

5) Conclusions Edena holds up well against other recent

assemblers, in both assembly quality and computational resources

Some assemblers are partially complementary to each other (Edena and Velvet) → can use together to produce results better than each individual assembler’s results

Rise of NGS paired read data will help produce longer contigs and clean up ambiguities

47

Is Edena The One?The One that will herald the beginning of cost-effective whole genome assembly with NGS?

Maybe you should ask the Oracle…

48

That’s all folks!Discussion Questions What were the strengths/weaknesses of the

Edena? How would you improve it? How do you think Edena compares to the other

assemblers tested? Would you test it against other assemblers not tested here?

Given Edena’s limitations, would you trust it for de novo genome assembly over traditional sequence assembly?

Why did we have to discuss yet another NGS genome assembler today?

1 de novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer...

Documents

novo short read assembleruses

short reads

classic overlap graph

prefix treeidentical

standard overlaplayout

possibility of coverage

redundant readsreduce

blah blah