1 de novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer...

48
1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli, Magne Østerås, Jacques Schrenzel Presented by Lucas Lochovsky

Upload: edwin-lambert

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

1

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer

David Hernandez, Patrice François, Laurent Farinelli, Magne

Østerås, Jacques Schrenzel

Presented by Lucas Lochovsky

Page 2: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

2

Outline1. Introduction2. Edena’s Methodology

Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production

3. Results Assemblers Assembly tasks

4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth

5. Conclusions

Page 3: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

3

Outline1. Introduction2. Edena’s Methodology

Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production

3. Results Assemblers Assembly tasks

4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth

5. Conclusions

Page 4: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

4

1) Introduction NGS will allow us to explore strange new

genomes, blah blah blah….

WGS assemblers we’ve covered so far: Medvedev-Brudno assembler Arachne AMOS-Cmp Velvet ALLPATHS

Think you’ve seen it all?

Page 5: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

5

1) Introduction (cont’d) Edena: De novo short read assembler

Uses a classic overlap graph approach to assembly

Anyone else get a feeling of déjà vu?

Compare to other recently published NGS read assemblers De novo assembly of two bacterial genomes

sequenced with the Illumina/Solexa platform

Page 6: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

6

Outline1. Introduction2. Edena’s Methodology

Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production

3. Results Assemblers Assembly tasks

4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth

5. Conclusions

Page 7: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

7

2) Edena’s Methodology Built around a standard overlap-layout-

consensus workflow Opted to use exact matching for overlap

detection Reduce # of spurious overlaps Faster than using approximate matching

Also assume that all reads have the same length Is this assumption valid?

Page 8: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

8

2) Edena’s Methodology (cont’d)

Four major steps:1. Remove redundant reads so that

dataset size is more manageable2. Overlap detection and overlap graph

construction3. Graph cleaning: simplification and

ambiguity resolution4. Produce contigs

Page 9: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

9

2) Edena’s Methodology (cont’d)

1) Practice your 3 R’s: Reducing Read Redundancy

Illumina Genome Analyzer has high amount of over-sampling → many redundant reads

Reduce dataset so it contains only a single copy of each read → non-redundant

Index all reads into a prefix tree Identical reads will be mapped to the same key

→ no duplicate reads in this structure

Page 10: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

10

2) Edena’s Methodology (cont’d)

Prefix trees are associative arrays for strings where all descendants of a node have a common prefix

Reads and their reverse complements are considered the same read → merged into the same tree key

Page 11: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

11

2) Edena’s Methodology (cont’d)

Ambiguous reads discarded, since they won’t work with exact matching Opens up possibility of coverage gaps in

read data (not explored by the authors) Original read data still useful for getting

read frequencies Contig coverage depth Repeat identification

Page 12: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

12

2) Edena’s Methodology (cont’d)

2) Overlap Graph Construction Non-redundant read dataset is indexed by a

suffix array Déjà vu moment: Almost exactly like suffix trees from

MUMmer/MUMmerGPU! Information used to produce a bidirected

overlap graph Déjà vu moment: Just like the Medvedev-Brudno

assembler! (which I presented!)

Page 13: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

13

2) Edena’s Methodology (cont’d)This slide should be review for all of you! Bidirected graphs are kind of like directed graphs, except

each edge has an orientation on each of its ends Gives rise to three types of edges:

Edges where one arrow points out of a vertex, and one arrow points into a vertex

Edges with both arrows pointing out, and

Edges with both arrows pointing in (easiest one to do in PowerPoint!)

For a walk in a bidirected graph, for each vertex on that walk, the orientation of the edge entering the vertex must be opposite that of the edge leaving the vertex

Page 14: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

14

2) Edena’s Methodology (cont’d)More review! In a bidirected overlap graph, each vertex is a double-stranded

read Edges represent read overlaps Three possible ways that two double-stranded reads can

overlap (corresponds to the three types of edges) Suppose we have two ds reads r1 and r2

Each read can be oriented to the left or to the right The three possible overlaps are:

i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) ii) r1 points left and r2 points right iii) r1 points right and r2 points left

Page 15: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

15

2) Edena’s Methodology (cont’d)

Parameter: Minimum overlap size Sensitivity vs. specificity tradeoff

Small value: Higher frequency of chance overlaps → causes path branching in graph (sensitivity favoured)

Large value: Creates more dead-end (DE) paths, i.e. reads not extended by overlapping reads on one side (specificity favoured)

Page 16: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

16

2) Edena’s Methodology (cont’d)

3a) Transitive Edge Reduction Simplifies paths by removing nonessential

nodes/edges Generally speaking, a path of the form v1 → v2

→ v3 can be reduced to v1 → v3, representing the same sequence with fewer nodes

Reduces graph complexity by the over-sampling rate c = NL/G N: Number of reads L: Read length G: Genome size

Page 17: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

17

2) Edena’s Methodology (cont’d)

For sequences, it’s about removing reads for which another read with the same sequence overlaps the first read to a greater extent

Page 18: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

18

2) Edena’s Methodology (cont’d)

3b) Graph Cleanup Can have multiple paths branching off a

single node (branching paths) Due to genomic repetitions, sequencing

errors, and clonal polymorphisms Genomic repetitions cannot be fixed

without additional information But the other two can be resolved

Page 19: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

19

2) Edena’s Methodology (cont’d)

Sequencing errors produce short dead-end (DE) paths

Attempt to elongate branching nodes up to a certain depth md (minimum depth)

Reads that cannot be extended to a depth of md are removed

Experimentally determined that md=10 is the best value

Page 20: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

20

2) Edena’s Methodology (cont’d)

Page 21: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

21

2) Edena’s Methodology (cont’d) Also disambiguate bubbles in the graph caused

by single base substitutions (aka “p-bubbles”) Length of p-bubble is at most ms = 4L - 2T - 1

L: Read length T: Min. overlap size

Explore each branching path up to length ms (guaranteed upper bound)

Remove path with less coverage Polymorphisms can be retained for later

analysis

Page 22: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

22

2) Edena’s Methodology (cont’d)

Page 23: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

23

2) Edena’s Methodology (cont’d)4) Contig Production If run in strict mode, Edena starts generating contig

sequences In non-strict mode, one more cleaning step is

performed Longer overlaps more reliable than shorter ones Save only edges at branching nodes that have the

highest overlap of all edges Produce contig sequence by following non-

intersecting simple paths in overlap graph Nodes must have in-degree and out-degree of exactly

one

Page 24: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

24

Outline1. Introduction2. Edena’s Methodology

Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production

3. Results Assemblers Assembly tasks

4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth

5. Conclusions

Page 25: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

25

3) Results

Survivor: WGS Assembly Four assemblers Two challenges One winner

Page 26: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

26

3) Results (cont’d)Contestant #1: SSAKE Indexes reads in a prefix tree based upon first

eleven 5’ bases Identify highest possible overlap between pairs

of reads Use most highly-covered reads as starting

points for read extension (i.e. assembly “nucleation points”)

So far only used for partial genome sequencing for comparative metagenomic analysis (e.g. bacterial species distinction)

Page 27: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

27

3) Results (cont’d)Contestant #2: Velvet k-mer/q-gram/k-gram/q-mer de Bruijn graph

representation of readsContestant #3: SHARCGS Can accept base quality scores along with read

data for read filtering (low quality reads discarded)

Also filter out reads with low coverage Assembly performed with a prefix treeContestant #4: Edena

Page 28: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

28

3) Results (cont’d)

Reward Challenge Assemble the 2.82 Mbp genome sequence and

the 20.7 Kbp plasmid sequence of the Staphylococcus aureus MW2 strain from Illumina reads

Immunity Challenge Assemble 1.55 Mbp genome sequence and the

3.66 Kbp plasmid sequence of the Helicobacter acinonychis Sheeba strain from Illumina reads

Page 29: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

29

3) Results (cont’d)

Staphylococcus aureus results Evaluated each assembler on the

parameter configurations that produced the best results

Edena: Min. overlap size: 21 bases Velvet: k-mer value: 23 SHARCGS: Max. gap span: 14 SSAKE: Default parameters

Page 30: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

30

3) Results (cont’d) Compared contig assembly to published

reference sequence Non-strict mode tends to produce longer contigs

at the expense of additional misassemblies Velvet comparable to Edena strict

Page 31: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

31

3) Results (cont’d) SHARCGS unable to assemble significant

contigs → insufficient coverage depth SSAKE produced a large number of

mismatches mostly at contig boundaries

Page 32: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

32

3) Results (cont’d) Authors also tried combining contig results from Edena

and Velvet due to significant overlaps between their contigs

N50 and mean contig size increased relative to original results

Edena non-strict has similar influence on results as previously

Page 33: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

33

3) Results (cont’d)

Helicobacter acinonychis results Best parameter settings: Edena: Min. overlap size: 27 (strict), 26

(non-strict) Velvet: k-mer value: 27 SHARCGS: Max. gap span: 10 (also

must remove last four bases from each read)

SSAKE: Default parameters

Page 34: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

34

3) Results (cont’d) Results similar to those from the previous

assembly challenge

Page 35: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

35

3) Results (cont’d)

Survivor: WGS Assembly Conclusion Granted Immunity: Edena, Velvet Sent to the Tribal Council: SSAKE,

SHARCGS

Page 36: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

36

Outline1. Introduction2. Edena’s Methodology

Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production

3. Results Assemblers Assembly tasks

4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth

5. Conclusions

Page 37: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

37

4) Additional Edena AnalysesGraph Cleaning Effectiveness Demonstrate the effectiveness of DE path removal

and p-bubble fixing Created an ideal read pool from the S. aureus MW2

strain Consists of one read at every possible position No errors No polymorphisms

Distinguish between positive and negative reads Positive reads have at least one exact occurrence in the

reference sequence Negative reads have none

Page 38: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

38

4) Additional Edena Analyses (cont’d)

Ideal dataset indicates branching nodes and p-bubbles caused by genomic repetition

Anomalies in real datasets only due to negative reads

Due to small quantity of branching nodes in the ideal dataset, branch removal procedure is extremely effective

Page 39: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

39

4) Additional Edena Analyses (cont’d)

Though many p-bubbles consist of sequences made of negative reads, most cannot be explained by base calling errors

Thought to correspond to underrepresented clonal polymorphisms

Page 40: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

40

4) Additional Edena Analyses (cont’d)

Since there are no DE paths in the ideal dataset, expect that DE removal should remove all DE paths in real dataset (i.e. dead-ends correspond to negative reads)

From tests with different md values (below), authors decided 10 was best

Not so clear-cut to me

Page 41: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

41

4) Additional Edena Analyses (cont’d)

Most DE paths have length 1 Correspond to paths created by base calling

errors Longer DE paths exist that do not appear

to be caused by such errors Thought to be clonal polymorphisms in low

abundance → can’t form a complete p-bubble

Page 42: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

42

4) Additional Edena Analyses (cont’d)

Effective Coverage Depth Computed effective coverage depth according

to formula from Lander and Waterman E = N(L-T)/G

N: # of usable reads L: Read length T: Req. overlap length G: Genome size

Can also estimate gaps in read coverage with N•e-E

Page 43: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

43

4) Additional Edena Analyses (cont’d)

S. aureus sequencing Raw coverage depth: 48x Effective coverage depth: 14x

H. acinonychis sequencing Raw coverage depth: 284x Effective coverage depth: 36x

Statistics imply that there should be no gaps in H. acinonychis assembly, and only a few in S. aureus

But each actual assembly contained several hundred gaps

Page 44: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

44

4) Additional Edena Analyses (cont’d)

Statistics assume uniform read sampling Investigated underrepresented parts of

genomes After alignment of reads to reference

genome, extracted low coverage sequences

These sequences have complex motifs and single base repeats → cause difficulty in replication

Page 45: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

45

Outline1. Introduction2. Edena’s Methodology

Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production

3. Results Assemblers Assembly tasks

4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth

5. Conclusions

Page 46: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

46

5) Conclusions Edena holds up well against other recent

assemblers, in both assembly quality and computational resources

Some assemblers are partially complementary to each other (Edena and Velvet) → can use together to produce results better than each individual assembler’s results

Rise of NGS paired read data will help produce longer contigs and clean up ambiguities

Page 47: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

47

Is Edena The One?The One that will herald the beginning of cost-effective whole genome assembly with NGS?

Maybe you should ask the Oracle…

Page 48: 1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

48

That’s all folks!Discussion Questions What were the strengths/weaknesses of the

Edena? How would you improve it? How do you think Edena compares to the other

assemblers tested? Would you test it against other assemblers not tested here?

Given Edena’s limitations, would you trust it for de novo genome assembly over traditional sequence assembly?

Why did we have to discuss yet another NGS genome assembler today?