opera: reconstructing optimal genomic scaffolds with high- throughput paired-end sequences song gao,...
Post on 29-Dec-2015
218 Views
Preview:
TRANSCRIPT
Opera: Reconstructing optimal genomic scaffolds with high-
throughput paired-end sequences
Song Gao, Niranjan Nagarajan, Wing-Kin Sung
National University of SingaporeGenome Institute of Singapore
2
Outline
Overview• Methods
- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation• Results• Ongoing Work
3
Transcripts
Microbial Community
Biological Entity Data Entity
GenomeGenomic Sequence
TranscriptAssembly
Metagenome
Reads Analysis
ACGTTTAACAGG…TTACGATTCGATGA…GCCATAATGCAAG…
CTTAGAATCGGATAGAC…AGGCATAGACTAGAG…
Sequencing Machine
4
Sequence Assembly
Reads Contigs ScaffoldsPaired-end Reads
Related Research Works
Contig Level
OLC Framework:
De Bruijn Graph:
Scaffold Level
Comparative Assembly:
Embedded Module:
Standalone Module:
(I) (II)
Celera Assembler[Myers et al,2000], Edena[Hernandez et al,2008],Arachne[Batzoglou et al,2002], PE Assembler[Ariyaratne et al ,2011]
EULER[Pevzner et al, 2001] , Velvet[Zerbino et al,2008] ,ALLPATHS[Butler et al,2008], SOAPdenovo[Li et al,2010]
AMOScmp[Pop,2004], ABBA[Salzberg,2008]
EULER[Pevnezer et al, 2001], Arachne[Batzoglou et al ,2002],Celera Assembler[Myers et al,2000], Velvet[Zerbino, 2008]
Bambus[Pop, et al, 2004] , SOPRA[Dayarian et al, 2010]
5
Scaffolding Problem[Huson et al, 2002]
Value AdditionGap Filling:
GapCloser Module of SOAPdenovo
Repeat Resolution
Long-Range Genomic Structure
1k 3k 2.5k
Discordant Read
Paired-end Read Scaffold
Contig
* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)
6
DataSequencing Errors Read Length Coverage
Analysis
Long Insert vs. Long Read[Chaisson, 2009; Zerbino, 2009]
Statistics of Assembled Genomes[Schatz et al, 2010]
Organism Genome Size
Grapevine 500Mb
Panda 2.4Gb
Strawberry 220Mb
Turkey 1.1Gb
* Zerbino, D.R.: Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler. PLoS ONE, 4(12) (2009)* Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Research 19, 336-346 (2009)
# of Contigs N50
58,611 18.2kb
200,604 36.7kb
16,487 28.1kb
128,271 12.6kb
# of Scaffold N50
2,093 1.33Mb
81,469 1.22Mb
3,263 1.44Mb
26,917 1.5Mb
* Schatz M. C., Arthur L. D., Steven L. S.: Assembly of large genomes using second-generation sequencing. Genome Research, 20-9, 1165-1173 (2010)
* N50: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L >= N.
7
NP-Complete [Huson et al, 2002]
* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)
8
Heuristic Methods- Celera Assembler[Myers et al,2000] - Euler[Pevzner et al, 2001]
- Jazz[Chapman et al, 2002] - Arachne[Batzoglou et al ,2002]
- Velvet[Zerbino et al,2008] - Bambus[Pop, et al, 2004]
“True Complexity”Phase transition based on parameters[Hayes, 1996]
Parametric Complexity[Rodney et al, 1999]
Vertex Cover Problem
Fixed-parameter tractabillity
* Hayes, B. Can't get no satisfaction. American. Scientist. 85, 108-112 (1996).
3-SAT Problem
* Rodney G. D., et al. Parameterized Complexity: A Framework for Systematically Confronting Computational Intractability. DIMACS. Vol 49. 1999
9
Outline
• Overview Methods
- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation• Results• Ongoing Work
10
1. Pre-Processing
Paired-end Reads -> Clusters [Huson et al, 2002]
Chimeric NoiseFiltered by simulation
* Upper Bound of Paired-end Reads
3
* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)
Chimera
11
No discordant clusters in final scaffold
Naïve Solution
+A
+A+B
+A-B
+A+C
+A-C
+A+B+C
+A+B-C
Exponential Time
+A-C+B
+A-C-B
…
……
A B C D
2. A Special Case
12
Dynamic ProgrammingScaffold Tail is Sufficient
Analogous to Bandwidth Problem[Saxe, 1980]
Orientation of Nodes
Direction of Edges
Discordant Edges …
* J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363-369 (1980)
width(w)
Upper Bound
Equivalence class of scaffoldsS1 and S2 have the same tail -> They are in the same class
Feature of equivalence class:
- Use of the same set of contigs;
- All or none of them can be extended to a solution
Tail
+A-B+C
+D+E
-A+C
+D+E+F
…
14
Equivalence ClassNumber of Discordant Edges (p)
Chimeric Reads
ACCAAAATTT
ACCAAGAATTT
Sequencing Errors
CTAGAA CAAGAA
?
Mapping Errors
3. Full Algorithm
Consider discordant clusters
4. Graph Contraction
20k
4. Graph Contraction
4. Graph Contraction
18
UtilityGenome finishing(Genome Size Estimation)
Scaffold Correctness
Calculate Gap Sizes
Maximum Likelihood
Quadratic Function
Solved through quadratic programming [Goldfarb, et al, 1983]
Polynomial Time
g1 g2 g3
μ,σ
5. Gap Estimation
* Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983)
19
Outline
• Overview• Methods
- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation
Results• Ongoing Work
20
Runtime Comparison
◆E. coli ★B. pseudomallei ◆S. cerevisiae ◆D. melanogaster
Bambus 50s 16m 2m 3m
SOPRA 49m - 2h 5h
Opera 4s 7m 11s 30s
• Coverage of 300bp insert library: >20X• Coverage of 10kbp insert library: 2X• Contigs assembled using Velvet
◆ Simulated data set using MetaSim ★ In house data
21
Scaffold Contiguity
E. coli B. pseudomallei S. cerevisiae D. melanogaster0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
N50
Velvet
Bambus
SOPRA
Opera
E. coli B. pseudomallei S. cerevisiae D. melanogaster0
1
2
3
4
5
6
7
8
9
Max Length
Velvet
Bambus
SOPRA
Opera
22
Scaffold Correctness
E. coli S. cerevisiae D. melanogaster0
20
40
60
80
100
120
# of Breakpoints
VelvetBambusSOPRAOpera
23
Scaffold Correctness
E. coli S. cerevisiae D. melanogaster0
2
4
6
8
10
12
14
16
18
# of Discordant Edges
VelvetOpera
E.coli S. cerevisiae D. melanogaster
Opera 1 3 4
Bambus 19 55 423
24
Ongoing Work
Genome Size N50
Opera ~2Gbp 765.5Kbp
SSpace 281.7Kbp
A Rodent Genome
A Tree Genome
Genome Size N50 Max Length
Opera ~300Mbp 209.9Kbp 921.8Kbp
25
Ongoing Work
Repeats
Lower bounds and better scaffold
Multiple Libraries
Other applications
Metagenomics
Cancer Genomics
Link: https://sourceforge.net/projects/operasf/
26
Acknowledgement
Questions?
Wing-Kin Sung Niranjan Nagarajan
Pramila N. Ariyaratne
Fundings:
A*STAR of Singapore
Ministry of Education, Singapore
NUS Graduate School for Integrative Sciences and Engineering (NGS)
top related