opera: reconstructing optimal genomic scaffolds with high- throughput paired-end sequences song gao,...

Post on 29-Dec-2015

218 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Opera: Reconstructing optimal genomic scaffolds with high-

throughput paired-end sequences

Song Gao, Niranjan Nagarajan, Wing-Kin Sung

National University of SingaporeGenome Institute of Singapore

2

Outline

Overview• Methods

- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation• Results• Ongoing Work

3

Transcripts

Microbial Community

Biological Entity Data Entity

GenomeGenomic Sequence

TranscriptAssembly

Metagenome

Reads Analysis

ACGTTTAACAGG…TTACGATTCGATGA…GCCATAATGCAAG…

CTTAGAATCGGATAGAC…AGGCATAGACTAGAG…

Sequencing Machine

4

Sequence Assembly

Reads Contigs ScaffoldsPaired-end Reads

Related Research Works

Contig Level

OLC Framework:

De Bruijn Graph:

Scaffold Level

Comparative Assembly:

Embedded Module:

Standalone Module:

(I) (II)

Celera Assembler[Myers et al,2000], Edena[Hernandez et al,2008],Arachne[Batzoglou et al,2002], PE Assembler[Ariyaratne et al ,2011]

EULER[Pevzner et al, 2001] , Velvet[Zerbino et al,2008] ,ALLPATHS[Butler et al,2008], SOAPdenovo[Li et al,2010]

AMOScmp[Pop,2004], ABBA[Salzberg,2008]

EULER[Pevnezer et al, 2001], Arachne[Batzoglou et al ,2002],Celera Assembler[Myers et al,2000], Velvet[Zerbino, 2008]

Bambus[Pop, et al, 2004] , SOPRA[Dayarian et al, 2010]

5

Scaffolding Problem[Huson et al, 2002]

Value AdditionGap Filling:

GapCloser Module of SOAPdenovo

Repeat Resolution

Long-Range Genomic Structure

1k 3k 2.5k

Discordant Read

Paired-end Read Scaffold

Contig

* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)

6

DataSequencing Errors Read Length Coverage

Analysis

Long Insert vs. Long Read[Chaisson, 2009; Zerbino, 2009]

Statistics of Assembled Genomes[Schatz et al, 2010]

Organism Genome Size

Grapevine 500Mb

Panda 2.4Gb

Strawberry 220Mb

Turkey 1.1Gb

* Zerbino, D.R.: Pebble and rock band: heuristic resolution of repeats and scaolding in the velvet short-read de novo assembler. PLoS ONE, 4(12) (2009)* Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Research 19, 336-346 (2009)

# of Contigs N50

58,611 18.2kb

200,604 36.7kb

16,487 28.1kb

128,271 12.6kb

# of Scaffold N50

2,093 1.33Mb

81,469 1.22Mb

3,263 1.44Mb

26,917 1.5Mb

* Schatz M. C., Arthur L. D., Steven L. S.: Assembly of large genomes using second-generation sequencing. Genome Research, 20-9, 1165-1173 (2010)

* N50: Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L >= N.

7

NP-Complete [Huson et al, 2002]

* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)

8

Heuristic Methods- Celera Assembler[Myers et al,2000] - Euler[Pevzner et al, 2001]

- Jazz[Chapman et al, 2002] - Arachne[Batzoglou et al ,2002]

- Velvet[Zerbino et al,2008] - Bambus[Pop, et al, 2004]

“True Complexity”Phase transition based on parameters[Hayes, 1996]

Parametric Complexity[Rodney et al, 1999]

Vertex Cover Problem

Fixed-parameter tractabillity

* Hayes, B. Can't get no satisfaction. American. Scientist. 85, 108-112 (1996).

3-SAT Problem

* Rodney G. D., et al. Parameterized Complexity: A Framework for Systematically Confronting Computational Intractability. DIMACS. Vol 49. 1999

9

Outline

• Overview Methods

- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation• Results• Ongoing Work

10

1. Pre-Processing

Paired-end Reads -> Clusters [Huson et al, 2002]

Chimeric NoiseFiltered by simulation

* Upper Bound of Paired-end Reads

3

* Huson, D. H., Reinert, K., Myers E.W.: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM 49(5), 603–615 (2002)

Chimera

11

No discordant clusters in final scaffold

Naïve Solution

+A

+A+B

+A-B

+A+C

+A-C

+A+B+C

+A+B-C

Exponential Time

+A-C+B

+A-C-B

……

A B C D

2. A Special Case

12

Dynamic ProgrammingScaffold Tail is Sufficient

Analogous to Bandwidth Problem[Saxe, 1980]

Orientation of Nodes

Direction of Edges

Discordant Edges …

* J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363-369 (1980)

width(w)

Upper Bound

Equivalence class of scaffoldsS1 and S2 have the same tail -> They are in the same class

Feature of equivalence class:

- Use of the same set of contigs;

- All or none of them can be extended to a solution

Tail

+A-B+C

+D+E

-A+C

+D+E+F

14

Equivalence ClassNumber of Discordant Edges (p)

Chimeric Reads

ACCAAAATTT

ACCAAGAATTT

Sequencing Errors

CTAGAA CAAGAA

?

Mapping Errors

3. Full Algorithm

Consider discordant clusters

4. Graph Contraction

20k

4. Graph Contraction

4. Graph Contraction

18

UtilityGenome finishing(Genome Size Estimation)

Scaffold Correctness

Calculate Gap Sizes

Maximum Likelihood

Quadratic Function

Solved through quadratic programming [Goldfarb, et al, 1983]

Polynomial Time

g1 g2 g3

μ,σ

5. Gap Estimation

* Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983)

19

Outline

• Overview• Methods

- 1. Pre-Processing- 2. A Special Case- 3. Full Algorithm- 4. Graph Contraction- 5. Gap Estimation

Results• Ongoing Work

20

Runtime Comparison

◆E. coli ★B. pseudomallei ◆S. cerevisiae ◆D. melanogaster

Bambus 50s 16m 2m 3m

SOPRA 49m - 2h 5h

Opera 4s 7m 11s 30s

• Coverage of 300bp insert library: >20X• Coverage of 10kbp insert library: 2X• Contigs assembled using Velvet

◆ Simulated data set using MetaSim ★ In house data

21

Scaffold Contiguity

E. coli B. pseudomallei S. cerevisiae D. melanogaster0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

N50

Velvet

Bambus

SOPRA

Opera

E. coli B. pseudomallei S. cerevisiae D. melanogaster0

1

2

3

4

5

6

7

8

9

Max Length

Velvet

Bambus

SOPRA

Opera

22

Scaffold Correctness

E. coli S. cerevisiae D. melanogaster0

20

40

60

80

100

120

# of Breakpoints

VelvetBambusSOPRAOpera

23

Scaffold Correctness

E. coli S. cerevisiae D. melanogaster0

2

4

6

8

10

12

14

16

18

# of Discordant Edges

VelvetOpera

E.coli S. cerevisiae D. melanogaster

Opera 1 3 4

Bambus 19 55 423

24

Ongoing Work

Genome Size N50

Opera ~2Gbp 765.5Kbp

SSpace 281.7Kbp

A Rodent Genome

A Tree Genome

Genome Size N50 Max Length

Opera ~300Mbp 209.9Kbp 921.8Kbp

25

Ongoing Work

Repeats

Lower bounds and better scaffold

Multiple Libraries

Other applications

Metagenomics

Cancer Genomics

Link: https://sourceforge.net/projects/operasf/

26

Acknowledgement

Questions?

Wing-Kin Sung Niranjan Nagarajan

Pramila N. Ariyaratne

Fundings:

A*STAR of Singapore

Ministry of Education, Singapore

NUS Graduate School for Integrative Sciences and Engineering (NGS)

top related