james lindsay*, hamed salooti, alex zelikovski, ion mandoiu* acm-bcb 2012 scaffolding large genomes...

31
JAMES LINDSAY* , HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut* Georgia State University

Upload: dwain-richardson

Post on 25-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*

ACM-BCB 2012

Scaffolding Large Genomes Using Integer Linear

Programming

University of Connecticut* Georgia State University

Page 2: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

De-novo Assembly Paradigm

shotgun sequencing

short contigs

the scaffolds

short reads

the genome

denovoassembly

scaffolding

Page 3: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Scaffold

gene XYZ

No scaffold

Page 4: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap filling

Structural variation!gene XYZ

3’ UTR

5’ UTR

Sanger Sequencing

gene XYZ3’

UTR5’

UTR

Biologist: There are holes in my genes!

Page 5: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Why Scaffolding?

Annotation Comparative biology

Re-sequencing and gap Filling

Structural variation!

Page 6: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Massive Sequencing Projects

Effects of Read Length

I5k 5000 insect and

arthropod species

G10k 10,000 vertebrate

species

Dog Genome 7.5x Sanger N50: 180Kb

Chicken Genome 6x Illumina N50: 12Kb

Human Genome 100x Illumina N50: 24Kb

Fragmented Genomes

Page 7: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

The Scaffolding Problem

GIVENβ€’ CONTIGS, PAIRED READSFINDβ€’ ORIENTATION, ORDERING,

RELATIVE DISTANCEGOALβ€’ RECREATE TRUE SCAFFOLDS

Page 8: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Paired Read Construction

Paired Read Styles

Mate Pair

Paired End

Paired Reads

2kb

2kb

same strand and orientation

R1 R2

100b 100b 10kb

different strand and orientation

R1R2

Page 9: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Linkage Information

Possible States (mate pair)Two contigs are adjacent if:

A read pair spans the contigs

State (A, B, C, D) Depends on orientation of

the read Order of contigs is arbitrary

Each read pair can be β€œconsistent” with one of the four states

5’ 3’

contig i contig j

R1 R2A

B

C

D

Page 10: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Nodes Edges

Nodes are contigs Adjacent contigs have 4 edges (one for each state)

Weighted by overlap with repetitive region

Scaffolding Graph

contig i contig jState A

π‘Š 𝑖𝑗𝐴= βˆ‘

π‘Ÿ π‘’π‘Žπ‘‘π‘π‘Žπ‘–π‘Ÿπ‘ 

1βˆ’ΒΏ π‘π‘π‘–π‘›π‘Ÿπ‘’π‘π‘’π‘Žπ‘‘π‘Ÿπ‘’π‘”π‘–π‘œπ‘›

ΒΏπ‘π‘π‘–π‘›π‘Ÿπ‘’π‘Žπ‘‘

Page 11: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Integer Linear Program Formulation

Variables

, ,

𝑧=max βˆ‘( 𝑖 , 𝑗 ) ∈𝐸

(π‘Š ΒΏΒΏ 𝑖𝑗𝐴 𝐴𝑖𝑗 )+(π‘Š ΒΏΒΏ 𝑖𝑗𝐡 𝐡𝑖𝑗)+(π‘Š ΒΏΒΏ 𝑖𝑗𝐢𝐢𝑖𝑗)+(π‘Š ΒΏΒΏ 𝑖𝑗𝐷 𝐷𝑖𝑗)ΒΏΒΏΒΏΒΏ

Contig pair state:

Contig orientation: 𝑆 π‘–βˆˆ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Objective Maximize weight of consistent pairs

Page 12: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 π‘–βˆˆ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

Pairwise Orientation

𝑆 𝑖𝑗≀𝑆 𝑗+𝑆𝑖

𝑆 𝑖𝑗≀2βˆ’π‘†π‘–βˆ’π‘† 𝑗

𝑆 𝑖𝑗β‰₯𝑆 π‘—βˆ’π‘† 𝑖

𝑆 𝑖𝑗β‰₯π‘†π‘–βˆ’π‘† 𝑗

Page 13: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 π‘–βˆˆ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

State Variables

2 𝐴𝑖𝑗≀(1βˆ’π‘†ΒΏΒΏ 𝑖)+(1βˆ’π‘† 𝑗)ΒΏ 2𝐡𝑖𝑗≀(1βˆ’π‘†ΒΏΒΏ 𝑖)+𝑆 𝑗¿

2𝐢𝑖𝑗≀𝑆 𝑖+(1βˆ’π‘† 𝑗) 2𝐷𝑖𝑗≀𝑆𝑖+𝑆 𝑗

Page 14: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Constraints

Variables

, , Contig pair state:

Contig orientation: 𝑆 π‘–βˆˆ {0,1 }Adjacent contig consistency:

𝑆 𝑖 𝑗 ∈ {0,1 }

𝐴𝑖𝑗+𝐷 𝑖𝑗≀1βˆ’π‘†π‘– 𝑗 𝐡𝑖𝑗+𝐢𝑖𝑗≀𝑆𝑖 𝑗

Mutual Exclusivity

Page 15: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Constraints

Forbid 2 Cycles

𝐡𝑖𝑗+𝐢𝑖 𝑗≀ 𝑆𝑖 𝑗 𝐴𝑖𝑗+𝐷 𝑖 𝑗≀1βˆ’π‘† 𝑖 𝑗

Forbid 3 Cycles

2222

2222

*larger cycles are broken at the end

Page 16: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Largest Connected Component

Page 17: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Graph Decomposition: Articulation Points

solve

Articulation point

MIP, Salmela 2011

Page 18: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Largest Biconnected Component

Page 19: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Non-Serial Dynamic Programming

A technique which exploits the sparsity of the scaffolding graph by computing the solution in stages, incorporating the results from previous stages

~inspired by (Neumaier, 06)

Page 20: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Non-Serial Dynamic Programming

2-cut+

+

+

-

-

+

-

-

𝑧 𝐴 𝑧𝐡

𝑧𝐢 𝑧𝐷

Page 21: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Non-Serial Dynamic Programming

+

+

+

-

-

+

-

-

𝑧 𝐴 𝑧𝐡

𝑧𝐢 𝑧𝐷

+

Objective Modification:

𝑧 𝐴

𝑧𝐡

𝑧𝐢

𝑧𝐷

Page 22: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

SPQR-tree Based Implementation

β€’ SPQR-tree efficiently finds 2 cuts (Tarjan, 73)

β€’ DFS of SPQR-tree is used to schedule elimination order for NSDP

Page 23: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Post Processing ILP Solution

May have cyclesNot a total ordering

for each connected components

A

B

C

DF

E

ILP Solutionoutgoing incoming

A

B

C

D

E

F

A

B

C

D

E

F

Bipartite matching Objectives:

Max weight Max cardinality Max cardinality / Max weight

Page 24: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

GAGE Framework

Genome Size (Mb) # reads

Staphlococcus Aureus 2.9 3,494,070

Rhodobacter sphaeorides

4.6 2,050,868

Human Chr14 107 22,669,408

Assembled using: ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA,

SOAPdenovo, VelvetScaffolded using:

SILP (our method), Opera, MIP, Bambus2

Page 25: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Testing Metrics

TPN50 Break scaffold at incorrect edges, then find N50 Size of contig where 50% of the contigs are this size

Binary Classification Given n contigs in a scaffold How many of n-1 adjacencies can you predict

PPV Sensitivity MCC

Page 26: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Results

staph rhodo chr140

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

Scaffolding TPN50

silpoperamipbambus2

Genome

TP

N50 (

bp)

Page 27: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Results

staph rhodo chr140.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

PPV

silpoperamipbambus2

Genome

PP

V

Page 28: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Results

staph rhodo chr140.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Sensitivity

silpoperamipbambus2

Genome

Sensi

tivit

y

Page 29: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Results

staph rhodo chr140.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

Matthews Correlation Coefficient

silpoperamipbambus2

Genome

MC

C

Page 30: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Conclusions

Success ILP solves scaffolding problem! NSDP works

Improvements Include SOAPdenovo, Allpaths-LG scaffolds in comparison Look at parameter effects Practical considerations (read style, multi-libraries, merge

ctgs)Future Work

Where else can I apply NSDP? Scaffold before assembly … promising Structural Variation??

Page 31: JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia

Questions?