james lindsay*, hamed salooti, alex zelikovski, ion mandoiu* scaffolding large genomes using integer...
Post on 14-Dec-2015
221 Views
Preview:
TRANSCRIPT
JAMES LINDSAY* , HAMED SALOOTI , ALEX ZELIKOVSKI , ION MANDOIU*
Scaffolding Large Genomes Using Integer Linear
Programming
University of Connecticut* Georgia State University
De-novo Assembly Paradigm
Sequencing
The Contigs
The Scaffolds
The Reads
The Genome
Assembly
Scaffolding
Why Scaffolding?
Annotation Comparative biology
Re-sequencing and gap filling
Structural variation!gene XYZ
3’ UTR
5’ UTR
Scaffold
gene XYZ
No scaffold
Why Scaffolding?
Annotation Comparative biology
Re-sequencing and gap filling
Structural variation!gene XYZ
3’ UTR
5’ UTR
Sanger Sequencing
gene XYZ3’
UTR5’
UTR
Biologist: There are holes in my genes!
Why Scaffolding?
Annotation Comparative biology
Re-sequencing and gap Filling
Structural variation!
Read Pairs
Paired Read Construction
2kb
2kb
same strand and orientation
R1 R2
Informative Reads
Align each read against the contigs
Only accept uniquely mapped reads Use the non-unique
reads laterBoth reads in a pair
must map to different contigs
Linkage Information
Possible States
Two contigs are adjacent if: A read pair spans the contigs
State (A, B, C, D) Depends on orientation of
the read Order of contigs is arbitrary
Each read pair can be “consistent” with one of the four states
5’ 3’
contig i contig j
R1 R2A
B
C
D
The Scaffolding Problem
Given• Contigs• Paired readsFind• Orientation• Ordering• Relative DistanceGoal• Recreate true scaffolds
Possible Objectives• Un-weighted• Max number of consistent
read pairs• Weighted• Each states is weighted:
• Overlap with repeat• Deviation of expected distance• …
𝑊 𝑖𝑗𝐴 ,𝑊 𝑖𝑗
𝐵 ,𝑊 𝑖𝑗𝐶 ,𝑊 𝑖𝑗
𝐷
Graph Representation
Using input we can define a scaffolding graph:
This is an undirected multi-graph
Assume it is connected
𝐺=(𝑉 ,𝐸)
𝑉 ,𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑐𝑜𝑛𝑡𝑖𝑔𝑠E, set of
Integer Linear Program Formulation
Variables
, ,
max ∑( 𝑖 , 𝑗 ) ∈𝐸
(𝑊 ¿¿ 𝑖𝑗𝐴 𝐴𝑖𝑗 )+(𝑊 ¿¿ 𝑖𝑗𝐵 𝐵𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐶𝐶𝑖𝑗)+(𝑊 ¿¿ 𝑖𝑗𝐷 𝐷𝑖𝑗)¿¿¿¿
Contig Pair State:
Contig Orientation: 𝑆 𝑖∈ {0,1 }Pairwise Contig Consistency:
𝑆 𝑖 𝑗 ∈ {0,1 }
Objective Maximize weight of consistent pairs
Constraints
Pairwise Orientation
𝑆 𝑖𝑗≤𝑆 𝑗+𝑆𝑖
𝑆 𝑖𝑗≤2−𝑆𝑖−𝑆 𝑗
𝑆 𝑖𝑗≥𝑆 𝑗−𝑆 𝑖
𝑆 𝑖𝑗≥𝑆𝑖−𝑆 𝑗
𝐴𝑖𝑗+𝐷 𝑖𝑗≤1−𝑆𝑖 𝑗 𝐵𝑖𝑗+𝐶𝑖𝑗≤𝑆𝑖 𝑗
Mutually Exclusivity
Forbid 2 and 3 Cycles Explicitly
Graph Decomposition: Articulation Points
solve
Articulation point
Graph Decomposition: 2-cuts
2-cut+
+
+
-
-
+
-
-
Non-Serial Dynamic Programming
• SPQR-tree to schedule decomposition
• Traverse tree using DFS
• NSDP utilizes solutions of previous stage in current stage
Largest Connected Component
Largest Biconnected Component
Largest Triconnected Component
Post Processing ILP Solution
May have cyclesNot a total ordering
for each connected components
A
B
C
DF
E
ILP Solutionoutgoing incoming
A
B
C
D
E
F
A
B
C
D
E
F
Bipartite matching Objectives:
Max weight Max cardinality Max cardinality / Max weight
Testing Framework
Venter Genome
Read Type Total ReadsTotal
BasesAvg
lengthCoverage
Sanger 31,861,976 2.79E+10 875 9.930637
SOLiD pairs 4.85E+08 2.42E+10 50 8.623028
# Reads# Bases in
reads # Contigs# Bases in
contigs N50112,00,000 1.1E+10 422,837 2.26E+09 7704
4x Assembly
Testing Metrics
Computer Scientists Finding Scaffold = Binary Classification Test
n contigs, try to predict n-1 adjacencies TP,FP,TN,FN, Sensitivity, PPV
Biologists (main focus) N50 (basically average scaffold size, ignore gaps) TP50
Break scaffold at incorrect edges, then find N50
Results
test case method
bundle size sensitivity ppv N50 TP50
10% opera 2 81.13% 99.26% 27,567 27,327
10% mip 2 59.01% 98.94% 19,988 19,755
10% ilp 1 79.86% 98.58% 26,814
26,459
25% opera 2 80.44% 98.27% 27,296
26,849
25% mip 2 58.95% 97.56% 19,842 19,518
25% ilp 1 79.30% 96.93% 26,684
26,079
100% opera 3 pending … … … 100% mip 3 failed n/a n/a n/a
100% ilp 1 68.25% 89.90% 20,538
19,006
Conclusions
Success ILP solves scaffolding problem! NSDP works.
Improvements Finalize large test cases (then publish?!) Practical considerations (read style, multi-libraries,
merge ctgs)Future Work
Where else can I apply NSDP? Scaffold before assembly?? Structural Variation??
Questions?
top related