fragment assembly
DESCRIPTION
Fragment Assembly. Introduction. Fragments are typically of 200-700 bp long “Target” string is about 30k – 100k bp long Problem: given a set of fragments reconstruct the target. Introduction. Multiple-alignment of the fragments ignoring spaces at the end The alignment is called “layout” - PowerPoint PPT PresentationTRANSCRIPT
04/22/23 1
Fragment Assembly
04/22/23 2
Introduction Fragments are typically of 200-700
bp long
“Target” string is about 30k – 100k bp long
Problem: given a set of fragments reconstruct the target
04/22/23 3
Introduction Multiple-alignment of the fragments
ignoring spaces at the end
The alignment is called “layout”
The output is called the “consensus sequence”
An optimization problem
04/22/23 4
Complications Base-call errors: Substitution errors [p 107] Insertion errors (possibly from the
host sequence) [p 108, fig 4.3] Deletion error [fig 4.4] Majority voting solves them (or
some form of optimization)
04/22/23 5
Complications Chimeras: To non-contiguous fragments get
joined as a single fragment [p 109, fig 4.5]
Needs to be weeded out as a preprocessing step
Similar to chimeras, contaminant fragments (possibly from host) needs to be filtered out as well
04/22/23 6
Complications Unknown orientation: Fragments may come from either strand Even from the opposite strand, its
reverse-complement must be in the target string
Consequence: try both forward and rev-complement of each fragment (2^n trial in worst, for n fragments)
[p 109, fig 4.6]
04/22/23 7
Complications Repeats: Regions (super-string of some
fragments) may repeat in a target Consequent problem: where do the
fragments really come from, on approximate alignment? [p 110, fig 4.7]
Problem 2: where should the inter-repeat fragments go? [p111, fig 4.8, fig 4.9]
Inverted repeats: repeat of the reverse complement [fig 4.10]
04/22/23 8
Complications Insufficient coverage: Chance of coverage increases with
redundancy (a heuristic: cover 8 times the target length)
Chance of covering a gap reduces when it remains uncovered even after multiple fragments are aligned): random sampling is not good solution here
04/22/23 9
Complications Insufficient coverage: What you get with insufficient coverage
is multiple “contigs,” not one contig “t-contig” is where we expect t-long
overlap between pairs of fragments Expected number of contigs: [p 112,
formula 4.1] Lower t means lesser number of contigs
(more aligned segments), but weaker consensus
04/22/23 10
Reconstruction Shortest common superstrings are
not the best solution Fig 4.12 vs Fig 4.13 (p115/116)
04/22/23 11
Reconstruction Superstring to be reconstructed out of
fragments An alignment problem with no end penalty d_s is edit distance score without end-
penalty: minimized over edit distances d Fig 4.14 (p117) for best aligned
subsequence-matching Note, char matched is charged 0,
mismatch 1, gap 2, in “distance” rather than “similarity”
We will use d for d_s
04/22/23 12
Reconstruction f is approximate substring of S at
error level e, then the score isd(f, S) =< e|f|,
e=1 means no error allowede<1 allows insert/delete/substitution
errors f and f- both should be matched
04/22/23 13
Reconstruction: Problem Input: Set F of substrings, error
level e Output: Shortest possible string S
s.t. for all f Min(d(f, S), d(f-, S)) =< e|f|
04/22/23 14
Reconstruction: Multicontig How much overlap do we require
between strings? Ideally, each column in the layout L
should have same character, for all columns 1 through |L|
Fig 4.4 (p 118): t-contig for t=3, 2, 1
Balance between t and number of t-contigs
04/22/23 15
Reconstruction: Multicontig S is e-consensus sequence
(multicontig) for 0=<e=<1: edit distance d(f, S) =< e|f|
Multicontig problem: Input: set F, integer t>=0, 0=<e=<1 Output: Minimum partition over F,
each partition Ci is a t-contig with e-consensus
04/22/23 16
Reconstruction: Overlap Multi-graph Nodes are the fragments Directed arcs label length t of overlap
between nodes” t-suffix= t-prefix Arcs between all pairs of nodes, but no self-
loop Fig 4.15 (p 121): example Length of a created superstring=total wt
along the path(or overlaps) + total length of all fragments involved
Max weight Hamiltonian path is what we are looking for in this graph max overlapped superstring
04/22/23 17
Reconstruction Substrings of fragments within the
set of fragments are noise: remove them
Draw OMG of the substring free set of fragments
Shortest common superstring always correspond to a Hamiltonian path in this graph
04/22/23 18
Reconstruction: OMG Thm 4.1 (p 123): F substring free, for
every common superstring S, there is a Ham. Path P, s.t., S(P) is in S
Substrings are strictly ordered over S: order of left pts = order of rt points (otherwise substring exists)
Path follows the same order of fragments (as in S) in OMG
S may contain extra garbage materials, so, S(P) is within S
04/22/23 19
Reconstruction: OMG If S is shortest common
superstring, then S must be within S(P), or S=S(P)
In other words, a Ham. Path in OMG for substring-free collection F’ is a shortest common superstring of the Fragment set F
04/22/23 20
Reconstruction: OMG Think of an algorithm for weeding out
substrings from F
Also, weed out multi-edges by keeping the largest wt edge between any pair of nodes
If the wt on an edge is below a threshold t, then the wt should be treated as 0
04/22/23 21
Reconstruction: OMG Greedy Algorithm to draw Ham. Path (p 125) Collects edges largest to smallest,
(1) preventing cycle (union-find), (2) indegree of each node should be =<1 (first node has 0)(3) outdegree of each node should be =<1 (last node has 0)
[Does not return Ham. Path. Can you modify to return Ham. Path?]
Alg is NOT optimal, example (p 126): returns 3, optimal wt is 4
04/22/23 22
Reconstruction: OMG Subintervals: if a fragment can be
embedded within another one in the set
Subinterval-free and repeat-free graphs connected at level t has a Ham. Path that generates the target string
04/22/23 23
Reconstruction: OMG If a repeat exists in the original string,
then the graph will have a cycle False positive: substrings from two
different portions has t-overlap If a cycle exist in the graph, then there
must be a “false positive” (Thm 4.4, p129): proof by contradiction, otherwise the subinterval-free fragments can be totally ordered
04/22/23 24
Reconstruction: OMG If there is no repeats in a
subinterval-free graph, then there exist a unique Ham. Path
If there exist a cycle it may not come from a repeat
04/22/23 25
Reconstruction: OMG Example 4.6 (p 130): greedy alg
finds wrong string, but the Ham. Path finds the correct one
Greedy does not care about linkage (optimizes on total overlap – finds shortest common superstring)
Ham path chooses any t-overlap connections – cares for linkage only
04/22/23 26
Parameters in aligning for fragment assembly Score on a column: traditionally {0,-1,-
2} in sum-of-pairs Entropy:
Sum[over alphabets and space c] –pc log pc, where pc is probability of c
All same character, pc = 1, entropy=0 For {a, t, c, g, -}, all different, pc = 1/5,
entropy=log 5entropy measures uniformity alone, a better metric
04/22/23 27
Parameters in aligning for fragment assembly Coverage: How many each column is
“covered” by how many fragments? (Average, min, max)
This is different from the concept of t-overlap
If a column (of the target) is covered by 0, then the layout is disconnected
Counteracts with the requirement of subinterval-free collection if we expect coverage>1 for all columns
04/22/23 28
Parameters in aligning for fragment assembly Coverage is not enough, we need
good linkage, Example: p 133 Ham. Path algorithm is doing that
04/22/23 29
Steps in assembly : Step 1: Overlap finding Approximate – delete, insert,
replace allowed by semi-global DP algorithm with appropriate end-gap penalty, pairwise between each fragment
and its reverse-complement
04/22/23 30
Steps in assembly : Step 2: Construct over (F union F-
bar) for the fragment set F (-- after eliminating substrings?) Construct Hamiltonian path in this
graph Cycles and unbalanced coverage
may mean repeats
04/22/23 31
Steps in assembly : Step 3: fine tuning the multiple
alignment to get a consensus target
Manual or algorithmic Examples in p 137-138