optimal phylogenetic networks with constrained and unconstrained recombination (the root-unknown...
Post on 21-Dec-2015
222 views
TRANSCRIPT
![Page 1: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/1.jpg)
![Page 2: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/2.jpg)
Optimal Phylogenetic Networks with Constrained and
Unconstrained Recombination(The root-unknown case)
Dan Gusfield
UC Davis
![Page 3: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/3.jpg)
Reconstructing the Evolution of Binary Bio-Sequences
• Perfect Phylogeny (tree) model• Phylogenetic Networks (DAG) with
recombination• Phylogenetic Networks with disjoint cycles:
Galled-Trees• Phylogenetic Networks with unconstrained cycles:
Blobbed-Trees• Combinatorial Structure and Efficient Algorithms
![Page 4: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/4.jpg)
Geneological or Phylogenetic Networks
• The major biological motivation comes from genetics and attempts to reconstruct the history of recombination in populations.
• Also relates to phylogenetic-based haplotyping.
• The algorithmic and mathematical results also have phylogenetic applications, for example in lateral gene transfer.
![Page 5: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/5.jpg)
The Perfect Phylogeny Model for binary sequences
00000
1
2
4
3
510100
1000001011
00010
01010
12345sitesAncestral sequence
Extant sequences at the leaves
Site mutations on edgesThe tree derives the set M:1010010000010110101000010
![Page 6: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/6.jpg)
Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:
0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
When can a set of sequences be derived on a perfect phylogeny
with the all-0 root?
![Page 7: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/7.jpg)
A richer model
00000
1
2
4
3
510100
1000001011
00010
01010
12345101001000001011010100001010101 added
pair 4, 5 fails the threegamete-test. The sites 4, 5``conflict”.
Real sequence histories often involve recombination.
![Page 8: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/8.jpg)
10100 01011
5
10101
The first 4 sites come from P (Prefix) and the sitesfrom 5 onward come from S (Suffix).
P S
Sequence Recombination
A recombination of P and S at recombination point 5.
Single crossover recombination
![Page 9: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/9.jpg)
Network with Recombination
00000
1
2
4
3
510100
1000001011
00010
01010
12345101001000001011010100001010101 new
10101
The previous tree with onerecombination event now derivesall the sequences.
5
P
S
![Page 10: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/10.jpg)
Multiple Crossover Recombination
4-crossovers
2-crossovers = ``gene conversion”
![Page 11: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/11.jpg)
Elements of a Phylogenetic Network (single crossover
recombination)• Directed acyclic graph. • Integers from 1 to m written on the edges. Each integer
written only once. These represent mutations.• A choice of ancestral sequence at the root.• Every non-root node is labeled by a sequence obtained
from its parent(s) and any edge label on the edge into it.• A node with two edges into it is a ``recombination node”,
with a recombination point r. One parent is P and one is S.• The network derives the sequences that label the leaves.
![Page 12: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/12.jpg)
A Phylogenetic Network
00000
52
3
3
4S
p
PS
1
4
a:00010
b:10010
c:00100
10010
01100
d:10100
e:01100
00101
01101
f:01101
g:00101
00100
00010
![Page 13: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/13.jpg)
Which Phylogenetic Networks are meaningful?
Given M we want a phylogenetic network that derives M, but which one?
A: A perfect phylogeny (tree) if possible. As little deviationfrom a tree, if a tree is not possible. Use as little recombinationor gene-conversion as possible.
![Page 14: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/14.jpg)
Minimizing recombinations
• Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful.
• However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations.
• Two algorithmic problems: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations (Hein’s problem). Find a network generating M that has some biologically-motivated structural properties.
![Page 15: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/15.jpg)
Minimization is NP-hard
The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP-hard. (Wang et al 2000)
They explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible.
They gave a sufficient but not a necessary condition to recognize cases when this is possible. O(nm + n^4) time.
![Page 16: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/16.jpg)
Recombination Cycles
• In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.
• The cycle specified by those two paths is called a ``recombination cycle”.
![Page 17: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/17.jpg)
Galled-Trees
A recombination cycle in a phylogenetic network is called a “gall” if it shares no node with any other recombination cycle.
A phylogenetic network is called a “galled-tree” if every recombination cycle is a gall.
![Page 18: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/18.jpg)
4
1
3
2 5
a: 00010
b: 10010
d: 10100
c: 00100
e: 01100
f: 01101
g: 00101
A galled-tree generatingthe sequences generatedby the prior network.
3
4
p s
ps
![Page 19: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/19.jpg)
![Page 20: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/20.jpg)
Sales pitch for Galled-TreesGalled-trees represent a small deviation from true trees.
There are sufficient applications where it is plausible that a galled tree exists that generates the sequences.
Observable recombinations tend to be recent; block structure of human DNA; recombination is sparse, so the true history of observable
recombinations may be a galled-tree.
The number of recombinations is never more than m/2. Moreover,when M can be derived on a galled-tree, the number of recombinations used is the minimum possible over any phylogenetic network, even if multiple cross-overs at a recombination event are counted as a single recombination.
A galled-tree for M is ``almost unique” - implications for reconstructing thecorrect history.
![Page 21: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/21.jpg)
Old (Aug. 2003) Results
• O(nm + n^3)-time algorithm to determine whether or not M can be derived on a galled-tree with all-0 ancestral sequence.
• Proof that the galled-tree produced by the algorithm is a “nearly-unique” solution.
• Proof that the galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used, over all phylogenetic-networks with all-0 ancestral sequence.
![Page 22: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/22.jpg)
The initial version is in the Proceedings of the 2003 IEEE CSBconference.
The expanded journal version with proof of optimality isin “Optimal, Efficient Reconstruction of Phylogenetic Networkswith Constrained Recombination” by Gusfield, Eddhu and Langley. J. Bioinformatics and Computational Biology, Vol. 2No. 1 (2004) p. 173-213
![Page 23: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/23.jpg)
New work
We derive the galled-tree results in a more general setting that addresses unconstrained
recombination cycles and multiplecrossover recombination. This also solves theproblem of finding the ``most tree-like” network when a perfect phylogeny is not possible. In this algorithm, no ancestral
sequence is known in advance.
![Page 24: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/24.jpg)
Blobbed-trees: generalizing galled-trees
• In a phylogenetic network a maximal set of intersecting cycles is called a blob.
• Contracting each blob results in a directed, rooted tree, otherwise one of the “blobs” was not maximal.
• So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree.
The blobs are the non-tree-like parts of the network.
![Page 25: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/25.jpg)
Ugly tanglednetwork insidethe blob.
Every network is a tree of blobs. How do the tree partsand the blobs relate?
How can we exploitthis relationship?
![Page 26: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/26.jpg)
The start of technical stuff
![Page 27: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/27.jpg)
Site Conflicts
A pair of sites (columns) of M that fail the
4-gametes test are said to conflict.
And each site in the pair is said to be conflicted.
A site that is not in such a pair is unconflicted.
![Page 28: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/28.jpg)
0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1
1 2 3 4 5abcdefg
1 3
4
2 5
Two nodes are connected iff the pairof sites conflict, i.e, fail the 4-gamete test.
Conflict Graph
M
THE MAIN TOOL: We represent the pairwise conflictsin a conflict graph.
![Page 29: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/29.jpg)
Simple Fact
If sites two sites i and j conflict, then the sites must be together on some recombination cycle whose recombination point is between the two sites i and j.
(This is a general fact for all phylogenetic networks.)
Ex: In the prior example, site 1 conflicts with 3 and 4; and site 2conflicts with 5.
![Page 30: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/30.jpg)
A Phylogenetic Network
00000
52
3
3
4S
p
PS
1
4
a:00010
b:10010
c:00100
10010
01100
d:10100
e:01100
00101
01101
f:01101
g:00101
00100
00010
![Page 31: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/31.jpg)
Simple Consequence of the simple fact
All sites on the same (non-trivial) connected component of the conflict graph
must be on the same blob in any blobbed-tree.Follows by transitivity.
So we can’t subdivide a blob into a tree-like structure if it only contains sites from a single connected component of the conflict graph.
![Page 32: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/32.jpg)
Key Result about Galls: For galls, the converse of the simple
consequence is also true.Two sites that are in different (non-trivial) connected
components cannot be placed on the same gall inany phylogenetic network for M.
Hence, in a galled-tree T for M each gall contains all and only the sites of one (non-trivial) connected component of the conflict graph. All unconflicted sites can be put on edges outside of the galls.
This was the key to the galled-tree solution.
![Page 33: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/33.jpg)
4
1
3
2 5
a: 00010
b: 10010
d: 10100
c: 00100
e: 01100
f: 01101
g: 00101
A galled-tree generatingthe sequences generatedby the prior network.
2
4
p s
ps
1 3
4
2 5
Conflict Graph
![Page 34: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/34.jpg)
Reduced Galled-Trees
• A galled-tree is called reduced if every gall only contains conflicted sites.
• Theorem: If M can be derived on a galled-tree, it can be derived on a reduced galled-tree.
• The number of recombination nodes in a reduced galled-tree equals the number of connected components of the conflict graph, which is the minimum number of recombinations possible in any galled-tree.
![Page 35: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/35.jpg)
• A reduced galled-tree for M (if one exists) minimizes the number of recombinations used over any phylogenetic network for M. The initial proof is based on a lower bound method.
• Another proof (together with Dean Hickerson): The minimum number of recombinations needed in any phylogenetic network for M is at least the number of non-trivial connected components of the conflict graph for M.
![Page 36: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/36.jpg)
The main new fact
For any set of sequences M, there is a blobbed-tree T(M) that derives M, where each blob contains all and only the sites in one non-trivial connected component of the conflict graph. The unconflicted sites can always be put on edges outside of any blob. Moreover, the tree part of T(M) is unique.
This is bit weaker than the result for galled-trees: itreplaces “must” with “can”.
![Page 37: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/37.jpg)
My proof is direct and self-contained, but the resultmay also be derivable from well-established resultsabout splits, for example, from facts known aboutBunemann graphs. Thanks to Mike Steel for pointingthis out to me.
![Page 38: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/38.jpg)
What does T(M) look like? How do we construct it? How do we
exploit it?
Let T’(M) be the tree where each blob in T(M) has been contracted to a single node.
Finding T’(M) is easy from M. Then we expand each node in T’(M) to get T(M).
![Page 39: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/39.jpg)
How to find T’(M)
For a connected component C in the conflictgraph, let M[C] be M restricted to the sites in C.
Consider each distinct sequence S in M[C] as defining a binary character (split) partitioning those rows of M[C] that contain S, and those that do not.
Let M’ be the binary matrix obtained from these characters, over all connected components in the conflict graph.
![Page 40: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/40.jpg)
Fact: M’ satisfies the 4-gamete test and so has a unique undirected perfect phylogeny, which is T’(M).
Then we need to expand each node in T’(M) that correspondsto a blob for connected component C to some network B so that the external nodes in B have the labels in M[C],when restricted to C.
![Page 41: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/41.jpg)
sites on Bfrom C
v
For any leaf in Tv, and onlyat those leaves, the statesof the sites in Care the same as at v.
w
T(M)
Key point:
B’ Leaf set Tv
B: blobC: component of theconflict graph whose sitesare on B
![Page 42: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/42.jpg)
4
1
3
2 5
a: 00010
b: 10010
d: 10100
c: 00100
e: 01100
f: 01101
g: 00101
3
4
p s
ps
134010
Example:
v
![Page 43: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/43.jpg)
More Formally:
Let M[C] be the matrix M restricted to the sites in C. LetS[C] be a sequence S restricted to the sites in C.
Key point:Each distinct (non-zero) sequence in M[C] must be the sequence S[C] for some sequence S labeling a branching node v on B.
So, although we don’t know much about the interior of B, or the arrangement of the exterior nodes (braching off of B),we know precisely their number and the sequences (restricted to sites of C) thatlabel the exterior nodes on B. And we know thatthe states of the non-C sites are identical at each node in B.
![Page 44: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/44.jpg)
0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1
1 2 3 4 5abcdefg
1 3
4
M
a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 01 0
1 3 4
Matrix M[C] isMatrix M restrictedto the columns in C.
4
1
3
a
b
d
c, e, f, g
2p s
B
C
001010
101
110
000
![Page 45: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/45.jpg)
Punch Line
• Each distinct non-zero sequence S[C] in M[C] corresponds to an edge e branching off of B, and exactly defines the set of sequences in M (and hence leaves in T(M)) that must be below e in T(M).
• If S[C] is the all-zero sequence, then theleaf labeled S must connect to B through the coalesenct
node of B.These two facts allow the construction of the tree part
of T(M) in O(nm) time, starting from M. Italso defines the complete label of each exterior node.
![Page 46: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/46.jpg)
0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1
1 2 3 4 5abcdefg
1 3
4
M
a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 01 0
1 3 4
C1
2 5
C2
abcdefg
0 00 00 00 01 01 10 1
2 5
So the paths to every leaf pass through the blob B1, but onlythe paths to e, f, g pass throughgall B2. The path from B1 toB2 exits B1 at the node whoseC1-restricted label is 010.
B1 B2
M[C1] M[C2]
![Page 47: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/47.jpg)
0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1
1 2 3 4 5abcdefg
1 3
4
M
a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 01 0
1 3 4
C1
2 5
C2
abcdefg
0 00 00 00 01 01 10 1
2 5
M[C1] M[C2]
abcdefg
1 0 0 0 1 0 0 00 1 0 0 1 0 0 00 0 1 0 1 0 0 00 0 0 1 1 0 0 00 0 1 0 0 1 0 00 0 1 0 0 0 1 00 0 1 0 0 0 0 1
M’
1 2 3 4 5 6 7 8
1234333
5555678
![Page 48: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/48.jpg)
Algorithmically
• Finding the tree part of the blobbed-tree is easy.• Determining the sequences labeling the exterior nodes on
any blob is easy.• Determining a “good” structure inside a blob B is the
problem of generating the sequences of the exterior nodes of B.
• It is easy to test whether the exterior sequences on B can be generated with only a single (possibly multiple-crossover) recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob.
![Page 49: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/49.jpg)
The main open question
• Is there always a blobbed-tree where each blob has all and only the sites of a single connected component of the conflict graph, which minimizes the number of recombinations over all possible phylogenetic networks for M?
• Proof attempt by splitting blob B into B’ and B”. The catch is the possibility of a recombination node in B that is needed in both B’ and B”.
![Page 50: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/50.jpg)
If Yes,
Then any method that computes a lower bound on the number ofneeded recombinations should be applied separately on sequencesdefined by the sites on each connected component, and the resultsadded together.
This may be true even if the desired result is not. It is worthtrying to prove this.
![Page 51: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/51.jpg)
How to arrange the sites on a blob
Given a single connected component of the conflict graph with k sites, how do we arrange those k sites to generate the required sequences, using only one
(multiple-crossover) recombination,
or using a multiple-crossover recombination
with the fewest cross-overs?
![Page 52: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/52.jpg)
Arranging the sitesWe will describe an O(n^3) time method to arrange all of the blobs. O(n^2)
time is possible with a more complex method when only single-crossoverrecombinations are allowed.
![Page 53: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/53.jpg)
Let Q be a gall for the sites on connected-component C of the conflict graph.
Let M[C] be the matrix M restricted to the sites on C.
Let LQ[C] be the sequences labeling the nodes of Q, restricted to the sites on C.
Claim: The two sets of sequences are identical, i.e., M[C] = LQ[C].
A needed fact in words
![Page 54: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/54.jpg)
0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1
1 2 3 4 5abcdefg
1 3
4
M
a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 01 0
1 3 4
Matrix M[C] isMatrix M restrictedto the columns in C.
4
1
3
a
b
d
c, e, f, g
2p s
Q
C
001010
101
110
LQ[C]
Fact: M[C] = LQ[C]
LQ[C] are the nodelabels on Q restrictedto the sites in C
![Page 55: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/55.jpg)
The idea for arranging the sites of C on B: via a short movie
![Page 56: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/56.jpg)
4
1
3
a
b
d
c, e, f, g
2p s
Q
001010
101
110
![Page 57: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/57.jpg)
4
1
3
a
b
d
c, e, f, g
Q
001010
101
110
![Page 58: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/58.jpg)
4
1
3
a
b: 101
d
c, e, f, g: 010
B
001010
101
110
Blob B minus the recombination node is a perfect phylogeny for M[C]minus the recombinant sequence; all sites are on one or two pathsfrom the root; and the two end sequences of those paths can recombine at point r to recreate the recombinant sequence.
![Page 59: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/59.jpg)
The point
If we remove the recombinant node from B, we have a phylogenetic tree (no cycles) for the remaining sequences and hence a perfect phylogenetic tree for the sequences in
M[C] minus the recombinant sequence of B.The sites in this tree are on one or two paths.Moreover, the two end sequences on that perfect
phylogeny can recombine to create the removed recombinant sequence.
![Page 60: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/60.jpg)
The algorithm for arranging a blob B for C
1.Form the matrix M[C].
2. For each row of M[C], remove the row, see if thereis a perfect phylogeny for the remaining rows.If yes, see if the sites are in one or two paths, andthe end sequences can generate the removed rowby a recombination.
Fact: Every row that works gives a permitted arrangement of the sites on B.
![Page 61: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/61.jpg)
Optimality
Theorem: The minimum number of recombination nodesin any phylogenetic network for M is at least the number of non-trivial connected components of the conflict graph.
Hence, when the sequences on each blob on T(M) can be generatedwith a single recombination node, the blobbed-tree minimizesthe number of recombination nodes over all phylogeneticnetworks.
![Page 62: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/62.jpg)
The number of arrangements on agall (all-0 ancestral sequence)
Following the algorithmic approach above, one canprove that the number of arrangements of any gall isat most three, and this happens only if the gall has two sites.
If the gall has more than two sites, then the number ofarrangements is at most two.
If the gall has four or more sites, with at least two siteson each side of the recombination point (not the side ofthe gall) then the arrangement is forced and unique.
![Page 63: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/63.jpg)
What is the most tree-like Network?
Assume no node with only one child. Contracting each blob to a single node results in a tree. The number of edges in that tree is a measure of how tree-like the network is. The larger the number of edges,
the more tree-like is the network.
Then T(M) is the most tree-like network.
![Page 64: Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis](https://reader035.vdocuments.us/reader035/viewer/2022081514/56649d625503460f94a450fe/html5/thumbnails/64.jpg)
OPEN QUESTIONS
• Main open question - stated earlier• PPH problem when the haplotypes were
derived on a galled-tree or blobbed tree rather than a tree.
• Better lower bounds on the number of recombinations based on a finer examination of the structure of the conflict graph.