constructing a level-2 phylogenetic network from a dense set of input triplets
DESCRIPTION
Constructing a level-2 phylogenetic network from a dense set of input triplets. Leo van Iersel 1 , Judith Keijsper 1 , Steven Kelk 2 , Leen Stougie 12 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam Email: [email protected] - PowerPoint PPT PresentationTRANSCRIPT
Constructing a level-2 phylogenetic network from a dense set of input triplets
Leo van Iersel1, Judith Keijsper1, Steven Kelk2, Leen Stougie12
(1) Technische Universiteit Eindhoven (TU/e)(2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam
Email: [email protected] Web: http://homepages.cwi.nl/~kelk
Triplet-based methods (1)
Given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)
Find the tree that by contracting and deleting edges can give each of the triplet subgraphs as a minor
z w x
x y z
y x w
w z y
algorithm
w z x y
solution
Triplet-based methods (2)
Given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)
Find the tree that by contracting and deleting edges can give each of the triplet subgraphs as a minor
z w x
x y z
y x w
w z y
algorithm
w z x y
solution
Triplet-based methods (2)
Given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)
Find the tree that by contracting and deleting edges can give each of the triplet subgraphs as a minor
z w x
x y z
y x w
w z y
algorithm
w z x y
solution
From trees to networks…
• The algorithm of Aho et al. (1981) can be used to construct trees from rooted triplets.
• But…what if the algorithm fails? Why might the algorithm fail?
• Possible reason 1: The underlying evolution is tree-like, but the input triplets contain errors.
• Possible reason 2: The triplets are correct, but the underlying evolution is not tree-like. Biological phenomena such as hybridization, horizontal gene transfer, recombination and gene duplication can lead to evolutionary scenarios that are not tree-like!
• Response: try and construct not phylogenetic trees, but phylogenetic networks
From trees to networks (2)
x y z
x z y
• For example, suppose the input is {xy|z, xz|y}.
z
x
y
(Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)
From trees to networks (2)
x y z
x z y
• For example, suppose the input is {xy|z, xz|y}.
z
x
y
(Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)
From trees to networks (2)
x y z
x z y
• For example, suppose the input is {xy|z, xz|y}.
z
x
y
(Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)
Level-k phylogenetic networks
z
x
y
root(only one!)
leaf-vertex
split-vertex
recombination-vertex
A level-k phylogenetic network is a rooted,
directed acyclic graph where every biconnected
component (in the underlying undirected
graph) contains at most k recombination vertices.
• A set of input triplets is dense iff, for every subset of 3 species, there is at least one triplet corresponding to those 3 species.
• Therefore, a dense set of input triplets for n species contains O(n3) triplets.
• Jansson & Sung (2006) showed:
Level-1 Networks
Given a dense set of triplets T for a set L of species, it is possible to determine in polynomial-time whether a level-1 phylogenetic
network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.)
• They later showed, together with Nguyen, how to do this in time linear in |T|. They also showed that, in the non-dense case, the problem is NP-hard.
• But what about level-2 networks, and higher?
Here is an example of a level-2 network.
Main result: Given a dense set of triplets T for a set L of species, it is possible to determine in time O(|T|3) whether a level-2
phylogenetic network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.)
Algorithm, basic idea
• The basic idea behind Aho’s algorithm for trees is that we are able to determine, recursively, which species belong to which of the two subtrees hanging from some root vertex.
• For the level-1 and level-2 networks if there again exists such a clear dichotomy, we iterate on the two subsets.
root
Sub-network
Sub-network
Algorithm, basic idea
• The basic idea behind Aho’s algorithm for trees is that we are able to determine, recusively, which species belong to which of the two subtrees hanging from some root vertex.
• For the level-1 networks if there again exists such a clear dichotomy, we iterate on the two subsets. Otherwise there must exist a network of the form
Sub-networ
k
Sub-networ
k Sub-networ
kSub-
network
Sub-networ
k
Algorithm, basic idea
• The basic idea behind Aho’s algorithm for trees is that we are able to determine, recusively, which species belong to which of the two subtrees hanging from some root vertex.
• For the level-1 networks if there again exists such a clear dichotomy, we iterate on the two subsets. Otherwise there must exist a network of the form
Sub-networ
k
Sub-networ
k Sub-networ
kSub-
network
Sub-networ
k
Find the partition of the species (leaves)
into the subnetworks
Find the blue backbone network
Treat each of the partition elements (sub-networks) as
leaves to be hanged on the backbone
Recurse on the subnetworks
Algorithm, high-level idea
For level-2 networks the idea is similar:
Sub-networ
k
Sub-networ
kSub-
network
Sub-networ
k
Sub-networ
k
Find the partition of the species (leaves)
into the subnetworks
There is a complication in
level-2
Find the blue backbone network!
There are more level-2 backbone
forms
Treat each of the partition elements (sub-networks) as
(meta-)leaves to be hanged on the
backbone
Recurse on the subnetworks
• Suppose I have a partition P = {P1, P2, …, Pt} of the leaf set L.
• Suppose I have a dense set of triplets T on the leaf set L.
• Let T’ be a new triplet set on leaf set {q1, q2,…, qt} defined as follows:
• qiqj|qk is in T’ if and only if i≠j≠k and there exists a triplet xy|z in T such that x is in Pi, y is in Pj and z is in Pk
• Then we say that T’ is the triplet set induced by the partition P of L.
• Critically: if T is dense, then T’ is also dense.
• In some sense this can be perceived as a ‘coarsening’ of the input set.
Definition: inducing new triplet sets from partitions of the leaf set
Definition: simple level-2 networks
Lemma: There are exactly 4 different backbone networks
A simple level-2 network is any network obtained by“hanging leaves” off one of the above structures.
Here the leaves{a,b,c,d,e,f,g,h} have
been ‘hung’ from structure 8a, to yield a simple level-2
network.
A picture description of the simple level-2 algorithm
Level-2 network algorithm
Assume some oracle gives us the partition of the leaves into sub-networks
Treat each subnetwork as a leaf and construct a simple level-2 network
The simple level-2 network algorithm Guess the right “recombination leaf” Remove it and remove the triplets that contain this leaf 1 recombination vertex left with below it a caterpillar
Suppose we can correctly ‘guess’ that leaf g hangs
directly below a recombination node
If we remove g, and all triplets that contain g, then we know that a
level-1 network must be possible on this new set of triplets (because now
fewer recombination nodes are needed)
Level-2 network algorithm
Assume some oracle gives us the partition of the leaves into sub-networks
Treat each subnetwork as a leaf and construct a simple level-2 network
The simple level-2 network algorithm Guess the right “recombination leaf” Remove it and remove the triplets that contain this leaf 1 recombination vertex left with below it a caterpillar Guess the right “caterpillar set”
Caterpillar set
A caterpillar set with respect to a dense triplet set T is the set of leaves of a caterpillar subgraph of a network consistent with T
The empty set is also a caterpillar set
Caterpillar
Suppose we subsequently guess that the caterpillar with h now
hangs below a recombination node in
the new network.
If we remove the h-caterpillar, and all triplets that contain leaves of it,
then we know that a level-0 network must be possible on this new set of triplets (because now
even fewer recombination nodes are
needed.)
Level-2 network algorithm
Assume some oracle gives us the partition of the leaves into sub-networks
Treat each subnetwork as a leaf and construct a simple level-2 network
The simple level-2 network algorithm Guess the right “recombination leaf” Remove it and remove the triplets that contain this leaf 1 recombination vertex left with below it a caterpillar Guess the right “caterpillar set” Remove it and remove the triplets that contain any
element of this set Construct the unique tree for the remaining triplets
[Jansson&Sung 2006]
In such a case the resulting tree is UNIQUE
(J&S).
So now we have a tree. We are going to guess
how to add the h-caterpillar back in, and then guess how to add
leaf g back in.
Adding the h-caterpillar back in.
And finally adding leaf g back in.
g
Level-2 network algorithm
Assume some oracle gives us the partition of the leaves into sub-networks
Treat each subnetwork as a leaf and construct a simple level-2 network
The simple level-2 network algorithm Guess the right “recombination leaf” Remove it and remove the triplets that contain this leaf 1 recombination vertex left with below it a caterpillar Guess the right “caterpillar set” Remove it and remove the triplets that contain any element
of this set Construct the unique tree for the remaining triplets
[Jansson&Sung 2006] Insert the caterpillar set and the recombination leaf in the
tree in the correct way
For each pair of guesses try all 4 backbone structures
Simple level-2 algorithm
Theorem: The simple level-2 network algorithm works in O(|T|^3)
SN-sets to partition the set of leaves
• Jansson & Sung introduced the SN-set to partition the set of leaves• SN-sets are special subsets of the leaves L, and are defined w.r.t. T • All sets containing just a single leaf, are SN-sets.• Any other SN-set is any subset of leaves obtained by taking the
closure of some subset S of the leaves L w.r.t. the following operation
If x,y є S and xz|y є T or yz|x є T then z є S
The SN-set that is equal to the total leaf set L, is called the trivial SN-set.
An SN-set that is non-trivial, and is not a strict subset of any other non-trivial
SN-set, is called a maximal SN-set.
(If the network is a tree there are 2 maximal SN-sets: one the set of leaves of
the subtree right and the other the set of leaves of the subtree left of the root)
• Jansson and Sung proved that the set of maximal SN-sets indeed partition the leaf set L. So no two maximal SN-sets overlap, and they completely cover the set of input leaves.
• All SN-sets and all maximal SN-sets can be found in polynomial-time.
• Jansson & Sung solved the level-1 problem by observing that each maximal SN-sets hangs as a ‘meta-leaf’ on the level-1 backbone network;
each maximal SN-set can completely be separated from the rest of the network by removing just one edge
• There are maximal SN-sets in level-2 networks that can hang under more than one edge!!!!
Definition: maximal SN-set
Definition highest cut-edge
In a phylogenetic network N, a cut-edge (x,y) is an edge whose removal disconnects the undirected graph.
A cut-edge (x,y) is said to be a trivial cut edge iff y is a leaf.
A cut-edge (x,y) is said to be highest iff there is no cut-edge (p,q) such that there is a directed path from q to x in N.
• Fact. Let (x,y) be a highest cut-edge and let L’ be the set of leaves reachable from y. Let L* be a strict subset of L’. Then L* is not a maximal SN-set.
• Proof: the set of leaves reachable from a highest cut-edge (x,y), is itself an SN-set. Clearly for any two leaves p,q in L’ and leaf r outside L’ there cannot be triplets pr|q and qr|p: the edge (x,y) forms a bottleneck. Thus pq|r must exist.
y
x
p q r
p r qL’
So: each maximal SN-set
can be expressed as
the union of the leaves
reachable by one or more highest cut-
edges.
Central Theorem (simplified). Suppose there is a dense triplet set T consistent with some simple level-2 network N. Then there
exists a level-2 network N’ (not necessarily simple) such that, with the exception of perhaps one maximal SN-set with respect to T,
every maximal SN-set appears below a single cut-edge in N’. The remaining, ‘odd-one-out’ maximal SN-set (if it exists) will be equal
to the union of leaves below two cut-edges.
In other words: there exists at most one maximal SN-set which is the union of the leaves below two highest cut-edges, whereas all other
SN-sets consist of the leaves below one highest cut-edge
The algorithm Determine the maximal SN-sets Guess the right SN-set to be split Treat the max SN-sets and the two split
sets as leaves {S1,S2,…,Sq} Adapt T to a new triplet set T’: SiSk|Sh є T’ if and only if there exist xєSi, yєSk,zєSh s.t. xy|z є T Construct a simple level-2 network for T’ Recursively find the sub-networks for
the sets S1,S2,…,Sq
Conclusions & open problems
• So we know how to efficiently construct level-2 networks from dense triplet sets. What’s next?
• Applicability: how useful is it?
• Initial implementation: programming and fine-tuning
• Improving running time: in the spirit of the “SN-tree” of J&S&N
• Complexity: what about level-3 and higher?
• Bounds: worst-case, best-case scenarios
• Building all networks
• Properties of output networks as function of input
• Different triplet restrictions
• Confidence: how good are the solutions?
• Exponential-time exact algorithms for NP-hard problems