jaroslaw byrka 1,2 , steven kelk 2 , katharina t. hüber 3

37
Worst-case optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam (3) University of East Anglia (UEA), England Email: [email protected] Web: http://homepages.cwi.nl/~kelk

Upload: hayley

Post on 09-Jan-2016

31 views

Category:

Documents


3 download

DESCRIPTION

Worst-case optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks. Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Worst-case optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks

Jaroslaw Byrka1,2, Steven Kelk2, Katharina T. Hüber3

(1) Technische Universiteit Eindhoven (TU/e)(2) Centrum voor Wiskunde en Informatica (CWI),

Amsterdam(3) University of East Anglia (UEA), England

Email: [email protected] Web: http://homepages.cwi.nl/~kelk

Page 2: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Phylogenetic tree reconstruction

Orangutan

Gorilla

Chimpanzee Human

(This tree borrowed from a presentation by Tandy Warnow)

Phylogenetic tree reconstruction is

essentially the science of efficiently inferring and constructing

plausible evolutionary trees when we only

have limited input data about the ‘species’

concerned…

At the intersection of biology, bioinformatics, computer science and

mathematics.

Page 3: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Dominant methods in phylogenetic reconstruction

Character-based methods

Maximum Parsimony (= Minimum Steiner Tree)

Maximum Likelihood

Bayesian methods (Markov Chain Monte Carlo - MCMC)

Distance-based methods

Neighbour Joining

UPGMA

Triplet-based methods

Page 4: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Triplet-based methods (1)

• Triplet-based methods are used for constructing rooted evolutionary trees: there is a root (a hypothetical most-distant ancestor) and edges are directed, explicitly denoting the direction of evolution.

• The central idea: build a single, ‘big’ evolutionary tree for a set S of species by combining smaller evolutionary trees on subsets of S such that the big tree respects the structure of the smaller trees.

• In triplet-based methods, the small input trees are always defined on size-3 subsets of the species set S (and are called rooted triplets.)

Page 5: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Triplet-based methods (2)

• For example. Suppose I want to reconstruct a plausible evolution for the species set {w,x,y,z}.

• I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)

z w x

x y z

y x w

w z y

algorithm

w z x y

solution

Page 6: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Triplet-based methods (2)

• For example. Suppose I want to reconstruct a plausible evolution for the species set {w,x,y,z}.

• I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)

z w x

x y z

y x w

w z y

algorithm

w z x y

solution

Page 7: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Triplet-based methods (2)

• For example. Suppose I want to reconstruct a plausible evolution for the species set {w,x,y,z}.

• I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)

z w x

x y z

y x w

w z y

algorithm

w z x y

solution

Page 8: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

When trees fail

• The algorithm of Aho et al. (1981) can be used to construct a tree that is consistent with all the input rooted triplets, if one exists…

• But…what if the algorithm fails? Why might the algorithm fail?

• Possible reason 1: The underlying evolution is tree-like, but the input triplets contain errors.

• Possible reason 2: The triplets are correct, but the underlying evolution is not tree-like. Biological phenomena such as hybridization, horizontal gene transfer, recombination and gene duplication can lead to evolutionary scenarios that are not tree-like.

• Responses:

• try constructing a phylogenetic tree that maximises the number of input triplets it is consistent with, and/or

• try and construct not phylogenetic trees, but phylogenetic networks

Page 9: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Networks instead of trees

x y z

x z y

• For example, suppose the input is {xy|z, xz|y}.

z

x

y

Page 10: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Networks instead of trees

x y z

x z y

• For example, suppose the input is {xy|z, xz|y}.

z

x

y

Page 11: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Networks instead of trees

x y z

x z y

• For example, suppose the input is {xy|z, xz|y}.

z

x

y

Page 12: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Level-k phylogenetic networks

z

x

y

root(only one!)

leaf-vertex(labelled with species)

split-vertex

recombination-vertex

A level-k phylogenetic network is a rooted,

directed acyclic graph where every biconnected

component (in the underlying undirected

graph) contains at most k recombination vertices.

This network here is a very simple example of a

level-1 network.

In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the

alternative name “galled tree”.

Page 13: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

leaf-vertex(labelled with species)

Level-k phylogenetic networks

z

x

y

root(only one!)

split-vertex

recombination-vertex

A level-k phylogenetic network is a rooted,

directed acyclic graph where every biconnected

component (in the underlying undirected

graph) contains at most k recombination vertices.

This network here is a very simple example of a

level-1 network.

In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the

alternative name “galled tree”.

Page 14: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

The complexity of “LEVEL-k”

LEVEL-k

Input: Set of rooted triplets T

Output: A level-k network N consistent with all the triplets in T, or state that no such network exists.

Complexity

k=0 In P (Aho et al. 1981)

k=1 NP-hard, but in P when T is “dense” (Jansson & Sung 2005)

k=2 NP-hard, but in P when T is “dense” (Van Iersel, Keijsper, Kelk, Stougie 2007)

k>2 ??? idem ??? (general case is almost certainly NP-hard, but density?)

Page 15: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

What about maximization?

• Gasieniec et al. (1999) showed how to find in polynomial time a tree that is consistent with at least 1/3 of the input triplets T.

• Is it possible to always find a tree that is consistent with > 1/3 of the input triplets?

• No. Let T1(n) be the full triplet set on n species. Contains triplets.

• For example, T1(4) = {ab|c, ac|b, cb|a, ab|d, ad|b, bd|a, ac|d, dc|a, ad|c, bc|d, bd|c, dc|b}.

• For a given three species, a tree is consistent with at most one triplet on those three species. So at most 1/3 of the triplets in T1(n) can be consistent with a tree.

• So for trees, and comparing with the upper bound |T|, 1/3 is worst case optimal.

3

3n

Page 16: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Formalising the question

Assuming that we restrict the set of phylogenetic networks to some subclass, what is the maximum value of 0 ≤ p ≤ 1 such that for every input set T of rooted triplets, there exists some network N(T) from the subclass such that at least p|T| of the triplets are consistent with N(T)?

• So for level-0 networks (trees), p=1/3.

• This can be trivially converted to a 3-approximation algorithm for the problem MAX-LEVEL-0, where MAX-LEVEL-k is defined as “Given a set of triplets T, what is the maximum number of triplets from T that some level-k network can be consistent with?”

• In general, having an algorithm that gets a fraction q of the input triplets, becomes a (1/q)-approximation for the MAX variant. (Better approximation factors for the MAX variant are probably possible, but none yet known!)

Page 17: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Determining the p-fraction for level-1 and higher

•For level-1, Jansson, Nguyen and Sung (2005) showed how to find in polynomial time a level-1 network consistent with at least 5/12 ≈ 0.416… of the input triplets. So for level-1, p ≥ 5/12 ≈ 0.416…

• They also showed, given the full triplet set T1(n) on n leaves, how to build an optimal level-1 network for those triplets i.e. no other level-1 network can be consistent with a higher fraction of T1(n).

• By counting they show that such optimal level-1 networks are consistent with a fraction approaching (from above) ≈ 0.488… of the input triplets, showing that, for level-1, p ≤ 0.488…

• Obvious questions: what is the true value of p for level-1? What about higher level networks? Are networks achieving the p-fraction always polynomial-time constructable? What is the role of the full triplet set in determining p? How about p as a function of n = the number of species?

Page 18: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• TheBefore our result

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

n

frac

tio

n trees

level-1 LB

level-1 UB

Page 19: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Our result: p is defined by the full triplet set!

Let N be a network that is consistent with a fraction p’ of the full triplet set T1(n). Then, for any arbitrary triplet input set T on n species, we can convert N in polynomial time into an isomorphic network N’(T) that is consistent with a fraction ≥ p’ of T. (The result also holds for

weighted triplet sets.)• All tree shapes (not just caterpillars) can be consistent with 1/3 of input triplets, because every tree is consistent with 1/3 of T1(n).

• We get a polynomial-time worst-case optimal algorithm for level-1 networks (for the |T| upper bound.) This means that we can always get at least 0.48… of the input triplets. With a customized derandomization we can do this in time O(|T|n2).

• For level-2, we can in polynomial time always get at least 0.61 of the input.

• Is this bad news for the biological relevance of triplet methods and/or the level-k hierarchy?

Page 20: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• The

After our result

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46

n

frac

tio

n trees

level-1

level-2 LB

Page 21: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Method: labelling an unlabelled network

• Suppose we know a network N that is consistent with a fraction p’ of the full triplet set T1(n). Let T be the input set of triplets, on n species.

• Note that if the species on the leaves of N are arbitrarily permuted, the resulting network is still consistent with a fraction p’ of T1 – because all species in T1 are indistinguishable.

• Hence, we can view N as an unlabelled network i.e. a network without species on the leaves. Only the shape of N is important.

• We argue that we can label the leaves of N with species in such a way that the resulting network N’, which will be isomorphic to N, is consistent with a fraction ≥ p’ of T.

• We use a probabilistic argument to argue the existence of such a labelling.

• We then use the method of conditional expectation to derandomize this i.e. so that the labelling can be found in polynomial time.

Page 22: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Choosing the labelling u.a.r. is good enough

• Suppose we know a network N that is consistent with a fraction p’ of the full triplet set T1(n). Let T be the input set of triplets, on n species.

• If we choose a random labelling of the leaves of N (i.e. randomly assign the n species from T to the n leaves of N) to get a network N’, the expected fraction of T that N’ is consistent with, is p’.

Page 23: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ?

• It is the probability that species x,y,z get mapped to leaves t1, t2, t3 such that t1t2|t3 is consistent with N.

Page 24: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ?

• It is the probability that species x,y,z get mapped to leaves t1, t2, t3 such that t1t2|t3 is consistent with N. x

y

z

Page 25: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ?

• It is the probability that species x,y,z get mapped to leaves t1, t2, t3 such that t1t2|t3 is consistent with N. y

x

z

Page 26: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• Consider an arbitrary triplet xy|z from T. What is the probability that N’, created by randomly labelling N, is consistent with xy|z ?

• It is the probability that species x,y,z get mapped to leaves t1, t2, t3 such that t1t2|t3 is consistent with N.

y

x

z

Page 27: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• For each “leaf triplet” t1t2|t3 in N, there are 2(n-3)! labellings that map xy|z to that leaf triplet.

• A labelling that maps xy|z to a leaf triplet, cannot map xy|z to another leaf triplet.

• So the probability that the labelled network N’ is consistent with xy|z, is the probability that xy|z gets mapped to one of the leaf triplets in N. Hand-waving, the probability is thus:

'!

)!3.(2.3.3'.

pn

nn

p

Page 28: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• So, for any triplet t in T, we expect a fraction p’ of triplet t to be in the labelled network N’ when N’ is made by randomly labelling N.

• Summing over all triplets, we get that the expected fraction of T consistent with N’, is also p’.

• We conclude that there exists some labelling of N that achieves a fraction ≥ p’.

• This proves that, for a subclass of networks, the p-fraction is indeed defined by the full triplet set, and that any network obtaining the p-fraction for the full triplet set, can be relabelled to obtain the p-fraction for an arbitrary input set T.

• But how to find in polynomial time the correct labelling for a given input set T?

• Derandomization by the method of conditional expectation.

Page 29: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• An appropriate labelling can be found in time O(m4n3) time, where m is the number of vertices in the unlabelled network N.

• We do this by labelling the leaves of N, one at a time.

• General idea: At a given iteration of the algorithm, let F be the set of leaves of N which have already been labelled with species.

• We then arbitrarily pick an unlabelled leaf t and add it to F, by labelling it. But how do we choose the species that labels it?

• We choose the species that maximises the expected fraction of T that the finished labelled network N’ will be consistent with, assuming the labelling of the leaves in F U {t} is fixed, and that the remaining leaves are labelled uniformly at random.

• The main point to observe is how the probabilities can be computed in polynomial time.

Derandomizing: a sketch

Page 30: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• We compute the probability for each triplet independently.

• E.g. consider a triplet xy|z. Suppose x and y have already been assigned to leaves.

• What is the probability that xy|z will be in, given that the remaining leaves are labelled u.a.r.?

• Simply try all possible ways of mapping z into the remaining leaves, and count the successful mappings.

x

y

Page 31: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• We compute the probability for each triplet independently.

• E.g. consider a triplet xy|z. Suppose x and y have already been assigned to leaves.

• What is the probability that xy|z will be in, given that the remaining leaves are labelled u.a.r.?

• Simply try all possible ways of mapping z into the remaining leaves, and count the successful mappings.

x

y

= bad leaves for z

= good leaves for z

Page 32: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• Jansson, Nguyen & Sung (2005) showed how to construct the galled caterpillar on n leaves, denoted C(n).

• This level-1 network C(n) has the property that no other network is consistent with a higher fraction of the full triplet set T1(n); it is thus in some sense optimal.

• It is easy to construct C(n) in time polynomial in n. Combining this with our generic derandomized labelling algorithm, we obtain a polynomial-time worst-case optimal algorithm for level-1.

• For level-1 networks, let us parameterize the p-fraction as a function of n, the number of species. Combining our result with that of J&N&S, we get:

Worst-case optimal algorithm for level-1

Page 33: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

33

)()(

)(2

)(2

23

max)(:3

0)2()1()0(

1

nnS

np

anSan

aanaa

nSn

SSS

na

• The value p(n) seems to smoothly approach a horizontal asymptote of ≈0.4880… from above. With help from Mathematica and some insights into ‘good’ values of a we have bound p(n) below by 0.48 for all n.

Page 34: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

• The After our result

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46

n

frac

tio

n trees

level-1

level-2 LB

Page 35: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

The galled caterpillar C(17)

)(2

)(2

2

3

max)( 1

anSan

a

ana

a

nS na

• Galled caterpillars have a very regular structure, and this allows us to do a faster, customized derandomization, in time O( |T|n2 )

Page 36: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Level-2

• Using a combination of our relabelling technique, Java programming, and Mathematica, we were easily (in one afternoon) able to prove a lower bound on p of 0.61 for level-2 networks.

• The real value of p for level-2 is probably somewhere around 2/3. But to prove that conclusively we need to know what optimal level-2 networks look like for the full triplet set! A nice challenge for someone...

Page 37: Jaroslaw Byrka 1,2 , Steven Kelk 2 , Katharina T. Hüber 3

Conclusions and open problems

• We have shown that all tree shapes are worst-case optimal; we have identified p(n) for level-1 networks, and given a lower bound on p for level-2.

• More generally: we show how, for any given subclass of networks, the p-fraction can be obtained by studying only the full triplet set and that (generic or customised) polynomial-time algorithms can be constructed around this.

• Obtaining (bounds on) p can also be a first step on the road to good approximation algorithms for the MAX variants; it gives a (1/p) approximation for the MAX variant.

• Significance for biology, for the triplet method, for the level-k hierarchy? Our result is probably bad news for the field (not much discriminatory power)

• What is the real value of p for level-2, and for higher level networks, and for other subclasses of networks?

• Confirming whether or not there are (in polynomial time) better approximation factors possible for the MAX variants than (1/p).