cnrs - université montpellier 2 france 1 phylogenetic signal with induction and non-contradiction:...

28
CNRS - Université Montpellier 2 France 1 Phylogenetic Signal with Induction and non- Contradiction: the PhySIC method for building supertrees http:/atgc.lirmm.fr/SuperTree/PhySIC Vincent Berry 1 , V. Ranwez 2 , A. Criscuolo 1,2 , P.-H. Fabre 2 , S. Guillemot 1 , C. Scornavacca 1,2 , E.J.P. Douzery 2 Funded by ACI IMPBIO & BIOSTIC LR 1 2

Upload: reynard-turner

Post on 30-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

CNRS - Université

Montpellier 2France

1

Phylogenetic Signal with Induction and non-Contradiction:

the PhySIC method for building supertrees

http:/atgc.lirmm.fr/SuperTree/PhySIC

Vincent Berry1, V. Ranwez2,A. Criscuolo1,2, P.-H. Fabre2, S. Guillemot1,

C. Scornavacca1,2, E.J.P. Douzery2

Funded by ACI IMPBIO & BIOSTIC LR1 2

PhySIC: Phylogenetic Signal with Induction and non-Contradiction2

Introduction: use of supertreesSupertrees are useful for

producing well-resolved large phylogenies to provide a framework

for broad comparative studies (Gittleman et al 2004) Quantitative studies of input-tree congruence, identifying outlier

taxa by tree-supertree distance measures (Willkinson et al 2004) Exploring and identifying agreement and disagreement among sets

of input trees. The aim is then to reveal conflicts rather than resolving them. Conflict are ultimately resolved from additional data or analyses (Willkinson et al 2001)

Identifying where limited overlap between the leaf sets of the input trees is an obstacle in their amalgamation, thereby guiding further research (Sanderson et al 1996, Arné et al 2007).

PhySIC: Phylogenetic Signal with Induction and non-Contradiction3

Introduction : dealing with conflicts

Dealing with topological contradictions (“conflicts”)

among source trees : Voting methods (MRP,MMC,CLANN,…)

resolve conflicts based on a voting procedure(optimization approach)

Veto methods (Strict Consensus, Build,SMAST): do not favor any resolution in case of conflict

(consensus approach)

D

C

B

A

C

B

D

A

PhySIC: Phylogenetic Signal with Induction and non-Contradiction4

Veto methods Proceed from an axiomatic approach:

proposed supertrees satisfy specified theoretical properties

Goal:

obtain a reliable, if incomplete, picture of

how the source trees fit together

Motivation: Full congruence with the source trees can be necessary for

further applications such as phylogeography, divergence time estimations, etc.

Avoid as much as possible the inference of non-supported novel clades, unlike in some existing voting methods

PhySIC: Phylogenetic Signal with Induction and non-Contradiction5

Overview

Some relevant properties for reliable inference Decomposition of a tree into triplets Identifying a tree Property of Induction (PI) Property of non-Contradiction (PC)

Algorithms (sketch) BUILD - Aho PhySICPC

PhySICPI

Biological case study: Primate supertree

Conclusion & prospects

PhySIC: Phylogenetic Signal with Induction and non-Contradiction6

Axiomatic approach: important properties

Police investigation SuperTree

The inspector The superTree method

The witnesses The source trees

The testimonies Phylogenetic information contained within source trees

Reliable facts are those that can be induced from testimonies and that are not incompatible with any other.

Deducing the true story

Pointing out contradictions in the testimonies

Deducing new facts by cross-checking

PhySIC: Phylogenetic Signal with Induction and non-Contradiction7

Decomposition of trees in building stones

dcba

cdbe

T1T2

dca

dba

tr(T1)

dcb

cba

bc|d ac|d ab|d ab|c ed|c eb|d eb|c

tr(T2)

bd|c

ac|d

Triplets (rooted triples): subtrees on 3 taxa

PhySIC: Phylogenetic Signal with Induction and non-Contradiction8

Properties of interest: identification A tree T displays a set R of triplets

iff R tr(T) In such a case R is said to be compatible :

all triplets of R can be combined into a tree

dcba

cba

dcb

bc|d ab|c

T

ab|c ab|d

R’ does not identify TR identifies T

R identifies T iff T displays R AND every tree T’ displaying R contains all the clades of T

cdba

X

PhySIC: Phylogenetic Signal with Induction and non-Contradiction9

d

R identifies Tyet R does not contain all triples of tr(T):

additional triples are induced by those present in R

d

cb

bc|d ab|c

ab|d and ac|d are induced

c

ba

T

c

b

aR

Properties of interest: identification

PhySIC: Phylogenetic Signal with Induction and non-Contradiction10

We want to infer reliable supertrees: not making arbitrary inferences

Relevant properties: induction (PI)

we only accept supertrees T such that tr(T) is present in the data R or induced by hypotheses in RPI

dcba

ab|c ab|dac|d? cd|

b?

cba

d

ba

R

dcba

ab|c ab|dac|d?bc|d?

dcba

ab|c ab|d

PhySIC: Phylogenetic Signal with Induction and non-Contradiction11

Focusing on a coherent subset of hypotheses

R ab|c bc|d ab|d ac|d ad|c bd|c

dcba

cdba

Supertree method ? R identifies T

T

There is no chance that practical data exactly identifies a (super)tree: Lack of overlap between the source trees: missing data Errors due to gene specific evolution, systematic errors in the source

tree inference (long branch attraction, estimated model of evolution)

find a subset R’ of R identifying a tree (ie, a subtree of the underlying tree)

However, there is a chance that part of the underlying “correct” tree appears uncorrupted in the data:

PhySIC: Phylogenetic Signal with Induction and non-Contradiction12

Relevant properties: non-contradiction

we reject subsets R’ obtained by keeping xy|z and removing xz|y.

ab|c ab|d bc|d ac|d bd|c ad|c

dcba

T

R’ identifies T

R’ R

We focus on R(T), the triplets of R resolved by T

We search for a subset of R identifying a tree T

But we want to be reliable: no clade contradicted by the data

we don’t accept hypotheses that are in direct contradiction with discarded hypothesesPC

PhySIC: Phylogenetic Signal with Induction and non-Contradiction13

Link between the properties:

R(T) identifies T is equivalent to T satisfies PC: (property of non-contradiction)

for any triplet ab|c displayed by T, R(T) induces neither bc|a nor ac|b

and T satisfies PI: (property of induction)

every triplet ab|c displayed by T is induced by R(T)

Given a supertree T and a collection of source trees, PI and PC can be checked in polynomial time.

A given supertree can be modified in polynomial time so that it verifies PI and PC.

Why not designing a supertree method proposing supertrees satisfying PI and PC from the start : the PhySIC method

(Phylogenetic Signal with Induction and non-Contradiction)

PhySIC: Phylogenetic Signal with Induction and non-Contradiction14

Overview

Relevant properties for a veto method (reliable facts) Decomposition of a tree into triplets Tree identification Property of Induction (PI) Property of non-Contradiction (PC)

Algorithms (sketch) BUILD - Aho PhySICPC

PhySICPI

Biological case study: Primate supertree

Conclusion & prospects

PhySIC: Phylogenetic Signal with Induction and non-Contradiction15

Algorithmic ideas: BUILD (Aho et al 81)

a

b

c

d

d

{a,b,c}

a

b

c

c

{a,b}

a

b

a

b

cba

dcb

bc|d ab|c

dcba

R

PhySIC: Phylogenetic Signal with Induction and non-Contradiction16

Algorithmic ideas: limits of BUILD

dcba

cdba

R2bc|d bd|cac|d ad|c ab|c ab|d

a

b

c

d

dcba

dbca

R1ab|c ac|b bc|d ab|d ac|d

a

b

c

d

d

{a,b,c}

a

b

c dcba

Returns a tree only when R is compatible.

PhySIC: Phylogenetic Signal with Induction and non-Contradiction17

Algorithmic ideas: PhySICPC

dcba

cdba

Rbc|d bd|cac|d ad|c ab|c ab|d

a

b

c

d

R’bc|d bd|cac|d ad|c ab|c ab|d

d

a

b

c

cdba

At each iteration, if there is a single connected component Check if using R’ leads to several connected components If so, check that the tree will satisfy PC w.r.t. R. Or else, propose a multifurcation on those taxa

We thus obtain a more resolved tree satisfying PC: contradictions affecting basal clades do not always imped deeper clades to be obtained

Idea: temporarily forget the direct contradictions

PhySIC: Phylogenetic Signal with Induction and non-Contradiction18

Algorithmic ideas: limits of BUILD (2)

R ab|c ef|c

c

b

a

a

b

c

e

f

{a,b}

c

{e,f}

c

f

e

When the graph contains several connected components, it is necessary to check that the triplets we are about to create are really induced by R

Branches that create triplets not induced by R are collapsed (use graph algorithms)

ef|a ??

a

b

c

e

f

PhySIC: Phylogenetic Signal with Induction and non-Contradiction19

Algorithmic ideas - a summary

A supertree draft is proposed by PhySICPC ensuring PC

If a clade is not « strong enough » the corresponding branch is collapsed by PhySICPI ensuring also PI

Physic is a polynomial-time supertree method:1. Decomposition of the input forest into triplets O(kn3)

2. Creation of a tree satisfying PC O(n4)3. Collapsing edges displaying triplets not induced by the

source trees: O(n4)

the algorithm requires O(kn3+n4) computing time

PhySIC: Phylogenetic Signal with Induction and non-Contradiction20

Overview

Relevant properties for a veto method Decomposition of a tree into triplets Tree identification Property of Induction (PI) Property of non-Contradiction (PC)

Algorithms (intuitive presentation) BUILD Aho PhySICPC

PhySICPI

Biological case study: Primate supertree

Conclusion & prospects

PhySIC: Phylogenetic Signal with Induction and non-Contradiction21

Primate case study: source trees ADRA2B and IRBP study (Poux et al. 04, 06)

SINEs (Roos et al. 04)

Branches with bootstrap support <50% are collapsed

Anthropoids

PhySIC: Phylogenetic Signal with Induction and non-Contradiction22

Primate case study: PC & PI in action

ADRA2B

IRBP

Platyrrhines are unresolved due to a conflict (PC)

PhySICPC PhySIC

Arbitrary resolution among Anthropoids is removed (PI)

Source trees

PhySIC: Phylogenetic Signal with Induction and non-Contradiction23

Labels indicating source of problems

PhySIC can tell the reason for multifurcations proposed: Lack of overlap or information in the source trees (i)

Local contradictions between the source trees (c)

this guides correction/completion of source trees and primary data

PhySIC: Phylogenetic Signal with Induction and non-Contradiction24

Pointing out “problems” in other supertrees

eg, MRP is known to have some indesirable features:

inferring “novel clades” not supported by any input tree (Bininda-Emonds & Bryant 98, Goloboff & Pol 01, Goloboff 05)

being affected by a size-bias, i.e. when two trees conflict on the resolution of a clade, the tree with the smallest local sampling is ignored (Purvis 95, Bininda-Emonds & Bryant 98, Goloboff 05)

favoring source tree that are more unbalanced (Wilkinson et al 01)

A supertree already built from a collection of source trees by

an usual supertree method, can be reanalyzed in the light of

PI & PC to identify problems on some dubious nodes.

PhySIC: Phylogenetic Signal with Induction and non-Contradiction25

Primate case study: MRP tree analyzed

ADRA2B

IRBP

Source trees MRP supertree

1

12 PC

filtered MRP supertree

PhySIC: Phylogenetic Signal with Induction and non-Contradiction26

Online server: http://atgc.lirmm.fr/SuperTree/PhySIC

Contact:

[email protected]

PhySIC: Phylogenetic Signal with Induction and non-Contradiction27

Conclusion & prospects

appearing in the november issue of Syst.Biol.

PI and PC properties PhySIC method (http://atgc.lirmm.fr/SuperTree/PhySIC)

Supertrees satisfying PI and PC (exact) and as much resolved as possible (heuristics)

Proposes very reliable supertrees: identified by the data (low type-I err) Polynomial-time method Localization of conflicts and areas with insufficient overlap Enables to check/correct supertrees built by other methods (MRP, …).

Further developments: Producing more resolved trees satisfying PC et PI Filtering triplets based on their frequencies Coupling with a database (TreeBase, …)

PhySIC: Phylogenetic Signal with Induction and non-Contradiction28

Thanks

Emmanuel Douzery

Vincent Ranwez

Alexis Criscuolo

Sylvain Guillemot

Pierre-Henri Fabre

Celine Scornavacca

Vincent Lefort

Equipe Méth. et Algor. pour la bioinf.

LIRMM Equipe Phylogénie Moléculaire

ISEM