1 disk-covering method based on the paper by d.huson, s.nettles, t.warnow presented by galiya s.,...
Post on 21-Dec-2015
213 views
TRANSCRIPT
1
Disk-Covering Method
Based on the paper by D.Huson, S.Nettles,
T.Warnow
Presented by Galiya S. , Eduard S.
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website,University of Arizona
2
Phylogenetic Tree
A phylogenetic tree is a tree showing the evolutionary interrelationships among various species.
From the Desert Vista high school, Phoenix, Arizona
3
Definition 1: Let T be a fixed rooted tree with leaves labeled 1,…,n.The Jukes-Cantor model makes the following assumptions:
1. The possible states for each site are A,C,T,G.
2. The sequence length is an input parameter and for each site, the state at the root is drawn from a distribution (typically uniform).
siteAGACTT
Jukes-Cantor model
3. The sites evolve identically and independently (i.i.d) down the tree from the root.
GGACTTAGGCCT
4
4. For each edge with u the parent of v, if the state
of a site is different at u than at v, then the probability that v has
any state of the three remaining states is equal.
Jukes-Cantor model (cont.)
( , ) ( )e u v E T
1 3
1 3
1 3
1 3
A G C TAGCT
a a a a
a a a a
a a a a
a a a a
GGGCAT AGCCCT GCACTT
AGACTT
GGACTTAGGCCT
eu
v
The example above based on CIPRES ppt.University of Texas at Austin.
5
5. To each edge e in the tree T associated a Poisson random variable for the number of mutations of a randomly selected site on that edge.
6. Each edge has an expectancy , .
Jukes-Cantor model (cont.)
eX
( )e eE X e
3
Multiple changes at a single site – hidden changes: seq1 AGTCAG
seq2 AGTCAC
Number of changes: Seq1 T G C A
Seq2 T A
21
1
AGTCACAGTCAG
AGTCTG
1e 2e
6
Definition 2 split - Removing an edge e from an unrooted phylogenetic tree T
partitions the leaf set S of the tree into two not empty sets. We denote it . Example:e
( ) { | ( )} ( ') { | ( ')}e eC T e E T C T e E T
Definition 2: T is the unrooted true tree, and T’ is the unrooted
inferred tree, both with leaves labeled 1,…,n. e is internal edge.
let define:
5
e1
23S={1,2,3,4,5}
{1,2}|{3,4,5}e T:
4
7
1 2
1
2
( ) { , }
{1,2} | {3,4,5}
{1,2,3} | {4,5}
e e
e
e
C T
Example: 5
T: e2e11
23
4FN
1 2
1
2
( ') { , }
{1,2}|{3,4,5}
{1,2,4} |{3,5}
e e
e
e
C T
T’:
Definition 2 (cont.) Any split is called a false negative (FN).( ) ( ')C T C T
Any split is called a false positive (PN).( ') ( )C T C T
An edge is recovered in T’ if the split appears in .( ')C T( )e C T
e11
2
34
e2
5
FP
E(T)e
8
FP rate:
FN rate: Number of false negative
Number of internal edges in T
Number of false positive
Number of internal edges in T'
Definition 2 (cont.)
Example: 5
T: e2e11
23
4FN
e11
2
3T’:4
e2
5FP
FN=0.5=50%
FP=0.5=50%
9
Definition 3: A matrix D is called additive if there exists a tree T with
positive edge weighting w such that .
is the path in T between leaves i and j.
Additive matrix
( )ij
ij e PD w e
ijp
Given an additive matrix D the tree T can be uniquely reconstruct in .2( )o n
A dissimilarity matrix is a symmetric matrix that is 0 on the diagonal.
10
Xe2
Xe3
is called the true distance between i and j.
is an additive matrix.
ij
Let and let [ ] then:
where [ ]
ij
ij
ij e ij ije P
e e ee P
x X X
X
ij
remainder: Let T be the unrooted true tree.
is the path in T between leaves i and j.
we represent the evolutionary process by a set of Poisson process.
ijp
{ | ( )}eX e E T
i
j
Xe1
Xij= Xe1+Xe2 +Xe3
True distance
11
is the number of different sites between sequences i and j.
is called the Hamming Distance.
Hamming Distance
is the normalized Hamming distance.ij
H(i,j)h =
k
H(i,j)
is the sequence length.k
H(i,j)
Example:
s1 CAACCCCGGT H(s1, s2) = 4
s2 TAATTTCGGT k = 10
h(s1, s2) = 4/10 = 0.4
12
Jukes-Cantor distance correction for each two leaves i, j is:
If : ij
3 3log(1- h )
4 4ijd 3
4ijh
Afterwards, compute the maximum Jukes-Cantor distance, multiply
that value by the number n of leaves and replace all undefined values.
distance correction
Example:
The matrix d is:
1 2 3 4
1 0 0.05 0.116
2 0 0.194
3 0
4 0
0.194 4 0.778
Replace * with
1 2 3 4
1 0 0.05 0.778 0.116
2 0 0.778 0.194
3 0 0.778
4 0
* 0.778
3 TCAAG 4 TTGGATTGCC1 TGGCC2 The 4 leaves are:
3
4
13
Definition 7: Let be a real number. Then:
and
| |ij ij ijd
,( ) max{ | min( ) }ij ij ijq d q
0q
Example: q=3.2 1 2 3 4
1 0 1 3 4
0 3.4 42
0 1.53
04
:
1 2 3 4
1 0 1.2 2.8 4.3
2 0 3.1 3.8
3 0 1.1
4 0
:d
1 2 3 4
1 0 0.2 0.2 0.3
0 0.3 0.22
0 0.43
04
:e
( ) 0.4q
The error
1 3 1.2
1.5
2.8
3.1
1.1
0
0.2 0.2
0.4
0.3
0
00
0
0
0
0
0
00
0
14
Let d be an dissimilarity matrix and let be any real number.
The threshold graph Thresh(d,q) is defined as:
Vertex set is {1,2,…,n }.
The edges are: (i,j) is an edge if and only if q.
For example: q = 4.5
Threshold Graph
1 2 3 4
1 0 2 4 6
2 0 7 5
3 0 1
4 0
d:
n n
ijd
4
1
3
2Thresh(d,4.5):2 4
1
0q
15
Triangulated graph
Definetion: A graph is triangulated if no subset of nodes
induced a cycle of size four or more.
Taken from wikipedia
16
A generic disk-covering method has four steps:
1. Decomposition: Compute a decomposition of the dataset into overlapping subsets.
2. Solution: Construct trees on the subsets using a base method.
3. Merge: Use a supertree method to merge the trees on the subsets into a tree on the full dataset.
4. Refinement: Compute the asymetric median tree of all posible supertrees.
Disk Covering Method
The example above based on CIPRES ppt.University of Texas at Austin.
17
Simplicial elimination order
{ : , ( , ) ( )}i j i jX v j i v v E G
Lemma: Simplicial elimination order is ordering of the verticesof G so the set
Form a clique. Every triangulated graph G has a simplicial elimination ordering.
The maximal clique in G are of the form This ordering can be found at . So maximal cliques of Gcan be found at
Example:
1 2 3 4 5 6 7 8{ , , , , , , , }v v v v v v v v
3
7 8
5
{ }i iv X
2O n 2O n
18
Constructing Tq
input: d dissimilarity matrix, Real number q>0.output: reconstructed tree, Tq.
1. Compute Thresh(d,q) 2. Triangulate Thresh(d,q) Polynomial Complexity 3. Compute Buneman Trees far all Maximal Cliques in
Triangulated Thresh(d,q). 4. Merge subtrees into a supertree.
Overall Complexity: Polynomial Complexity
2O n
2O n 2O n
19
Intersection graph Intersection graph is undirected graph formed by sets of sets of vertices:
by choosing one vertex for each set and connecting two vertices when the corresponding sets have none empty intersection.
1 2 1 2{ , ,..., } { , ,..., }m i i i ikS S S S v v v
iv iS,i jv v
jv
iv
jS
iS
Taken from wikipedia
20
Triangulaing Tresh(d,q) Complexity
Lemma: If d is an additive matrix, then Tresh(d,q) is triangulated.
Proof: let d be an arbitrary additive matrix, and let (T,w) be the edge weighted tree associated uniquily to d. Let q > 0. Add intermediate vertices to the edges of T and re-weight the edges so that the path between leaf pair are unchanged, but for every pair of leaves u and v in T if then there is a node x in the enlarged tree T’ so that
' ' '( , ) / 2 ( , ) ( , ) / 2T T Td u x q and d x v d u v q
, / 2u vd q
subtree of T’
u
v
tree T’
xuX
21
Triangulaing Tresh(d,q) Complexity
Now let denote the subtree of T’ of distance at most q/2 of u. Note that if only if , and so the Thresh(d,q) is identical to the intersection graph of the as u ranges over the leaves of T. Consecuntly Thresh(d,q) is triangulated.
u vX X ,u vd quX
uX
u
v
tree T
xuXvX
u
v
u
v
Thresh(d,q)Intersection
Graph
Taken from wikipedia
22
Supertree Construction Algorithm (SCA)
Step 1 : First obtain a simplicial elemination ordering for G. Compute where
For each Ci find a maximal clique C containing Ci and compute a tree ti for Ci by deleting the leaves in C-Ci form Tc.
Step 2 : Construct tree for i = n-3,n-4,…,1 compute the tree Ti formed by merging ti and using Consensus Subtree Merger method
{ : , ( , ) ( )}i j i jX v j i v v E G
Example:
C: {1,2,3,4}
C2: { 2,3,4}
C-C2{1 }
left { 2,3,4}
1iT
iii XvC
23
Strict Consenseus Subtree Merger
This method contracts a minimum set of edges in each tree in order to make them identical on the subtree they induce, lets denote that subtree by X and call it the backbone.
Merging two tree is done by attaching the pieces of each tree appropriately to the different edges of the backbone.
The situatuion in which the some piece of each tree attaches onto the same edge of the backbone, called collision.
1 2
34 6
5
1 2
37 4
1
3
2
4
1 2
3 4
1 2
3 4
12
34
5
6
7
24
Short Quartet Definition
Let (T,w) be a binary tree edge weighted by , and leaf laled by the set of spieces. Let e be an edge in T that is not incident to a leaf of T. Aroun e there is four subtrees A,B,C,D. Let a,b,c,d be four laves of the subtrees A,B,C,D repectivly, closest to e.Where the distance between leaves i and j measured as . We call {a,b,c,d} a short quartet around e. and the collection of all short quartets around internal nodes of T is denoted by )(TQshort
ijPeew )(
RTEw )(:},...,2,1{ nS
subtree of B
subtree of A
subtree of D
subtree of C
dc
ba
e
25
Gsq Definition
Let be the additive distance matrix associated to T. The Graph Gsq on the vertex set S = {1,2,…,n} is defined by if i and j are in same short quatet
Examples:
j
i
Tj
i
sqG
sqGji ),(
26
Proof of Tq correctnessTheorem: Let T be a leaf-labeled tree, Let G be a triangulated graph such that . Let Be the collection of Buneman trees applied to on the maximal cliques of G and assume this collection reconstructs the correct subtree, and let T* be the tree obtained by applying SCA to (G, ). Then T*=T.
Proof: We will show that under this conditions, Ti and the T restricted to the same vertices are identical and no collision occur.
Part I: Let T be a tree whose leaves are labeled by . Let G be a triangulated graph on S, and let where is a tree on leaf set A for every maximal clique A in G. Let be a simplicial elimination ordering of G. Let show that for every i
Base: this is true since we assumed that all buneman trees are correct.
sqG G
1 2{ , ,..., }nS v v v
AT{ }AT
1|{ , ,..., }i i n iT v v v T 1 2{ , ,..., }nv v v
3 3 2 1|{ , , , }n n n n nT T v v v v
27
Proof of Tq correctness(Cont.)Lets assume for some . forms the leaf set of the back bone of the strict consensus merger of . So we get Consequently there is no edge contraction when we compute the back bone.
Part II: There can be a collision only if the backbone contains an edge onto which both and some other attach, denote this edge by e. Thus, some subtree t’ of Ti attached onto e. Let the leaf set of t’ by . Let P be a path in T corresponding to edge e and let its endpoints be a and b. Let denote T0 be subtree of T obtained by deleting all the nodes in T that are separated from a by the deletion of b, and vice versa. Let be the leaves of T0. The following are true:1. and all leaves in t’ are also in 2. restricted to is path connected.3.
{1,2,..., 4}i n 1|{ , ,..., }i i i nT T v v v
1i it and T
1iX
1 1 1| | |i i i i iT X T X t X
1iv 1j iv X 1 1{ , ,..., }i i n iY v v v X
1 ,i a bX A
,a bA
,i a bv A,a bA
sqG ,a bA
28
Proof of Tq correctness(Cont.)Now, let P’ be a path lying in form to some node in Y. Let y be the first node in Y on the path P’. by (3) also lies entirely in so Consequently But this contradicts earlier assumption that
1iy X ,sq a bG A 1iv
1 2 1, ,..., iv v v 1iv
1 1( ) { , ,..., }i i i ny v v v v 1,iv y E G
1iy X
29
Experimental Results-Buneman FN rate of DCM-Buneman is lower than Buneman for every sequnce length. FP rate of DCM-Buneman is slightly higher than Buneman 3% and 0% respectively FN rate of DCM-Buneman reaches 5% at 10,000 sequence length,Buneman doesn’t reach this value.
30
Experimental Results - NJ FN and FP rates of DCM-NJ is significantly lower than NJ.
DCM-NJ becomes lower then 5% at 250 sequence length.
DCM-NJ can reconstruct the true tree at sequence beyond length of 900.
31
Distance Methods
The goal is a phylogenetic tree T such that the distance between species in T approximate The distance in D.
A distance matrix D is a symmetric, non-negative with zero diagonal.
we now describe some distance methods.
32
Buneman Input: a dissimilarity matrix d. Output: tree T.
1. Topology on every four-leaf subset is inferred using Four-Point Method:
Input – 4*4 dissimilarity matrix on i, j ,k, l.
Output –
if dij+dkl< min {dik+djl, dil+djk} then:
The topology ij | kl (i, j are separated from k, l by an edge) is returned.
if dij+dkl= min {dik+djl, dil+djk} then a star tree is returned.
j
i k
lstar
i
lj
ke
ij | kl
33
Buneman (cont.) Let Q be a set of four-leaf trees, defined by the FPM. The buneman tree is the maximally resolved tree satisfying:
for all quartets i, j, k, l if T restricted to i, j, k, l induces a binary tree, then: the tree in Q in i, j, k, l is the same binary tree.
Lemma 1: Let d be an input dissimilarity matrix. Let T be the buneman tree defined by d. Then C(T) is the set of splits (A, B) defined by:
complexity: polynomial time.
Qb'b,|a'a,treetheB,}b'{b,andA}a'{a,allFor
A={1,2,3}
B={4,5}Q: 1
52
4
1,2 | 4,5
1
53
4
1,3 | 4,5
2
53
4
2,3 | 4,5C(T)={(A,B)}
34
Neighbor - Joining
Input: a distance matrix d.
Output: unrooted binary tree T.
Algorithm Description:
For every 2 species, it determines a score, based on the distance matrix.
At each step the algorithm joins the pair with the minimum score:
make a subtree whose root replaces the two chosen species in the matrix.
The distance are recalculated to this new node.
This is reapeted until only tree nodes remain.
Finally, it connects the remaining two vertices with edge.
complexity: polynomial time - o(n3)