building phylogenetic trees jurgen mourik & richard vogelaars utrecht university
Post on 20-Dec-2015
216 Views
Preview:
TRANSCRIPT
Building phylogenetic trees
Jurgen Mourik &
Richard VogelaarsUtrecht University
Building phylogenetic trees2
Overview
• Background
• Making a tree from pairwise distances;
• Parsimony;– <break>;
• Assessing the trees: the bootstrap;
• Simultaneous alignment and phylogeny;
• Application: Phylip
Building phylogenetic trees3
Background
• Phylogenetic tree: diagram showing evolutionary lineages of species/genes
• Trees are used:– To understand lineage of various species– To understand how various functions evolved– To inform multiple alignments
Building phylogenetic trees4
Phylogenetic tree approaches
• Distance:– UPGMA– Neighbour-joining
• Parsimony:– Traditional parsimony– Weighted parsimony
Building phylogenetic trees5
Making a tree from pairwise distances
• Given a set of sequences you want to build a tree.
• Compute the distances dij between each pair i, j of the sequences.
• There are many different distance measures.
• Average distance between pairs of sequences from each cluster.
Building phylogenetic trees6
UPGMA
• Unweighted Pair Group Method using arithmetic Averages.
• It works by clustering the sequences, at each stage combining two clusters and at the same time creating a new node in a tree, using a distance measure.
Building phylogenetic trees7
Distance between points
• |Ci| and |Cj| denote the number of sequences in clusters i and j.
ji , q in Cp in C
pq
ji
ij dCC
d1
3
2 4
i
l
j
411
1 )(d
*d ilil
Building phylogenetic trees8
Distance between clusters
• Let Ck be the union of clusters Ci and Cj,then dkl
• Where Cl is any other cluster.
ji
jjliil
klCC
CdCdd
3
4k
l
5.32
7
11
1*31*4
kld
i
j
Building phylogenetic trees9
Building the tree: UPGMA
Initialisation:
Assign each sequence i to its own cluster Ci,
Define one leaf of T for each sequence, and place at height zero.Iteration:
Determine the two clusters i, j for which dij is minimal.
Define a new cluster k by , and define dkl for all l.
Define a node k with daughter nodes i an j, and place it at height dij /2.
Add k to the current clusters and remove i and j.Terminiation:
When only two clusters i, j remain, place the root at height dij /2.
jik CCC
Building phylogenetic trees10
UPGMA: Initialisation
Building phylogenetic trees11
UPGMA: Iteration 1
Building phylogenetic trees12
UPGMA: Iteration 2
Building phylogenetic trees13
UPGMA: Iteration 3
Building phylogenetic trees14
UPGMA: Terminiation
Building phylogenetic trees15
Properties of UPGMA
• Molecular clock & ultrametric property of distances
• Additivity
Building phylogenetic trees16
Properties of UPGMA:Molecular clock & ultrametric
• The molecular clock assumption: divergence of sequences is assumed to occur at the same rate at all points in the tree.
• If this does holds, then the data is said to be ultrametric.
Building phylogenetic trees17
Properties of UPGMA:Additivity
• Given a tree, its edge lengths are said to be additive if the distance between any pair of leaves is the sum of the lengths of the edges on the path connecting them.
j
i
m
k
)(21
ijjmimkm
jkikij
kmjkjm
kmikim
dddd
ddd
ddd
ddd
Building phylogenetic trees18
Neighbour-joining
• N-j constructs a tree by iteratively joining subtrees (like UPGMA).
• Produces an unrooted tree.
• Doesn’t make the molecular clock assumption, therefore the ultrametric property does not hold.
Building phylogenetic trees19
Distances in Neighbour-joining
• Given a new internal node k, the distance to another node m is given by:
)dd(dd ijjmimkm 21
)dd(dd jmimijik 21
ikijjk ddd j
i mk
Building phylogenetic trees20
Distances in Neighbour-joining
• Generalizing this so that the distance to all other leaves are taken into account:
• Where
• And |L| denotes the size of the set L of leaves.
)rr(dd jiijik 21
Lm
imi dL
r2
1j
i mk
Building phylogenetic trees21
Building the tree:Neighbour-joining
Initialisation:Define T to be the set of leaf nodes, one for each given sequence, and put L=T.
Iteration:Pick a pair i, j in L for which defined by is minimal.Define a new node k and set , for all m in L.Add k to T with edges of lengths , joining k to i and j, respectively.Remove i and j from L and add k.
Termination:When L consists of two leaves i and j add the remaining edge between i and j, with length dij.
)rr(dd jiijik 21
ikijjk ddd )dd(dd ijjmimkm 2
1
)( jiijij rrdD
Lm
imi dL
r2
1
Building phylogenetic trees22
Rooting trees
• Finding a root in an unrooted tree is sometimes accomplished by using an outgroup:– A species known to be more
distantly related to remaining species than they are to each other
• The point where the outgroup joins the rest of the tree is the best candidate for root position j
i
m
k
outgroup
Candidateroot
l
Building phylogenetic trees23
Comments on distance based methods
• If the given data is ultrametric (and these distances represent real distances), then UPGMA will identify the correct tree.
• If the data is additive (and these distances represent real distances), then Neighbour-joining will identify the correct tree.
• Otherwise, the methods may not recover the correct tree, but they may still be reasonable heuristics.
Building phylogenetic trees24
Phylogenetic tree approaches
• Distance:– UPGMA– Neighbour-joining
• Parsimony:– Traditional parsimony– Weighted parsimony
Building phylogenetic trees25
Parsimony
• Most widely used tree building algorithm(?).• Finds the tree that explains the data with a
minimal number of changes.• Instead of building a tree, it assigns a cost to a
given tree.• Two components of the parsimony algorithm can
be distinguished:– The computation of a cost for a given tree;– A search through all trees, to find the overall
minimum of this cost.
Building phylogenetic trees26
Parsimony example
• Given the following sequences: AAG,AAA,GGA,AGA.
• Several trees could explain the phylogeny
Building phylogenetic trees27
Traditional Parsimony
• Count the number of substitutions
• At each node keep:– a list of minimal cost residues– the current cost
• Post-order traversal of the tree
Building phylogenetic trees28
Traditional Parsimony
Initialisation:Set current cost C=0 and k =2n-1, the number of the root node.
Recursion: To obtain the set Rk:If k is a leaf node:
SetIf k is not a leaf node:
Compute Ri , Rj for the daughter i, j of k, and set if this intersection is not empty, or else
set and increment C.Termination:
Minimal cost of tree = C.
kuk xR
jik RRR jik RRR
Building phylogenetic trees29
Weighted Parsimony
• Extension of the traditional parsimony.
• Adds a cost function S(a,b) for each substitution of a by b.
• Post-order traversal of the tree
• Aim is now to minimize the cost.
Building phylogenetic trees30
Weighted Parsimony
Initialisation:Set k =2n-1, the number of the root node
Recursion: Compute Sk(a) for all a as follows:If k is a leaf node:
Set , otherwiseIf k is not a leaf node:
Compute Si(a), Sj(a) for all a at the daughter i, j and define
Termination:
Minimal cost of tree = minaS2n-1(a).
)),()((min)),()((min)( baSbSbaSbSaS jbibk
)( ,for )( aSxaaS kkuk
Building phylogenetic trees31
Break
• Questions so far?
• After the break:– Assessing the trees: the bootstrap;– Simultaneous alignment and phylogeny;– Application: Phylip
Building phylogenetic trees32
Branch and bound
• Parsimony itself can not build a tree!
• Using simple enumeration methods the number of trees become very large very fast.
• How to build the trees?– Stochastically– Branch and bound
Building phylogenetic trees33
Branch and bound
• B&B uses the parsimony algorithm.
• It guarantees to find the overall best tree.
• It systematically builds trees by increasing the number of leaves.
• Abandons a particular avenue of tree building whenever the current incomplete tree (T*) has a cost(T*)>cost(Tmin).
Building phylogenetic trees34
The Bootstrap
• A measure how much a tree should be trusted.
• Use the bootstrap as a method of assessing the significance of some phylogenetic feature.
Building phylogenetic trees35
The Bootstrap (2)
• The bootstrap works as follows:– Given a dataset of an alignment of sequences.– Generate an artificial dataset of the same size as the original
dataset by picking columns from the alignment at random with replacement.
– Apply the tree building algorithm to this artificial dataset.– Repeat selection and tree building procedure n times.– The feature with which a chosen phylogenetic features
appears is taken to be a measure of the confidence we can have in this feature.
Building phylogenetic trees36
Simultaneous alignment and phylogeny
• Simultaneously aligning sequences and finding a plausible phylogeny:– Sankoff & Cedergren’s gap-substitution algorithm;– Hein’s affine cost algorithm.
• Both find an optimal alignment given a tree.
Building phylogenetic trees37
Sankoff & Cedergren’s gap-substitution algorithm
• Guarantees to find ancestral sequences, and alignments of them and the leaf sequences.
• It uses a character-substitution model of gaps
• Together this minimizes a tree-based parsimony-type cost.
• The algorithm is a combination of two known methods:– Dynamic programming method (Chapter 6);– Weighted Parsimony algorithm.
Building phylogenetic trees38
Hein’s affine cost algorithm
• It uses affine gap penalties.
• Faster than the Sankoff & Cedergren algorithm.
• The aim is to find sequences z at a given node aligned to both of the sequences x and y at the daughter nodes satisfying:
• Where S is the total cost for a given alignment of two sequences. (mismatch cost =1 and 0 otherwise)
),(),(),( yxSyzSzxS
Building phylogenetic trees39
Hein’s affine cost algorithm
• Compared to equation (2.16) (alignment with affine gap scores) here the algorithm searches for the minimal cost path.
• The affine gap cost for a gap of length k isd+(k-1)e, where e<=d.
ejiV
djiVjiV
ejiV
djiVjiV
yxSjiV
yxSjiV
yxSjiV
jiV
Y
MY
X
MX
iiY
iiX
iiM
M
)1,(
)1,(min),(
),1(
),1(min),(
),()1,1(
),()1,1(
),()1,1(
min),(
Building phylogenetic trees40
Dynamic programming matrix for two sequences
VM
VX
VY
d=2
e=1
i
j
Building phylogenetic trees41
Hein’s affine cost algorithm
• Find the z for whichis minimal.
• From the matrix follows: – C - - A C -– C A C - - -
• CAC could be possible z.
),(),(),( yxSyzSzxS
CAC(?)
CAC CTCACA
Building phylogenetic trees42
Hein’s affine cost algorithmCAC(?)
CAC CTCACA
CACACA(?)
CAC CTCACA
CACAC(?)
CAC CTCACA
Which z could serve best as
ancestor?
Building phylogenetic trees43
Hein’s affine cost algorithm
CAC
CACACA
CACAC
12),(
0),(
edCTCACACACS
CACCACS12),( edCTCACACACS
1),(
2),(
CTCACACACACAS
edCACCACACAS12),( edCTCACACACS
1),(
),(
dCTCACACACACS
edCACCACACS12),( edCTCACACACS
Building phylogenetic trees44
Sequence graph
• Follow a path through the dynamic programming matrix.
• Derive a graph from this matrix.
• Whenever a cell is used by an optimal path a vertex is added to the graph.
Building phylogenetic trees45
Sequence graph
Graph 1
Building phylogenetic trees46
Sequence graph:line arrangement
Graph 1
Graph 2
Building phylogenetic trees47
Sequence graph:replacing the dummy edges
Graph 2
Graph 3
Building phylogenetic trees48
Dynamic Programming matrix:TAC – Graph 3
Building phylogenetic trees49
Ancestors
• Possible ancestral sequences for the leaf sequences TAC, CAC and CTCACA given the tree shown.
• Derived from the sequence graphs.CAC
CTCACA
CACTAC
CAC
1
5
Building phylogenetic trees50
Limitations of Hein’s model
• Hein’s algorithm takes the minimal cost sequences at each node upward.
• This can fail to give the overall optimum.
• Suppose the cost for a gap of length k is:– 13+3(k-1)
• Mismatch:– 4
• Suppose the leaves G and GTT.
Building phylogenetic trees51
Limitations of Hein’s model
• A eligible ancestor of G and GTT would be themselves, since they both have a cost of 13+3=16.
• GT would not be eligible because of the total cost of 2*13=26.
• Now we want to branch to the ancestor of G and GTT and there is a third leave GT.– The total cost for ineligible GT would be lower than
for either G or GTT.
Building phylogenetic trees52
Application: PHYLIP (Phylogeny Inference Package)
• Many features, among:– Traditional (unrooted) parsimony – Branch and bound to find all most parsimonious
trees
Building phylogenetic trees53
Application: PHYLIP
• Test dataset:Jurgen AACGUGGCCAAAU
Alpha ACCGCCGCCAAAU
Beta AAGGUCGCCAAAC
Gamma CAUUUCGUCACAA
Delta GGUAUCUCGGCCU
Epsilon GAAAUCUCGAUCC
Richard GGGCUCUCGGCUC
Demo
Questions?
top related