building phylogenetic trees. contents phylogeny phylogenetic trees how to make a phylogenetic tree...
TRANSCRIPT
Building Building phylogenetic treesphylogenetic trees
Contents
Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise
distances UPGMA method (+ an example) Neighbor-Joining method (+ an example)
Comparison of methods Conclusion
Phylogeny Phylogeny is the evolution of related species/genes Phylogenetic tree: diagram showing evolutionary
lineages of species/genes The history of genes or species may be very different Genes can be homologous or analogous, but still
remind each other Homologous sequences can be devided into two
parts Orthologous sequences diverged by specification from
a common ancestor Paralogous sequences evolved by gene dublication
within species Analogous sequences may appear and function very
similarly, but they do not have a common ancestor WHEN WE WANT TO EXPLORE EVOLUTIONARY
RELATIONSHIPS, WE NEED TO HANDLE ORTHOLOGOUS SEQUENCES
Genes
Homologous Analogous
Orthologous Paralogous
Phylogenetic trees
WHY construct a phylogenetic tree? to understand lineage of various species to understand how various functions evolved to inform multiple alignments
Trees can be rooted (a common ancestor in known) or unrooted
Leaves are the terminal nodes that correspond to the observed sequences of genes or species (A, B, C, D)
Internal nodes are hypothetical ancestral nodes All trees will be assumed to be binary, meaning that
an edge that branches splits into two daughter edges Each edge has a certain amount of evolutionary
divergence associated to it, defined by some measure of distance between sequences, or from a model of substitution of residues over the course of evolution
Phylogenetic trees
Different ways to represent a phylogenetic tree (illustrated by Treeview)
HRV10
HRV100
HRV66
HRV77
HRV25
HRV62
HRV29
HRV44
HRV31
HRV47
HRV39
HRV59
HRV63
HRV40
HRV85
HRV56
HRV54
HRV98
HRV1A
HRV1bGenba
HRV12
HRV78
HRV20
HRV68
HRV28
HRV53
HRV71
HRV51
HRV65
HRV46
HRV80
HRV45
HRV8
HRV95
HRV58
HRV36
HRV89Genba
HRV7
HRV88
HRV23
HRV30
HRV2Genban
HRV49
HRV43
HRV75
HRV16Genba
HRV81
HRV57
HRV55
HRVHanks
HRV21
HRV11
HRV33
HRV76
HRV24
HRV90
HRV18
HRV34
HRV50
HRV73
HRV13
HRV41
HRV61
HRV96
HRV15
HRV74
HRV38
HRV60
HRV67
HRV32
HRV9
HRV19
HRV82
HRV22
HRV64
HRV94
0.1
HRV12
HRV78
HRV20
HRV68
HRV28
HRV53
HRV71
HRV51
HRV65HRV46
HRV80
HRV45
HRV8HRV95
HRV58
HRV36HRV89GenbaHRV7
HRV88
HRV23HRV30
HRV2Genban
HRV49
HRV43
HRV75
HRV16Genba
HRV81
HRV57HRV55
HRVHanks
HRV21HRV11
HRV33HRV76
HRV24
HRV90HRV18HRV34HRV50
HRV73
HRV13
HRV41
HRV61
HRV96 HRV15HRV74
HRV38
HRV60
HRV67HRV32HRV9HRV19
HRV82HRV22
HRV64
HRV94HRV1A
HRV1bGenbaHRV39
HRV59HRV63
HRV40
HRV85
HRV56
HRV54
HRV98
HRV66
HRV77
HRV25 HRV62
HRV29
HRV44
HRV31
HRV47
HRV100HRV10
HRV10
HRV100
HRV66
HRV77
HRV25
HRV62
HRV29
HRV44
HRV31
HRV47
HRV39
HRV59
HRV63
HRV40
HRV85
HRV56
HRV54
HRV98
HRV1A
HRV1bGenba
HRV12
HRV78
HRV20
HRV68
HRV28
HRV53
HRV71
HRV51
HRV65
HRV46
HRV80
HRV45
HRV8
HRV95
HRV58
HRV36
HRV89Genba
HRV7
HRV88
HRV23
HRV30
HRV2Genban
HRV49
HRV43
HRV75
HRV16Genba
HRV81
HRV57
HRV55
HRVHanks
HRV21
HRV11
HRV33
HRV76
HRV24
HRV90
HRV18
HRV34
HRV50
HRV73
HRV13
HRV41
HRV61
HRV96
HRV15
HRV74
HRV38
HRV60
HRV67
HRV32
HRV9
HRV19
HRV82
HRV22
HRV64
HRV94
0.1
HRV10
HRV100
HRV66
HRV77
HRV25HRV62
HRV29
HRV44
HRV31
HRV47
HRV39
HRV59
HRV63
HRV40
HRV85
HRV56
HRV54
HRV98
HRV1AHRV1bGenba
HRV12
HRV78
HRV20
HRV68
HRV28
HRV53HRV71
HRV51
HRV65
HRV46
HRV80
HRV45
HRV8
HRV95
HRV58
HRV36
HRV89Genba
HRV7
HRV88
HRV23
HRV30
HRV2Genban
HRV49
HRV43
HRV75
HRV16Genba
HRV81
HRV57
HRV55
HRVHanks
HRV21
HRV11
HRV33HRV76
HRV24
HRV90
HRV18
HRV34
HRV50
HRV73
HRV13HRV41
HRV61
HRV96
HRV15
HRV74
HRV38HRV60
HRV67
HRV32
HRV9
HRV19
HRV82
HRV22
HRV64
HRV94
Different algorithms used to infer phylogeny from sequence data
1. Distance methods
2. Parsimony
3. Likelihood
4. Probabilistic methods
5. Phylogenetic invariants
Route from the molecular sequences to the phylogenetic treeDistance methods: Select a set of related (orthologous) nucleotide or amino
acid sequences Perform multiple sequence alignment (Clustal series
widely used) Calculate pairwise distances of the sequence using
chosen evolution model of substitution (Distances between sequences describe the evolution: the smaller distances are the closer they are related)
Select the most suitable algorithm to infer phylogeny View the tree with a certain program (Treeview,
NJPlot,..)
Hamming Distance
Making a tree from pairwise distances Distances dij between each pair
of sequences i and j are calculated in the given dataset
Different ways defining distances For nucleotide sequences:
Jukes-Cantor, Kimura-2-parameter K2P, HKY (Hasegawa-Kishino-Yano), F84, Tamura-Nei, General time-reversible model, General 12-parameter model
For amino acid sequences:PAM-matrices, BLOSUM-matrices
A B C D
A 0 32 44 46
B 32 0 29 43
C 44 29 0 30
D 46 43 30 0
Distance matrix methods
UPGMAAlgorithm introduced by Sokal and Michener
1958
Neighbor-JoiningAlgorithm introduced by Saitou and Nei 1987Modified by Studier and Keppler 1988
Clustering method: UPGMA
UPGMA = Unweighted pair group method using arithmetic averages
Simple method It works by clustering the sequences, at each
stage connecting two clusters and finally creating a new node on a tree
Method assumes equal rate of evolutionary change along branches Molecular clock assumption
UPGMA
UPGMA produces a rooted tree Branch lengths satisfy a molecular clock The divergence of sequences is assumed to occur at the same constant rate
at all points in the tree Trees that are clocklike are rooted and the total branch length from the root
up to any leaf is equal Trees are often referred to be ultrametric A distance measures are ultrametric if either all three distances are equal
dij = dik = djk or two of them are equal and one is smaller: djk < dij = dik
UPGMA is guaranteed to build the correct tree if distances are ultrametric Method can be used for reconstructing phylogenies if evolutionary rates are
assumed to be same in all lineages criticism in the phylogeny literature Suitable for the species closely related
Running time O(n2)
A
C
B
D
Algorithm: UPGMA
Initialisation:
Assign each sequence i in dataset to its own cluster
Define one leaf of T for each sequence, and place at height zero
Iteration:Find the two clusters i and j for which dij is the smallest (pick randomly if several equal distances)
Define a new cluster ij by Cij = Ci U Cj. Cluster ij has nij = ni + nj
members ( initially ni = 1 )
Connect i and j on the tree to a new node v
The branch lengths from new node to i and j are
placed at height
2ijd
Algorithm: UPGMA (cont.)
Iteration (cont.)Compute the distances between the new cluster and the remaining clusters by using
Add ij to the current clusters and remove i and j
Termination:When only two clusters i and j remain, place the root at height
2ijd
jkji
jik
ji
ikij d
nn
nd
nn
nd
),(
An example UPGMA (1)
Distance matrix (arbitrary) for four items (sequences) A, B, C and D
Actually distances are not ultrametric, because three distances are not equal
dij ≠ dik ≠ djk or two of them are not equal and one is smaller: djk < dij ≠ dik
A B C D
A 0 8 7 12
B 8 0 9 14
C 7 9 0 11
D 12 14 11 0
Step 1. Find the smallest distance, dij, between two clusters A and C, where dij is 7
An example UPGMA (2)
Step 2. Define new cluster ij, which has nij = ni + nj members (initially ni = 1)
New cluster A and C nAC = nA+ nC=2
Step 3. Connect A and C on the tree to a new node v1
Step 4. The branch lengths from new node v1 to A and C
5,32
7
2ACd A
C3,5
3,5
A B C D
A 0 8 7 12
B 0 9 14
C 0 11
D 0
Step 5. Compute the distances between the new cluster AC and the remaining clusters (B and D):
Step 6. Delete the columns and rows of the distance matrix that correspond to clusters A and C, and add a column and a row for cluster AC
An example UPGMA (3)
5.89*2
18*
2
1,
CBCA
CAB
CA
ABAC d
nn
nd
nn
nd
5.1111*2
112*
2
1,
CDCA
CAD
CA
ADAC d
nn
nd
nn
nd
AC B D
AC 0 8,5 11,5
B 0 14
D 0
New distance matrix
An example UPGMA (4)AC B D
AC 0 8,5 11,5
B 0 14
D 0
2nd iteration process
Step 1. Find the two sequences i and j for which dij is the smallest (randomly if several equal distances)AC-B
Step 2. Define new cluster (ij), which has nij = ni + nj members ( initially ni = 1 ) New cluster AC and B nACB = nAC+ nB = 2 + 1 = 3
Step 3. Connect AC and B on the tree to a new node v2
Step 4. The branch lengths from new node v2 to AC and B
25,42
5.8
2ACBd
A
C3,5
3,5
B4,25
An example UPGMA (5)
Step 5. Compute the distances between the new cluster and the remaining cluster (D)
Step 6. Delete the columns and rows of the distance matrix that correspond to clusters AC and B, and add a column and a row for cluster ACB
33,1214*3
15,11*
3
2),(
BDBAC
BACD
BAC
ACDACB d
nn
nd
nn
nd
ACB D
ACB 0 12,33
D 0
New distance matrix
An example UPGMA (6)
Termination: Only two clusters (ACB and D) remaining
Place the root height
ACB D
ACB 0 12,33
D 0
17,62
33,12
2ijd
A
C3,5
3,5
B4,25
6,17D
Original distance matrix and final phylogenetic tree(including thebranch lengths)
1,92A B C D
A 0 8 7 12
B 0 9 14
C 0 11
D 0
0,75
Neighbor-Joining (N-J)
Another algorithm that works by clustering the sequences Does not assume molecular clock N-J trees are unrooted N-J assumes additivity
Def. Edge lengths are said to be additive if the distance between any pair of leaves is the sum of lengths of the edges on the path connecting them
Method uses an approximate algorithm, where the tree is built by finding a pair of neighboring leaves i and j that minimize the length of the tree. Finally neighboring leaves are joined.
Running time O(n2)
B
AC
D
Initialisation:Define T to be the set of leaf nodes, one for each given sequence
Iteration:Compute for each sequence, where n is the number of sequences in the distance matrixPick a pair i and j (for which dij – ui – uj is the smallest (pick randomly if several equal)Join items i and j with a new node vCompute the branch lengths from a new node v to items i and j Compute the distances between new node v and remaining itemsRemove i and j from the distance matrix and replace them by new node v
Termination:When only two items i and j remain, add the remaining edge between i and j, with length dij
Algorithm: Neighbor-Joining
n
ij
iji n
du
2
Step 1. Computefor each row indistance matrixStep 2. Compute(the lower-diagonal matrix) and choose the smallest (most negative)
An example N-J (1)
A B C D Step 1 - ui
A 0 8 7 12 =(8+7+12)/(4-2) = 13,5
B 8 0 9 14 =(8+9+14)/(4-2)=15,5
C 7 9 0 11 =(7+9+11)/(4-2)=13,5
D 12
14 11 0 =(12+14+11)/(4-2)=18,5
n
ij
iji n
du
2
)( jiij uud
A B C D
A 0 8 7 12
B 8-(13,5+15,5)=-21 0 9 14
C 7-(13,5+13,5)=-20 9-(15,5+13,5)= -20 0 11
D 12-(13,5+18,5)=-20 14-(15,5+18,5)=-20 11-(13,5+18,5)=-21 0
An example N-J (2)Step 3. Join A and B together with a new node v1. Compute the edge lengths, from A to node v and from B to node v1
Step 4. Compute distances between the new node v1 and remaining items (C and D)
3
2
5,155,13
2
8
2
)(
2
BAAB
A
uudv
5
2
5,135,15
2
8
2
)(
2
ABAB
B
uudv
92
81412
2
)(
42
897
2
)(
),(
),(
ABBDADDAB
ABBCACCAB
dddd
dddd
v1
B
A
5
3
An example N-J (3)Step 5. Delete A and B from the distance matrix and replace them by new item AB
Step 6. Continue from step 1, because more than two items remain
Step 1. Compute for each row indistance matrix
Step 2 Computeand choose the smallest (the lower-diagonal matrix)
AB C D Step 1 = ui
AB 0 4 9 (4+9)/1=13
C 4 0 11 (4+11)/1=15
D 9 11 0 (9+11)/1=20
New reduced distance matrix
n
ij
iji n
du
2
)( jiij uud
AB C D
AB 0 4 9
C 4-(13+15)=-24 0 11
D 9-(13+20)=-24 11-(15+20)=-24 0
An example N-J (4)Step 3 Join v1 and C together with a new node v2. Compute the edge lengths, from v1 to node v2 and from C to node v2
Step 4 Compute distances between the new node v2 and remaining items (D)
3
2
1315
2
4
22
12
1513
2
4
2
)(
21
ABCABCC
CABABC
uudv
uudv
AB C D Step 1 = ui
AB 0 4 9 (4+9)/1=13
C 4 0 11 (4+11)/1=15
D 9 11 0 (9+11)/1=20
82
4119
2
)(),(
ABCCDABD
DABC
dddd
v1B
A
5
3
v21
3C
An example N-J (5)
Step 5 Delete AB and C from the distance matrix and replace them by ABC
Step 6 Only two nodes remaining connect them
ABC D
ABC 0 8
D 0
B
A
5
3C
D
8A B C D
A 0 8 7 12
B 0 9 14
C 0 11
D 0
13
Original distance matrix and final phylogenetic tree (including the edge lengths)
Comparison UPGMA
The total branch length from the root up to any leaf is equal
Produces a rooted tree, where the root is hypothesized ancestor of the sequences in the tree
Suitable for closely related sequences
Can be used to infer phylogenies if one can assume that evolutionary rates are the same in all lineages
Neighbor-joining Unrooted tree, where the
direction of evolution is unknown
Suitable for datasets with largely varying rates of evolution
Suitable for large datasets
B
A
5
3C
D
8
13
A
C3,5
3,5
B4,25
6,17 D
Conclusion
UPGMA method constructs a rooted phylogenetic tree correctly if there is a molecular clock with a constant rate of mutation
UPGMA method is rarely used, because molecular clock assumption is not generally true: selection pressures vary across time periods, genes within organisms, organisms, regions within gene
N-J method produces an unrooted tree without molecular clock hypothesis N-J method is one of the most popular and widely used by molecular
evolutionist Distance methods are strongly dependent on the model of evolution used Sequence information is reduced when transforming sequence data into
distances Distance methods are computationaly fast
Reference
Durbin, R., Eddy, S., Krogh, A., Mithchison G. 2003 Biological sequence analysis – Probabilistic models of proteins and nucleic acid. Campridge University Press.
Li, W. 1997. Molecular Evolution. Sinauer Associates, Sunderland, MA. p. 108
Felsenstein, J. 2003. Inferring Phylogenies. Sinauer Associates, Sunderland, MA. p.147-170
Examples of phylogeny programs
Multiple sequence alignment Clustal series (W, V) (free,
http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html )
Phylogeny packages PAUP (http://paup.csit.fsu.edu/ ) Phylip (free, http://evolution.gs.washington.edu) MEGA (free, http://www.megasoftware.net)
Viewing/plotting phylogenetic trees Treeview (free, http://taxonomy.zoology.gla.ac.uk/rod/treeview.html) NJPlot (free, http://pbil.univ-lyon1.fr/software/njplot.html)
Further reading
N-J: Saitou, N. and M. Nei.1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4): 406-25.
N-J: Studier, J. A., K. J. Keppler, et al. 1988. A note on the neighbor-joining algorithm of Saitou and Nei The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 5(6): 729-31.
UPGMA: Michener, C. D., and R. R. Sokal. 1957. A quantative approach to a problem in classification. Evolution 11: 130-162.
ClustalW: Thompson, J. D., T. J. Gibson, et al. 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 25(24): 4876-82.