building phylogenetic trees. contents phylogeny phylogenetic trees how to make a phylogenetic tree...

Building Building phylogenetic treesphylogenetic trees

Contents

Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise

distances UPGMA method (+ an example) Neighbor-Joining method (+ an example)

Comparison of methods Conclusion

Phylogeny Phylogeny is the evolution of related species/genes Phylogenetic tree: diagram showing evolutionary

lineages of species/genes The history of genes or species may be very different Genes can be homologous or analogous, but still

remind each other Homologous sequences can be devided into two

parts Orthologous sequences diverged by specification from

a common ancestor Paralogous sequences evolved by gene dublication

within species Analogous sequences may appear and function very

similarly, but they do not have a common ancestor WHEN WE WANT TO EXPLORE EVOLUTIONARY

RELATIONSHIPS, WE NEED TO HANDLE ORTHOLOGOUS SEQUENCES

Genes

Homologous Analogous

Orthologous Paralogous

Phylogenetic trees

WHY construct a phylogenetic tree? to understand lineage of various species to understand how various functions evolved to inform multiple alignments

Trees can be rooted (a common ancestor in known) or unrooted

Leaves are the terminal nodes that correspond to the observed sequences of genes or species (A, B, C, D)

Internal nodes are hypothetical ancestral nodes All trees will be assumed to be binary, meaning that

an edge that branches splits into two daughter edges Each edge has a certain amount of evolutionary

divergence associated to it, defined by some measure of distance between sequences, or from a model of substitution of residues over the course of evolution

Phylogenetic trees

Different ways to represent a phylogenetic tree (illustrated by Treeview)

HRV10

HRV100

HRV66

HRV77

HRV25

HRV62

HRV29

HRV44

HRV31

HRV47

HRV39

HRV59

HRV63

HRV40

HRV85

HRV56

HRV54

HRV98

HRV1A

HRV1bGenba

HRV12

HRV78

HRV20

HRV68

HRV28

HRV53

HRV71

HRV51

HRV65

HRV46

HRV80

HRV45

HRV8

HRV95

HRV58

HRV36

HRV89Genba

HRV7

HRV88

HRV23

HRV30

HRV2Genban

HRV49

HRV43

HRV75

HRV16Genba

HRV81

HRV57

HRV55

HRVHanks

HRV21

HRV11

HRV33

HRV76

HRV24

HRV90

HRV18

HRV34

HRV50

HRV73

HRV13

HRV41

HRV61

HRV96

HRV15

HRV74

HRV38

HRV60

HRV67

HRV32

HRV9

HRV19

HRV82

HRV22

HRV64

HRV94

0.1

HRV12

HRV78

HRV20

HRV68

HRV28

HRV53

HRV71

HRV51

HRV65HRV46

HRV80

HRV45

HRV8HRV95

HRV58

HRV36HRV89GenbaHRV7

HRV88

HRV23HRV30

HRV2Genban

HRV49

HRV43

HRV75

HRV16Genba

HRV81

HRV57HRV55

HRVHanks

HRV21HRV11

HRV33HRV76

HRV24

HRV90HRV18HRV34HRV50

HRV73

HRV13

HRV41

HRV61

HRV96 HRV15HRV74

HRV38

HRV60

HRV67HRV32HRV9HRV19

HRV82HRV22

HRV64

HRV94HRV1A

HRV1bGenbaHRV39

HRV59HRV63

HRV40

HRV85

HRV56

HRV54

HRV98

HRV66

HRV77

HRV25 HRV62

HRV29

HRV44

HRV31

HRV47

HRV100HRV10

HRV10

HRV100

HRV66

HRV77

HRV25

HRV62

HRV29

HRV44

HRV31

HRV47

HRV39

HRV59

HRV63

HRV40

HRV85

HRV56

HRV54

HRV98

HRV1A

HRV1bGenba

HRV12

HRV78

HRV20

HRV68

HRV28

HRV53

HRV71

HRV51

HRV65

HRV46

HRV80

HRV45

HRV8

HRV95

HRV58

HRV36

HRV89Genba

HRV7

HRV88

HRV23

HRV30

HRV2Genban

HRV49

HRV43

HRV75

HRV16Genba

HRV81

HRV57

HRV55

HRVHanks

HRV21

HRV11

HRV33

HRV76

HRV24

HRV90

HRV18

HRV34

HRV50

HRV73

HRV13

HRV41

HRV61

HRV96

HRV15

HRV74

HRV38

HRV60

HRV67

HRV32

HRV9

HRV19

HRV82

HRV22

HRV64

HRV94

0.1

HRV10

HRV100

HRV66

HRV77

HRV25HRV62

HRV29

HRV44

HRV31

HRV47

HRV39

HRV59

HRV63

HRV40

HRV85

HRV56

HRV54

HRV98

HRV1AHRV1bGenba

HRV12

HRV78

HRV20

HRV68

HRV28

HRV53HRV71

HRV51

HRV65

HRV46

HRV80

HRV45

HRV8

HRV95

HRV58

HRV36

HRV89Genba

HRV7

HRV88

HRV23

HRV30

HRV2Genban

HRV49

HRV43

HRV75

HRV16Genba

HRV81

HRV57

HRV55

HRVHanks

HRV21

HRV11

HRV33HRV76

HRV24

HRV90

HRV18

HRV34

HRV50

HRV73

HRV13HRV41

HRV61

HRV96

HRV15

HRV74

HRV38HRV60

HRV67

HRV32

HRV9

HRV19

HRV82

HRV22

HRV64

HRV94

Different algorithms used to infer phylogeny from sequence data

1. Distance methods

2. Parsimony

3. Likelihood

4. Probabilistic methods

5. Phylogenetic invariants

Route from the molecular sequences to the phylogenetic treeDistance methods: Select a set of related (orthologous) nucleotide or amino

acid sequences Perform multiple sequence alignment (Clustal series

widely used) Calculate pairwise distances of the sequence using

chosen evolution model of substitution (Distances between sequences describe the evolution: the smaller distances are the closer they are related)

Select the most suitable algorithm to infer phylogeny View the tree with a certain program (Treeview,

NJPlot,..)

Hamming Distance

Making a tree from pairwise distances Distances dij between each pair

of sequences i and j are calculated in the given dataset

Different ways defining distances For nucleotide sequences:

Jukes-Cantor, Kimura-2-parameter K2P, HKY (Hasegawa-Kishino-Yano), F84, Tamura-Nei, General time-reversible model, General 12-parameter model

For amino acid sequences:PAM-matrices, BLOSUM-matrices

A B C D

A 0 32 44 46

B 32 0 29 43

C 44 29 0 30

D 46 43 30 0

Distance matrix methods

UPGMAAlgorithm introduced by Sokal and Michener

1958

Neighbor-JoiningAlgorithm introduced by Saitou and Nei 1987Modified by Studier and Keppler 1988

Clustering method: UPGMA

UPGMA = Unweighted pair group method using arithmetic averages

Simple method It works by clustering the sequences, at each

stage connecting two clusters and finally creating a new node on a tree

Method assumes equal rate of evolutionary change along branches Molecular clock assumption

UPGMA

UPGMA produces a rooted tree Branch lengths satisfy a molecular clock The divergence of sequences is assumed to occur at the same constant rate

at all points in the tree Trees that are clocklike are rooted and the total branch length from the root

up to any leaf is equal Trees are often referred to be ultrametric A distance measures are ultrametric if either all three distances are equal

dij = dik = djk or two of them are equal and one is smaller: djk < dij = dik

UPGMA is guaranteed to build the correct tree if distances are ultrametric Method can be used for reconstructing phylogenies if evolutionary rates are

assumed to be same in all lineages criticism in the phylogeny literature Suitable for the species closely related

Running time O(n2)

A

C

B

D

Algorithm: UPGMA

Initialisation:

Assign each sequence i in dataset to its own cluster

Define one leaf of T for each sequence, and place at height zero

Iteration:Find the two clusters i and j for which dij is the smallest (pick randomly if several equal distances)

Define a new cluster ij by Cij = Ci U Cj. Cluster ij has nij = ni + nj

members ( initially ni = 1 )

Connect i and j on the tree to a new node v

The branch lengths from new node to i and j are

placed at height

2ijd

Algorithm: UPGMA (cont.)

Iteration (cont.)Compute the distances between the new cluster and the remaining clusters by using

Add ij to the current clusters and remove i and j

Termination:When only two clusters i and j remain, place the root at height

2ijd

jkji

jik

ji

ikij d

nn

nd

nn

nd

),(

An example UPGMA (1)

Distance matrix (arbitrary) for four items (sequences) A, B, C and D

Actually distances are not ultrametric, because three distances are not equal

dij ≠ dik ≠ djk or two of them are not equal and one is smaller: djk < dij ≠ dik

A B C D

A 0 8 7 12

B 8 0 9 14

C 7 9 0 11

D 12 14 11 0

Step 1. Find the smallest distance, dij, between two clusters A and C, where dij is 7


Step 2. Define new cluster ij, which has nij = ni + nj members (initially ni = 1)

New cluster A and C nAC = nA+ nC=2

Step 3. Connect A and C on the tree to a new node v1

Step 4. The branch lengths from new node v1 to A and C

5,32

7

2ACd A

C3,5

3,5

A B C D

A 0 8 7 12

B 0 9 14

C 0 11

D 0

Step 5. Compute the distances between the new cluster AC and the remaining clusters (B and D):

Step 6. Delete the columns and rows of the distance matrix that correspond to clusters A and C, and add a column and a row for cluster AC


5.89*2

18*

2

1,

CBCA

CAB

CA

ABAC d

nn

nd

nn

nd

5.1111*2

112*

2

1,

CDCA

CAD

CA

ADAC d

nn

nd

nn

nd

AC B D

AC 0 8,5 11,5

B 0 14

D 0

New distance matrix

An example UPGMA (4)AC B D

AC 0 8,5 11,5

B 0 14

D 0

2nd iteration process

Step 1. Find the two sequences i and j for which dij is the smallest (randomly if several equal distances)AC-B

Step 2. Define new cluster (ij), which has nij = ni + nj members ( initially ni = 1 ) New cluster AC and B nACB = nAC+ nB = 2 + 1 = 3

Step 3. Connect AC and B on the tree to a new node v2

Step 4. The branch lengths from new node v2 to AC and B

25,42

5.8

2ACBd

A

C3,5

3,5

B4,25


Step 5. Compute the distances between the new cluster and the remaining cluster (D)

Step 6. Delete the columns and rows of the distance matrix that correspond to clusters AC and B, and add a column and a row for cluster ACB

33,1214*3

15,11*

3

2),(

BDBAC

BACD

BAC

ACDACB d

nn

nd

nn

nd

ACB D

ACB 0 12,33

D 0

New distance matrix


Termination: Only two clusters (ACB and D) remaining

Place the root height

ACB D

ACB 0 12,33

D 0

17,62

33,12

2ijd

A

C3,5

3,5

B4,25

6,17D

Original distance matrix and final phylogenetic tree(including thebranch lengths)

1,92A B C D

A 0 8 7 12

B 0 9 14

C 0 11

D 0

0,75

Neighbor-Joining (N-J)

Another algorithm that works by clustering the sequences Does not assume molecular clock N-J trees are unrooted N-J assumes additivity

Def. Edge lengths are said to be additive if the distance between any pair of leaves is the sum of lengths of the edges on the path connecting them

Method uses an approximate algorithm, where the tree is built by finding a pair of neighboring leaves i and j that minimize the length of the tree. Finally neighboring leaves are joined.

Running time O(n2)

B

AC

D

Initialisation:Define T to be the set of leaf nodes, one for each given sequence

Iteration:Compute for each sequence, where n is the number of sequences in the distance matrixPick a pair i and j (for which dij – ui – uj is the smallest (pick randomly if several equal)Join items i and j with a new node vCompute the branch lengths from a new node v to items i and j Compute the distances between new node v and remaining itemsRemove i and j from the distance matrix and replace them by new node v

Termination:When only two items i and j remain, add the remaining edge between i and j, with length dij

Algorithm: Neighbor-Joining

n

ij

iji n

du

2

Step 1. Computefor each row indistance matrixStep 2. Compute(the lower-diagonal matrix) and choose the smallest (most negative)

An example N-J (1)

A B C D Step 1 - ui

A 0 8 7 12 =(8+7+12)/(4-2) = 13,5

B 8 0 9 14 =(8+9+14)/(4-2)=15,5

C 7 9 0 11 =(7+9+11)/(4-2)=13,5

D 12

14 11 0 =(12+14+11)/(4-2)=18,5

n

ij

iji n

du

2

)( jiij uud

A B C D

A 0 8 7 12

B 8-(13,5+15,5)=-21 0 9 14

C 7-(13,5+13,5)=-20 9-(15,5+13,5)= -20 0 11

D 12-(13,5+18,5)=-20 14-(15,5+18,5)=-20 11-(13,5+18,5)=-21 0

An example N-J (2)Step 3. Join A and B together with a new node v1. Compute the edge lengths, from A to node v and from B to node v1

Step 4. Compute distances between the new node v1 and remaining items (C and D)

3

2

5,155,13

2

8

2

)(

2

BAAB

A

uudv

5

2

5,135,15

2

8

2

)(

2

ABAB

B

uudv

92

81412

2

)(

42

897

2

)(

),(

),(

ABBDADDAB

ABBCACCAB

dddd

dddd

v1

B

A

5

3

An example N-J (3)Step 5. Delete A and B from the distance matrix and replace them by new item AB

Step 6. Continue from step 1, because more than two items remain

Step 1. Compute for each row indistance matrix

Step 2 Computeand choose the smallest (the lower-diagonal matrix)

AB C D Step 1 = ui

AB 0 4 9 (4+9)/1=13

C 4 0 11 (4+11)/1=15

D 9 11 0 (9+11)/1=20

New reduced distance matrix

n

ij

iji n

du

2

)( jiij uud

AB C D

AB 0 4 9

C 4-(13+15)=-24 0 11

D 9-(13+20)=-24 11-(15+20)=-24 0

An example N-J (4)Step 3 Join v1 and C together with a new node v2. Compute the edge lengths, from v1 to node v2 and from C to node v2

Step 4 Compute distances between the new node v2 and remaining items (D)

3

2

1315

2

4

22

12

1513

2

4

2

)(

21

ABCABCC

CABABC

uudv

uudv

AB C D Step 1 = ui

AB 0 4 9 (4+9)/1=13

C 4 0 11 (4+11)/1=15

D 9 11 0 (9+11)/1=20

82

4119

2

)(),(

ABCCDABD

DABC

dddd

v1B

A

5

3

v21

3C

An example N-J (5)

Step 5 Delete AB and C from the distance matrix and replace them by ABC

Step 6 Only two nodes remaining connect them

ABC D

ABC 0 8

D 0

B

A

5

3C

D

8A B C D

A 0 8 7 12

B 0 9 14

C 0 11

D 0

13

Original distance matrix and final phylogenetic tree (including the edge lengths)

Comparison UPGMA

The total branch length from the root up to any leaf is equal

Produces a rooted tree, where the root is hypothesized ancestor of the sequences in the tree

Suitable for closely related sequences

Can be used to infer phylogenies if one can assume that evolutionary rates are the same in all lineages

Neighbor-joining Unrooted tree, where the

direction of evolution is unknown

Suitable for datasets with largely varying rates of evolution

Suitable for large datasets

B

A

5

3C

D

8

13

A

C3,5

3,5

B4,25

6,17 D

Conclusion

UPGMA method constructs a rooted phylogenetic tree correctly if there is a molecular clock with a constant rate of mutation

UPGMA method is rarely used, because molecular clock assumption is not generally true: selection pressures vary across time periods, genes within organisms, organisms, regions within gene

N-J method produces an unrooted tree without molecular clock hypothesis N-J method is one of the most popular and widely used by molecular

evolutionist Distance methods are strongly dependent on the model of evolution used Sequence information is reduced when transforming sequence data into

distances Distance methods are computationaly fast

Reference

Durbin, R., Eddy, S., Krogh, A., Mithchison G. 2003 Biological sequence analysis – Probabilistic models of proteins and nucleic acid. Campridge University Press.

Li, W. 1997. Molecular Evolution. Sinauer Associates, Sunderland, MA. p. 108

Felsenstein, J. 2003. Inferring Phylogenies. Sinauer Associates, Sunderland, MA. p.147-170

Examples of phylogeny programs

Multiple sequence alignment Clustal series (W, V) (free,

http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html )

Phylogeny packages PAUP (http://paup.csit.fsu.edu/ ) Phylip (free, http://evolution.gs.washington.edu) MEGA (free, http://www.megasoftware.net)

Viewing/plotting phylogenetic trees Treeview (free, http://taxonomy.zoology.gla.ac.uk/rod/treeview.html) NJPlot (free, http://pbil.univ-lyon1.fr/software/njplot.html)

Further reading

N-J: Saitou, N. and M. Nei.1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4): 406-25.

N-J: Studier, J. A., K. J. Keppler, et al. 1988. A note on the neighbor-joining algorithm of Saitou and Nei The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 5(6): 729-31.

UPGMA: Michener, C. D., and R. R. Sokal. 1957. A quantative approach to a problem in classification. Evolution 11: 130-162.

ClustalW: Thompson, J. D., T. J. Gibson, et al. 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 25(24): 4876-82.

building phylogenetic trees. contents phylogeny phylogenetic trees how to make a phylogenetic tree...

Documents

molecular sequences

pair of sequences

speciesanalogous sequences

partsorthologous sequences

otherhomologous sequences

observed sequences of

clustering method

distancesupgma method