phylogenetic trees lecture 1

48
. Phylogenetic Trees Lecture 1 Credits: N. Friedman, D. Geiger , S. Moran,

Upload: hashim-powers

Post on 03-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Phylogenetic Trees Lecture 1. Credits: N. Friedman, D. Geiger , S. Moran,. Evolution. Evolution of new organisms is driven by Diversity Different individuals carry different variants of the same basic blue print Mutations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Phylogenetic Trees Lecture 1

.

Phylogenetic Trees

Lecture 1

Credits: N. Friedman, D. Geiger , S. Moran,

Page 2: Phylogenetic Trees Lecture 1

2

Evolution

Evolution of new organisms is driven by

Diversity Different individuals

carry different variants of the same basic blue print

Mutations The DNA sequence

can be changed due to single base changes, deletion/insertion of DNA segments, etc.

Selection bias

Page 3: Phylogenetic Trees Lecture 1

3

The Tree of Life

Sou

rce:

Alb

erts

et

al

Page 4: Phylogenetic Trees Lecture 1

4

D’après Ernst Haeckel, 1891

Tree of life- a better picture

Page 5: Phylogenetic Trees Lecture 1

5

Primate evolution

A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

Page 6: Phylogenetic Trees Lecture 1

6

Historical Note Until mid 1950’s phylogenies were constructed by

experts based on their opinion (subjective criteria)

Since then, focus on objective criteria for constructing phylogenetic trees

Thousands of articles in the last decades

Important for many aspects of biology Classification Understanding biological mechanisms

Page 7: Phylogenetic Trees Lecture 1

7

Morphological vs. Molecular

Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc.

Modern biological methods allow to use molecular features

Gene sequences Protein sequences

Analysis based on homologous sequences (e.g., globins) in different species

Page 8: Phylogenetic Trees Lecture 1

8

Morphological topology

BonoboChimpanzeeManGorillaSumatran orangutanBornean orangutanCommon gibbonBarbary apeBaboonWhite-fronted capuchinSlow lorisTree shrewJapanese pipistrelleLong-tailed batJamaican fruit-eating batHorseshoe bat

Little red flying foxRyukyu flying foxMouseRatVoleCane-ratGuinea pigSquirrelDormouseRabbitPikaPigHippopotamusSheepCowAlpacaBlue whaleFin whaleSperm whaleDonkeyHorseIndian rhinoWhite rhinoElephantAardvarkGrey sealHarbor sealDogCatAsiatic shrewLong-clawed shrewSmall Madagascar hedgehogHedgehogGymnureMoleArmadilloBandicootWallarooOpossumPlatypus

Archonta

Glires

Ungulata

Carnivora

Insectivora

Xenarthra

(Based on Mc Kenna and Bell, 1997)

Page 9: Phylogenetic Trees Lecture 1

9

Rat QEPGGLVVPPTDA

Rabbit QEPGGMVVPPTDA

Gorilla QEPGGLVVPPTDA

Cat REPGGLVVPPTEG

From sequences to a phylogenetic tree

There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins).

Page 10: Phylogenetic Trees Lecture 1

10

DonkeyHorseIndian rhinoWhite rhinoGrey sealHarbor sealDogCatBlue whaleFin whaleSperm whaleHippopotamusSheepCowAlpacaPigLittle red flying foxRyukyu flying foxHorseshoe batJapanese pipistrelleLong-tailed batJamaican fruit-eating bat

Asiatic shrewLong-clawed shrew

MoleSmall Madagascar hedgehogAardvarkElephantArmadilloRabbitPikaTree shrewBonoboChimpanzeeManGorillaSumatran orangutanBornean orangutanCommon gibbonBarbary apeBaboon

White-fronted capuchinSlow lorisSquirrelDormouseCane-ratGuinea pigMouseRatVoleHedgehogGymnureBandicootWallarooOpossumPlatypus

Perissodactyla

Carnivora

Cetartiodactyla

Rodentia 1

HedgehogsRodentia 2

Primates

ChiropteraMoles+ShrewsAfrotheria

XenarthraLagomorpha

+ Scandentia

Mitochondrial topology(Based on Pupko et al.,)

Page 11: Phylogenetic Trees Lecture 1

11

Nuclear topology

Round Eared Bat

Flying Fox

Hedgehog

Mole

Pangolin

Whale

Hippo

Cow

Pig

Cat

Dog

Horse

Rhino

Rat

Capybara

Rabbit

Flying Lemur

Tree Shrew

Human

Galago

Sloth

Hyrax

Dugong

Elephant

Aardvark

Elephant Shrew

Opossum

Kangaroo

1

2

3

4

Cetartiodactyla

Afrotheria

Chiroptera

Eulipotyphla

Glires

Xenarthra

CarnivoraPerissodactyla

Scandentia+Dermoptera

Pholidota

Primate

(tree by Madsenl)

(Based on Pupko et al. slide)

Page 12: Phylogenetic Trees Lecture 1

12

Theory of Evolution

Basic idea speciation events lead to creation of different

species. Speciation caused by physical separation into

groups where different genetic variants become dominant

Any two species share a (possibly distant) common ancestor

Page 13: Phylogenetic Trees Lecture 1

.

Basic Assumptions Closer related organisms have more similar

genomes.

Highly similar genes are homologous (have the same ancestor).

A universal ancestor exists for all life forms.

Molecular difference in homologous genes (or protein sequences) are positively correlated with evolution time.

Phylogenetic relation can be expressed by a dendrogram (a “tree”) .

Page 14: Phylogenetic Trees Lecture 1

14

Phylogenenetic trees

Leafs - current day species Nodes - hypothetical most recent common ancestors Edges length - “time” from one speciation to the next

Aardvark Bison Chimp Dog Elephant

Page 15: Phylogenetic Trees Lecture 1

15

Dangers in Molecular Phylogenies

We have to emphasize that gene/protein sequence can be homologous for several different reasons:

Orthologs -- sequences diverged after a speciation event

Paralogs -- sequences diverged after a duplication event

Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

Page 16: Phylogenetic Trees Lecture 1

16

Species Phylogeny

Gene Phylogenies

Speciation events

Gene Duplication

1A 2A 3A 3B 2B 1B

Phylogenies can be constructed to describe evolution genes.

Three species termed 1,2,3.Two paralog genes A and B.

Page 17: Phylogenetic Trees Lecture 1

17

Dangers of Paralogs

Speciation events

Gene Duplication

1A 2A 3A 3B 2B 1B

If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree that does not represent the phylogeny of the host species of the given sequences because duplication does not create new species.

In the sequel we assume all given sequences are orthologs.

S

SS

Page 18: Phylogenetic Trees Lecture 1

18

Types of Trees

A natural model to consider is that of rooted trees

CommonAncestor

Page 19: Phylogenetic Trees Lecture 1

19

Types of treesUnrooted tree represents the same phylogeny without

the root node

Depending on the model, data from current day species does not distinguish between different placements of the root.

Page 20: Phylogenetic Trees Lecture 1

20

Rooted versus unrooted treesTree A

ab

Tree B

c

Tree C

Represents the three rooted trees

Page 21: Phylogenetic Trees Lecture 1

21

Positioning Roots in Unrooted Trees

We can estimate the position of the root by introducing an outgroup:

a set of species that are definitely distant from all the species of interest

Aardvark Bison Chimp Dog Elephant

Falcon

Proposed root

Page 22: Phylogenetic Trees Lecture 1

22

Type of Data

Distance-based Input is a matrix of distances between species Can be fraction of residue they disagree on, or

alignment score between them, or …

Character-based Examine each character (e.g., residue)

separately

Page 23: Phylogenetic Trees Lecture 1

23

Three Methods of Tree Construction

Distance- A tree that recursively combines two nodes of the smallest distance.

Parsimony – A tree with a total minimum number of character changes between nodes.

Maximum likelihood - Finding the best Bayesian network of a tree shape. The method of choice nowadays. Most known and useful software called phylip uses this method.

Page 24: Phylogenetic Trees Lecture 1

24

Distance-Based Method

Input: distance matrix between species

Outline: Cluster species together Initially clusters are singletons At each iteration combine two “closest”

clusters to get a new one

Page 25: Phylogenetic Trees Lecture 1

25

Unweighted Pair Group Method using Arithmetic Averages (UPGMA)

UPGMA is a type of Distance-Based algorithm.

Despite its formidable acronym, the method is simple and intuitively appealing.

It works by clustering the sequences, at each stage amalgamating two clusters and, at the same time, creating a new node on the tree.

Thus, the tree can be imagined as being assembled upwards, each node being added above the others, and the edge lengths being determined by the difference in the heights of the nodes at the top and bottom of an edge.

Page 26: Phylogenetic Trees Lecture 1

26

An example showing how UPGMA producesa rooted phylogenetic tree

Page 27: Phylogenetic Trees Lecture 1

27

An example showing how UPGMA producesa rooted phylogenetic tree

Page 28: Phylogenetic Trees Lecture 1

28

An example showing how UPGMA producesa rooted phylogenetic tree

Page 29: Phylogenetic Trees Lecture 1

29

An example showing how UPGMA producesa rooted phylogenetic tree

Page 30: Phylogenetic Trees Lecture 1

30

An example showing how UPGMA producesa rooted phylogenetic tree

Page 31: Phylogenetic Trees Lecture 1

31

UPGMA Clustering Let Ci and Cj be clusters, define distance between them to

be

When we combine two cluster, Ci and Cj, to form a new cluster Ck, then

Define a node K and place its children nodes at depth

d(Ci, Cj)/2

i jCp Cqji

ji qpdCC

1CCd ),(

||||),(

||||

),(||),(||),(

ji

ljjliilk CC

CCdCCCdCCCd

Page 32: Phylogenetic Trees Lecture 1

32

Example

UPGMA construction on five objects.

The length of an edge = its (vertical) height.

2 3 4 1

6

8

9

d(7,8) / 2

5

7d(2,3) / 2

Page 33: Phylogenetic Trees Lecture 1

33

Molecular clock

This phylogenetic tree has all leaves in the same level. When this property holds, the phylogenetic tree is said to satisfy a molecular clock. Namely, the time from a speciation event to the formation of current species is identical for all paths (wrong assumption in reality).

Page 34: Phylogenetic Trees Lecture 1

34

Molecular Clock

1

2

3

4

UPGMA

2 3 4 1

UPGMA constructs trees that satisfy a molecular clock, even if the true tree does not satisfy a molecular clock.

Page 35: Phylogenetic Trees Lecture 1

35

Restrictive Correctness of UPGMAProposition: If the distance function is derived by adding edge distances in a tree T with a molecular clock, then UPGMA will reconstruct T.

Proof idea: Move a horizontal line from the bottom of the T to the top. Whenever an internal node is formed, the algorithm will create it.

Page 36: Phylogenetic Trees Lecture 1

36

Additivity

Molecular clock defines additive distances, namely,

distances between objects can be realized by a tree:

ab

c

i

j

k

cbkjd

cakid

bajid

),(

),(

),(

Page 37: Phylogenetic Trees Lecture 1

37

What is a Distance Matrix?

Given a set M of L objects with an L× L

distance matrix:

d(i, i) = 0, and for i ≠ j, d(i, j) > 0d(i, j) = d(j, i). For all i, j, k, it holds that d(i, k) ≤ d(i, j)+d(j, k).

Can we construct a weighted tree which realizes these distances?

Page 38: Phylogenetic Trees Lecture 1

38

Additive Distances

We say that the set M with L objects is additive if there is a tree T, L of its nodes correspond to the L objects, with positive weights on the edges, such that for all i, j, d(i, j) = dT(i, j), the length of the path from i to j in T.

Note: Sometimes the tree is required to be binary, and then the edge weights are required to be non-negative.

Page 39: Phylogenetic Trees Lecture 1

39

Three objects sets are additive:

For L=3: There is always a (unique) tree with one internal node.

( , )( , )( , )

d i j a bd i k a cd j k b c

ab

c

i

j

k

m

Thus0

2

1 )],(),(),([),( jidkjdkidmkdc

Page 40: Phylogenetic Trees Lecture 1

40

How about four objects?

L=4: Not all sets with 4 objects are additive:

e.g., there is no tree which realizes the below distances.

i j k l

i 0 2 2 2

j 0 2 2

k 0 3

l 0

Page 41: Phylogenetic Trees Lecture 1

41

The Four Points Condition

Theorem: A set M of L objects is additive iff any subset of four objects can be labeled i,j,k,l so that:

d(i, k) + d(j, l) = d(i, l) +d(k, j) ≥ d(i, j) + d(k, l)

We call {{i,j}, {k,l}} the “split” of {i, j, k, l}.

ik

lj

Proof:Additivity 4P Condition: By the figure...

Page 42: Phylogenetic Trees Lecture 1

42

4P Condition Additivity:

Induction on the number of objects, L.For L ≤ 3 the condition is empty and tree exists.

Consider L=4. B = d(i, k) +d(j, l) = d(i, l) +d(j, k) ≥ d(i, j) + d(k, l) = A

Let y = (B – A)/2 ≥ 0. Then the tree should look as follows:We have to find the distances a,b, c and f.

a b

i j

k

m

c

y

l

n

f

Page 43: Phylogenetic Trees Lecture 1

43

Tree construction for L = 4

a

b

i

j

k

m

c

y

l

n

f

Construct the tree by the given distances as follows:

1. Construct a tree for {i, j, k}, with internal vertex m

2. Add vertex n ,d(m,n) = y

3. Add edge (n, l), c+f = d(k, l)

n

f

n

f

n

fRemains to prove:

d(i,l) = dT(i,l)d(j,l) = dT(j,l)

Page 44: Phylogenetic Trees Lecture 1

44

Proof for “L = 4”

a

b

i

j

k

m

c

y

l

n

f

By the 4 points condition and the definition of y :d(i,l) = d(i,j) + d(k,l) +2y - d(k,j) = a + y + f = dT(i,l) (the middle equality holds since d(i,j), d(k,l) and d(k,j) are realized by the tree)d(j, l) = dT(j, l) is proved similarly.

B = d(i, k) +d(j, l) = d(i, l) +d(j, k) ≥ d(i, j) + d(k, l) = A,y = (B – A)/2 ≥ 0.

Page 45: Phylogenetic Trees Lecture 1

45

Induction step for “L > 4” : Remove Object L from the set By induction, there is a tree, T’, for {1, 2, … , L-1}. For each pair of labeled nodes (i, j) in T’, let aij, bij, cij

be defined by the following figure:

aij

bij

cij

i

j

L

mij

1[ ( , ) ( , ) ( , )]

2ijc d i L d j L d i j

Page 46: Phylogenetic Trees Lecture 1

46

Induction step: Pick i and j that minimize cij.

T is constructed by adding L (and possibly mij) to T’, as in the figure. Then d(i,L) = dT(i,L) and d(j,L) = dT(j,L)

Remains to prove: For each k ≠ i, j : d(k,L) = dT(k,L).

aij

bij

cij

i

j

L

mij

T’

Page 47: Phylogenetic Trees Lecture 1

47

Induction step (cont.)

Let k ≠ i, j be an arbitrary node in T’, and let n be the branching point of k in the path from i to j.

By the minimality of cij , {{i,j},{k,L}} is NOT a “split” of {i,j,k,L}. So assume WLOG that {{i,L},{j,k}} is a

“split” of {i,j, k,L}.

aij

bij

cij

i

j

L

mij

T’

k

n

Page 48: Phylogenetic Trees Lecture 1

48

Induction step (end)Since {{i,L},{j,k}} is a split, by the 4 points condition

d(L,k) = d(i,k) + d(L,j) - d(i,j)

d(i,k) = dT(i,k) and d(i,j) = dT(i,j) by induction hypothesis, and

d(L,j) = dT(L,j) by the construction.

Hence d(L,k) = dT(L,k). QED

aij

bij

cij

i

j

L

mij

T’

k

n