introduction to phylogenetic analysis irit orr feb 2005reconstruction is to attempt to estimate the...

91
Introduction to Phylogenetic Analysis Irit Orr Feb 2005

Upload: others

Post on 03-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Intr

oduc

tion

to

Phyl

ogen

etic

Ana

lysi

s

Irit

Orr

Feb

2005

Page 2: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Sub

ject

s of

this

lect

ure

Intro

duci

ng s

ome

of th

e te

rmin

olog

y us

ed in

ph

ylog

enet

ics.

Intro

duci

ng th

e m

ore

com

mon

evo

lutio

nary

m

odel

s

Intro

duci

ng s

ome

of th

e m

ost c

omm

only

use

d m

etho

ds fo

r rec

onst

ruct

ing

phyl

ogen

etic

s

Exp

lain

how

to b

uild

/read

phy

loge

netic

trees

.

Page 3: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Taxo

nom

y-i

s th

e sc

ienc

e of

cla

ssifi

catio

n of

org

anis

ms.

Page 4: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Phy

loge

ny –

is th

e ev

olut

ion

of a

gen

etic

ally

re

late

d gr

oup

of o

rgan

ism

s.

Or:

A s

tudy

of r

elat

ions

hips

be

twee

n co

llect

ion

of "t

hing

s"

(gen

es, p

rote

ins,

org

ans.

.) th

at

are

deriv

ed fr

om a

com

mon

an

cest

or

Page 5: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Phy

loge

netic

s-

Fiel

d of

bio

logy

that

de

als

with

the

rela

tions

hips

bet

wee

n ta

xa(o

rgan

ism

s, m

olec

ular

dat

a).

It in

clud

es th

e di

scov

ery

of th

ese

rela

tions

hips

, and

the

stud

y of

the

caus

esbe

hind

this

pat

tern

.

Page 6: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Phy

loge

netic

s-W

HY

?

Find

evo

lutio

nary

ties

bet

wee

n or

gani

sms.

(Ana

lyze

cha

nges

occ

urin

gin

diff

eren

t org

anis

ms

durin

g ev

olut

ion)

.Fi

nd (u

nder

stan

d) re

latio

nshi

ps b

etw

een

an

ance

stra

l seq

uenc

e an

d it

desc

enda

nts.

(Evo

lutio

n of

fam

ily o

f seq

uenc

es)

Est

imat

e tim

e of

div

erge

nce

betw

een

a gr

oup

of

orga

nism

s th

at s

hare

a c

omm

on a

nces

tor.

Page 7: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Phy

loge

netic

s-W

HY

?Th

e pu

rpos

e of

phy

loge

netic

reco

nstru

ctio

n is

to

atte

mpt

to e

stim

ate

the

phyl

ogen

y fo

r som

e da

ta.

The

basi

c as

sum

ptio

n is

that

:Fo

r any

col

lect

ion

of d

ata

ther

e w

ill b

e so

me

ance

stra

l re

latio

nshi

p be

twee

n th

ecu

rren

t(co

ntem

pora

y)se

quen

ces.

The

data

itse

lf co

ntai

ns in

form

atio

nth

at c

an b

e us

ed to

reco

nstru

ct o

r to

infe

r the

se a

nces

tral

rela

tions

hips

. Th

is in

volv

es re

cons

truct

ing

a br

anch

ing

stru

ctur

e, te

rmed

a p

hylo

geny

or t

ree,

that

ill

ustra

tes

the

rela

tions

hips

bet

wee

n th

e se

quen

ces.

Page 8: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Bas

ic tr

ee

Of L

ife

Euk

aryo

te tr

ee

Page 9: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mol

ecul

ar p

hylo

gene

tics

The

stud

y of

phy

loge

nies

and

pro

cess

es o

f ev

olut

ion

by th

e an

alys

is o

f mol

ecul

ar d

ata

(DN

A o

r am

ino

acid

seq

uenc

es)

AAG

AA

TC

AAG

AG

TT

AAG

A(A

/G)T

(C/T

)

Page 10: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Sim

ilar s

eque

nces

, com

mon

anc

esto

r...

... c

omm

on a

nces

tor,

sim

ilar f

unct

ion

Page 11: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

From

a c

omm

on a

nces

tors

eque

nce,

two

DN

A s

eque

nces

are

dive

rged

.

Eac

hof

thes

e tw

o se

quen

ces

star

t to

accu

mul

ate

nucl

eotid

e su

bstit

utio

ns.

The

rate

, num

ber a

nd c

hara

cter

of t

hese

m

utat

ions

are

use

d in

mol

ecul

ar e

volu

tion

anal

ysis

.

Page 12: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Wha

t is

a ph

ylog

enet

ictr

ee?

Phylog

enet

ictr

ee =

Den

dogr

amA

n ilu

stra

tion

of t

he h

iera

rchi

cal r

elat

ions

hips

amon

g a

grou

p of

org

anis

ms

aris

ing

thro

ugh

evol

utio

n.Th

e re

lati

onsh

ips

are

usua

lly r

epre

sent

ed b

y a

sche

mat

ic ‘t

ree’

com

pris

ing

a se

t of

nod

eslin

ked

toge

ther

by

bran

ches

. E

ach

NO

DE

repr

esen

ts a

spe

ciat

ion

even

t in

evol

utio

n. B

eyon

d th

is p

oint

any

seq

uenc

e ch

ange

s th

at o

ccur

red

are

spec

ific

for e

ach

bran

ch (s

peci

e).

Page 13: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

In a

phy

loge

netic

tree

…..

Term

inal

nod

es(e

xter

nal n

odes

, tip

s or

leav

es)

typi

cally

rep

rese

nt t

he a

ctua

l tax

a=

know

n se

quen

ces

from

ext

ant

orga

nism

s.

Know

n al

so a

s O

TUs

(Ope

ratio

nal T

axon

omic

Uni

ts)

Inte

rnal

nod

esre

pres

ent

ance

stra

l di

verg

ence

s in

to t

wo (o

r m

ore)

gen

etic

ally

is

olat

ed g

roup

s.K

now

n al

so a

s H

TUs

(Hyp

othe

tical

Tax

onom

ic U

nits

)

Each

inte

rnal

nod

eis

att

ache

d to

a b

ranc

h(s)

repr

esen

ting

evo

luti

on f

rom

its

ance

stor

to

its

desc

enda

nts.

Page 14: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

In a

phy

loge

netic

tree

…..

The

leng

ths

of t

he b

ranc

hes

in t

he t

ree

can

repr

esen

t th

e ev

olut

iona

ry

dist

ance

s th

at s

epar

ate

the

node

s.

(mea

ning

the

# of

cha

nges

that

occ

urre

d in

the

seqs

prio

r to

the

next

leve

l of

sepa

ratio

n).

Page 15: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Tree

Top

olog

y

Is th

e in

form

atio

n of

the

brun

chin

g or

der o

f re

latio

nshi

ps (b

runc

hing

pat

tern

s) b

etw

een

taxa

, with

out c

onsi

dera

tion

of th

e br

anch

le

ngth

s.

Page 16: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

phyl

ogen

etic

tree

s

Inte

rnal

Nod

esEx

tern

al N

odes

Page 17: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

The

Phy

loge

netic

tree…

.

Peo

ple

still

usu

ally

con

side

r a p

hylo

gene

tic

reco

nstru

ctio

npr

ogra

m to

act

like

a b

lack

box

.

It ta

kes

inpu

t, ch

urns

aro

und

for a

whi

le a

nd th

en

spits

out

the

actu

al p

hylo

gene

tic a

nsw

er.

This

is in

corr

ect.

Firs

t, th

e ac

tual

phy

loge

netic

ans

wer

can

not

be

obta

ined

by

any

know

n m

etho

d.

The

amou

nt o

f evo

lutio

nary

tim

e th

at p

asse

d fro

m th

e se

para

tion

of th

e 2

sequ

ence

s is

not

kn

own.

Page 18: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

All

phyl

ogen

etic

anal

ysis

met

hods

can

only

pr

ovid

e es

timat

esan

d ed

ucat

ed g

uess

es

of w

hat a

phy

loge

netic

tree

mig

ht lo

ok li

ke

for t

he c

urre

nt s

et o

f dat

a.

Thes

e m

etho

ds c

an o

nly

estim

ate

the

# of

cha

nges

that

oc

curr

ed fr

om th

e tim

e of

sep

arat

ion

(bru

nchi

ng e

vent

).

Thes

e es

timat

es a

re o

nly

as g

ood

as th

e da

ta it

self

and

only

as

good

as

the

algo

rithm

. Som

e al

gorit

hms

in c

omm

on u

se

are

actu

ally

qui

te p

oor m

etho

ds.

Page 19: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

In m

olec

ular

phy

loge

netic

sw

e try

to u

nder

stan

d se

quen

ce e

volu

tion.

To in

ferr

obus

tphy

loge

netic

sfo

r the

dat

a w

e ne

ed p

ower

ful (

stat

istic

al) E

volu

tiona

ry M

odel

s.

All

met

hods

desc

ribin

g se

quen

ce e

volu

tion

use

a m

odel

that

con

sist

of 2

com

pone

nts:

phyl

ogen

tictre

eA

Mod

el (d

escr

iptio

n of

the

way

the

nule

otid

esan

d aa

seqs

evol

ved)

.

Page 20: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

Mod

els

of m

olec

ular

evo

lutio

n ar

e hi

ghly

sim

plifi

ed

desc

riptio

ns o

f the

his

tory

and

pro

cess

of s

eque

nce

chan

ge.

(A ->

C, G

-> T

, A

-> T

, S ->

T)

The

hist

ory

is re

pres

ente

d by

the

phyl

ogen

etic

tree

topo

logy

and

the

expe

cted

am

ount

of e

volu

tion

on

each

bra

nch

(i.e.

the

bran

ch le

ngth

s)M

ost

effo

rt f

or m

odel

ing

sequ

ence

evo

luti

on h

as

been

con

cent

rate

d on

the

pro

cess

es o

f nu

cleo

tide

sub

stit

utio

n an

d am

ino-

acid

re

plac

emen

t.

Page 21: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

As

inse

rtio

n an

d de

leti

onev

ents

hav

e pr

oven

di

ffic

ult

to m

odel

. Con

vent

iona

l met

hods

for

phyl

ogen

y re

cons

truc

tion

req

uire

a k

nown

se

quen

ce a

lignm

ent

and

negl

ect

alig

nmen

t un

cert

aint

y.

Alig

nmen

t co

lum

ns w

ith

gaps

are

eith

er

rem

oved

fro

m t

he a

naly

sis

or a

re t

reat

ed in

an

ad h

oc f

ashi

on. A

s a

resu

lt, e

volu

tion

ary

info

rmat

ion

from

inse

rtio

ns a

nd d

elet

ions

is

typi

cally

igno

red

duri

ng p

hylo

geny

re

cons

truc

tion

.

Page 22: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

The

diffi

culty

with

inse

rtion

s an

d de

letio

ns

may

be

the

mos

t vex

ing

prob

lem

for

mod

el-b

ased

app

roac

hes

to s

tudy

ing

sequ

ence

evo

lutio

n.M

odel

s di

ffer b

y th

eir a

ssum

ptio

ns

rega

rdin

g th

e ra

tes

of s

ubst

itutio

ns

occu

rrenc

e of

all

poss

ible

repl

acem

ents

.

Page 23: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

Ther

e ar

e 2

know

n ap

proa

ches

for

mat

hem

atic

al m

odel

s of

seq

uenc

e ev

olut

ion

whi

ch in

clud

e va

riabl

es th

at re

pres

ent

info

rmat

ion

(feat

ures

) of t

he p

roce

ss o

f ev

olut

ion:

Em

piric

al m

odel

s–

use

fixed

par

amet

ers

valu

es (e

stim

ated

from

pre

viou

s kn

owle

dge

of

data

sets

, and

pre

com

pute

d)P

aram

etric

mod

els

–do

not

use

fixe

d pa

ram

eter

s va

lues

. The

y al

low

the

valu

es to

be

deriv

ed fr

om th

e cu

rren

t dat

aset

.

Page 24: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

Ther

e ar

e di

ffere

nt ty

pes

of m

odel

s:D

NA

Sub

stitu

tion

mod

els

with

3 ty

pes

of

para

met

ers

used

by

them

:B

ase

frequ

ency

para

met

ers

Bas

e ex

chan

geab

ility

para

met

ers

Rat

e he

troge

neity

para

mat

eres

(ass

ign

diffe

rent

var

iatio

n ra

tes

to d

iffer

ent p

arts

of

the

sequ

ence

, e.g

diff

eren

t rat

esto

each

pos

ition

in

the

codo

n, o

r to

site

s in

a p

rote

in d

omai

n)

Page 25: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

DN

A M

odel

sA

min

o ac

ids

Mod

el

Page 26: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mut

atio

ns in

DN

A a

s a

sour

ce fo

r ev

olut

iona

ry a

naly

sis

Onl

ym

utat

ions

that

wer

e fix

ed in

the

popu

latio

n ar

e ca

lled

subs

titut

ions

.

We

assu

me

that

eac

h ob

serv

ed c

hang

e in

si

mila

r seq

uenc

es, r

epre

sent

a “s

ingl

e m

utat

ion

even

t”.

The

grea

ter t

henu

mbe

r of c

hang

es, t

he

mor

e po

ssib

lety

pes

of m

utat

ions

.

Page 27: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Orig

inal

Seq

uenc

eA C T G A A C G T A

A C T G A >

C > T

A C > G

G T > A

A C > A

T G A

A C > A

G T > A

Sing

le S

ubst

itut

ion

Mul

tipl

eSu

bsti

tuti

onco

inci

dent

al S

ubst

itut

ion

Para

llel S

ubst

itut

ion

Page 28: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

DN

A S

ubst

itutio

n M

utat

ions

Tran

sitio

n-a

cha

nge

betw

een

purin

es(A

,G) o

r bet

wee

n py

rimid

ines

(T,C

).

Tran

sver

sion

-a c

hang

e be

twee

n pu

rines

(A,G

) to

pyr

imid

ines

(T,C

).

Sub

stitu

tion

mut

atio

ns u

sual

ly a

rise

from

m

ispa

iring

of b

ases

dur

ing

repl

icat

ion.

Page 29: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Cor

rect

ion

for l

ikel

ihoo

d of

mut

atio

ns

in D

NA

seq

uenc

es

Ther

e ar

e se

vera

levo

lutio

nary

mod

els

used

forc

orre

ctio

n fo

r the

like

lihoo

d of

m

ultip

le m

utat

ions

and

reve

rsio

ns in

DN

A

sequ

ence

s.Th

ese

evol

utio

nary

mod

els

use

ano

rmal

ized

dis

tanc

e m

easu

rem

ent t

hat i

sth

e av

erag

e de

gree

of c

hang

e pe

r len

gth

of a

ligne

d se

quen

ces.

Page 30: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Juke

s &

Can

tor o

ne-p

aram

eter

mod

el

This

mod

el a

ssum

es th

at s

ubst

itutio

nsbe

twee

n th

e 4

base

s oc

cur w

ith e

qual

freq

uenc

y.

Mea

ning

no

bias

in th

e di

rect

ion

of th

e ch

ange

.

AG

α

T

Is th

e ra

te o

f sub

stitu

tions

In e

ach

of th

e 3

dire

ctio

nsFo

r one

bas

e.

α

( Is

the

one

para

met

er).

α

C

Page 31: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Kim

ura

two-

para

met

er m

odel

This

mod

el a

ssum

es th

at tr

ansi

tions

(A

-G

or T

-C

) occ

ur m

ore

ofte

n th

an

trans

vers

ions

(pur

ine

-pyr

imid

ine)

.

AG

α

T

Is th

e ra

te o

f tra

nsiti

onal

S

ubst

itutio

ns.

α

ββ

βIs

the

rate

of t

rans

vers

iona

lsu

bstit

utio

ns.

Page 32: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Thes

e ev

olut

iona

ry m

odel

s im

prov

e th

e di

stan

ce c

alcu

latio

ns b

etw

een

the

sequ

ence

s.

Thes

e ev

olut

iona

ry m

odel

s ha

ve le

ss

effe

ct in

phy

loge

netic

pred

ictio

ns o

f cl

osel

y re

late

d se

quen

ces.

Th

ese

evol

utio

nary

mod

els

have

bet

ter

effe

ct w

ith d

ista

nt re

late

d se

quen

ces.

Page 33: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

Am

ino

acid

s re

plac

emen

ts m

odel

sA

min

o-ac

ids

frequ

ency

and

exc

hang

eabi

lity

mod

els

-e.

g D

ayho

ffm

odel

,ver

y ol

d an

d si

mpl

e, b

ased

on

coun

ting

the

obse

rved

repl

acem

ents

in v

ery

sim

ilar

prot

eins

and

usi

ng th

ese

info

for t

he p

aram

eter

s.

Day

hoff

mod

elas

sum

es th

at:

all s

ites

in a

pro

tein

evo

lve

inde

pend

ently

of o

ne a

noth

er.

Thei

r sim

ple

mod

el a

ssum

es th

at A

t eac

h si

te, t

he

proc

ess

of a

min

o-ac

id re

plac

emen

t is

defin

ed b

y a

mat

rix

of re

plac

emen

t rat

es.

Page 34: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

For e

ach

poss

ible

cha

nge

from

one

of t

he tw

enty

am

ino

acid

type

s to

ano

ther

, the

re is

a

corr

espo

ndin

g m

atrix

ent

ry. W

ith th

e D

ayho

ffm

odel

, all

site

s ev

olve

acc

ordi

ng to

the

sam

e ra

te m

atrix

.JT

T m

odel

, an

upda

ted

vers

ion

of th

e D

ayho

ffm

odel

Yan

g m

odel

, a m

ore

adva

nced

and

cle

ver,a

llow

ing

diffe

rent

site

s in

the

sequ

ence

to e

volv

e at

di

ffere

nt ra

te. T

his

mod

el a

ssum

e th

at, e

xcep

t for

ra

te h

eter

ogen

eity

, all

site

s ev

olve

acc

ordi

ng to

th

e sa

me

proc

ess.

Page 35: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

A li

mita

tion

of a

ll th

e af

orem

entio

ned

mod

els

of

amin

o ac

id re

plac

emen

tis

that

they

are

bas

ed

on c

hang

es a

mon

g 20

sta

tes,

whe

re e

ach

stat

e re

pres

ents

an

amin

o ac

id ty

pe.

In re

ality

, evo

lutio

n oc

curs

at t

he le

vel o

f DN

A

sequ

ence

s.Th

eref

ore,

it is

pre

fera

ble

to fr

ame

mod

els

of

sequ

ence

evo

lutio

n in

term

s of

cod

ons

rath

er

than

in te

rms

of a

min

o ac

ids.

To d

ate,

mos

t cod

on-b

ased

mod

els

have

em

ploy

ed th

e as

sum

ptio

n th

at a

ll ch

ange

s to

a c

odon

invo

lve

only

one

of

the

thre

e co

don

posi

tions

.

Page 36: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mod

ellin

gE

volu

tion

In th

e ev

olut

ion

liter

atur

e, c

hang

es to

a c

odon

are

term

ed e

ither

syn

onym

ous

beca

use

they

do

not

alte

rthe

am

ino

acid

spe

cifie

d by

the

codo

n.

Oth

er c

hang

es a

re te

rmed

non

syno

nym

ous

beca

use

they

do

resu

lt in

a c

hang

eof

the

amin

o ac

id b

eing

spe

cifie

d by

the

codo

n.

A A

G -

> Ly

sA

A A

->Ly

sA

A A

-> L

ysA

G A

-> A

rg

Page 37: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

The

Mar

kov

Pro

cess

Mod

elIs

a m

athe

mat

ical

mod

elof

infr

eque

nt

chan

ges

of (d

iscr

ete)

sta

tes

over

tim

e, in

wh

ich

futu

re e

vent

s oc

cur

by c

hanc

e an

d de

pend

onl

y on

the

cur

rent

sta

te, a

nd n

ot o

n th

e hi

stor

y of

how

tha

t st

ate

was

reac

hed.

In

mol

ecul

ar p

hylo

gene

tics

, the

sta

tes

of t

he

proc

ess

are

the

poss

ible

nuc

leot

ides

or

amin

o ac

ids

pres

ent

at a

giv

en t

ime

and

posi

tion

in a

se

quen

ce. A

sta

te c

hang

esin

thi

s ca

se

repr

esen

t m

utat

ions

in s

eque

nces

.

Page 38: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Rat

e he

tero

gene

ity a

nd th

e G

amm

a D

istru

butio

nm

odel

From

peo

ple

work

(pub

licat

ions

), it

is k

nown

tha

t:

Mut

atio

n ra

tes

vary

con

side

rabl

y am

ongs

t si

tes

of D

NA

and

am

ino

acid

seq

uenc

es,

beca

use

of b

ioch

emic

al f

acto

rs, c

onst

rain

ts o

f th

e ge

neti

c co

de, s

elec

tion

for

gen

e fu

ncti

on,

etc.

Th

is v

aria

tion

is o

ften

mod

eled

usi

ng a

gam

ma

dist

ribu

tion

of

rate

s ac

ross

seq

uenc

e si

tes.

The

shap

e of

the

gam

ma

dist

ribu

tion

is

cont

rolle

d by

a p

aram

eter

a, a

nd t

he

dist

ribu

tion

’s m

ean

and

vari

ance

are

1 a

nd 1

/a,

resp

ecti

vely

.

Page 39: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Rat

e he

tero

gene

ity a

nd th

e G

amm

a D

istru

butio

nm

odel

Larg

e va

lues

of

a (p

arti

cula

rly

a >1

) giv

e a

bell

curv

e-sh

aped

dist

ribu

tion

, su

gges

ting

litt

le o

r no

rat

e he

tero

gene

ity

Smal

l val

ues

of a

give

a

reve

rse

–J-

shap

eddi

stri

buti

on, s

ugge

stin

g hi

gher

leve

ls o

f ra

te

hete

roge

neit

yal

ong

with

m

any

site

s wi

th lo

w ra

tes

of e

volu

tion

.

Page 40: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

The

mol

ecul

ar c

lock

hyp

othe

sis

assu

mes

that

…H

omol

ogou

s se

quen

ces

of D

NA

evo

lve

at a

co

nsta

nt a

nd in

varia

ble

rate

acr

oss

all t

axa.

(s

ame

rate

alo

ng a

ll tre

e br

unch

es)

The

rate

of t

he m

utat

ions

is th

e sa

me

for a

ll po

sitio

ns a

long

the

sequ

ence

.If

true,

this

hyp

othe

sis

wou

ld e

nabl

e to

de

term

ine

mor

e ac

cura

tely

div

erge

nce

times

an

d ph

ylog

enet

icre

latio

nshi

ps.

How

ever

, pub

licat

ions

sho

w v

aria

ble

rate

s of

mut

atio

n (e

volu

tion)

, whi

ch q

uest

ion

the

valid

ity o

f thi

s th

eory

.Th

eref

ore,

the

mol

ecul

ar c

lock

hyp

othe

sis

is m

ost s

uita

ble

for c

lose

ly re

late

d sp

ecie

s.

Page 41: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Roo

ted

Tree

= C

lado

gram

A p

hylo

gene

tictre

e th

at

all t

he "o

bjec

ts" o

n it

shar

e a

know

n co

mm

on

ance

stor

(the

root

). Th

ere

exis

ts a

par

ticul

ar

root

nod

e.A

root

is a

taxo

n(s

eq)

that

bra

nche

d ea

rlier

of

all t

he o

ther

taxa

on th

e tre

e, b

ut is

rela

ted

to

them

.

A

B

C

Roo

t

The

path

s fro

m th

e ro

ot to

the

node

s co

rres

pond

to

evol

utio

nary

tim

e

Page 42: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Unr

oote

dTr

ee =

Phe

nogr

am

A p

hylo

gene

tictre

e w

here

all

the

"obj

ects

" on

it ar

e re

late

d de

scen

dant

s -b

ut th

ere

is

not e

noug

h in

form

atio

n to

sp

ecify

the

com

mon

an

cest

or (r

oot).

The

path

bet

wee

n no

des

of

the

tree

do n

otsp

ecify

an

evol

utio

nary

tim

e.

B

Page 43: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Roo

ted

vers

us U

nroo

ted

The

num

ber o

f tre

e to

polo

gies

of r

oote

d tre

e is

muc

h hi

gher

than

that

of t

he

unro

oted

tree

for t

he s

ame

num

ber o

f O

TUs.

Ther

efor

e, th

e er

ror o

f the

unr

oote

dtre

e to

polo

gy is

sm

alle

r tha

n th

at o

f the

root

ed

tree.

Page 44: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Orth

olog

s-g

enes

rela

ted

by s

peci

atio

n ev

ents

. Mea

ning

sam

e ge

nes

in d

iffer

ent

spec

ies.

Par

alog

s-g

enes

rela

ted

by d

uplic

atio

n ev

ents

. Mea

ning

dup

licat

ed g

enes

in th

e sa

me

spec

ies.

Page 45: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Sel

ectin

g th

e da

tase

t (se

quen

ces)

fo

r phy

loge

netic

anal

ysis

W

e us

e bo

th ty

pes

of s

eque

nces

for r

estru

ctio

nph

ylog

enet

ictre

es: P

rote

inan

d D

NA

Fo

r DN

Ase

quen

ces,

the

rate

of m

utat

ion

is

assu

med

to b

e th

e sa

me

in b

oth

codi

ng a

nd n

on-

codi

ng re

gion

s.

H

owev

er, t

here

is a

diff

eren

ce in

the

subs

titut

ion

rate

bet

wee

n co

ding

and

non

-cod

ing

regi

ons.

N

on-c

odin

g D

NA

regi

ons

are

know

n to

hav

e m

ore

subs

titut

ions

than

cod

ing

DN

A re

gion

s.

Page 46: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Sel

ectin

g th

e da

tase

t (se

quen

ces)

fo

r phy

loge

netic

anal

ysis

For P

rote

ins,

the

rate

of m

utat

ion

is v

ery

low

in

the

cons

erve

d re

gion

,or “

func

tiona

l reg

ions

”as

we

rela

te to

them

.

Mos

t evo

lutio

n al

gorit

hm c

an b

ette

r ana

lyze

re

gion

s th

at m

utat

e sl

owly

, sm

all n

umbe

r of

chan

ges

in th

e m

ultip

le a

lignm

ents

.R

egio

ns th

at h

ave

“hig

h nu

mbe

r of c

hang

es”

need

spe

cial

alg

orith

m to

dea

l with

them

su

cces

sful

ly.

Page 47: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Sel

ectin

g th

e da

tase

t (se

quen

ces)

fo

r phy

loge

netic

anal

ysis

‡S

eque

nces

that

are

bei

ng u

sed

as th

e da

tase

t be

long

toge

ther

(orth

olog

s).

‡If

no a

nces

tral s

eque

nce

is a

vaila

ble

you

may

us

e an

"out

grou

p" a

s a

refe

renc

e to

mea

sure

di

stan

ces.

In s

uch

a ca

se, f

or a

n ou

tgro

upyo

u ne

ed to

cho

ose

a cl

ose

rela

tive

to th

e gr

oup

bein

g co

mpa

red.

For e

xam

ple:

if th

e gr

oup

is o

f mam

mal

ian

sequ

ence

s th

en th

e ou

tgro

upsh

ould

be

a se

quen

ce

from

bird

s an

d no

t pla

nts.

Page 48: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Kno

wn

Pro

blem

s of

Mul

tiple

Alig

nmen

ts

Goo

d an

alys

is is

bas

ed o

n go

od

alig

nmen

ts.

Che

ck th

e al

ignm

ent t

o se

e th

at Im

porta

nt

Site

sar

e no

t mis

alig

ned

by th

e so

ftwar

e us

ed fo

r the

seq

uenc

e al

ignm

ent.

Mis

alig

nmen

t can

effe

ct th

e si

gnifi

canc

e of

th

e si

te-a

nd th

e tre

e.Fo

r exa

mpl

e: A

TG a

s st

art c

odon

, or s

peci

fic

amin

o ac

ids

in fu

nctio

nal d

omai

ns.

Page 49: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Kno

wn

Pro

blem

s of

Mul

tiple

Alig

nmen

ts

Gap

s (in

dels

) in

mul

tiple

alig

nmen

t are

us

ually

nor

sco

red,

(or p

lain

ly ig

nore

d) b

y m

ost p

rogr

ams.

Gap

s ar

e no

t sco

red

sinc

e th

ere

is n

o su

itabl

e m

odel

of e

volu

tion

mec

hani

sm th

at

prod

uces

them

.

Page 50: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Kno

wn

Pro

blem

s of

Mul

tiple

Alig

nmen

ts

T Y R R S R ACA TAC AGG CGA

T Y R R

T Y R R S R ACA TAC AGG CGA

T Y R R

T Y R -

S R ACA TAC AGG ---

T Y R -

T Y R -

S R ACA TAC ---

CGA

T Y -

R

T Y R R S R ACA TAC AGG CGA

T Y R R

Alig

nmen

t of P

rote

ins

that

con

tain

s ga

ps, s

houl

d be

co

mpa

red

with

the

alig

nmen

t of t

heir

DN

A c

odin

g re

gion

s..

The

reas

on is

to to

be s

ure

abou

t the

pla

cem

ent o

f gap

s,si

nce

the

dege

nera

cy o

f the

cod

.

Page 51: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Kno

wn

Pro

blem

s of

Mul

tiple

Alig

nmen

ts

♥Lo

w c

ompl

exity

regi

ons

-effe

ct th

e m

ultip

le

alig

nmen

t bec

ause

they

cre

ate

rand

om b

ias

for

vario

us re

gion

s of

the

alig

nmen

t.♥

Low

com

plex

ity re

gion

ssh

ould

be

rem

oved

from

th

e al

ignm

ent b

efor

e bu

ildin

g th

e tre

e.♥

Or i

f tak

en in

to th

e da

tase

t, ne

ed s

peci

al m

odel

s of

bia

sed

regi

ons

If yo

u de

lete

thes

e re

gion

s yo

u ne

ed to

con

side

r th

e af

fect

of t

he d

elet

ions

on

the

bran

ch le

ngth

s of

th

e w

hole

tree

.

Page 52: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

How

to c

hoos

e th

e be

st m

etho

d?

Cho

ose

set o

f rel

ated

seqs

(DN

A o

r Pro

tein

sO

btai

n M

ultip

leA

lignm

ent

Is th

ere

a st

rong

sim

ilarit

y?

Stro

ng si

mila

rity

Max

imum

Par

sim

ony

Dis

tant

(wea

k) si

mila

rity

Dis

tanc

e m

etho

ds

Ver

y w

eak

sim

ilarit

yM

axim

um L

ikel

ihoo

dC

heck

val

idity

of t

he

resu

lts

Page 53: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Take

n fro

m D

r.Ita

iYan

ai

A -

GCTTGTCCGTTACGAT

B –

ACTTGTCTGTTACGAT

C –

ACTTGTCCGAAACGAT

D -

ACTTGACCGTTTCCTT

E –

AGATGACCGTTTCGAT

F -

ACTACACCCTTATGAG

Giv

en a

mul

tiple

alig

nmen

t, ho

w d

o w

e co

nstr

uct t

he tr

ee?

?

Page 54: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

The

best

met

hod

to b

uild

a

phyl

ogen

etic

tree

Will

ext

ract

the

max

imum

am

ount

of i

nfor

mat

ion

avai

labl

e fro

m th

e se

quen

ces

data

. It

will

com

bine

this

info

rmat

ion

with

prio

r kn

owle

dge

of p

atte

rns

of s

eque

nces

evo

lutio

n(e

volu

tion

mod

els)

,an

d w

ill a

dd m

odel

par

amet

ers

(suc

h as

tra

nsiti

on/tr

ansv

ersi

onbi

as k

) who

se v

alue

s ar

e no

t kno

wn

a pr

iori.

Page 55: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Bui

ldin

g P

hylo

gene

ticTr

ees

Mai

n m

etho

ds:

Dis

tanc

es m

atrix

met

hods

N

eigh

bour

Join

ing,

UP

GM

A

Cha

ract

er b

ased

met

hods

:P

arsi

mon

y m

etho

dsM

axim

um L

ikel

ihoo

d m

etho

dV

alid

atio

n m

etho

d:B

oots

trapp

ing

Jack

Kni

fe

Page 56: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Cha

ract

er B

ased

Met

hods

All

Cha

ract

er B

ased

Met

hods

assu

me

that

ea

ch c

hara

cter

sub

stitu

tion

is in

depe

nden

t of

its

neig

hbor

s.

Max

imum

Par

sim

ony

(min

imum

evo

lutio

n)

-in

this

met

hod

one

tree

will

be g

iven

(b

uilt)

with

the

few

est c

hang

es re

quire

d to

ex

plai

n th

e di

ffere

nces

obs

erve

din

the

data

, (tre

e) .

Page 57: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Max

imum

Par

sim

ony

Met

hods

In o

ther

wor

ds, M

axim

um P

arsi

mon

y M

etho

d w

ill

find

the

tree

that

chan

ges

any

sequ

ence

(tax

a)

on th

e tre

e in

to a

ll ot

her s

eque

nces

on

the

tree.

If th

e nu

mbe

r of c

hang

es p

er s

eque

nce

posi

tion

is re

lativ

ely

smal

l, th

en m

axim

um p

arsi

mon

y ap

prox

imat

es M

L an

d its

est

imat

es o

f tre

e to

polo

gy w

ill b

e si

mila

r to

thos

e of

M

L es

timat

ion.

A

s m

ore-

dive

rgen

t seq

uenc

es a

re a

naly

sed,

the

degr

ee o

f hom

opla

sy(i.

e. p

aral

lel,

conv

erge

nt,

reve

rsed

or s

uper

impo

sed

chan

ges)

incr

ease

s.Th

e tru

e ev

olut

iona

ry tr

ee b

ecom

es le

ss li

kely

to

be th

e on

e w

ith th

e le

ast n

umbe

r of c

hang

es,

and

pars

imon

y m

etho

ds fa

il.

Page 58: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Cha

ract

er B

ased

Met

hods

Q:H

ow d

o yo

u fin

d th

e m

inim

um #

of c

hang

es

need

ed to

exp

lain

the

data

in a

giv

en tr

ee?

A:T

he a

nsw

er w

ill b

e to

con

stru

ct a

set

of p

ossi

ble

way

s to

get

from

one

set

to th

e ot

her,

and

choo

se

the

"bes

t". ( f

or e

xam

ple:

Max

imum

Par

sim

ony)

CCGCCACGA

P P R

CGGCCACGA

RP R

Page 59: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Max

imum

Par

sim

ony

Met

hods

Furth

erm

ore,

whe

n th

e tru

e tre

e ha

s sh

ort i

nter

nal

bran

ches

and

long

term

inal

bra

nche

s, a

ph

enom

enon

can

occ

ur w

here

by th

e lo

ng

bran

ches

app

ear t

o at

tract

one

ano

ther

and

can

be

err

oneo

usly

infe

rred

to b

e to

o cl

osel

y re

late

d.

Com

bina

tions

of c

ondi

tions

whe

n th

is o

ccur

s ar

e of

ten

calle

d th

e Fe

lsen

stei

nzo

ne, a

ndpa

rsim

ony

is p

artic

ular

ly a

ffect

ed b

y th

is p

robl

em

beca

use

of it

s in

abili

ty to

dea

l with

hom

opla

sy.

In th

e Fe

lsen

stei

nzo

ne, p

arsi

mon

y be

com

es

incr

easi

ngly

cer

tain

of t

he w

rong

tree

; a p

rope

rty

refe

rred

to a

s in

cons

iste

ncy.

Page 60: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Max

imum

Par

sim

ony

Met

hods

In a

dditi

on, p

arsi

mon

y la

cks

an e

xplic

it m

odel

of e

volu

tion.

A

lthou

gh o

rigin

ally

see

n by

som

e as

a

stre

ngth

of p

arsi

mon

y, th

is h

as n

ow

beco

me

a lim

iting

fact

or p

reve

ntin

g fru

itful

fe

ed-fo

rwar

d be

twee

n ph

ylog

enet

ican

alys

is a

nd in

vest

igat

ion

of th

e bi

olog

ical

im

plic

atio

ns o

f diff

eren

t mod

els

of

evol

utio

n.

Page 61: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

123456789012345678901

Mouse CTTCGTTGGATCAGTTTGATA

Rat CCTCGTTGGATCATTTTGATA

Dog CTGCTTTGGATCAGTTTGAAC

Human CCGCCTTGGATCAGTTTGAAC

------------------------------------

Invariant * * ******** *****

Variant ** * * **

------------------------------------

Informative ** **

Non-inform. * *

Star

t by

clas

sify

ing

the

site

s:

Max

imu

m P

arsi

mon

y

Take

n fro

mD

r. Ita

iYan

ai

Page 62: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Wha

t’s in

a S

ITE

Varia

ble

site

Con

tain

s at

leas

t tw

o ty

pes

of n

ucle

otid

es o

r am

ino

acid

s. S

ome

varia

ble

site

s ca

n be

sin

glet

on o

r pa

rsim

ony-

info

rmat

ive.

Si

ngle

ton

Site

sA

site

is c

alle

d a

sing

leto

n si

te if

it c

onta

ins

at le

ast

two

type

s of

nuc

leot

ides

(or a

min

o ac

ids)

with

at

mos

t one

of t

hem

occ

urrin

g m

ultip

le ti

mes

. C

onst

ant S

iteIf

a si

te c

onta

ins

the

sam

e nu

cleo

tide

or a

min

o ac

id

in a

ll se

quen

ces,

it is

refe

rred

to a

s a

cons

tant

site

Page 63: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

123456789012345678901

Mouse CTTCGTTGGATCAGTTTGATA

Rat CCTCGTTGGATCATTTTGATA

Dog CTGCTTTGGATCAGTTTGAAC

Human CCGCCTTGGATCAGTTTGAAC

** *

Mou

se

Rat

Dog

Huma

n

Mou

se

Rat

Dog

Huma

n

Mou

seRa

t

Dog

Huma

n

Mou

se

Rat

Dog

Huma

n

Mou

se

Rat

Dog

Huma

n

Mou

seRa

t

Dog

Huma

nM

ouse

Rat

Dog

Huma

n

Mou

se

Rat

Dog

Huma

n

Mou

seRa

t

Dog

Huma

n

Site

5:G G T C T T

T C T C G G

T G T C G T

G C T C T G

G T T T T G

G C C C T G

GG

CC

GG

GG

CT

GG

TG

CC

GT

Site

2:

Site

3:

Take

n fro

mD

r. Ita

iYan

ai

Page 64: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Max

imum

Par

sim

ony

Met

hods

The

poss

ible

opt

imal

tree

is b

uilt

by

addi

ng th

e nu

mbe

r of c

hang

es a

t eac

h in

form

ativ

e si

te fo

r eac

h tre

e.Th

e tre

e th

at re

quire

s th

e le

ast n

umbe

r of

cha

nges

is c

hose

n.

Page 65: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Max

imum

Par

sim

ony

Met

hods

∞Th

e M

axim

um P

arsi

mon

y m

etho

d is

goo

d fo

r si

mila

r seq

uenc

es, a

seq

uenc

es g

roup

with

sm

all a

mou

nt o

f var

iatio

ns

Max

imum

Par

sim

ony

met

hods

do

not g

ive

the

bran

ch le

ngth

s on

ly th

e br

anch

ord

er.

For l

arge

r set

it is

reco

mm

ende

d to

use

the

“bra

nch

and

boun

d”m

etho

d in

stea

dof

Max

imum

Par

sim

ony.

Page 66: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Max

imum

Par

sim

ony

Met

hods

are

A

vaila

ble…

For D

NA

in P

rogr

ams:

paup

, mol

phy,

phyl

o_w

inIn

the

Phy

lippa

ckag

e:D

NA

Par

s, D

NA

Pen

ny, e

tc..

For P

rote

in in

Pro

gram

s:pa

up, m

olph

y,ph

ylo_

win

In th

e P

hylip

pack

age:

PR

OTP

ars

Page 67: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Cha

ract

er B

ased

Met

hods

:M

axim

um L

ikel

ihoo

dA

HY

PO

THE

SIS

is a

mod

el o

f evo

lutio

n to

ex

plai

n th

e da

ta, L

IKE

LIH

OO

D w

ill ch

oose

th

e hy

poth

esis

that

fits

the

data

bes

t.

Like

lihoo

d m

etho

dsre

gard

the

obse

rved

da

taas

a fi

xed

obse

rvat

ion

and

seek

the

valu

es o

f the

sta

tistic

al p

aram

eter

s th

at

prov

ide

the

mos

t pro

babl

e de

scrip

tion

of

thos

e da

ta, g

iven

the

mod

el o

f evo

lutio

n.

Page 68: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Cha

ract

er B

ased

Met

hods

:M

axim

um L

ikel

ihoo

d

The

likel

ihoo

d do

es n

ot d

escr

ibe

eith

er th

e pr

obab

ility

that

the

even

ts u

nder

stu

dy h

appe

ned

(they

did

hap

pen)

, or t

hat t

he m

odel

is tr

ue.

Like

lihoo

d de

scrib

es th

e lik

elih

ood

that

a g

iven

pr

oces

s (m

odel

) as

oppo

sed

to s

ome

othe

r pr

oces

ses

is re

spon

sibl

e fo

r the

obs

erve

d da

ta.

Thes

e pr

oper

ties

mak

e lik

elih

ood

parti

cula

rysu

ited

to h

isto

rical

infe

renc

e pr

oble

ms,

in w

hich

th

e ob

serv

ed d

ata

aris

e on

ly o

nce.

Page 69: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Cha

ract

er B

ased

Met

hods

:M

axim

um L

ikel

ihoo

dM

axim

um L

ikel

ihoo

d m

etho

d–

(like

the

Max

imum

Par

sim

ony

met

hod)

per

form

s its

an

alys

is o

n ea

ch p

ositi

on o

f the

mul

tiple

al

ignm

ent.

This

is w

hy th

is m

etho

d is

ver

y he

avy

on C

PU

. S

tarts

with

a s

et o

f seq

uenc

es, a

nd fi

nds

estim

ates

for t

he v

aria

bilit

y in

eac

h po

sitio

n.

It ch

ecks

the

rate

s of

tran

sitio

ns a

nd tr

ansv

ersi

ons

(with

bot

h m

odel

s K

imur

a an

d Ju

kes

& C

anto

r).

At t

he e

nd, a

fter a

ll po

sitio

ns in

the

sequ

ence

s al

ignm

ent w

ere

chec

ked,

the

likel

ihoo

d of

the

who

le tr

ee is

repo

rted.

Page 70: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Max

imum

Lik

elih

ood

met

hod

Can

be

foun

d in

the

follo

win

g pa

ckag

es:

phyl

ip, p

aup,

meg

a or

tree

-puz

zle.

Page 71: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Dis

tanc

es (p

airw

ise)

Met

hods

Dis

tanc

e-t

he n

umbe

r of s

ubst

itutio

ns p

er

site

per

tim

e pe

riod.

Evo

lutio

nary

dis

tanc

ear

e ca

lcul

ated

bas

ed

on o

ne o

f DN

A e

volu

tiona

ry m

odel

s.D

ista

nce

met

hods

use

the

sam

e m

odel

s of

ev

olut

ion

as M

Lto

est

imat

e th

e ev

olut

iona

ry

dist

ance

bet

wee

n ea

ch p

air o

f seq

uenc

es

from

the

set u

nder

ana

lysi

s.Th

en fi

ts a

phy

loge

netic

tree

to th

ose

dist

ance

s.

Page 72: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Dis

tanc

es M

etho

ds

The

estim

ated

dis

tanc

esw

ill u

sual

ly b

e M

Les

timat

es fo

r eac

h pa

ir (c

onsi

dere

d in

depe

nden

tly o

f the

oth

er s

eque

nces

), bu

t th

e se

t of a

ll pa

irwis

edi

stan

ces

will

notb

e co

mpa

tible

with

any

tree

.S

o, a

bes

t-fitt

ing

phyl

ogen

y is

der

ived

usi

ng

non-

ML

met

hods

.

Page 73: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Dis

tanc

es M

etho

ds

of d

ista

nce

met

hods

incl

ude

Dis

adva

ntag

es

the

inev

itabl

e lo

ss o

f evo

lutio

nary

in

form

atio

n w

hen

a se

quen

ce a

lignm

ent i

s co

nver

ted

to p

airw

ise

dist

ance

s, a

nd th

e in

abilit

y to

dea

l with

mod

els

cont

aini

ng

para

met

ers

for w

hich

the

valu

es a

re n

ot

know

n a

prio

ri (e

.g. ê

abo

ve).

Page 74: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Dis

tanc

es M

etho

ds:

Nei

ghbo

r-Joi

ning

Nei

ghbo

rs –

pair

of s

eque

nces

, in

a se

quen

ce

set,

that

hav

e th

e sm

alle

st n

umbe

r of c

hang

es

(sub

stitu

tions

) bet

wee

n th

em.

On

a ph

ylog

enet

ictre

e,ne

ighb

ors

are

join

ed b

y a

bran

chto

the

sam

e no

de(c

omm

on a

nces

tor)

.

Som

e of

the

Dis

tanc

es m

etho

ds, s

uch

as th

eN

eigh

bor-

Join

ing

UP

GM

A,

use

the

mol

ecul

ar

cloc

k hy

poth

esis

.M

ost o

f the

oth

er D

ista

nces

prog

ram

s do

not

.

Page 75: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Com

mon

ste

ps to

bui

ld a

TR

EE

use

d by

Dis

tanc

es m

etho

d

1M

ultip

le a

lignm

ents

-ba

sed

on a

ll ag

ains

t al

l pai

rwis

eco

mpa

rison

s.2

Bui

ldin

g di

stan

ce m

atrix

of a

ll th

e co

mpa

red

sequ

ence

s (a

ll pa

ir of

OTU

s).

3D

isre

gard

of t

he a

ctua

l seq

uenc

es.

4C

onst

ruct

ing

a gu

ide

tree

by c

lust

erin

g th

e di

stan

ces.

Iter

ativ

ely

build

the

rela

tions

(b

ranc

hes

and

inte

rnal

nod

es) b

etw

een

all

OTU

s.

Page 76: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Dis

tanc

e m

etho

d st

eps

Con

stru

ctio

n of

a d

ista

nce

tree

usi

ng c

lust

erin

g w

ith th

e U

nwei

ghte

dPa

ir G

roup

Met

hod

with

Ari

thm

atic

Mea

n (U

PGM

A)

AB

CD

EB

2C

44

D6

66

E6

66

4F

88

88

8

From

htt

p://w

ww

.icp.

ucl.a

c.be

/~op

perd

/pri

vate

/upg

ma.

htm

l

A -GCTTGTCCGTTACGAT

B –ACTTGTCTGTTACGAT

C –ACTTGTCCGAAACGAT

D -ACTTGACCGTTTCCTT

E –AGATGACCGTTTCGAT

F -ACTACACCCTTATGAG

Firs

t, co

nstru

ct a

dis

tanc

e m

atrix

:

Page 77: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Dis

tanc

es M

atrix

Met

hods

Can

be

foun

d in

the

follo

win

g P

rogr

ams:

C

lust

alw

, Phy

lo_w

in, P

aup

In

the

GC

G s

oftw

are

pack

age:

P

aups

earc

h, d

ista

nces

In

the

Phy

lippa

ckag

e:

DN

AD

ist,

PR

OTD

ist,

Fitc

h, K

itch,

Nei

ghbo

r

Page 78: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Sta

tistic

al te

sts

for T

ree

Topo

logy

In m

any

appl

icat

ions

(met

hods

), th

e pr

imar

y in

tere

st is

in th

e to

polo

gy o

f the

infe

rred

ev

olut

iona

ry tr

ee.

As

with

est

imat

es o

f mod

el p

aram

eter

s, a

sin

gle

poin

t est

imat

e is

of l

ittle

val

ue w

ithou

t som

e m

easu

re o

f the

con

fiden

cew

e ca

n pl

ace

in it

. A

pop

ular

way

of a

sses

sing

the

robu

stne

ss o

f a

tree

is b

y th

e m

etho

d of

non

-par

amet

ric

boot

stra

ppin

g

Page 79: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Boo

tstra

ppin

g is

…A

sta

tistic

al m

etho

d by

whi

ch d

istri

butio

ns th

at a

re

diffi

cult

to c

alcu

late

exa

ctly

can

be

estim

ated

by

the

repe

ated

cre

atio

n an

d an

alys

is o

f arti

ficia

l da

tase

ts, w

hich

repr

esen

t the

pop

ulat

ion.

In th

e no

n-pa

ram

etric

boo

tstra

p, th

ese

data

sets

are

gene

rate

d by

resa

mpl

ing

from

the

orig

inal

da

ta, w

here

as in

the

para

met

ric b

oots

trap,

the

data

are

sim

ulat

ed a

ccor

ding

to th

e hy

poth

esis

be

ing

test

ed.

Page 80: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Take

n fro

m D

r. Ita

iYan

ai

Chi

mpa

nzee

Gor

illa

Hum

an

Ora

ng-u

tan

Gib

bon

Giv

en th

e fo

llow

ing

tree

, est

imat

e th

e co

nfid

ence

of

the

two

inte

rnal

bra

nche

s

Page 81: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Take

n fro

m D

r. Ita

iYan

ai

Chim

panz

ee

Chi

mpa

nzee

Chi

mpa

nzee

Gor

illa

Gor

illa

Gor

illa

Hum

anH

uman

Hum

an

Ora

ng-u

tan

Ora

ng-u

tan

Ora

ng-u

tan

Gib

bon

Gib

bon

Gib

bon

41/1

0028

/100

31/1

00

Chim

panz

ee

Gor

illa

Hum

an

Ora

ng-u

tan

Gib

bon

100

41

Est

imat

ing

Con

fiden

ce fr

om th

e R

esam

plin

gs1.

Of t

he 1

00 tr

ees:

In 1

00 o

f the

100

tree

s, gi

bbon

and

ora

ng-u

tan

are

split

from

the

rest

.

In 4

1 of

the

100

trees

, ch

imp

and

goril

la a

re

split

from

the

rest

.

2. U

pon

the

orig

inal

tree

we

supe

rim

pose

boo

tstr

ap v

alue

s:

Page 82: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Wha

t pro

gram

s ca

n dr

aw T

rees

?

*R

oote

d tre

es s

houl

d be

plo

tted

usin

g th

e D

RA

WG

RA

Mpr

ogra

m (p

hylip

), or

sim

ilar.

*U

nroo

ted

trees

sho

uld

be p

lotte

d us

ing

the

DR

AW

TRE

Epr

ogra

m (p

hylip

), or

si

mila

r.

*O

n a

PC

use

the

Tree

Vie

w/T

reeE

xplo

rer/N

JPlo

tpro

gram

Page 83: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Kno

wn

prob

lem

s of

usi

ng p

opul

ar s

oftw

are

for

reco

nstru

ctin

g P

hylo

gene

ticTr

ees

Ord

er o

f the

inpu

t dat

a (s

eque

nces

) -

The

orde

r of t

he in

put s

eque

nces

effe

cts

the

tree

cons

truct

ion.

You

can

"cor

rect

" thi

s ef

fect

in s

ome

of th

e pr

ogra

ms

(phy

lip),

usin

g th

e Ju

mbl

e op

tion.

(J

in p

hylip

set t

o 10

).

The

num

ber o

f pos

sibl

e tre

es is

hug

e fo

r lar

ge

data

sets

. Ofte

n it

is n

ot p

ossi

ble

to c

onst

ruct

all

trees

, but

can

gua

rant

ee o

nly

"a g

ood"

tree

not

th

e "b

est t

ree"

.

Page 84: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Kno

wn

prob

lem

s of

usi

ng p

opul

ar s

oftw

are

for

reco

nstru

ctin

g P

hylo

gene

ticTr

ees

The

defin

ition

of "

best

tree

" is

ambi

guou

s.

It

mig

ht m

ean

the

mos

t lik

ely

tree,

or a

tree

with

th

e fe

wes

t cha

nges

, or a

tree

bes

t fit

to a

kno

wn

mod

el, e

tc..

Th

e tre

es th

at re

sult

from

var

ious

met

hods

diff

er

from

eac

h ot

her.

Nev

er th

e le

ss,i

n or

der t

o co

mpa

re tr

ees,

one

nee

d to

ass

ume

som

e ev

olut

iona

ry m

odel

so

that

the

tree

s m

ay b

e te

sted

.

Page 85: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Kno

wn

prob

lem

s of

usi

ng p

opul

ar s

oftw

are

for

reco

nstru

ctin

g P

hylo

gene

ticTr

ees

∆P

ay a

ttent

ion

to th

e da

ta u

sed

for t

he tr

ee

cons

truct

ion,

use

“inf

orm

ativ

e”da

ta,

with

out l

arge

gap

s.

∆P

opul

atio

n ef

fect

s ar

e of

ten

to b

e co

nsid

ered

, esp

ecia

lly if

we

have

a lo

t of

varie

ty (l

arge

# o

f alle

les

for o

ne p

rote

in).

Page 86: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Incr

easi

ng th

e ro

bust

ness

of t

he

tree

The

best

pos

sibl

e ph

ylog

enet

ices

timat

es w

ill

aris

e fro

m u

sing

robu

st in

fere

nce

met

hods

alli

ed

with

acc

urat

e ev

olut

iona

ry m

odel

s.

How

ever

, afte

r sta

tistic

al a

sses

smen

tof t

he

resu

lts it

cou

ld s

till b

e ne

cess

ary

to a

ttem

pt to

im

prov

e th

e qu

ality

of in

fere

nces

dra

wn.

The

two

mos

t obv

ious

way

s of

incr

easi

ng th

e ac

cura

cy o

f a p

hylo

gene

ticin

fere

nce

are:

to in

clud

e m

ore

sequ

ence

s in

the

data

to

incr

ease

the

leng

th o

f the

seq

uenc

es u

sed.

Page 87: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Incr

easi

ng th

e ro

bust

ness

of t

he

tree

Unt

il re

cent

ly, t

he li

kely

effe

cts

of th

ese

appr

oach

es h

ad n

ot b

een

wel

l cha

ract

eriz

ed.

In o

ne o

f thi

s st

udie

s, it

sho

ws

that

add

ing

mor

e se

quen

ces

to a

n an

alys

is d

oes

not i

ncre

ase

the

amou

nt o

f inf

orm

atio

n re

latin

g to

diff

eren

t par

ts o

f th

e tre

e un

iform

ly o

ver t

hat t

ree,

W

here

as th

e us

e of

long

er s

eque

nces

resu

lts in

a

linea

r inc

reas

e in

info

rmat

ion

over

the

who

le o

f th

e tre

e.

Page 88: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Incr

easi

ng th

e ro

bust

ness

of t

he

tree

Suc

h m

etho

ds a

llow

que

stio

ns o

f ex

perim

enta

l des

ign

in p

hylo

gene

tican

alys

is to

be

answ

ered

; fo

r exa

mpl

e:re

gard

ing

num

bers

and

leng

ths

of

sequ

ence

s in

the

data

set.

iden

tific

atio

n of

gen

es w

ith o

ptim

al ra

tes

of

evol

utio

n fo

r phy

loge

netic

infe

renc

e

Page 89: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

How

man

y tre

es to

bui

ld?

!Fo

r eac

h da

tase

t it i

s re

com

men

ded

to b

uild

m

ore

than

one

tree

. Bui

ld a

tree

usi

ng a

di

stan

ce m

etho

d an

d if

poss

ible

als

o us

e a

char

acte

r-bas

ed m

etho

d, li

ke m

axim

um

likel

ihoo

d.

!Th

e co

re o

f the

tree

sho

uld

be s

imila

r in

both

met

hods

, oth

erw

ise

you

may

sus

pect

th

at y

our t

ree

is in

corre

ct.

Page 90: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Whe

lan

S, G

oldm

an N

. A

gen

eral

em

piric

al m

odel

of p

rote

in e

volu

tion

deriv

ed

from

mul

tiple

pro

tein

fam

ilies

usi

ng a

max

imum

-like

lihoo

d ap

proa

ch.

Mol

Bio

lEvo

l. 20

01 M

ay;1

8(5)

:691

-9.

Whe

lan

S, L

ioP

, Gol

dman

N.

Mol

ecul

ar p

hylo

gene

tics:

sta

te-o

f-the

-art

met

hods

for

look

ing

into

the

past

.Tr

ends

Gen

et. 2

001

May

;17(

5):2

62-7

2.

Sai

tou

N, N

eiM

. Th

e ne

ighb

or-jo

inin

g m

etho

d: a

new

met

hod

for

reco

nstru

ctin

g ph

ylog

enet

ictre

es.

Mol

Bio

lEvo

l. 19

87 J

ul;4

(4):4

06-2

5.

Fost

er P

G, H

icke

y D

A.

Com

posi

tiona

l bia

s m

ay a

ffect

bot

h D

NA

-bas

ed a

nd

prot

ein-

base

d ph

ylog

enet

icre

cons

truct

ions

.J

Mol

Evo

l. 19

99 M

ar;4

8(3)

:284

-90.

Page 91: Introduction to Phylogenetic Analysis Irit Orr Feb 2005reconstruction is to attempt to estimate the phylogeny for some data. The basic assumption is that: `For any collection of data

Mul

ler T

, Vin

gron

M.

Mod

elin

g am

ino

acid

repl

acem

ent.

J C

ompu

tBio

l. 20

00;7

(6):7

61-7

6.

Gol

dman

N, Y

ang

Z.

A c

odon

-bas

ed m

odel

of n

ucle

otid

e su

bstit

utio

n fo

r pr

otei

n-co

ding

DN

A s

eque

nces

.M

ol B

iolE

vol.

1994

Sep

;11(

5):7

25-3

6.

Yan

g Z.

M

axim

um-L

ikel

ihoo

d M

odel

s fo

r Com

bine

d A

naly

ses

of

Mul

tiple

Seq

uenc

e D

ata

J M

ol E

vol.

1996

May

;42(

5):5

87-9

6.