phylogenetic analysis a brief introduction in 2 x 4 hours [email protected]

86
Phylogenetic analysis A brief introduction in 2 x 4 hours [email protected]

Upload: sarah-bond

Post on 29-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Phylogenetic analysis

A brief introduction in 2 x 4 hours

[email protected]

© 2009 SIB

What you can learn today

• Understand trees• Different types of gene relationships• The difference between a cladogram and a phylogram• Phylogenetic analysis methods• Steps performed during a phylogenetic analysis• Search strategies for tree topologies• Measures for tree robustness• Gene relationships and function prediction

© 2009 SIB

Outline

• Introduction to phylogenetic analysis• Application: Protein function prediction• Databases, servers and software

• TP5

© 2009 SIB

Introduction

Phylogeny is the study of evolutionary relationships.

Phylogenetic analysis is the means of inferring evolutionary relationships.

Ancestral genome

Genome species 1 Genome species 2

Polymorphisms - CNV

Gene duplication – Gene loss – gene fusion – gene fission - exon shuffling – retroposition – mobile elements – de novo gene origination

HGT HGT

© 2009 SIB

Trees

BA

Roots

Internal nodes

C D E F G BA C D E F GEnd nodes

Branches

© 2009 SIB

Phylogenetic trees

• Cladogram

• Phylogram

The branch length represents the number of character changes

Molecular clock

© 2009 SIB

Phylogenetic trees

• A phylogenetic tree is a model about the evolutionary relationship between operational taxonomic units (OTUs) based on homologous characters.

• But not all trees are phylogenetic trees

– Dendrogram: general term for a branching diagram

– Cladogram: branching diagram without branch length estimates

– Phylogram or phylogenetic tree: branching diagram with branch length estimates

Please note:

Guide trees produced during multiple sequence alignment have no phylogenetic meaning: the dendrograms are based on distances derived from pair-wise alignments; they are used to determine in what order sequences are aligned during the construction of the MSA.

© 2009 SIB

Rooted and unrooted trees

Outgroup

How many distinct trees?

© 2009 SIB

B

D

C

E

A

F

G

Solved (bifurcating) and un(re)solved (multifurcating) trees

B

D

C

E

A

F

G

© 2009 SIB

B1

C1

A1

Gene duplication

D

Speciation and gene duplication

B2

C2

A2

B1

B2

A1

Gene duplication

F

D

E

C

© 2009 SIB

Human gene 1

Mouse gene 2

Mouse gene 1

Human gene 2

Frog gene 1

Frog gene 2

Drosophila gene

Orthologs

Orthologs

Paralogs

Homologs

Gene duplication

Ancestral gene

Relationships within homologs

© 2009 SIB

Human gene 1

Mouse gene 2

Mouse gene 1

Human gene 2

Frog gene 1

Frog gene 2

Drosophila gene

Inparalogs of Group 2

Gene duplication

Ancestral gene

Co-orthologs of the Drosophila gene

Orthologs (Group 1)

Outparalogs of Group 1

Orthologs (Group 2)

Relationships between orthologs and paralogs

© 2009 SIB

Gene trees versus species trees …

© 2009 SIB

Gene relationships

Homologs = Genes of common originOrthologs = 1. Genes resulting from a speciation event, 2. Genes originating

from an ancestral gene in the last common ancestor of the compared genomes

Co-orthologs = Orthologs that have undergone lineage-specific gene duplications subsequent to a particular speciation event

Paralogs = Genes resulting from gene duplicationInparalogs = Paralogs resulting from lineage-specific duplication(s)

subsequent to a particular speciation eventOutparalogs = Paralogs resulting from gene duplication(s) preceding a

particular speciation eventOne-to-one (1:1) orthologs = Orthologs with no (known) lineage-specific gene

duplications subsequent to a particular speciation eventOne-to-many (1:n) orthologs: Orthologs of which at least one - and at most all

but one - has undergone lineage-specific gene duplication subsequent to a particular speciation event

Many-to-many (n:n) orthologs = Orthologs which have undergone lineage-specific gene duplications subsequent to a particular speciation event

Pseudo-orthologs = Paralogs with lineage-specific gene loss of orthologsXenologs = Orthologs derived by horizontal gene transfer from another

lineage

© 2009 SIB

Sequence data of actin-related protein 2

>Species A - RecName: Full=Actin-related protein 2;MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI IRSTTKVGNI EIKDLMVGDEASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI DTRNCKILLT EPPMNPTKNREKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG DGVTHICPVY EGFSLPHLTRRLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL CYVGYNIEQE QKLALETTVLVESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE LLFNTIQAAD IDTRSEFYKHIVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF KIRIEDPPRR KHMVFLGGAVLADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR

>Species B - RecName: Full=Actin-related protein 2;MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI IRSTTKVGNI EIKDLMVGDEASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI DTRNCKILLT EPPMNPTKNREKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG DGVTHICPVY EGFSLPHLTRRLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL CYVGYNIEQE QKLALETTVLVESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE LLFNTIQAAD IDTRSEFYKHIVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF KIRIEDPPRR KHMVFLGGAVLADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR

….

Phylogenetic analysis – an approach I

Species are:

Caenorhabditis briggsaeDrosophila melanogasterHomo sapiensMus musculusSchizosaccharomyces pombe

• ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• ARP2_C MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE• ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• ARP2_E MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• *:* :* ******** *** *** . **::****::*: . *::::**:***:*

• ARP2_A AEAVRSLLQVKYPMENGIIRDFEEMNQLWDYTF-FEKLKIDPRGRKILLTEPPMNPVANR• ARP2_B CSQLRQMLDINYPMDNGIVRNWDDMAHVWDHTFGPEKLDIDPKECKLLLTEPPLNPNSNR• ARP2_C ASQLRSLLEVSYPMENGVVRNWDDMCHVWDYTFGPKKMDIDPTNTKILLTEPPMNPTKNR• ARP2_D ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR• ARP2_E ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR• .. :*.:*::.***:**::*::::* ::**:** :*:.**. *:******:** **

• ARP2_A EKMCETMFERYGFGGVYVAIQAVLSLYAQGLSSGVVVDSGDGVTHIVPVYESVVLNHLVG• ARP2_B EKMFQVMFEQYGFNSIYVAVQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFALHHLTR• ARP2_C EKMIEVMFEKYGFDSAYIAIQAVLTLYAQGLISGVVIDSGDGVTHICPVYEEFALPHLTR• ARP2_D EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR• ARP2_E EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR• **: :.*** * *.. *:*:****:****** :***:********* **** . * **.

• ARP2_A RLDVAGRDATRYLISLLLRKGYAFNRTADFETVREMKEKLCYVSYDLELDHKLSEETTVL• ARP2_B RLDIAGRDITKYLIKLLLQRGYNFNHSADFETVRQMKEKLCYIAYDVEQEERLALETTVL• ARP2_C RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRIMKEKLCYIGYDIEMEQRLALETTVL• ARP2_D RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL• ARP2_E RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL• ***:**** *.***.*** .** **.:******* :******:.*::* : .*: *****

• ARP2_A MRNYTLPDGRVIKVGSERYECPECLFQPHLVGSEQPGLSEFIFDTIQAADVDIRKYLYRA• ARP2_B SQQYTLPDGRVIRLGGERFEAPEILFQPHLINVEKAGLSELLFGCIQASDIDTRLDFYKH• ARP2_C VESYTLPDGRVIKVGGERFEAPEALFQPHLINVEGPGIAELAFNTIQAADIDIRPELYKH• ARP2_D VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH• ARP2_E VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH• .*******:*.:*.**:*.** ******:. * *::*: *. ***:*:* * :*.

• ARP2_A IVLSGGSSMYAGLPSRLEKEIKQLWFERVLHGDPARLPNFKVKIEDAPRRRHAVFIGGAV• ARP2_B IVLSGGTTMYPGLPSRLEKELKQLYLDRVLHGNTDAFQKFKIRIEAPPSRKHMVFLGGAV• ARP2_C IVLSGGSTMYPGLPSRLEREIKQLYLERVLKNDTEKLAKFKIRIEDPPRRKDMVFIGGAV• ARP2_D IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV• ARP2_E IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV• ******::**.*******.*:***:::***:.: : :**:.** .* *. **:****

• ARP2_A LADIMAQND-HMWVSKAEWEEYGV-RALDKLGPRTT• ARP2_B LANLMKDRDQDFWVSKKEYEEGGIARCMAKLGIKA-• ARP2_C LAEVTKDRD-GFWMSKQEYQEQGL-KVLQKLQKISH• ARP2_D LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR• ARP2_E LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR• **:: :.* :*::. *::* *: . : **

Species are:Caenorhabditis briggsaeDrosophila melanogasterHomo sapiensMus musculusSchizosaccharomyces pombe

Which sequence is likely to correspond to which species?

© 2009 SIB

• ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• • ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE• • ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE• • ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE• • ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE•

© 2009 SIB

Distance matrix

A B C D E

A 0 - - - -

B 158 0 - - -

C 143 107 0 - -

D 139 97 73 0 -

E 139 97 73 0 0

Species are:

Caenorhabditis briggsaeDrosophila melanogasterHomo sapiensMus musculusSchizosaccharomyces pombe

© 2009 SIB

Expected species tree for …

• Caenorhabditis briggsae• Drosophila melanogaster• Homo sapiens• Mus musculus• Schizosaccharomyces pombe

© 2009 SIB

Phylogenetic analysis

1. Data selection

2. Data comparison

3. Selection of a data model

4. Selection of an evolutionary model

5. Tree-building

6. Tree evaluation

© 2009 SIB

What data types can be used to infer phylogenies?

• Morphological characters• Physiological characters• Gene order• Sequence data (nucleotide sequences, amino acid sequences)• Mixed characters• ….

© 2009 SIB

Data selection

• To be considered:– Input data must be homolog!– Taxonomic range and ~ distribution (balance, avoid LB)– Content of phylogenetic information– Number of character states– Size of the dataset– etc

© 2009 SIB

Phylogenetic analysis

1. Data selection

2. Data comparison

3. Selection of a data model

4. Selection of an evolutionary model

5. Tree-building

6. Tree evaluation

© 2009 SIB

Data comparison

• To be considered:– Prediction of characters that are derived from a common

ancestor– Chose a suitable alignment method– Highly diverged sequences

• Domain/family predictions• Structures

© 2009 SIB

Alignment

• Pairwise alignment versus MSA• MSA methods

– ClustalW (very fast)– Muscle (very fast)– MAFFT (fast)– Probcons– T-coffee– …

• When to use which method and why?

© 2009 SIB

Phylogenetic analysis

1. Data selection

2. Data comparison

3. Selection of a data model

4. Selection of an evolutionary model

5. Tree-building

6. Tree evaluation

© 2009 SIB

• Characters to be selected for the analysis• To be considered:

– Each position in the alignment should be homolog!– Missing data (in some OTU)– Number of characters– etc

Selection of a data model

© 2009 SIB

Selection of a data model

• Common methods– Gap removal– GBLOCKS

© 2009 SIB

Phylogenetic analysis

1. Data selection

2. Data comparison

3. Selection of a data model

4. Selection of an evolutionary model

5. Tree-building

6. Tree evaluation

© 2009 SIB

Evolutionary models

• Phylogenetic tree-building presumes particular evolutionary models

• The model chosen influences the outcome of the analysis and should be considered in the interpretation of the analysis results

© 2009 SIB

Evolutionary models

• Which aspects are to be considered?– …– …– …– …– etc

© 2009 SIB

Evolutionary models

• Which aspects are to be considered?1. Frequencies of aa exchange– …– …– …– etc

http://www.russell.embl-heidelberg.de/aas/other_images/lb3.gif

© 2009 SIB

Frequencies of aa exchange

• Substitution matrices

– Empirically derived from alignment datasets

• PAM (Dayhoff, 1968)

• JTT (Jones, Taylor, Thornton, 1992)

• Gonnet et al. (1992)

• WAG (Whelan, Goldman, 2001)

• mtrev (Hadachi, Hasegawa, 1996, specific for mitochondrial data)

– Estimated rate matrix -> series of replacement probability matrices (e.g. PAM1 … PAM250)

© 2009 SIB

Evolutionary models

• Which aspects are to be considered?1. Frequencies of aa exchange2. Change of aa frequencies during evolution– …– …– etc

Why?

© 2009 SIB

GC

content

© 2009 SIB

Evolutionary models

• Which aspects are to be considered?

1. Frequencies of aa exchange

2. Change of aa frequencies during evolution• GC content

– Differs between species (20-72%)– Differs within a genome (isochores)– Biased recombination-associated DNA repair– Temperature

© 2009 SIB

Evolutionary models

• Which aspects are to be considered?

1. Frequencies of aa exchange

2. Change of aa frequencies during evolution• Exchangeability matrix can be build for a particular

dataset• JTT + F

© 2009 SIB

Evolutionary models

• Which aspects are to be considered?

1. Frequencies of aa exchange

2. Change of aa frequencies during evolution

3. Between-site rate variation or Among-site substitution rate heterogenity

© 2009 SIB

Alignment

© 2009 SIB

Evolutionary models

• Which aspects are to be considered?

1. Frequencies of aa exchange

2. Change of aa frequencies during evolution

3. Between-site rate variation or Among-site substitution rate heterogenity

• Variation in substitution rates among different positions• Mostly discrete gamma model

Alpha parameter Scaling factor

Gamma distribution is a continuous probability density function

Infinitely large alpha value, rate variation is the same for all sites

alpha = 1, extensive rate variation

alpha < 1, many invariable sites

http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Gamma_distribution_pdf.png

Pro

babili

ty

densi

ty

Relative evolutionary rate

© 2009 SIB

Evolutionary models

• Which aspects are to be considered?

1. Frequencies of aa exchange

2. Change of aa frequencies during evolution

3. Between-site rate variation or Among-site substitution rate heterogenity

• Variation in substitution rates among different positions• Mostly discrete gamma model• Select the number of categories (4/8)

© 2009 SIB

Evolutionary models

• Which aspects are to be considered?

1. Frequencies of aa exchange

2. Change of aa frequencies during evolution

3. Between-site rate variation or Among-site substitution rate heterogenity

4. Presence of invariable sites

© 2009 SIB

Evolutionary models

Notation, e.g.

JTTJTT + FJTT + F + gamma (4 )JTT + F + gamma (8 ) + I (under discussion)JTT + F + I

• It is not always the most complex model that produces the best result.

• The more complex the model, the more complex the explanation of the results.

© 2009 SIB

Evolutionary models

• Selection of best-fit models (statistically) of evolution– ProtTest

• AIC (Akaike Information Criterion); – simple relationship between the likelihood and the

number of parameters to estimate the distance of a model from truth

• BIC (Bayesian Information Criterion)– includes a penalty for the number of parameters to avoid

overfitting of the selected model

© 2009 SIB

Phylogenetic analysis

1. Data selection

2. Data comparison

3. Selection of a data model

4. Selection of an evolutionary model

5. Tree-building

6. Tree evaluation

© 2009 SIB

Tree-building methods

• Distance (matrix) methods

1. Calculate distances for all pairs of taxa based on the sequence alignment

2. Construct a phylogenetic tree based on a distance matrix• Character-based (Sequence) methods

1. Constructs a phylogenetic tree based on the sequence alignment

© 2009 SIB

Step 1: Compute distances

Simple measure for the extend of sequence divergence:

p distance: p=nd/n

p = proportion (p distance)

nd= number of aa differences

n = number of aa used

^

© 2009 SIB

Step 1: Compute distances

• Relationship of p with t (time)

Time in million years

Num

ber

of

subst

ituti

ons

per

site

25 50 75

0.5

1.0

© 2009 SIB

Step 1: Compute distances

• Nonlinear relationship of p with t (time)

• Estimate the true number of amino acid substitutions between sequence pairs

– Poisson correction (PC distance)– Gamma correction (Gamma distance)

© 2009 SIB

Step 1: Compute distances

© 2009 SIB

Step 2: Tree-building

Common distance methods

• Neighbor Joining (NJ)

• (Un)-Weighted pair-group method using arithmetic averages (UPGMA / WPGMA)

• Least Square (LS)

• Minimal Evolution (ME)

© 2009 SIB

Neighbor Joining (NJ)

• Saitou, Nei (1987)

• Principle

– Bottom-up clustering method– Neighbours are defined as taxa connected by a single node

in an unrooted tree; closest neighbours are successively joined by a new node until the tree is resolved.

– Result: A single, unrooted tree with branch length estimates

© 2009 SIB

Neighbor Joining (NJ)

Mol Biol Evol. 1987 Jul;4(4):406-25.

Neighbor Joining (NJ)

© 2009 SIB

Neighbor Joining (NJ)

• Advantage– Very efficient– Also for large datasets

• Disadvantage– Does not examine all possible topologies

© 2009 SIB

Character- (Sequence-) based methods

Most common:• Maximum Parsimony (MP)• Maximum Likelihood (ML)• Baysian Inference

© 2009 SIB

Maximum Parsimony (MP)

• Henning, 1966• Originally developed for morphological characters• William of Ockham (1285-1349, Franciscan friar): the best

hypothesis is the one that requires the smallest number of assumptions

• The topology of the result tree is the one that requires the smallest number of evolutionary changes

• Group of related methods

© 2009 SIB

Maximum Parsimony (MP)

• Principle: – Estimate the minimum number of substitutions for a given

topology– Parsimony-informative sites (shared-derived characters,

exclude invariable sites and singletons)– Searching MP trees

• Exhaustive search• Branch-and-bound (Hendy-Penny, 1982)

– Good but time-consuming, if m>20• Heuristic search

– Result tree might not be the most parsimonious tree

– Result• Multiple result trees are possible (consensus tree)• Most parsimonious tree vs true tree• Unrooted result trees

© 2009 SIB

Maximum Parsimony (MP)

• Advantages– Free from assumptions (model-free)

• Disadvantages– Generally produces multiple result trees– Does not take into account homoplasy– Long-branch attraction (LBA): creates wrong topologies, if

the substitution rate varies extensively between lineages

© 2009 SIB

Maximum Likelihood (ML)

• Cavalli-Sforza, Edwards (1967), gene frequency data• Felsenstein (1981), nucleotide sequences• Kishino (1990), proteins• Principle

– Calculates likelihoods for each position in the alignment and for all possible topologies (gaps generally removed)

– Result = tree with the highest likelihood– Maximizes the likelihood of observing the sequence data

for a specific model of character state changes– Maximized to estimate branch lengths, not topologies

• Search strategies: rarely exhaustive, mostly heuristic• NNI (Nearest neighbor interchanges)• TBR (Tree bisection-reconnection)• SPR (Subtree pruning and regrafting)

© 2009 SIB

Number of possible trees

Leaves RootedUnrooted

© 2009 SIB

Maximum Likelihood (ML)

• Software

– PhyML (fast)– ProML (Phylip)– ProtML– RaxML (very fast)– …

© 2009 SIB

Bayesian estimation of phylogenies

• Very time-intensive• Programs: MrBayes, PhyloBayes

Prior distributionprob

abili

ty

1.0

Posterior distributionprob

abili

ty

1.0

Data (observations)

Tree topology 1

Tree topology 2

Tree topology 3

© 2009 SIB

Phylogenetic analysis

1. Data selection

2. Data comparison

3. Selection of a data model

4. Selection of an evolutionary model

5. Tree-building

6. Tree evaluation

© 2009 SIB

Tree evaluation

Analyze how well the data supports the result tree

Tests

1. Topology• Tree reconciliation (comparison of the gene tree with the

species tree)• Robustness, e.g. bootstrap, aLRT (PhyML)

2. Branch lengths tests

© 2009 SIB

Bootstrap

• by Bradley Efron (1979)• Felsenstein (1985)• Used to test the robustness of a tree topology• Principle:

– new MSA datasets are created by choosing randomly N columns from the original MSA; where N is the length of the original MSA

– Phylogenetic analysis is then performed on all bootstrap replicates

– The consensus tree indicates bootstrap support for each node• Mostly 1000 replicates (100 copies for large datasets)• Bootstrap support values: min. 98% (strict), min. 95% (accepted)

© 2009 SIB

Seq_1 ILKAEEKSeq_2 IVRSTQRSeq_3 IIRSSTKSeq_4 IIRSTTKSeq_5 LLKTTSR

Create a bootstrap replicate

Bootstrap and Bayesian support values.

© 2009 SIB

PhyML aLRT

• approximate Likelihood-Ratio Test (aLRT)• aLRT is a statistical test to compute branch supports: It uses the

likelihood score of a branch to calculate the approximate probability that a particular branch really exist in the true tree. It is much faster than bootstrapping.

1. aLRT

2. Chi2: parametric branch support

3. aLRT-SH: non-parametric branch support based on a Shimodaira-Hasegawa-like procedure

4. aLRT Chi2 and SH: calculates parametric and non-parametric branch support; result is the minimum support of both methods

© 2009 SIB

ApplicationPhylogenetic analysis for function prediction

© 2009 SIB

Gene duplication

• Prokaryots: at least 50%• Eukaryots: >90%

After gene duplication

• Coexistence (normally only for a short while)• Mostly, only one copy is retained

– becomes nonfunctional (non-functionalization),– becomes a pseudogene (pseudogenization)– is lost

• Both copies are retained– Distinct expression pattern– Distinct subcellular location (rare)– One copy keeps the original function, the other copy

acquires a new function (neofunctionalization)– Deleterious mutations in both entries (subfunctionalization)

© 2009 SIB

After gene duplication

• Synfunctionalization1. Functional divergence of the paralogs (e.g. expression)2. One paralog takes over ( in part / fully ) the function of the

other paralog, which leads either to 1. Orthologs that are not functionally equivalent2. gene loss

© 2009 SIB

Gene duplication followed by lineage-specific a) gene loss, b) function shuffling

© 2009 SIB

Cephalochordate Branchiostoma floridae diverged from other chordates after duplication of the ancestral SR gene.

BfER, the ortholog of vertebrate estrogen receptors, negatively regulates BfSR.

BfSR is specifically activated by estrogens and recognizes estrogen response elements.

Steroid hormone receptors

ancestral function: Bridgham JT, et al: PLoS Genet. 2008 Sep 12;4(9);

© 2009 SIB

Phylogenomic databases

Some phylogenomic databases

• COG/KOG• eggNOG• Ensembl (Compara)• HOGENOM• InParanoid• OMA browser• OrthoDB• OrthoMCL• PhylomeDB

Phylogenomic databases differ in their• Goals• Methodologies• Number of species• Taxonomic range• Hierarchies• Result presentation• Update frequencies

© 2009 SIB

Software for phylogenetic analysis

Examples of software packages

• Phylip• BioNJ• PhyML• PAML• MEGA• PAUP • Tree Puzzle• MrBayes

© 2009 SIB

Servers for phylogenetic analysis

• http://www.phylogeny.fr/ • http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html• http://atgc.lirmm.fr/phyml/• http://phylobench.vital-it.ch/raxml-bb/• http://power.nhri.org.tw/power/home.htm

© 2009 SIB

Take home

• Phylogenetic trees are models - not knowledge• Data selection is a very important step and can largely facilitate

phylogenetic analysis• It is not always the most complex evolutionary model that leads to

the best results – but complex models make the interpretation of the results more difficult!

• The most applied tree-building method is ML• Tree evaluation is the major step in phylogenetic analysis• Orthology prediction is helpful for function assignment, but the

function is only known when confirmed by wet lab experiments.

© 2009 SIB

Further reading …

• Masatoshi Nei, Sudhir Kumar. Molecular Evolution and Phylogenetics. Oxford University Press 2000.

• Dan Graur and Wen-Hsiung Li. Fundamentals of Molecular Evolution. Sinauer Associates, Massachusetts.

© 2009 SIB

TP5 1/2: Phylogenetic analysis

http://education.expasy.org/cours/phylo/MPB10_phylo_TP5.html

© 2009 SIB

TP5 2/2 - Analysis Refinement, Interpretation of Results

Preparation for the next course (if you wish you can work in groups up to 5)

Phylogenetic analysis of the X,K-ATPase beta subunit family

•Collect homologs from chordates (human, macaque, mouse, rat, chicken, zebrafinch, frog, zebrafish, fugu, Ciona intestinalis, C. savignyi); outgroup: Drosophila, Caenorhabditis elegans

•Perform a multiple sequence alignment

•Construct a data model

•Reconstruct a phylogenetic tree using ML

•Create one or more slides to present your analysis results to your colleagues next Monday. Easiest, if you paste links to the analysis servers for alignments, trees, etc

•Please send me your slides by Friday morning; if you work in groups, please indicate the names of your colleagues

Thank You

Remember:

Monday 17 December 2012 The course and practicals will take place in the computer room located in 10-12 Passage Baud-Bovy, behind UniMail