phylogenetic analysis a brief introduction in 2 x 4 hours [email protected]
TRANSCRIPT
© 2009 SIB
What you can learn today
• Understand trees• Different types of gene relationships• The difference between a cladogram and a phylogram• Phylogenetic analysis methods• Steps performed during a phylogenetic analysis• Search strategies for tree topologies• Measures for tree robustness• Gene relationships and function prediction
© 2009 SIB
Outline
• Introduction to phylogenetic analysis• Application: Protein function prediction• Databases, servers and software
• TP5
© 2009 SIB
Introduction
Phylogeny is the study of evolutionary relationships.
Phylogenetic analysis is the means of inferring evolutionary relationships.
Ancestral genome
Genome species 1 Genome species 2
Polymorphisms - CNV
Gene duplication – Gene loss – gene fusion – gene fission - exon shuffling – retroposition – mobile elements – de novo gene origination
HGT HGT
© 2009 SIB
Phylogenetic trees
• Cladogram
• Phylogram
The branch length represents the number of character changes
Molecular clock
© 2009 SIB
Phylogenetic trees
• A phylogenetic tree is a model about the evolutionary relationship between operational taxonomic units (OTUs) based on homologous characters.
• But not all trees are phylogenetic trees
– Dendrogram: general term for a branching diagram
– Cladogram: branching diagram without branch length estimates
– Phylogram or phylogenetic tree: branching diagram with branch length estimates
Please note:
Guide trees produced during multiple sequence alignment have no phylogenetic meaning: the dendrograms are based on distances derived from pair-wise alignments; they are used to determine in what order sequences are aligned during the construction of the MSA.
© 2009 SIB
B1
C1
A1
Gene duplication
D
Speciation and gene duplication
B2
C2
A2
B1
B2
A1
Gene duplication
F
D
E
C
© 2009 SIB
Human gene 1
Mouse gene 2
Mouse gene 1
Human gene 2
Frog gene 1
Frog gene 2
Drosophila gene
Orthologs
Orthologs
Paralogs
Homologs
Gene duplication
Ancestral gene
Relationships within homologs
© 2009 SIB
Human gene 1
Mouse gene 2
Mouse gene 1
Human gene 2
Frog gene 1
Frog gene 2
Drosophila gene
Inparalogs of Group 2
Gene duplication
Ancestral gene
Co-orthologs of the Drosophila gene
Orthologs (Group 1)
Outparalogs of Group 1
Orthologs (Group 2)
Relationships between orthologs and paralogs
© 2009 SIB
Gene relationships
Homologs = Genes of common originOrthologs = 1. Genes resulting from a speciation event, 2. Genes originating
from an ancestral gene in the last common ancestor of the compared genomes
Co-orthologs = Orthologs that have undergone lineage-specific gene duplications subsequent to a particular speciation event
Paralogs = Genes resulting from gene duplicationInparalogs = Paralogs resulting from lineage-specific duplication(s)
subsequent to a particular speciation eventOutparalogs = Paralogs resulting from gene duplication(s) preceding a
particular speciation eventOne-to-one (1:1) orthologs = Orthologs with no (known) lineage-specific gene
duplications subsequent to a particular speciation eventOne-to-many (1:n) orthologs: Orthologs of which at least one - and at most all
but one - has undergone lineage-specific gene duplication subsequent to a particular speciation event
Many-to-many (n:n) orthologs = Orthologs which have undergone lineage-specific gene duplications subsequent to a particular speciation event
Pseudo-orthologs = Paralogs with lineage-specific gene loss of orthologsXenologs = Orthologs derived by horizontal gene transfer from another
lineage
© 2009 SIB
Sequence data of actin-related protein 2
>Species A - RecName: Full=Actin-related protein 2;MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI IRSTTKVGNI EIKDLMVGDEASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI DTRNCKILLT EPPMNPTKNREKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG DGVTHICPVY EGFSLPHLTRRLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL CYVGYNIEQE QKLALETTVLVESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE LLFNTIQAAD IDTRSEFYKHIVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF KIRIEDPPRR KHMVFLGGAVLADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR
>Species B - RecName: Full=Actin-related protein 2;MDSQGRKVVV CDNGTGFVKC GYAGSNFPEH IFPALVGRPI IRSTTKVGNI EIKDLMVGDEASELRSMLEV NYPMENGIVR NWDDMKHLWD YTFGPEKLNI DTRNCKILLT EPPMNPTKNREKIVEVMFET YQFSGVYVAI QAVLTLYAQG LLTGVVVDSG DGVTHICPVY EGFSLPHLTRRLDIAGRDIT RYLIKLLLLR GYAFNHSADF ETVRMIKEKL CYVGYNIEQE QKLALETTVLVESYTLPDGR IIKVGGERFE APEALFQPHL INVEGVGVAE LLFNTIQAAD IDTRSEFYKHIVLSGGSTMY PGLPSRLERE LKQLYLERVL KGDVEKLSKF KIRIEDPPRR KHMVFLGGAVLADIMKDKDN FWMTRQEYQE KGVRVLEKLG VTVR
….
Phylogenetic analysis – an approach I
Species are:
Caenorhabditis briggsaeDrosophila melanogasterHomo sapiensMus musculusSchizosaccharomyces pombe
• ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• ARP2_C MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE• ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• ARP2_E MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• *:* :* ******** *** *** . **::****::*: . *::::**:***:*
• ARP2_A AEAVRSLLQVKYPMENGIIRDFEEMNQLWDYTF-FEKLKIDPRGRKILLTEPPMNPVANR• ARP2_B CSQLRQMLDINYPMDNGIVRNWDDMAHVWDHTFGPEKLDIDPKECKLLLTEPPLNPNSNR• ARP2_C ASQLRSLLEVSYPMENGVVRNWDDMCHVWDYTFGPKKMDIDPTNTKILLTEPPMNPTKNR• ARP2_D ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR• ARP2_E ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR• .. :*.:*::.***:**::*::::* ::**:** :*:.**. *:******:** **
• ARP2_A EKMCETMFERYGFGGVYVAIQAVLSLYAQGLSSGVVVDSGDGVTHIVPVYESVVLNHLVG• ARP2_B EKMFQVMFEQYGFNSIYVAVQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFALHHLTR• ARP2_C EKMIEVMFEKYGFDSAYIAIQAVLTLYAQGLISGVVIDSGDGVTHICPVYEEFALPHLTR• ARP2_D EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR• ARP2_E EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR• **: :.*** * *.. *:*:****:****** :***:********* **** . * **.
• ARP2_A RLDVAGRDATRYLISLLLRKGYAFNRTADFETVREMKEKLCYVSYDLELDHKLSEETTVL• ARP2_B RLDIAGRDITKYLIKLLLQRGYNFNHSADFETVRQMKEKLCYIAYDVEQEERLALETTVL• ARP2_C RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRIMKEKLCYIGYDIEMEQRLALETTVL• ARP2_D RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL• ARP2_E RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL• ***:**** *.***.*** .** **.:******* :******:.*::* : .*: *****
• ARP2_A MRNYTLPDGRVIKVGSERYECPECLFQPHLVGSEQPGLSEFIFDTIQAADVDIRKYLYRA• ARP2_B SQQYTLPDGRVIRLGGERFEAPEILFQPHLINVEKAGLSELLFGCIQASDIDTRLDFYKH• ARP2_C VESYTLPDGRVIKVGGERFEAPEALFQPHLINVEGPGIAELAFNTIQAADIDIRPELYKH• ARP2_D VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH• ARP2_E VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH• .*******:*.:*.**:*.** ******:. * *::*: *. ***:*:* * :*.
• ARP2_A IVLSGGSSMYAGLPSRLEKEIKQLWFERVLHGDPARLPNFKVKIEDAPRRRHAVFIGGAV• ARP2_B IVLSGGTTMYPGLPSRLEKELKQLYLDRVLHGNTDAFQKFKIRIEAPPSRKHMVFLGGAV• ARP2_C IVLSGGSTMYPGLPSRLEREIKQLYLERVLKNDTEKLAKFKIRIEDPPRRKDMVFIGGAV• ARP2_D IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV• ARP2_E IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV• ******::**.*******.*:***:::***:.: : :**:.** .* *. **:****
• ARP2_A LADIMAQND-HMWVSKAEWEEYGV-RALDKLGPRTT• ARP2_B LANLMKDRDQDFWVSKKEYEEGGIARCMAKLGIKA-• ARP2_C LAEVTKDRD-GFWMSKQEYQEQGL-KVLQKLQKISH• ARP2_D LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR• ARP2_E LADIMKDKD-NFWMTRQEYQEKGV-RVLEKLGVTVR• **:: :.* :*::. *::* *: . : **
Species are:Caenorhabditis briggsaeDrosophila melanogasterHomo sapiensMus musculusSchizosaccharomyces pombe
Which sequence is likely to correspond to which species?
© 2009 SIB
• ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• • ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE• ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE• • ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE• ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE• • ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• • ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE• • ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE• ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE•
© 2009 SIB
Distance matrix
A B C D E
A 0 - - - -
B 158 0 - - -
C 143 107 0 - -
D 139 97 73 0 -
E 139 97 73 0 0
Species are:
Caenorhabditis briggsaeDrosophila melanogasterHomo sapiensMus musculusSchizosaccharomyces pombe
© 2009 SIB
Expected species tree for …
• Caenorhabditis briggsae• Drosophila melanogaster• Homo sapiens• Mus musculus• Schizosaccharomyces pombe
© 2009 SIB
Phylogenetic analysis
1. Data selection
2. Data comparison
3. Selection of a data model
4. Selection of an evolutionary model
5. Tree-building
6. Tree evaluation
© 2009 SIB
What data types can be used to infer phylogenies?
• Morphological characters• Physiological characters• Gene order• Sequence data (nucleotide sequences, amino acid sequences)• Mixed characters• ….
© 2009 SIB
Data selection
• To be considered:– Input data must be homolog!– Taxonomic range and ~ distribution (balance, avoid LB)– Content of phylogenetic information– Number of character states– Size of the dataset– etc
© 2009 SIB
Phylogenetic analysis
1. Data selection
2. Data comparison
3. Selection of a data model
4. Selection of an evolutionary model
5. Tree-building
6. Tree evaluation
© 2009 SIB
Data comparison
• To be considered:– Prediction of characters that are derived from a common
ancestor– Chose a suitable alignment method– Highly diverged sequences
• Domain/family predictions• Structures
© 2009 SIB
Alignment
• Pairwise alignment versus MSA• MSA methods
– ClustalW (very fast)– Muscle (very fast)– MAFFT (fast)– Probcons– T-coffee– …
• When to use which method and why?
© 2009 SIB
Phylogenetic analysis
1. Data selection
2. Data comparison
3. Selection of a data model
4. Selection of an evolutionary model
5. Tree-building
6. Tree evaluation
© 2009 SIB
• Characters to be selected for the analysis• To be considered:
– Each position in the alignment should be homolog!– Missing data (in some OTU)– Number of characters– etc
Selection of a data model
© 2009 SIB
Phylogenetic analysis
1. Data selection
2. Data comparison
3. Selection of a data model
4. Selection of an evolutionary model
5. Tree-building
6. Tree evaluation
© 2009 SIB
Evolutionary models
• Phylogenetic tree-building presumes particular evolutionary models
• The model chosen influences the outcome of the analysis and should be considered in the interpretation of the analysis results
© 2009 SIB
Evolutionary models
• Which aspects are to be considered?1. Frequencies of aa exchange– …– …– …– etc
© 2009 SIB
Frequencies of aa exchange
• Substitution matrices
– Empirically derived from alignment datasets
• PAM (Dayhoff, 1968)
• JTT (Jones, Taylor, Thornton, 1992)
• Gonnet et al. (1992)
• WAG (Whelan, Goldman, 2001)
• mtrev (Hadachi, Hasegawa, 1996, specific for mitochondrial data)
– Estimated rate matrix -> series of replacement probability matrices (e.g. PAM1 … PAM250)
© 2009 SIB
Evolutionary models
• Which aspects are to be considered?1. Frequencies of aa exchange2. Change of aa frequencies during evolution– …– …– etc
Why?
© 2009 SIB
Evolutionary models
• Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution• GC content
– Differs between species (20-72%)– Differs within a genome (isochores)– Biased recombination-associated DNA repair– Temperature
© 2009 SIB
Evolutionary models
• Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution• Exchangeability matrix can be build for a particular
dataset• JTT + F
© 2009 SIB
Evolutionary models
• Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
3. Between-site rate variation or Among-site substitution rate heterogenity
© 2009 SIB
Evolutionary models
• Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
3. Between-site rate variation or Among-site substitution rate heterogenity
• Variation in substitution rates among different positions• Mostly discrete gamma model
Alpha parameter Scaling factor
Gamma distribution is a continuous probability density function
Infinitely large alpha value, rate variation is the same for all sites
alpha = 1, extensive rate variation
alpha < 1, many invariable sites
http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Gamma_distribution_pdf.png
Pro
babili
ty
densi
ty
Relative evolutionary rate
© 2009 SIB
Evolutionary models
• Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
3. Between-site rate variation or Among-site substitution rate heterogenity
• Variation in substitution rates among different positions• Mostly discrete gamma model• Select the number of categories (4/8)
© 2009 SIB
Evolutionary models
• Which aspects are to be considered?
1. Frequencies of aa exchange
2. Change of aa frequencies during evolution
3. Between-site rate variation or Among-site substitution rate heterogenity
4. Presence of invariable sites
© 2009 SIB
Evolutionary models
Notation, e.g.
JTTJTT + FJTT + F + gamma (4 )JTT + F + gamma (8 ) + I (under discussion)JTT + F + I
• It is not always the most complex model that produces the best result.
• The more complex the model, the more complex the explanation of the results.
© 2009 SIB
Evolutionary models
• Selection of best-fit models (statistically) of evolution– ProtTest
• AIC (Akaike Information Criterion); – simple relationship between the likelihood and the
number of parameters to estimate the distance of a model from truth
• BIC (Bayesian Information Criterion)– includes a penalty for the number of parameters to avoid
overfitting of the selected model
© 2009 SIB
Phylogenetic analysis
1. Data selection
2. Data comparison
3. Selection of a data model
4. Selection of an evolutionary model
5. Tree-building
6. Tree evaluation
© 2009 SIB
Tree-building methods
• Distance (matrix) methods
1. Calculate distances for all pairs of taxa based on the sequence alignment
2. Construct a phylogenetic tree based on a distance matrix• Character-based (Sequence) methods
1. Constructs a phylogenetic tree based on the sequence alignment
© 2009 SIB
Step 1: Compute distances
Simple measure for the extend of sequence divergence:
p distance: p=nd/n
p = proportion (p distance)
nd= number of aa differences
n = number of aa used
^
© 2009 SIB
Step 1: Compute distances
• Relationship of p with t (time)
Time in million years
Num
ber
of
subst
ituti
ons
per
site
25 50 75
0.5
1.0
© 2009 SIB
Step 1: Compute distances
• Nonlinear relationship of p with t (time)
• Estimate the true number of amino acid substitutions between sequence pairs
– Poisson correction (PC distance)– Gamma correction (Gamma distance)
© 2009 SIB
Step 2: Tree-building
Common distance methods
• Neighbor Joining (NJ)
• (Un)-Weighted pair-group method using arithmetic averages (UPGMA / WPGMA)
• Least Square (LS)
• Minimal Evolution (ME)
© 2009 SIB
Neighbor Joining (NJ)
• Saitou, Nei (1987)
• Principle
– Bottom-up clustering method– Neighbours are defined as taxa connected by a single node
in an unrooted tree; closest neighbours are successively joined by a new node until the tree is resolved.
– Result: A single, unrooted tree with branch length estimates
© 2009 SIB
Neighbor Joining (NJ)
• Advantage– Very efficient– Also for large datasets
• Disadvantage– Does not examine all possible topologies
© 2009 SIB
Character- (Sequence-) based methods
Most common:• Maximum Parsimony (MP)• Maximum Likelihood (ML)• Baysian Inference
© 2009 SIB
Maximum Parsimony (MP)
• Henning, 1966• Originally developed for morphological characters• William of Ockham (1285-1349, Franciscan friar): the best
hypothesis is the one that requires the smallest number of assumptions
• The topology of the result tree is the one that requires the smallest number of evolutionary changes
• Group of related methods
© 2009 SIB
Maximum Parsimony (MP)
• Principle: – Estimate the minimum number of substitutions for a given
topology– Parsimony-informative sites (shared-derived characters,
exclude invariable sites and singletons)– Searching MP trees
• Exhaustive search• Branch-and-bound (Hendy-Penny, 1982)
– Good but time-consuming, if m>20• Heuristic search
– Result tree might not be the most parsimonious tree
– Result• Multiple result trees are possible (consensus tree)• Most parsimonious tree vs true tree• Unrooted result trees
© 2009 SIB
Maximum Parsimony (MP)
• Advantages– Free from assumptions (model-free)
• Disadvantages– Generally produces multiple result trees– Does not take into account homoplasy– Long-branch attraction (LBA): creates wrong topologies, if
the substitution rate varies extensively between lineages
© 2009 SIB
Maximum Likelihood (ML)
• Cavalli-Sforza, Edwards (1967), gene frequency data• Felsenstein (1981), nucleotide sequences• Kishino (1990), proteins• Principle
– Calculates likelihoods for each position in the alignment and for all possible topologies (gaps generally removed)
– Result = tree with the highest likelihood– Maximizes the likelihood of observing the sequence data
for a specific model of character state changes– Maximized to estimate branch lengths, not topologies
• Search strategies: rarely exhaustive, mostly heuristic• NNI (Nearest neighbor interchanges)• TBR (Tree bisection-reconnection)• SPR (Subtree pruning and regrafting)
© 2009 SIB
Maximum Likelihood (ML)
• Software
– PhyML (fast)– ProML (Phylip)– ProtML– RaxML (very fast)– …
© 2009 SIB
Bayesian estimation of phylogenies
• Very time-intensive• Programs: MrBayes, PhyloBayes
Prior distributionprob
abili
ty
1.0
Posterior distributionprob
abili
ty
1.0
Data (observations)
Tree topology 1
Tree topology 2
Tree topology 3
© 2009 SIB
Phylogenetic analysis
1. Data selection
2. Data comparison
3. Selection of a data model
4. Selection of an evolutionary model
5. Tree-building
6. Tree evaluation
© 2009 SIB
Tree evaluation
Analyze how well the data supports the result tree
Tests
1. Topology• Tree reconciliation (comparison of the gene tree with the
species tree)• Robustness, e.g. bootstrap, aLRT (PhyML)
2. Branch lengths tests
© 2009 SIB
Bootstrap
• by Bradley Efron (1979)• Felsenstein (1985)• Used to test the robustness of a tree topology• Principle:
– new MSA datasets are created by choosing randomly N columns from the original MSA; where N is the length of the original MSA
– Phylogenetic analysis is then performed on all bootstrap replicates
– The consensus tree indicates bootstrap support for each node• Mostly 1000 replicates (100 copies for large datasets)• Bootstrap support values: min. 98% (strict), min. 95% (accepted)
© 2009 SIB
Seq_1 ILKAEEKSeq_2 IVRSTQRSeq_3 IIRSSTKSeq_4 IIRSTTKSeq_5 LLKTTSR
Create a bootstrap replicate
© 2009 SIB
PhyML aLRT
• approximate Likelihood-Ratio Test (aLRT)• aLRT is a statistical test to compute branch supports: It uses the
likelihood score of a branch to calculate the approximate probability that a particular branch really exist in the true tree. It is much faster than bootstrapping.
1. aLRT
2. Chi2: parametric branch support
3. aLRT-SH: non-parametric branch support based on a Shimodaira-Hasegawa-like procedure
4. aLRT Chi2 and SH: calculates parametric and non-parametric branch support; result is the minimum support of both methods
After gene duplication
• Coexistence (normally only for a short while)• Mostly, only one copy is retained
– becomes nonfunctional (non-functionalization),– becomes a pseudogene (pseudogenization)– is lost
• Both copies are retained– Distinct expression pattern– Distinct subcellular location (rare)– One copy keeps the original function, the other copy
acquires a new function (neofunctionalization)– Deleterious mutations in both entries (subfunctionalization)
© 2009 SIB
After gene duplication
• Synfunctionalization1. Functional divergence of the paralogs (e.g. expression)2. One paralog takes over ( in part / fully ) the function of the
other paralog, which leads either to 1. Orthologs that are not functionally equivalent2. gene loss
© 2009 SIB
Cephalochordate Branchiostoma floridae diverged from other chordates after duplication of the ancestral SR gene.
BfER, the ortholog of vertebrate estrogen receptors, negatively regulates BfSR.
BfSR is specifically activated by estrogens and recognizes estrogen response elements.
Steroid hormone receptors
ancestral function: Bridgham JT, et al: PLoS Genet. 2008 Sep 12;4(9);
© 2009 SIB
Phylogenomic databases
Some phylogenomic databases
• COG/KOG• eggNOG• Ensembl (Compara)• HOGENOM• InParanoid• OMA browser• OrthoDB• OrthoMCL• PhylomeDB
Phylogenomic databases differ in their• Goals• Methodologies• Number of species• Taxonomic range• Hierarchies• Result presentation• Update frequencies
© 2009 SIB
Software for phylogenetic analysis
Examples of software packages
• Phylip• BioNJ• PhyML• PAML• MEGA• PAUP • Tree Puzzle• MrBayes
© 2009 SIB
Servers for phylogenetic analysis
• http://www.phylogeny.fr/ • http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html• http://atgc.lirmm.fr/phyml/• http://phylobench.vital-it.ch/raxml-bb/• http://power.nhri.org.tw/power/home.htm
© 2009 SIB
Take home
• Phylogenetic trees are models - not knowledge• Data selection is a very important step and can largely facilitate
phylogenetic analysis• It is not always the most complex evolutionary model that leads to
the best results – but complex models make the interpretation of the results more difficult!
• The most applied tree-building method is ML• Tree evaluation is the major step in phylogenetic analysis• Orthology prediction is helpful for function assignment, but the
function is only known when confirmed by wet lab experiments.
© 2009 SIB
Further reading …
• Masatoshi Nei, Sudhir Kumar. Molecular Evolution and Phylogenetics. Oxford University Press 2000.
• Dan Graur and Wen-Hsiung Li. Fundamentals of Molecular Evolution. Sinauer Associates, Massachusetts.
© 2009 SIB
TP5 1/2: Phylogenetic analysis
http://education.expasy.org/cours/phylo/MPB10_phylo_TP5.html
© 2009 SIB
TP5 2/2 - Analysis Refinement, Interpretation of Results
Preparation for the next course (if you wish you can work in groups up to 5)
Phylogenetic analysis of the X,K-ATPase beta subunit family
•Collect homologs from chordates (human, macaque, mouse, rat, chicken, zebrafinch, frog, zebrafish, fugu, Ciona intestinalis, C. savignyi); outgroup: Drosophila, Caenorhabditis elegans
•Perform a multiple sequence alignment
•Construct a data model
•Reconstruct a phylogenetic tree using ML
•Create one or more slides to present your analysis results to your colleagues next Monday. Easiest, if you paste links to the analysis servers for alignments, trees, etc
•Please send me your slides by Friday morning; if you work in groups, please indicate the names of your colleagues