phylogeny - a brief introduction in 4 hours -. outline introduction practical approach evolutionary...
Post on 19-Dec-2015
216 views
TRANSCRIPT
Phylogeny
- A brief introduction in 4 hours -
Outline
• Introduction• Practical approach• Evolutionary models• Distance-based methods / TP5_1• Databases and software• Sequence-based methods / TP5_2
What is phylogeny?
Phylogeny is the evolutionary history and relationship of species.
Why is phylogeny of interest in a proteomics
course?
What data types can be used to infer phylogenies?
• Morphological characters• Physiological characters• Gene order (e.g. in mitochondria)• Sequence data
– Nucleotide sequences– Amino acid sequences
• Mixed characters• ….
What is a phylogenetic tree?
• A phylogenetic tree is a model about the evolutionary relationship between species (OTUs) based on homologous characters
• But not all trees are phylogenetic trees– Dendrogram = general term for a
branching diagram– Cladogram: branching diagram without
branch length estimates– Phylogenetic tree or Phylogram: branching
diagram with branch length estimates
What is a phylogenetic tree?
• Rooted or unrooted• bifurcating or multifurcating
(solved or unsolved)
Gene duplication• Prokaryots: at least 50%• Eukaryots: >90%
After gene duplication• Coexistence (normally only for a short
while)• Mostly, only one copy is retained
– becomes nonfunctional (non-functionalization),– becomes a pseudogene (pseudogenization)– is lost
• Both copies are retained– Distinct expression pattern– Distinct subcellular location (rare)– One copy keeps the original function, the other
copy acquires a new function (neofunctionalization)
– Deleterious mutations in both entries (subfunctionalization)
Human gene A
Mouse gene B
Mouse gene A
Human gene B
Frog gene A
Frog gene B
Drosophila gene AB
Orthologs
Orthologs
Paralogs
Homologs
Gene duplication
Ancestral gene
Relationships within homologs
Homologs …Homologs = Genes of common originOrthologs = 1. Genes resulting from a speciation event, 2. Genes originating
from an ancestral gene in the last common ancestor of the compared genomes
Co-orthologs = Orthologs that have undergone lineage-specific gene duplications subsequent to a particular speciation event
Paralogs = Genes resulting from gene duplicationInparalogs = Paralogs resulting from lineage-specific duplication(s)
subsequent to a particular speciation eventOutparalogs = Paralogs resulting from gene duplication(s) preceding a
particular speciation eventOne-to-one (1:1) orthologs = Orthologs with no (known) lineage-specific gene
duplications subsequent to a particular speciation eventOne-to-many (1:n) orthologs: Orthologs of which at least one - and at most all
but one - has undergone lineage-specific gene duplication subsequent to a particular speciation event
Many-to-many (n:n) orthologs = Orthologs which have undergone lineage-specific gene duplications subsequent to a particular speciation event
Xenologs = Orthologs derived by horizontal gene transfer from another lineage
Human gene A
Mouse gene B
Mouse gene A
Human gene B
Frog gene A
Frog gene B
Drosophila gene AB
Inparalogs of Group 2
Gene duplication
Ancestral gene
Co-orthologs of Drosophila gene AB
Orthologs (Group 1)
Outparalogs of Group 1
Orthologs (Group 2)
Relationships between orthologs and paralogs
Practical approach I
Actin-related protein 2 (first 60 columns of the alignment)
ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDEARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEEARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDEARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDEARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE *:* :* ******** *** *** . **::****::*: . *::::**:***:*
Species are:Caenorhabditis briggsaeDrosophila melanogasterHomo sapiensMus musculusSchizosaccharomyces pombe
Can you build a dendrogram (tree) for the sequences of the alignment?Can you assign the species to the corresponding sequences of the alignment?
Phylogenetic analysis
1. Select Data2. Alignment3. Select a data model4. Select a substitution model5. Tree-building
• [Distance matrix]• Tree-building
6. Tree evaluation
Select data
• To be considered:– Input data must be homolog!– Number of character states– Content of phylogenetic information– Size of the dataset– Automated cluster data from large
datasets– etc
Alignment
• MSA methods– ClustalW– muscle– MAFFT– Probcons– T-coffee– …
• See previous course …
Data model
= Characters selected for the analysis
• To be considered:– Each character should be homolog!– Missing data (in some OTU)– Number of characters– etc
Evolutionary modelsPhylogenetic tree-building presumes
particular evolutionary modelsThe model used influences the outcome of
the analysis and should be considered in the interpretation of the analysis results
• Which aspects are to be considered?1. Frequencies of aa exchange2. Change of aa frequencies during evolution3. Between-site rate variation or Among-site
substitution rate heterogenity4. Presence of invariable sites
Evolutionary modelsNotation, e.g.
JTTJTT + FJTT + F + gamma (4 )JTT + F + gamma (8 ) + I (under discussion)JTT + F + I
It is not always the most complex model that produces the best result.
The more complex the model, the more complex the explanation of the results.
Tree-building methods
• Distance (matrix) methods1. Calculate distances for all pairs of taxa
based on the sequence alignment2. Construct a phylogenetic tree based on
a distance matrix
• Character-based (Sequence) methods
1. Constructs a phylogenetic tree based on the sequence alignment
Step 1: Compute distances
1. Estimate the number of amino acid substitutions between sequence pairs
p distance: p=nd/n
p = proportion (p distance)nd= number of aa differences
n = number of aa used
^
Step 1: Compute distances
• Nonlinear relationship of p with t (time)
• Estimation of aa substitutions– Poisson correction
• PC distance
– Gamma correction• Gamma distance
Step 2: Tree-building
Common distance methods• Neighbor Joining (NJ)• UPGMA / WPGMA• Least Square (LS)• Minimal Evolution (ME)
Neighbor Joining (NJ)• Saitou, Nei (1987)• Principle
– Clustering method– Simplified minimal evolution principle– Neighbors = taxa connected by a single
node in an unrooted tree– Computational process: Star tree, followed
by a successive joining of neighbors and the creation of new pairs of neighbors
– Result: • A single final tree with branch length estimates• unrooted tree
Neighbor Joining (NJ)
• Sum of branch lengths in the star tree
• Calculate the sum of all branch lengths for all possible neighbors …
Neighbor Joining (NJ)
• Calculate Length X-Y
• Calculate again sum of all branch length
Neighbor Joining (NJ)
Neighbor Joining (NJ)
• Advantage– Very efficient– Also for large datasets
• Disadvantage– Does not examine all possible
topologies
Bootstrap
• Used to test the robustness of a tree topology
• by Bradley Efron (1979)• Felsenstein (1985)• Principle: new MSA datasets are created by
choosing randomly N columns from the original MSA; where N is the length of the original MSA
• 100-1000 replicates• Bootstrap support values: (75%), 95%, 98%
TP5 - 1st part, Exercises 1-5
http://education.expasy.org/m07_phylo.html
Ortholog databases & phylogenetic databases
Some databases providing orthologous groups and trees
• COG/KOG• HOGENOM• Ensembl• OMA browser• OrthoDB• OrthoMCL
• Pfam• PANDIT• SYSTERS• TreeBase• Tree of Life
Phylogenetic software
Software packages• Freely available
– Phylip – BioNJ– PhyML– Tree Puzzle– MrBayes
• Commercial– PAUP– MEGA
Phylogenetic servers
• http://www.phylogeny.fr/• http://bioweb.pasteur.fr/seqanal/phylogeny/intro-
uk.html• http://atgc.lirmm.fr/phyml/• http://phylobench.vital-it.ch/raxml-bb/• http://www.fbsc.ncifcrf.gov/app/htdocs/appdb/
drawpage.php?appname=PAUP• http://power.nhri.org.tw/power/home.htm
Sequence methods
Most common:• Maximum Parsimony (MP)• Maximum Likelihood (ML)• Baysian Inference
Maximum Parsimony (MP)
• Originally developed for morphological characters
• Henning, 1966• William of Ockham: the best
hypothesis is the one that requires the smallest number of assumptions
Maximum Parsimony (MP)• Principle:
– Estimate the minimum number of substitutions for a given topology
– Parsimony-informative sites (exclude invariable sites and singletons)
– Searching MP trees• Exhaustive search• Branch-and-bound (Hendy-Penny, 1982)
– Good but time-consuming, if m>20• Heuristic search
– Result tree might not be the most parsimonious tree
– Result• Multiple result trees are possible (strict consensus
tree, majority-rule consensus tree)• Most parsimonious tree vs true tree• Unrooted result trees
Maximum Parsimony (MP)
• Advantages– Free from assumptions (model-free)
• Disadvantages– Does not take into account homoplasy– Long-branch attraction (LBA): creates
wrong topologies, if the substitution rate varies extensively between lineages
Maximum Likelihood (ML)• Cavalli-Sforza, Edwards (1967), gene frequency data• Felsenstein (1981), nucleotide sequences• Kishino (1990), proteins• Principle
– Maximizes the likelihood of observing the sequence data for a specific model of character state changes
– Likelihood of a site = Sum of probabilities of every possible reconstruction of ancestral states at the internal nodes
– Likelyhood of the tree = Product of the likelihoods for all sites (=sum of log likelihoods)
– Result = tree with the highest likelihood• Maximized to estimate branch lengths, not topologies• Search strategies: rarely exhaustive, mostly heuristic
• NNI (Nearest neighbor interchanges)• TBR (Tree bisection-reconnection)• SPR (Subtree pruning and regrafting)
Number of possible trees
• Unrooted bifurcating trees:
• Rooted bifurcating trees:
Number of possible trees
Leaves Rooted Unrooted
Number of possible trees
Leaves Unrooted Rooted 3 1 3 4 3 15 5 15 105 6 105 945 7 945 10395 8 10395 135135 9 135135 202702510 2027025 34459425
Maximum Likelihood (ML)
• Methods:– ProML (Phylip)– PhyML– RaxML– …
Tree evaluation
1. Topology1. Comparison with species tree2. Robustness, e.g. bootstrap
2. Branch lengths
TP5 – 2nd part, Exercise 6
http://education.expasy.org/m07_phylo.html