20150331 phylogenetic treestheory.bio.uu.nl/bpa/2015/slides/20150331_phylogenetic... · 2015. 3....

6
31Mar15 1 Phylogenetic trees Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 31 st 2015 A case story In August 1994 a nurse in Lafayette, LA, tests negative for HIV A few weeks later, she breaks off a messy 10 year affair with a doctor Three weeks later, while suffering from chronic fatigue symptoms, the doctor gives his exmistress a vitamin B12 shot, somewhat against her will In January 1995, the nurse tests positive for both HIV and hepatitis C. Investigation reveals no obvious means of infection (positive test for a sexual partner, accident with a patient, et cetera). The vitamin B12 shot becomes suspicious The doctor’s office records from the day are conveniently missing but eventually found by police buried in the back of a closet. The records show that the doctor had withdrawn blood samples from a known HIV patient and a known hepatitis C patient the same day as the vitamin B12 shot. The record keeping is not in line with standard office procedure and there is no information as to what happened to either blood sample The nurse never had contact with either patient Seemingly strong, but otherwise circumstantial, evidence that the doctor deliberately infected the nurse with HIV and hepatitis C Case story continued HIV evolves very fast This is partly why it has been so difficult to develop a cure Can we show that the HIV in the nurse is related to the HIV from the patient? 1. Take samples of HIV from the nurse 2. Take samples of HIV from the patient 3 Take samples of HIV from other HIV positive people from the 3. Take samples of HIV from other HIV positive people from the same town 4. Sequence HIV gene sequences 5. Construct a phylogeny of the HIV HIV phylogeny HIV strains found in patient } } HIV strains found in victim } } HIV strains found in other individuals from Lafayette Phylogenetic trees A phylogenetic tree represents the phylogeny of species or sequences Evolutionary signatures reveal the phylogenetic history Phylotenetic trees contain: Present day sequences Ancestral nodes Time Ancestral nodes A root The same tree can be represented in many different ways:

Upload: others

Post on 30-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • 31‐Mar‐15

    1

    Phylogenetic trees

    Bas E. DutilhSystems Biology: Bioinformatic Data Analysis

    Utrecht University, March 31st 2015

    A case story• In August 1994 a nurse in Lafayette, LA, tests negative for HIV• A few weeks later, she breaks off a messy 10 year affair with a doctor• Three weeks later, while suffering from chronic fatigue symptoms, the 

    doctor gives his ex‐mistress a vitamin B‐12 shot, somewhat against her will• In January 1995, the nurse tests positive for both HIV and hepatitis C. 

    Investigation reveals no obvious means of infection (positive test for a sexual partner, accident with a patient, et cetera). The vitamin B‐12 shot becomes suspicious

    • The doctor’s office records from the day are conveniently missing but eventually found by police buried in the back of a closet. The records show that the doctor had withdrawn blood samples from a known HIV patient and a known hepatitis C patient the same day as the vitamin B‐12 shot. The record keeping is not in line with standard office procedure and there is no information as to what happened to either blood sample

    • The nurse never had contact with either patient• Seemingly strong, but otherwise circumstantial, evidence that the doctor 

    deliberately infected the nurse with HIV and hepatitis C

    Case story continued• HIV evolves very fast

    – This is partly why it has been so difficult to develop a cure• Can we show that the HIV in the nurse is related to the HIV from the patient?1. Take samples of HIV from the nurse2. Take samples of HIV from the patient3 Take samples of HIV from other HIV positive people from the3. Take samples of HIV from other HIV positive people from the 

    same town4. Sequence HIV gene sequences5. Construct a phylogeny of the HIV

    HIV phylogeny

    HIV strains found in patient

    }}

    HIV strains found in victim}

    } HIV strains found in other individuals from Lafayette

    Phylogenetic trees• A phylogenetic tree represents thephylogeny of species or sequences– Evolutionary signatures reveal thephylogenetic history

    • Phylotenetic trees contain:– Present day sequences– Ancestral nodes

    Time

    Ancestral nodes– A root

    • The same tree can be represented in many different ways:

  • 31‐Mar‐15

    2

    Time

    Ways of constructing phylogenetic trees• Distance‐based approaches

    – Among the fastest programs for making phylogenetic trees– Unweighted Pair Group Method with Arithmetic mean (UPGMA)– Neighbor Joining (NJ)

    • Maximum likelihood and maximum parsimony approaches

    Distance‐based approach

    Multiple sequence alignment

    Evolutionary di t

    Calculate evolutionary divergence

    distance matrix

    Phylogenetic tree

    Cluster: UPGMA, Neighbor Joining

    Evolutionary distances• Sequence (dis‐)similarity represents evolutionary distance

    – Use similarity quantification methods from last week’s lectures• But: evolutionary distance does not correlate 1:1 with sequence alignment score– Because mutations at the same position in the sequence become increasingly likely

    – So we have to correct for that: )41ln(3 Dd So we have to correct for that: )3

    (4

    d

    Observed number of mutated positionsActual num

    ber o

    f mutations

    Jukes Cantor correction

    The molecular clock• The concept of a molecular clock is based on the observation that the number of amino acid differences in hemoglobin between different lineages changes roughly linearly with time, as estimated from fossilestimated from fossil evidence

    • In reverse, this would allow us to estimate the dates of evolutionary events from by biological sequence analysis

    • The molecular clock holds in some cases

    Human Mouse Chimp Worm Yeast

    Human ‐ 5 1 8 9

    Mouse ‐ 4 10 11

    Chimp 9 9

    UPGMA example• UPGMA is a greedy algorithm

    – This means that nodes with the smallest distances are joined first

    Chimp ‐ 9 9

    Worm ‐ 2

    Yeast ‐

    H+C Mouse Worm Yeast

    H+C ‐ 4.5 8.5 9

    Mouse ‐ 10 11

    Worm ‐ 2

    Yeast ‐

    Y

    W

    1

    1

    H

    C

    0.5

    0.5

  • 31‐Mar‐15

    3

    UPGMA example

    Human Mouse Chimp Worm Yeast

    Human ‐ 5 1 8 9

    Mouse ‐ 4 10 11

    Chimp 9 9

    • UPGMA is a greedy algorithm– This means that nodes with the smallest distances are joined first

    H+C Mouse W+Y

    H+C ‐ 4.5 8.75

    Mouse ‐ 10.5

    W+Y ‐M

    1.75

    2.25

    Y

    W

    1

    1

    H

    C

    0.5

    0.5

    Chimp ‐ 9 9

    Worm ‐ 2

    Yeast ‐

    UPGMA example

    Human Mouse Chimp Worm Yeast

    Human ‐ 5 1 8 9

    Mouse ‐ 4 10 11

    Chimp 9 9

    • UPGMA is a greedy algorithm– This means that nodes with the smallest distances are joined first– The root is added last at the mid‐point

    (H+C)+M W+Y

    (H+C)+M ‐ 9.33

    W+Y ‐

    M

    1.75

    2.25

    3.67

    2.42

    Y

    W

    1

    1

    H

    C

    0.5

    0.5

    Chimp ‐ 9 9

    Worm ‐ 2

    Yeast ‐

    An exam question

    A

    B

    D

    E

    C

    2

    1.5

    1

    0.5

    1. Assume that the molecular clock holds.a) Fill in the missing branch lengths.b) What algorithm was used for building this phylogenetic tree?c) What are dAB and dCD?d) Write this tree in Newick tree format with branch lengths.

    2. Research has revealed that the molecular clock does not hold for the lineage leading to C. If dBC = 6, what is the distance between C and its last common ancestor with A, B, D, and E?

    Answers

    A

    B

    D

    E

    C

    2

    1.5

    1

    0.52

    1.51.5

    3.5

    1. Assume that the molecular clock holdsa) See aboveb) Unweighted Pair Group Method with Arithmetic mean (UPGMA)c) dAB = 4, dCD = 7d) (((A:2,B:2):1,(D:1.5,E:1.5):1.5):0.5),C:3.5);

    2. 2.5

    Answers

    C

    0.5

    A

    B

    21

    2

    D

    E

    1.5

    1.51.5

    3.5A

    B

    E

    D

    C

    2

    1.5

    1

    0.52

    1.51.5

    3.5

    1. The following bracket‐notations are also correct:d) (((A:2,B:2):1,(D:1.5,E:1.5):1.5):0.5),C:3.5);d) (C:3.5,((D:1.5,E:1.5):1.5,(B:2,A:2):1):0.5));d) (C:3.5,((B:2,A:2):1,(D:1.5,E:1.5):1.5):0.5));d) (C:3.5,((A:2,B:2):1,(E:1.5,D:1.5):1.5):0.5));Et cetera

  • 31‐Mar‐15

    4

    Non‐uniform molecular clock• Greedy algorithms only work if the 

    clock runs at the same speed in all branches– All distances to the root are equal– The tree is called ultrametric

    • This is often not the case:Species A (fast evolving)Species B (slow evolving)Species C (fast evolving)Species D (slow evolving) NowNow

    Unequal rates of evolution are the rule

    Protist mitochondrionPlant  mitochondrion

    • Neighbor Joining (NJ) is designed to account for a non‐uniform molecular clock

    For detailed explanation check: youtu.be/B‐oHOoYvE6E

    Maximum parsimony (MP) and likelihood (ML)• Maximum parsimony (MP): the tree that requires the fewest evolutionary events to explain the alignment– Occam’s razor: the simplestexplanation of the observations

    • Maximum likelihood (ML): theMaximum likelihood (ML): the tree most likely to have led to the alignment given a certain model of evolution

    Maximum parsimony (MP)• MP example for a single position “alignment” in 5 species:

    Chimpanzee CGibbon TGorilla CHuman COrangutan T

    • Draw all possible trees for the sequences/species present in your multiple alignment

    • For each tree, identify where the mutations have taken place– Make parsimony assumption: minimum number of required mutations

    Maximum parsimony (MP)• How many trees are there?

    – # unrooted trees NU = (2n ‐ 5)!!– # rooted trees NR = (2n ‐ 3)!!

    • The MP tree has the minimum number of required mutations– It is the simplest explanation of the alignmentInformative positions contain at ≥2 different characters ≥2 × each

    = (2n ‐ 5) × (2n ‐ 7) × ... × 3 × 1= (2n ‐ 3) × (2n ‐ 5) × ... × 3 × 1

    – Informative positions contain at ≥2 different characters ≥2 × each

    7: a-t

    6: a-t

    1: c-t3: t-a

    Maximum likelihood (ML)• The simplest explanation is not always the most likely one

    – We know that some mutations are more likely than others• Like MP, ML starts by drawing all possible trees• For each tree, ML then calculates how likely it is that thistree gave rise to the observations (i.e. the alignment)– This depends on assumptions made about evolutionary events

    S b tit ti t i d lti• Substitution matrix and gap penalties• Faster/slower evolving positions in the sequence• Faster/slower evolving lineages in the tree

    – These assumptions are called the evolutionary model• The maximum likelihood tree is the tree that is most likelyto have generated the observations (i.e. the multiple alignment) given the model

  • 31‐Mar‐15

    5

    How to root a tree• Outgroup rooting

    – Include a basal sequence/species in the tree that branched off before the rest of the sequences/species diverged

    – You know that the root lies between this distant homolog and the rest of the sequences/species

    M

    Fish

    Yeast

    Mouse

    Human Y F M H

    1

    • Midpoint rooting– Assume that the root lies in the middle of longest distance between two nodes in the tree

    • Use a gene duplication event– Applicable when studying paralogousgenes in a family (next lecture) 

    f-

    h-

    f-

    h-

    Fish

    Yeast

    Mouse

    Human5

    22

    1

    1

    4 2

    12

    1Y F M H

    f- h- f- h-

    Remember: garbage in = garbage out!• Any sequences will be aligned by an alignment program

    – Even of the sequences are not homologous, and thus the “multiple sequence alignment” is meaningless

    • Any “multiple sequence alignment” will be turned into a “tree” by an phylogeny program– Even if the sequences are not homologous or if they are very badly aligned, and thus the “phylogenetic tree” is meaningless

    • Solutions:– Check your multiple sequence alignment carefully before making a phylogenetic tree

    – If the tree shows unexpected branching order, think twice about your methods before interpreting it as a true biological event

    Which phylogeny approach to use?• As with sequence alignments, remember that a phylogenetic tree is a hypothesis of the true evolutionary history

    • Distance trees are fast and give you a quick idea of the phylogenetic relationships– NJ is better than UPGMA because unequal rates of evolution are qa common phenomenon

    • ML trees are slower but tend to give more reliable phylogenetic trees– MP is rarely used for phylogenetic inference

    • You can test which approach is the best if you know the true Tree of Life

    Phylogenetic markers• Available/easy to sequence

    – Cytochrome C

    Fitch Science 1967

    • Present in all species• Constant function• Slowly evolving

    – SSU rRNA

    Fitch, Science 1967

    Woese et al, PNAS 1977

    Ribosomal RNA genes• Small subunit ribosomal RNA (SSU rRNA) is a universal marker gene that indicates the taxonomic group of an organism, and was used to discover the three domains in the Tree of Life (ToL)– 16S rRNA (Bacteria and Archaea)– 18S rRNA (Eukaryotes)

    Different genes tell different stories• Conflict between trees based on single genes

    – Unrecognized paralogy (next lecture)– Horizontal gene transfer– Mutation saturation biases divergent rates of evolutionMutation saturation, biases, divergent rates of evolution

    • Combine information from moregenes to average out theseanomalies– Complete genomes contain themaximum phylogenetic information

    – Application of complete genomes toreconstruct phylogenies is called phylogenomics

  • 31‐Mar‐15

    6

    Testing phylogeny approaches on fungi• Many complete genomes are available• Consensus about phylogeny• ML tree of the concatenated multiple 

    sequence alignments of universal proteins recovers most target nodes

    19 target nodes

    How do we assess confidence in an experiment?• We use statistics to assess confidence in an experiment– We repeat the experiment N timesand test how robustthe result is

    – How much variation is there in the result?

    How do we assess confidence in a tree?• We have already used all the data to create the phylogenetic tree, so how can we getinformation about the statistical confidence of the nodes in the tree?

    • “To pull oneself up by one's bootstraps” means “to better oneself without external help”

    • Bootstrap re‐sampling or bootstrapping:– A multiple alignment consists of many observations (i.e. the positions or columns)

    – We can randomly re‐sample new datasets from these manyobservations and test if the nodes in the phylogenetic tree are robust in these new datasets

    Bootstrap re‐sampling• Randomly sample columns (positions) from a multiple alignment– The number of sampled characters is identical to the number of positions in the original alignment

    – Sampling is done with replacement, so some positions can be sampled multiple times, while other positions are never chosen

    • Re‐calculate a bootstrap tree based on the randomly re‐p ysampled alignment

    • Repeat this e.g. 100‐1,000 times– This means making 100‐1,000 trees

    • For each branch in the original tree,in what percentage of the bootstraptrees was it correctly recovered?– This is the bootstrap score for the branch

    • Like many alignment algorithms, bootstrapping assumes that all positions in the alignment evolve independently (which is not true)

    An exam question• What is the maximum 

    number of bootstrap trees where Zebrafish_1A and Stickleback_1A could have formed a clade of two leaves?

    • What is the maximum number of bootstrap trees pwhere any Clawed Frog sequence could have formed a clade of two leaves with any Stickleback sequence?