![Page 1: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/1.jpg)
Selecting Genomes for Reconstruction of Ancestral Genomes
Louxin ZhangDepartment of Mathematics
National University of Singapore
![Page 2: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/2.jpg)
Boreoeutherian Ancestor
![Page 3: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/3.jpg)
The Genome Selection for Reconstruction problem
Instance: Given a phylogeny P of a set of genomes, an integer k and a reconstruction method T
(say parsimony).
Solution: k genomes in the phylogeny that gives the highest accuracy of reconstructing the ancestral genome at the root of the phylogeny, using method T.
![Page 4: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/4.jpg)
Two reasons
It is often impossible to sequence all descendent genomes below an ancestor;
More taxa do not necessarily give a higher accuracy for the reconstruction of ancestral character states in general (examples will be given below)
![Page 5: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/5.jpg)
Outline
Introduction to reconstruction accuracy analysis
More genomes are not necessarily better for reconstruction
Greedy algorithms for genome selection
A joint work with G. Li, J. Ma and M. Steel
![Page 6: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/6.jpg)
1. Reconstruction and its Accuracy
There are different methods for reconstructing the ancestral character states Parsimony Maximum likelihood methods (Koshi &
Golstein’96, Yang et al’95) Bayesian methods (Yang et al’95)
In this work, we study the problem with the Fitch parsimony and maximum likelihood in the Jukes-Cantor evolutionary model.
![Page 7: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/7.jpg)
Jukes-Cantor Model
Characters evolve by a symmetric, reversible Markov process.
Probability of a substitution change of any sort is the same on a branch.
For simplicity, we assume there are two states 0 and 1.
![Page 8: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/8.jpg)
Reconstruction Accuracy
In the symmetric Jukes-Cantor model,
the reconstruction accuracy of a method is independent of the prior distribution of the states at the root.
![Page 9: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/9.jpg)
•D denotes a state configuration at leaves: it has one state for each leaf.•There are state configurations since thereare 2 possible states at each leaf.
n2
D
K KDIDA ),,0()0|Pr(
I(0, D, K) is 1 if the method K reconstructs state 0 from D and 0 otherwise.
Pr(D|0) is the probability that 0 at the root evolves into D.
![Page 10: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/10.jpg)
•D denotes a state configuration at leaf nodes: it has one state for each leaf.•There are state configurations since thereare 2 possible states at each leaf.
n2
D
K KDIDA ),,0()0|Pr(
The reconstruction accuracy is the sum of generating Prob. of state configurations which allow the true state 0 to be recovered by the method K.
![Page 11: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/11.jpg)
Previous Analysis Works
Simulation study (Martins’99, Mooers’04, Salisbury & Kim’01,
Zhang & Nei’97, Yang et al’95); Theoretical study (Mossel’01, Lucena and Haussler’05,
Maddison’95)
![Page 12: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/12.jpg)
Fitch Method
Given a state configuration of the leaves, the Fitch method reconstructs a subset of states at each internal node (from leaves to the root ) recursively:
0 1 0
{0, 1}
{0}
B C
A
![Page 13: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/13.jpg)
The unambiguous reconstruction accuracy:
PAccuracy= P[{1}|1]=P[{0}|0]
and the reconstruction accuracy
P[{1}|1]= the probability that Fitch method outputs true state at the root.
P[{0}|1], P[{1}|1], and P[{0, 1}|1] can be calculated bya dynamic approach (Maddison, 1995)
Calculating the Reconstruction Accuracy of Fitch Method
![Page 14: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/14.jpg)
Outline
Introduction to reconstruction accuracy analysis
More genomes are not necessarily better for reconstruction accuracy
Greedy algorithms for genome selection
![Page 15: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/15.jpg)
2. Reconstruction accuracy is not a monotone functionof the size of taxon sampling
umbalanced tree
• There is a large clade with a long stem• A short single sister lineage
Such a phylogeny is used when both fossil record and data at extant species are used for reconstruction (Finarelli and Flynn, 2006)
![Page 16: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/16.jpg)
A
Y Z
p1 p2
Theorem 1: Aparsimony < p1 if ½< p2<= p1
p1 is the conservation probability on AYp2 is the conservation probability on AZ
0 {0}
{0}
0
{0}
{0, 1}
PA[{0}|0]
= PrAY[00] x (PrAZ[00] PZ[{0} or {0, 1}| 0] + PrAZ[01] PZ[{0} or {0, 1}| 1] = p1 (p2 (1- PZ[{1}| 0] ) + (1-p2) (1-PZ[{1}|1])
= p1 ( 1- p2 PZ[{1}|0] – (1-p2) PZ[{1}|1] )
Proof.
![Page 17: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/17.jpg)
A
Y Z
p1 p2
p1 is the conservation probability on AYp2 is the conservation probability on AZ
0{1}
{0, 1}
1
{0, 1}
{0}
PA[{0, 1}|0]= [p1p2+(1-p1)(1-p2)] x PZ[{1}|0] + [ p1(1-p2)+p2(1-p1)] X PZ[{0}|0]
Aparsimony = PA[{0}|0] + ½ PA[{0, 1}|1] = p1 + ½ (1-p1-p2) PZ[{1}|0] + ½(p2-p1) PZ[{0}|0] < p1
½ < p2 <= p1
![Page 18: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/18.jpg)
The reconstruction accuracy oncomb-shaped trees in the limit case
![Page 19: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/19.jpg)
A
Y Z
p1 p2
Theorem 2: AML = p1 if ½< p2<= p1
p1 is the conservation probability on AYp2 is the conservation probability on AZ
0
DZ
DZ : a state configuration below Z.
PrA( 0DZ |s): the probability that s at A evolves into state configuration 0DZ, s=0,1.
Marginal ML method:
PrA(0DZ|0) = p1 x [ p2PrZ(DZ|0) + (1-p2)PrZ(DZ| 1)]
0
0, 1?
PrA(0DZ|1) = (1-p1) x [ (1-p2)PrZ(DZ|0) + p2PrZ(DZ| 1)]
PrA(0DZ|0)-PrA(0DZ|1)=(p1+p2-1)PrZ(DZ|0) + (p1-p2)PrZ(DZ|1) >0
The marginal ML outputs 0 at A iff the state at Y is 0.
![Page 20: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/20.jpg)
Another Example showing the Non-monotone Property of Reconstruction Accuracy
![Page 21: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/21.jpg)
Simulation
Experiment setup Yule birth-death model Conservation probability along branches:
0.5~1 Count the number of random trees in
which the ambiguous accuracy of using a single (longest or shortest) path is better than that from the full phylogeny
![Page 22: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/22.jpg)
Simulation results:Counting the bad trees
+: using the shortest path
![Page 23: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/23.jpg)
Comparison of Parsimony, Joint ML and Marginal ML
500 random trees with 12 leaves generated: • Yule birth-death model• branch length is uniform from 0 to 1
• MML outperforms JML, MP. • In 80% of instances, MML is strictly better than JML• In 99% of instances, JML is strictly better than MP.
![Page 24: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/24.jpg)
Outline Introduction to reconstruction
accuracy analysis
More genomes are not necessarily better for reconstruction accuracy
Greedy algorithms for genome selection problem
![Page 25: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/25.jpg)
Genome selection for reconstructionthe problem
Instance: A phylogeny P over n genomes, integer k and a reconstruction method T
Question: Find k genomes that allows the ancestral genome of the root of P to be reconstructed with the maximum accuracy, using method T.
![Page 26: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/26.jpg)
Our approaches
The genome selection problem is unlikely polynomial-time solvable
(no hardness proof yet) As a result, we propose two greedy
algorithms for the problem: Forward greedy algorithm & Backward greedy algorithm
![Page 27: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/27.jpg)
Forward Greedy Algorithm
1. Set S ← φ;
2. For i = 1, 2, · · · , k do 2.1) for each genome g not in S, compute the accuracy A(g) of the reconstruction by applying method T to S ∪ {g}; 2.2) add g with the max accuracy A(g) to S ;
3. Output S
S is the set of selected genomes
![Page 28: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/28.jpg)
Backward Greedy Algorithm
1. Let S contain all the given genomes;
2. For i = 1, 2, · · · , n − k do 2.1) for each genome g in S, compute the accuracy A’(g) of the reconstruction by applying T to S − {g};
2.2) remove g from S if A’(g) is the max over all g’s;
3. Output S
![Page 29: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/29.jpg)
Validation test– Trees with the same height
Experiment setup Random trees with N (9, or 16)
leaves generated by program Evolver in PAML with the following parameters:
Birth rate=10; Death rate=5; Sampling fraction=1. Tree height = 0.1, 0.2, 0.5, 1, 2, or 5.
![Page 30: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/30.jpg)
Performance of the selection method for reconstruction with Parsimony
![Page 31: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/31.jpg)
Performance of the selection method for reconstruction with Marginal Maximum Likelihood
![Page 32: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/32.jpg)
Performance of the selection method for reconstruction with Joint Maximum Likelihood
![Page 33: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/33.jpg)
Marginal Maximum Likelihood
![Page 34: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/34.jpg)
Parsimony Method
![Page 35: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/35.jpg)
Concluding remarks Reconstruction accuracy is not monotone
increasing with the taxon sampling size in unbalanced trees for Parsimony method
--- Another kind of “inconsistency”
1. One implication of this observation is that Parsimony, ML method might not explore the full power of incorporating fossil record into current data.
Hence, modification might probably be needed.
2. Caution should be used in drawing conclusion on testing hypothesis with ancestral state reconstruction.
![Page 36: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/36.jpg)
3. Is the reconstruction accuracy function monotone in ultrametric phylogeny? It seems true when the number of taxa is large.
Consider the complete binary tree when conservation prob on each branch is less than 7/8,
(The ambiguous reconstructionaccuracy)= (the accuracy of using just one taxa )=1/2
in the limit case. (Rormula exists, see Steel’89.)
![Page 37: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/37.jpg)
Concluding remarks Formulate the genome selection for
reconstruction problem Two greedy algorithms proposed for the
problem Validation test shows that the
reconstruction accuracy of using the genomes selected by the greedy algorithms are comparable to the
the max reconstruction accuracy.
![Page 38: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/38.jpg)
Thanks You!
![Page 39: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/39.jpg)
A Biological Example
Boreoeutherian ancestor From Encode project
4 states at leaf nodes
Expected accuracy at the root node
![Page 40: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/40.jpg)
A Biological Example – Results
Backward algo is always similar as the exhaustive search
With 8 leaf nodes, the accuracy from Backward algo is 93.6%, near to the accuracy 94.6% with full phylogeny
![Page 41: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/41.jpg)
Outline Introduction to phylogeny reconstruction
accuracy analysis
More genomes are not necessarily better for reconstruction accuracy
Greedy algorithms for genome selection problem
Validation test
Conclusion
![Page 42: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/42.jpg)
Conclusion Formulate the genome selection for
reconstruction problem Two greedy algorithms proposed for the
problem Validation test shows that the
reconstruction accuracy of using the genomes selected by the greedy algorithms are comparable to the
the max reconstruction accuracy.
![Page 43: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/43.jpg)
Fitch Parsimony method
Given character states in the leave nodes the method reconstructs a subset of states at each internal nodes by the following rule:
0 1 0
{0, 1}
{0}
![Page 44: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/44.jpg)
More Genomes Are Not Necessarily Better – An example with 4 leaves
The complete tree
![Page 45: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/45.jpg)
More Genomes Are Not Necessarily Better – An example with 4 leaves
The unambiguous reconstruction accuracy of using one genome is
Ppath= p2+(1-p)2;
The unambiguous reconstruction accuracy of using all the 4 genomes is
Pwhole= Ppath – 3p2(1-p)2;
More genomes give more noise!
![Page 46: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/46.jpg)
A Small with Six Leaves
(The ambiguous reconstruction accuracy)< (The unambiguous accuracy on the shortest path) When 0.5<p<0.65
![Page 47: Selecting Genomes for Reconstruction of A ncestral Genomes](https://reader035.vdocuments.us/reader035/viewer/2022081519/568135d2550346895d9d3b2a/html5/thumbnails/47.jpg)
Reconstruction accuracy on complete phylogeny in limit case
When conservation rate on each branch is less than 7/8,
(The ambiguous reconstructionaccuracy)= (the accuracy of using just one genome )=1/2