exact computation of coalescent likelihood under the infinite sites model yufeng wu university of...
Post on 19-Dec-2015
218 views
TRANSCRIPT
![Page 1: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/1.jpg)
Exact Computation of Coalescent Likelihood under the Infinite Sites
Model
Yufeng Wu
University of Connecticut
ISBRA 2009 1
![Page 2: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/2.jpg)
Coalescent Likelihood
• D: a set of binary sequences.• Coalescent genealogy: history with
coalescent and mutation events.• Coalescent likelihood P(D):
probability of observing D on coalescent model given mutation rate
• Assume no recombination. 00000 00010 00010 00010 01100 10000 10001
1
5
2
3
4
Coalescent Mutation
![Page 3: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/3.jpg)
Perfect Phylogeny• Infinite many sites model of
mutations: one mutation per site in history
• Perfect phylogeny– Site labels tree branches– Each site appears exactly once.– Sequence: list of mutations from
root to leaf.– Unique topologically, except root is
unknown and order of mutations on the same branch is not fixed.
00000
01100
3
![Page 4: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/4.jpg)
Genealogy and Perfect Phylogeny
• Perfect phylogeny: not enough timing information
– Exists many coalescent genealogy for a fixed perfect phylogeny.– Each genealogy: different probability (depending on )
• Coalescent likelihood: sum over all compatible genealogy.
1
5
2
3
4
1
5
2
3
4
1
5
2
3
4
![Page 5: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/5.jpg)
Computing Coalescent Likelihood• Computation of P(D): classic population genetics problem.
Statistical (inexact) approaches:– Importance sampling (IS): Griffiths and Tavare (1994), Stephens and
Donnelly (2000), Hobolth, Uyenoyama and Wiuf (2008).– MCMC: Kuhner,Yamato and Felsenstein (1995).
• Genetree: IS-based, widely used but (sometimes large) variance still exists.
• How feasible of computing exact P(D)?– Considered to be difficult for even medium-sized data (Song, Lyngso and
Hein, 2006).
• This talk: exact computation of P(D) is feasible for data significantly larger than previously believed.– A simple algorithmic trick: dynamic programming
5
![Page 6: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/6.jpg)
Ethier-Griffiths Recursion• Build a perfect phylogeny for D.• Ancestral configuration (AC): pairs of
sequence multiplicity and list of mutations for each sequence type at some time
• Transition probability between ACs: depends on AC and .
• Genealogy: path of ACs (from present to root)
• P(D): sum of probability of all paths.• EG: faster summation, backwards in time.
(1, 0), (3, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)
(1, 0), (1, 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)
(1, 0), (2, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)
(1, 0), (1, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)
(3, 4 0)6
![Page 7: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/7.jpg)
Computing Exact Likelihood• Key idea: forward instead of backwards
– Create all possible ACs reachable from the current AC (start from root). Update probability.
– Intuition of AC: growing coverage of the phylogeny, starting from root
• Possible events at root: three branching (b1, b2, b3), three mutations (m1, m2, m4).
• Branching: cover new branch
• Covered branch can mutate
• Each event: a new AC
b1
m2
b2Start from root AC
7
![Page 8: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/8.jpg)
Forward Finding of ACs • Finding all ACs by forward looking.
• Maintain a list of active events in each AC. Update in the new ACs.
• Rule: at a node in phylogeny, mutated branches covered branches (unless all branches are covered)
Mut branch = covered branch
8
![Page 9: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/9.jpg)
Why Forward?• Bottleneck: memory • Layer of ACs: ACs
with k mutation or branching events from root AC, k= 1,2,3…
• Key: only the current layer needs to be kept. Memory efficient.
• A single forward pass is enough to compute P(D).
9
Coalescent
Mutation
![Page 10: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/10.jpg)
Results on Simulated Data• Use Hudson’s program ms: 20, 30 , 40 and 50
sequences with = 1, 3 and 5. Each settings: 100 datasets. How many allow exact computation of P(D) within reasonable amount of time?
Number of sequences
% of feasible data
Number of sequences
Ave. run time (sec.) for feasible data
10
![Page 11: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/11.jpg)
Application: MLE of Mutation Rate• Given a set of sequences D, what is the
maximum likelihood estimate of the mutation rate ?
• Issue: Need to compute for many possible and root of genealogy is not known.
• Use exact likelihood– Compute P(D | ) for on a grid.– MLE of : maximize P(D | ).– Full likelihood: sum P(D | ) over all possible roots.– Faster computation: compute P(D) for multiple in
one pass. Quicker than computing P(D) for one each time.
11
P(D)
MLE()
![Page 12: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/12.jpg)
A Mitochondrial Data• Mitochondrial data from Ward, et al. (1991). Previously analyzed
by Griffiths and Tavare (1994) and others.– 55 sequences and 18 polymorphic sites.
– Believed to fit the infinite sites model.
12
![Page 13: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/13.jpg)
MLE of Mutation Rate for the Mitochondrial Data
• MLE of : 4.8 Griffiths and Tavare (1994)
• IS methods can have variance– Is 4.8 really the MLE?
• Compute the exact full likelihood for a grid of between 4.0 to 6.0.
13
![Page 14: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1](https://reader031.vdocuments.us/reader031/viewer/2022032704/56649d3f5503460f94a18d8c/html5/thumbnails/14.jpg)
Conclusion• IS seems to work well for the Mitochondrial
data– However, IS can still have large variance for some
data.– Thus, exact computation may help when data is
not very large and/or relatively low mutation rate.– Can also help to evaluate different statistical
methods.
• Research supported by National Science Foundation.
14