distance-based genome rearrangement phylogeny
Post on 31-Jan-2016
26 Views
Preview:
DESCRIPTION
TRANSCRIPT
Distance-Based Distance-Based Genome Rearrangement PhylogenyGenome Rearrangement Phylogeny
Li-San Wang
University of Texas at Austinand
University of Pennsylvania
2
Genomes As Signed Permutations
1 –5 3 4 -2 -6 or5 –1 6 2 -4 -3 etc.
1
5
3
4
2
6
3
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
1 2 –6 –5 -4 -3 7 8 9 10
1 2 7 8 3 4 5 6 9 10
1 2 7 8 –6 -5 -4 -3 9 10
Inversion:
Transposition:
Inverted Transposition:
4
Phylogeny Reconstruction
FN: false negative (missing edge)FP: false positive (incorrect edge)
50% error rate
S1 S2 S3 S4 S5
FN
S1
S2
S3S4
S5FP
S1 S2 S3
S4 S5
5
Results
5% Error
6
Outline
• Genome Rearrangement Evolution• True evolutionary distance estimators• Distance-based phylogeny reconstruction• Simulation study: accuracy of tree reconstruction
7
Our Model: the Generalized Nadeau-Taylor Model [STOC’01]
• Three types of events: − Inversions (INV)− Transpositions (TRP)− Inverted Transpositions (ITP)
• Events of the same type are equiprobable• Probabilities of the three types have fixed ratio
• We focus on signed circular genomes in this talk.
8
Additive Distance Matrix and True Evolutionary Distance (T.E.D.)
S2 S3 S4 S5
S1 0 9 15 14 17S2 0 14 13 16S3 0 13 16S4 0 13 13
75
4
5
8
S1
S2
S3
S4
S5
S1
S5 0
Theorem [Waterman et al. 1977] Given an m×m additive distance matrix, we can reconstruct a tree realizing the distance in O(m2) time.
9
Error Tolerance of Neighbor Joining
Theorem [Atteson 1999]Let {Dij} be the true evolutionary distances, and {dij} be the estimated distances for T. Let be the length of the shortest edge in T. If for all taxa i,j, we have
then neighbor joining returns T.
2
1|| ijij dD
10
Edit Distances Between Genomes
• (INV) Inversion distance [Hannenhalli & Pevzner 1995]− Computable in linear time [Moret et al 2001]
• (BP) Breakpoint distance [Watterson et al. 1982]− Computable in linear time− NJ(BP): [Blanchette, Kunisawa, Sankoff, 1999]
1 2 3 4 5 6 7 8 9 10
1 2 3 -8 -7 -6 4 5 9 10
A =
B =
BP(A,B)=3
11
NJ(BP) and NJ(INV)
120 genes, 160 leavesUniformly Random Trees
Transpositions/inverted transpositions only
Inversion only
12
BP and INV
INV vs K(120 genes)
(K: Actual number of inversions) (Inversion-only evolution)
BP/2 vs K
13
Objectives
• Better techniques for estimating true evolutionary distances− Mathematical guarantees whenever possible− Good accuracy in simulation− Robust to model violations
• Main goal: improve topological accuracy of tree reconstructions using neighbor joining and weighbor
14
New Estimators
• Exact-IEBP• Approx-IEBP• EDE
IEBP: Inverting Expected BreakPoint distanceEDE: Empirically Derived Estimator (inverts the expected inversion distance)
15
Distance-Based Methods
WeighborINV
NJBP
EDE
IEBP(Exact-, Approx-)
16
BP/2
K
K
BP
/2
(1)
(2)
Estimate True Evolutionary DistancesUsing BP
BP/2 vs K (120 genes)
(K: Actual number of inversions) (Inversion-only evolution)
To use the scatter plot to estimate the actual number of events (K):
1. Compute BP/2
2. From the curve, look up the corresponding valueof K
17
Using Breakpoints to Estimate True Evolutionary Distances
• Compute fn(k)= E[BP(G0,Gk)]
(i.e. the expected number of breakpoints after k random events; n is the number of genes)
• Given two genomes G and G’:− Compute breakpoint distance d=BP(G,G’)
− Find k so that fn (k) is closest to d
• Challenge: finding fn (k)
18
True Evolutionary Distance (t.e.d.) Estimators for Gene Order Data
T.E.D. Estimator
Exact-IEBP [WABI’01]
Approx-IEBP [STOC’01]
EDE [ISMB’01]
Based on the Expectation of
Breakpoint distance (Exact)
Breakpoint distance (Approx.)
Inversion distance (Approx.)
Derivation Analytical Analytical Empirical
Model knowledge
Required Required Inversion-only
Running Time O(n3) O(log n) O(1)
IEBP: Inverting the Expected BreakPoint distanceEDE: Empirically Derived Estimator
19
Approx-IEBP [Wang & Warnow, STOC’01]
20
True Evolutionary Distance Estimators
IEBP vs K(120 genes)
(K: Actual number of inversions) (Inversion-only evolution)
BP vs K
21
True Evolutionary Distance Estimators
BP INV
IEBP
EDE
22
Regression Formula for E(INV)
• Let n be the number of genes, and k be the number of inversions
• We use nonlinear regression to obtain easily computable formulas for E(INV) and Var(INV):
-> b=0.5956, c=0.4577
20
2
[ ( , )]~ ( ) min{ , } ( )kE INV G G x bx k
f x x xn x cx b n
0)0( f1. 1)0(' f2.
0 ( )f x x 3.
4.1( )f y : 0 1y y exists for all
23
Absolute Difference Plot
• Motivation: Recall the criterion of Atteson’s Theorem. For all i,j:
2
1|| ijij dD
24
Error Tolerance of Neighbor Joining
Theorem [Atteson 1999]Let {Dij} be the true evolutionary distances, and {dij} be the estimated distances for T. Let be the length of the shortest edge in T. If for all taxa i,j, we have
then neighbor joining returns T.
2
1|| ijij dD
25
Absolute Difference Plot
• Motivation: Recall the criterion of Atteson’s Theorem. For all i,j:
• Absolute difference plot:
− Compute d=d(G,G’)− Plot |d-k| (y-axis) vs k (x-axis)
G G’
NT Model, k rearrangement events
2
1|| ijij dD
26
120 Genes, Inversion-only Model
27
120 Genes, Inv:Transp=1:1
28
120 Genes, Transp Only
29
Using True Evolutionary Distance Helps
120 genes160 taxaUniformly random treesTranspositions/invertedtranspositions only(180 runs per figure)
5%
30
Variance of True Evolutionary Distance Estimators
• There are new distance-based phylogeny reconstruction methods (though designed for DNA sequences) − Weighbor [Bruno et al. 2000]
uses the variance of good t.e.d.s, and yield more accurate trees than NJ.
• Variance estimates for the t.e.d.s [Wang WABI’02]− Weighbor(IEBP),
Weighbor(EDE) K vs Exact-IEBP (120 genes)
31
Using True Evolutionary Distance Helps
120 genes160 taxaUniformly random treesTranspositions/invertedtranspositions only(180 runs per figure)
5%
32
Robustness
• EDE− Assumes inversion-only− When used with NJ or Weighbor, gives very
accurate results even under transposition-only evolution
• IEBP?
33
IEBP is Robust to Model Violations
120 genes, 160 taxaUniformly Random Trees(alpha,beta)=(0,0) (inversion only)
NJ(Exact-IEBP)
34
Beta Splitting Model [Aldous 1995]
-2 -1.5 -1 0 20
Comb UniformlyRandom
Aldous’suggestion
Yule CompletelyBalanced
35
Effect of Beta on Accuracy(120 Genes, 160 Taxa)
(Beta= -1.5: Uniform, -1: Aldous, 0: Yule)
36
Number of Genes
37
Number of Taxa
38
Observations
• Number of taxa has little effect on the accuracy of trees (at least under our way of model tree generation)
• Datasets with more genes give more accurate trees
• Topology has effect on the accuracy of trees for NJ(BP) and NJ(INV); using corrected distances reduces this effect.
39
Summary
• True evolutionary distance estimators for gene order data
• When used with Neighbor Joining and Weighbor, produce highly accurate trees
• How model trees are generated has effect on the accuracy of tree reconstruction methods
40
Acknowledgements
• University of Texas Tandy Warnow Robert K. Jansen Randy Linder Stacia Wyman
• University of New Mexico Bernard M.E. Moret David Bader Jijun Tang Mi Yan
• Central Washington University Linda Raubeson
• University of Canterbury Mike Steel
• NSF• Packard Foundation
top related