coalescence dna replication dna coalescence a coalescent event occurs when two lineages of dna...
TRANSCRIPT
Coalescence
DNAReplication
DNACoalescence
A coalescent event occurs when two lineages of DNA molecules merge back into a single DNA molecule at some time in the past.
Gene Tree (all copies of homologous
DNA coalesce to a common ancestral molecule)
COALESCENCE OF n COPIES OF
HOMOLOGOUS DNA
Coalescence in an Ideal Population of N with Ploidy Level x
• Each act of reproduction is equally likely to involve any of the N individuals, with each reproductive event being an independent event
• Under these conditions, the probability that two gametes are drawn from the same parental individual is 1/N
• With ploidy level x, the probability of identity by descent/coalescence from the previous generation is (1/x)(1/N) = 1/(xN)
• In practice, real populations are not ideal, so pretend the population is ideal but with an “inbreeding effective size” of an idealized population of size Nef; Therefore, the prob. of coalescence in one generation is 1/(xNef)
Sample Two Genes at RandomThe probability of coalescence exactly t generations ago is the probability of no coalescence for the first t-1 generations in the past followed by a coalescent event at generation t:
€
Prob.(Coalesce at t) = 1−1
xN ef
⎛
⎝ ⎜
⎞
⎠ ⎟
t −1
1
xN ef
⎛
⎝ ⎜
⎞
⎠ ⎟
Sample Two Genes at Random
The average time to coalescence is:
The variance of time to coalescence of two genes (ct) is the
average or expectation of (t-xNef)2 : €
Expected(Time to Coalesce) = tt =1
∞
∑ 1−1
xN ef
⎛
⎝ ⎜
⎞
⎠ ⎟
t −1
1
xN ef
⎛
⎝ ⎜
⎞
⎠ ⎟= xN ef
€
ct2 = t − xN ef( )
2
t =1
∞
∑ 1−1
xN ef
⎛
⎝ ⎜
⎞
⎠ ⎟
t −1
1
xN ef
⎛
⎝ ⎜
⎞
⎠ ⎟= xN ef (xN ef −1) = x2 N ef
2 − xN ef
Sample n Genes at Random
€
Number of pairs of genes =n
2
⎛
⎝ ⎜
⎞
⎠ ⎟=
n!
(n − 2)!2!=
n(n −1)
2
€
Prob.(coalescence in the previous gen.) =n
2
⎛
⎝ ⎜
⎞
⎠ ⎟1
xN=
n(n −1)
2xN
€
Prob.(no coalescence in the previous gen.) =1 -n(n −1)
2xN
Sample n Genes at Random
€
Prob.(first coalescence in t generations) = 1 -n(n −1)
2xN
⎛
⎝ ⎜
⎞
⎠ ⎟t -1
n(n −1)
2xN
€
E(time to first coalescence) = tt =1
∞
∑ 1 -n(n −1)
2xN
⎛
⎝ ⎜
⎞
⎠ ⎟t -1
n(n −1)
2xN=
2xN
n(n −1)
€
12 = t −
n(n −1)
4 N
⎛
⎝ ⎜
⎞
⎠ ⎟2
1−n(n −1)
2xN
⎛
⎝ ⎜
⎞
⎠ ⎟
t =1
∞
∑t −1
n(n −1)
2xN=
2xN
n(n −1)
2xN
n(n −1)−1
⎛
⎝ ⎜
⎞
⎠ ⎟
Sample n Genes at RandomOnce the first coalescent event has occurred, we now have n-1 gene lineages, and therefore we simply repeat all the calculations with n-1 rather than n. In general, the expected time and variance between the k–1 coalescent event and the kth event is:
€
E(time between k −1 and k coalescent events) =2xN
(n − k +1)(n − k)
€
k2 =
2xN
(n − k +1)(n − k)
2xN
(n − k +1)(n − k)−1
⎛
⎝ ⎜
⎞
⎠ ⎟
€
E(time to coalescence of all n genes) =2xN
(n − k +1)(n − k)k =1
n−1
∑ = 2xN 1− 1n( )
Sample n Genes at RandomThe average times to the first and last coalescence are:
2xNef/[n(n-1)] and 2xNef(1-1/n)
•Let n = 10 and x=2, then the time span covered by coalescent events is expected to range from 0.0444Nef to 3.6Nef.•Let n = 100, then the time span covered by coalescent events is expected to range from 0.0004Nef to 3.96Nef.•These equations imply that you do not need large samples to cover deep (old) coalescent events, but if you want to sample recent coalescent events, large sample sizes are critical.•For n large, the expected coalescent time for all genes is 2xNef
Sample n Genes at RandomThe variance of time to coalescence of n genes is:
•Note that in both the 2- and n-sample cases, the mean coalescent times are proportional to Nef and the variances are proportional to Nef
2.•The Standard Molecular Clock is a Poisson Clock in Which the Mean = Variance.•The Coalescent is a noisy evolutionary process with much inherent variation that cannot be eliminated by large n’s; it is innate to the evolutionary process itself and is called “evolutionary stochasticity.”
€
2xN
(n − k +1)(n − k)
2xN
(n − k +1)(n − k)−1
⎛
⎝ ⎜
⎞
⎠ ⎟
k=1
n−1
∑ ≈ 4 x 2N 2 1
(i)2(i −1)2i= 2
n
∑
Buri’s Experi-ment on Genetic
Drift
GenerationNumber ofPopulationsFixed for bw
Number ofPopulations
Fixed for bw75
1 0 0
2 0 0
3 0 0
4 0 1
5 0 2
6 1 3
7 3 3
8 5 59 5 6
10 7 811 11 1012 12 1713 12 1814 14 2115 18 2316 23 2517 26 2618 27 2819 30 28
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Number of bw75 Alleles
Generation of Fixation
Fixation (Coalescence) Times in 105 Replicates of the Same Evolutionary Process
Problem: No Replication With Most Real Data Sets. Only 1 Realization.
Evolutionary Stochasticity
Using the standard molecular clock and an estimator of of 10-8 per year, the time to coalescence of all mtDNA to a common ancestral molecule has been estimated to be 290,000 years ago (Stoneking et al. 1986). This figure of 290,000 however is subject to much error because of evolutionary stochasticity. When evolutionary stochasticity is taken into account (ignoring sampling error, measurement error, and the considerable ambiguity in ), the 95% confidence interval around 290,000 is 152,000 years to 473,000 years (Templeton 1993) -- a span of over 300,000 years!
Coalescence of a mtDNA in an Ideal Population of N♀ haploids
• Each act of reproduction is equally likely to involve any of the N♀ individuals, with each reproductive event being an independent event
• Under these conditions, the probability that two gametes are drawn from the same parental individual is 1/N♀
• Under haploidy, the probability of identity by descent/coalescence from the previous generation is (1)(1/N♀) = 1/(N♀)
• In practice, real populations are not ideal, so pretend the population is ideal but with an “inbreeding effective size” of an idealized population of size Nef♀; Therefore, the prob. of coalescence in one generation is 1/(Nef♀)
Expected Coalescence Times for a Large Sample of Genes
Mitochondrial DNA 2Nef♀=Nef (if Nef♀=1/2Nef)
Y-Chromosomal DNA 2Nef♂=Nef (if Nef ♂=1/2Nef)
X-Linked DNA 3Nef
Autosomal DNA 4Nef
Estimated Coalescence Times for 24 Human Loci
0
1
2
3
4
5
6
7
8
9
Y-DNA
mtDNA
MAO
FIX
MSN/ALAS2
Xq13.3
G6PD
HS571B2
APLX
AMELX
TNFSF5
RRM2P4
PDHA1
MC1R
ECP
EDN
MS205
HFE
Hb-Beta
CYP1A2
FUT6
Lactase
CCR5
FUT2
MX1
Locus
TM
RC
A (
In M
illio
ns o
f Y
ear
s)
Uniparental Haploid DNA Regions
X-Linked Loci
Autosomal Loci
Coalescence With Mutation
Mutation Creates
Variation and
Destroys Identity
by Descent
Coalescence Before Mutation… …
€
Prob.(coalescence before mutation) = Prob.(identity by descent)
= 1−1
xN ef
⎛
⎝ ⎜
⎞
⎠ ⎟
t −1
1
xN ef
⎛
⎝ ⎜
⎞
⎠ ⎟(1− μ )2t
=
Prob. of no
coalescence
for t -1 gen.
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟×
Prob. of
coalescence
at gen. t
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟×
Prob. of no
mutation in
2t DNA
replications
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
Mutation Before Coalescence… …
Mutation
€
Prob.(mutation before coalescence) =
1−1
xNef
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
t
2μ(1− μ)2t−1
Mutation and Coalescence: Genetic Diversity
= Expected Heterozygosity (where xNef)
€
Prob.(mutation before coalescence| mutation or coalescence)
=2μ (1− μ )2t −1 1− 1
xNef( )t
2μ (1− μ )2t −1 1− 1xNef( )
t+ 1
xNef(1− μ )2t 1− 1
xNef( )t −1 =
2xN ef μ − 2μ
2xN ef μ − 3μ +1
€
2xNef μ − 2μ
2xNef μ − 3μ +1≈
2xNef μ
2xNef μ +1=
θ
θ +1
Gene Vs. Allele (Haplotype) Tree
Gene Trees vs. Haplotype TreesGene trees are genealogies of genes. They describe how different copies at a homologous gene locus are “related” by ordering coalescent events.
The only branches in the gene tree that we can observe from sequence data are those marked by a mutation. All branches in the gene tree that are caused by DNA replication without mutation are not observable. Therefore, the tree observable from sequence data retains only those branches in the gene tree associated with a mutational change. This lower resolution tree is called an allele or haplotype tree.
The allele or haplotype tree is the gene tree in which all branches not marked by a mutational event are collapsed together.
Unrooted Haplotype Tree
The Inversion Tree Is Not Always The Same As A Tree of Species Or Populations, In This Case Because of:Transpecific Polymorphism
Haplotype trees are not new in population genetics; they have been around in the form of inversion trees since the 1930’s.
Haplotype Trees Can Coalesce Both Within And Between Species
The human MHC region fits this pattern; it takes 35 million years to coalesce, so humans and monkeys share polymorphic clades.
Ebersberger et al. (2007) Estimated Trees From 23,210 DNA Sequences In Apes & Rhesus Monkey:
Below Are The Numbers That Significantly Resolved the “Species Tree”
Haplotype Trees ≠Species or Population Trees
It is dangerous to equate a haplotype tree to a species tree.
It is NEVER justified to equate a haplotype tree to a tree of populations
within a species because the problem of lineage sorting is greater and the time
between events is shorter. Moreover, a population tree need not exist at all.
Homoplasy & The Infinite Sites Model• Homoplasy is the phenomenon of independent mutations (& many gene conversion events) yielding the same genetic state.• Homoplasy represents a major difficulty when trying to reconstruct evolutionary trees, whether they are haplotype trees or the more traditional species trees of evolutionary biology.• It is common in coalescent theory (and molecular evolution in general) to assume the infinite sites model in which each mutation occurs at a new nucleotide site.• Under this model, there is no homoplasy because no nucleotide site can ever mutate more than once. Each mutation creates a new haplotype.
Homoplasy & The Infinite Sites Model
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Homoplasy & The Infinite Sites Model
Motif Number of
Nucleotides in
Motif
Number of
Polymorphic
Nucleotides
Percent Polymorphic
CG
198 19 9.6%
polymerase arrest
sites with motif
TG(A/G)(A/G)GA
264 8 3.0%
Mononucleotide
Runs ≥ 5
Nucleotides
456 15 3.3%
All Other Sites 8,777 46 0.5%
The distribution of polymorphic nucleotide sites in a 9.7 kb region of the human Lipoprotein Lipase gene over nucleotides associated with three known mutagenic motifs and all remaining nucleotide positions.
E. g., Apoprotein E Gene Region0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5
Exon
1
Exon
2
Exon
3
Exon
4
73 308
471
545
560
624
832
1163
1522
1575
1998
2440
2907
3106
3673
3937
4036
4075
4951
5229
A52
29B
5361
3701*
No recombination has been detected in this region.
The Apo-protein E Haplotype
Tree
0
2
3 8
6
14
725
28
11
19
10
26
53611998 2440
2440
3937
3937
0
13
12
0
21
0
560
560
29
30
14
2907
0
27
9560
56016
0
2224
832
Chimpanzee (Outgroup)
17
0
23
20
1998 5
0
18
15
313106
0
The Apo-protein E Haplotype Tree
A C T OC TC
A
T
T
A. Maximum Parsimony
B. Statistical Parsimony
560 624 15751575
T C C560 624 1575
A T C560 624 1575
624
C T624
T T C560 624 1575
A
T
A C T OC TC
A
T
T560 624 15751575
T C C560 624 1575
A T C560 624 1575
624
C T624
T T C560 624 1575
A
T
OR OR OR
OR
Use a Finite Sites mutation model that allows homoplasy. Can show that probability of homoplasy between two nodes increasing with increasing number of observed mutational differences. Therefore, allocate homoplasies to longer branches. Called “Statistical Parsimony” because you can use models to calculate the probability of violating parsimony for a given branch length.
The Apo-protein E Statistical Parsimony
Haplotype Tree
In this case, most of the homoplasy is associated with Alu sequences, a common repeat type in the human genome that is known to cause local gene conversion, which mimics the effects of parallel mutations.
Homoplasy is still common, as shown by circled mutations.
Estimated Times To Common Ancestor (Method of Takahata et al. 2001)
Dh Nuc.Diff.Within Humans
Dhc Nuc.Diff.Between Humans
& Chimps
6 Million Years Ago
TMRCA = 12Dh/Dhc
The Apo-protein E
Haplotype Coalescent
3937
4075
5229B
624
308
3673
545
2440 1163
1522
3701
2907
4714951
73
3106
4036
1998
3.2
2.4
1.6
0.8
0
Years(x 105)
9 16 6 27 2 28 1 14 29 30 12 13 17 20 5 31
2 3 4
Estimate the distribution of the age of the haplotype or clade as a Gamma
Distribution (Kimura, 1970) with mean T=4N (or N for mtDNA) and Variance
T2/(1+k) (Tajima, 1983)where k is the average pairwise divergence
among present day haplotypes derived from the haplotype being aged, measured as the number of nucleotide differences.
NOTE: VARIANCE INCREASES WITH INCREASING T AND DECREASING k!
The Apo-protein E Haplotype Coalescent
3937
4075
5229B
624
308
3673
545
2440 1163
15223701
2907
4714951
73
3106
4036
1998
3.2
2.4
1.6
0.8
0
Years(x 105)
9 16 6 27 2 28 1 14 29 30 12 13 17 20 5 31
2 3 4
Years (x 105)
f(t)
Because of Deviations From The Infinite Sites Model, Corrections Must Also be Made in How We Count the
Number of Mutations That Occurred in The Coalescent Process.
The Basic Idea of Coalescence Is That Any Two Copies of Homologous DNA Will Coalesce Back To An Ancestral
Molecule Either Within Or Between Species
t
Time
Mutations Can Accumulate in the Two DNA Lineages During This Time, t, to Coalescence. We Quantify This Mutational
Accumulation Through A Molecule Genetic Distance
t
Time
X M
utations Y M
utat
ions
Molecule Genetic Distance = X + Y.If = the neutral substitution rate, then the Expected Value of
X = t and the Expected Value of Y = t, So the Expected Value of the Genetic Distance = 2t
t
Time
X M
utations Y M
utat
ions
Complication: Only Under The Infinite Sites Model Are
X+Y Directly Observable;
Otherwise X+Y ≥ The Observed
Number of Differences.
Use Models of DNA Mutation To
Correct For Undercounting
Molecule Genetic Distance = X + Y = 2 tTHE JUKES-CANTOR GENETIC DISTANCE
Consider a single nucleotide site that has a probability of mutating per unit time (only neutral mutations are allowed). This model assumes that when a nucleotide site mutates it is equally likely to mutate to any of the three other nucleotide states. Suppose further that mutation is such a rare occurrence that in any time unit it is only likely for at most one DNA lineage to mutate and not both. Finally, let pt be the probability that the nucleotide site is in the same state in the two DNA molecules being compared given they coalesced t time units ago. Note that pt refers to identity by state and is observable from the current sequences. Then,
€
pt+1 = pt(1− μ )2 +(1− pt )2μ / 3 ≈ (1− 2μ )pt + 2μ (1− pt )/ 3
Molecule Genetic Distance = X + Y = 2 tTHE JUKES-CANTOR GENETIC DISTANCE
€
pt +1 ≈ (1− 2μ) pt + 2μ(1− pt ) /3
€
Δp = pt+1 − pt = −2μpt + 2μ (1− pt )/ 3 = − 83 μpt + 2
3 μ
Approximating the above by a differential equation yields:
€
dpt
dt= − 8
3 μpt + 23 μ
€
pt = 1+ 3e−8μt / 3( ) / 4
extract 2t from the equation given above:
€
pt = 14 + 3
4 e−8μt / 3
34 e−8μt / 3 = pt − 1
4
− 83 μt = l n 4
3 pt − 13( )
2μt = − 34 l n 4
3 pt − 13( ) = DJC
Molecule Genetic Distance = X + Y = 2 tTHE JUKES-CANTOR GENETIC DISTANCE
€
DJC = − 34 l n 4
3 pt − 13( )
The above equation refers to only a single nucleotide, so pt is either 0 and 1. Hence, this equation will not yield biologically meaningful results when applied to just a single nucleotide. Therefore, Jukes and Cantor (1969) assumed that the same set of assumptions is valid for all the nucleotides in the sequenced portion of the two molecules being compared. Defining as the observed number of nucleotides that are different divided by the total number of nucleotides being compared, Jukes and Cantor noted that pt is estimated by 1-. Hence, substituting 1- for pt yields:
€
2μt = − 34 l n 1− 4
3 π( ) ≡ DJC
Molecule Genetic Distance = X + Y = 2 tTHE KIMURA 2-PARAMETER GENETIC DISTANCE
The Jukes and Cantor genetic distance model assumes neutrality and that mutations occur with equal probability to all 3 alternative nucleotide states. However, for some DNA, there can be a strong transition bias (e.g., mtDNA):
Pyrimidines T C
A GPurines
α
α
β β
β β
where α is the rate of transition substitutions, and 2β is the rate of transversion substitutions. The total rate of substitution (mutation) α β
Molecule Genetic Distance = X + Y = 2 tTHE KIMURA 2-PARAMETER GENETIC DISTANCE
Kimura (J. Mol. Evol. 16: 111-120, 1980) showed that
GENETIC DISTANCE = Dt = 2(α β)t = -1/2ln(1-2P-Q) - 1/4ln(1-2Q)
where P is the observed proportion of homologous nucleotide sites that differ by a transition, and Q is the observed proportion of homologous nucleotide sites that differ by a transversion.
Note that if α β (no transition bias), then we expect P = Q/2, so = P+Q = 3/2Q, or Q = 2/3. This yields the Jukes and Cantor distance, which is therefore a special case of the Kimura Distance.
If α β (large transition bias), as t gets large, P converges to 1/4 regardless of time, while Q is still sensitive to time. Therefore, for large times and with molecules showing an extreme transition bias, the distances depend increasingly only on the transversions. Therefore, you can get a big discrepancy between these two distances when a transition bias exists and when t is large enough.
Molecule Genetic Distance = X + Y = 2 t
You can have up to a 12 parameter model for just a single nucleotide (a parameter for each arrowhead). You can add many more parameters if you consider more than 1 nucleotide at a time.
Pyrimidines T C
A GPurines
α
α
β β
β β
If distances are small (Dt ≤ 0.05), most alternatives give about the same value, so people mostly use Jukes and Cantor, the simplest distance. Above 0.05, you need to investigate the properties of your data set more carefully. ModelTest can help you do this (I emphasize help because ModelTest gives some statistical criteria for evaluating 56 different models -- but conflicts frequently arise across criteria, so judgment is still needed). LOOK AT YOUR DATA!
Recombination Can Create Complex Networks Which Destroy the “Treeness” of the Relationships Among Haplotypes.
(Templeton et al.,AMJHG 66: 69-83, 2000)
0
2
4
6
8
10
12
14
16
18
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Recombination Events
Region of Overlap of the Inferred Intervals Of All 26 Recombination and Gene Conversion Events Not Likely to Be Artifacts.
LD in the human LPL geneRecombination is notUniformly distributed in thehuman genome, but rather isConcentrated into “hotspots” thatSeparate regions of low to noRecombination.
Significant |D’|
Non-significant |D’|
Too Few Observationsfor any |D’| to be significant
HaplotypeTrees can beEstimated for theseTwo regions, but notFor the entire LPL region.
Because of the random mating equation: Dt=D0(1-r)t
Linkage Disequilibrium Is Often Interpreted As An Indicator of the Amount of Recombination.
This Is Justifiable When Recombination Is Common Relative To Mutation
However, in regions of little to no recombination, the pattern of disequilibrium is determined
primarily by the historical conditions that existed at the time of mutation, that is the Haplotype
Tree.
Apoprotein E Gene Region0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5
Exon
1
Exon
2
Exon
3
Exon
4
73 308
471
545
560
624
832
1163
1522
1575
1998
2440
2907
3106
3673
3937
4036
4075
4951
5229
A52
29B
5361
3701*
These Two Sites Are in Strong Disequilibrium in All Samples
These Two Sites Show No Significant Disequilibrium in Any Sample
Note, African-Americans Have More D Than Europeans & EA Because of Admixture: Not All D Reflects Linkage
The Apo-protein E Haplotype
Tree
560
560
560
560560
560
1575
624
624624
624
1522
5361
5361
5361
4951
4951
4951
832
83224401998 1998
3937
5229B
4075
1163 4036
73
471
14
1119
17 20 18
23
1512
25
13
10 16
24
2
22
67 5
1
1575
560
560
624
624
21
26
4
3
31
3106
28545
27 3673
308
29 3701
8
30
2907
9
These haplotypes Are T at Site 832 &
C At Site 3937
These haplotypes Are G at Site 832 & T At Site 3937
These mutations areWell separated in timeAnd show little D
These mutations are close in timeAnd show much Disequilibrium
All Four Gametes Exist Because of Homoplasy, Not Recombination