allele frequencies as stochastic processes: mathematical & statistical approaches
DESCRIPTION
Presented at Animal Breeding & Genomics Seminar. University of Wisconsin-Madison.TRANSCRIPT
Allele frequencies as Stochastic ProcessesMathematical and Statistical Approaches
Gota Morota
Nov 30, 2010
1 / 32
Outline
Change of Allele Frequencies as Stochastic Processes
Steady State Distributions of Allele Frequencies
Time Series Analysis
2 / 32
Outline
Change of Allele Frequencies as Stochastic Processes
Steady State Distributions of Allele Frequencies
Time Series Analysis
3 / 32
Outline
Change of Allele Frequencies as Stochastic Processes
Steady State Distributions of Allele Frequencies
Time Series Analysis
4 / 32
Various factors affecting allele frequencies
• Selection, mutation and migration (cross breedings)⇒systematic pressures (Wright 1949)
• Random fluctuations1. Random sampling of gametes (genetic drift)2. Random fluctuation in systematic pressures
⇓
Allele frequencies are funcions of the systematic forces and therandom components
5 / 32
Random walk⇒ Brownian Motion
Time
−0.040
−0.035
−0.030
−0.025
−0.020
−0.015
−0.010
2 4 6 8 10
Figure 1: Time = [1,10]
Time
−0.10
−0.08
−0.06
−0.04
−0.02
20 40 60 80 100
Figure 2: Time = [1:100]
Time
−0.25
−0.20
−0.15
−0.10
−0.05
0.00
0.05
0.10
200 400 600 800 1000
Figure 3: Time = [1:1000]
Time
−0.2
0.0
0.2
0.4
0.6
0.8
2000 4000 6000 8000 10000
Figure 4: Time = [1:10000]
6 / 32
Brownian Motion⇒ Diffusion Model
Time
−0.2
0.0
0.2
0.4
0.6
0.8
2000 4000 6000 8000 10000
Figure 5: Time = [1:10000]
+ conditional on Systematicforces
• treat change of allele frequencies as stochastic porcess
⇓
Diffusion Model
7 / 32
Diffusion ModelIt frames infinite number of paths that allele fequencies would takeover time under certain systematic pressures.
0 2000 4000 6000 8000 10000
TimeA
llele
Fre
quen
cy
0 2000 4000 6000 8000 10000
Time
Alle
le F
requ
ency
0 2000 4000 6000 8000 10000
Time
Alle
le F
requ
ency
• pick up single timepoint t (say 5000 inabove)
• try to find PDF atpoint t
• need to solve partial differntialequation (PDE)
• Fokker-Planck Equation!
8 / 32
Fokker-Planck Equation
• Derived from a continuous time stochastic process (X)• Partial differential equation
∂φ(p, x; t)∂t
=12∂2
∂x2 {Vδxφ(p, x; t)} −∂
∂x{Mδxφ(p, x; t)} (1)
where• p: initial allele frequency (fixed)• x: allele frequency (random variable)• t : time (continuous variable)• φ(p, x; t): PDF• Vδx : variance of δx (amount of change in allele frequency per
time)• Mδx : mean of δx (amount of change in allele frequency per
time)• Vδx and Mδx : both may depend on x and t
9 / 32
Fokker-Planck Equation for Brownian MotionA standard Brownian motion can be constructed from random walkwith error having mean 0 and variance 1 under right scaling. It hasthe PDF of N(0, t).
• when t = 1.0, N(0, 1)• when t = 1.5, N(0, 1.5)
Fokker-Planck equation:
∂φ(p, x; t)∂t
=12∂2
∂x2φ(p, x; t) (2)
= Heat equation (3)
Mδx = 0 and Vδx = 1 in equation (1)Solution:
φ(p.x; t) =1√
2πtexp
(−x2
2t
)(4)
10 / 32
Solution of the Heat Equation (the Heat Kernel)
−2 −1 0 1 2
x
t = 0.00001t = 0.01t=0.1t=1t=10
11 / 32
Under Random Genetic Drift
Mδx = 0 Vδx =x(1 − x)
2Ne
Fokker-Planck equation for random genetic drift:
∂φ(p, x; t)∂t
=1
4Ne
∂2
∂x2 x(1 − x)φ(p, x; t) (5)
Solutions are obtained as infinite series of sum by...
• Kimura (1955) Hypergeometric function
• Korn and Korn (1968) Gegenbauer polynomial
φ = 6p(1 − p)exp(−12Ne
t)+ 30p(1 − p)(1 − 2p)(1 − 2x)
(−32Ne
t)+ · · · ,
12 / 32
Solution of FPE (Kimura 1955)GENETICS: MOTOO KIMURA
FIGS. 1-2.-The processes of the change in the probability distribution of heterallelic classes,due to random sampling of gametes in reproduction. It is assumed that the population startsfrom the gene frequency 0.5 in Fig. 1 (left) and 0.1 in Fig. 2 (right). t = time in genera-tion; N = effective size of the population; abscissa is gene frequency; ordinate is probabilitydensity.
The probability of heterozygosis is calculated by equation (15):
fo12x(1-x~~x~t~dx =(2i+ l)TH. = O 2X(1- X)+O(X, t)di= E (i+1) i i (1-2p) X
(1 -Z2)T,._.(z) e-'i(i + 1)/4N]t dZ.
By virtue of equation (14) (put m = 0), the last integral is 0 except for i = 1.Hence
Hg= pq 1 4 (2)t 2pqe-(l/2lt = Hoe-(l/2N)t, (18)2 3
showing that the heterozygosis decreases exactly at the rate of 1/(2N) per generation.This is readily confirmed by a simple calculation: Let p be the frequency of A inthe population, where the frequency of the heterozygotes is 2p(l - p). Theamount of heterozygosis to be expected after one generation of random sampling ofthe gametes is
E{2 ( + 5p) (1 - P- 6P)} 2p(1 -p) -2E(ap)2=
2p(l - p) - 2 =(-22p(1-p),as was to be shown.
149VOL. 41) 1955
13 / 32
Under Selection and Random Genetic Drift
Mδx = sx(1 − x) Vδx =x(1 − x)
2Ne
∂φ(p, x; t)∂t
=1
4Ne
∂2
∂x2 x(1 − x)φ(p, x; t) − s∂
∂xx(1 − x)φ(p, x; t) (6)
Solutions are obtained as infinite series using oblate spheroidalequation using transformaton of allele frequencies (z = 1-2x)• Kimura (1955)• Kimura and Crow (1956)
φ(p, x, t) =∞∑
k=0
Ck exp(−λk t + 2cx)V (1)1k (z) (7)
where
V (1)1k (z) =
∑n=0,1
fkn T1
n (z)
14 / 32
Kolmogorov Backward Equation
• Derived from a continuous time stochastic process (P)• Partial differential equation
∂φ(p, x; t)∂t
=12
Vδp∂2
∂p2φ(p, x; t) +Mδp∂
∂pφ(p, x; t) (8)
where• p: initial allele frequency (random variable)• x: allele frequency (random variable except x in the time t is
fixed)• t : time (continuous variable)• φ(p, x; t): PDF• Vδp : variance of δp (amount of change in allele frequency)• Mδp : mean of δp (amount of change in allele frequency)• Vδp and Mδp : both may depend on x but not on t (time
homogeneous)15 / 32
Steady State Distribution of Allele FrequenciesEquilibrium• single point (balance between various forces that keep allele
frequecies near equilibrium )• PDF
⇓
PDF of stable equilibrium instead of single point
Steady state allele frequency distribution• Fisher (1922), (1930)• Wright (1931), (1937), (1938)
φ(p, x; t) = solution of a fokker-planck equation (9)
limt→∞
φ(p, x; t) = φ(x) (10)
φ(x) =C
Vδxexp(2
∫Mδx
Vδxdx) (11)
16 / 32
Steady State Distribution – Random Genetic Drift
For a large value of t, only the first few terms have impact ondetermining the actual form of the PDF.
φ = 6p(1 − p)exp(−t
2Ne
)+ 30p(1 − p)(1 − 2p)(1 − 2x)
(−3t2Ne
)+ · · · ,
Asymptotic formula:
limt→∞
φ = C · exp(−12Ne
t)
17 / 32
Graphical Representation (Wright 1931)
114 SEWALL WRIGHT
Before finally accepting this solution, however, it will be well to exam-
ine the terminal conditions. The amount of fixation a t the extremes if N
is large can be found directly from the Poisson series according to which
the chance of drawing 0 where m is the mean number in a sample i s r m .
The contribution to the 0 class will thus be (e-1+e-2+e-3 . . .)f =
e-l
1 -e-l f , = 0.582f.
25% 50% 754, Factor Freq u e n c y
T
FIGURE 3.-Distribution of gene frequencies in an isolated population in which fixation and
loss of genes each is proceeding at the rate 1/4N in the absence of appreciable selection or muta-
tion. y=Loe-TI*N.
This is a little larger than the i f deduced above and indicates a
small amount of distortion near the ends due to the element of approxi-
mation involved in substituting integration for summation. The nature
and amount of this distortion are indicated by the exact distributions ob-
tained in the extreme cases of only 2 and 3 monoecious individuals.
Letting Lo be the initial number of unfixed loci (pairs of allelomorphs)
and T the number of generations we have approximately
18 / 32
Steady State Distribution – Selection and Mutation
Mδx = −ux + v(1 − x) +x(1 − x)
2dadx
Vδx =x(1 − x)
2Ne
φ(x) = C · exp(2Ne a)x4Nev−1(1 − x)4Neu−1 (12)
When A has selecive advantage s over a:
a = 2sx2 + s2x(1 − x) + 0 ∗ (1 − x2)
= 2sx
φ(x) = C · exp(4Nesx)x4Nev−1(1 − x)4Neu−1 (13)
19 / 32
Graphical Representation (Wright 1937)GENETICS: S. WRIGHT
Fig 4
Fig. 5
Fig. 6
Fig. 8
(Captions for figares on opposite page.)
Fig.l
Fi9.2
308 PROC. N. A. S.
20 / 32
Time Series Analysis
When variable is measured sequentially in time resulting data forma time series.
• Diffusion Model – Continuous time stochastic process
• Time Series – Discrete time stochastic process
21 / 32
Basic Models
Observations close together in time tend to be correlated
• Autoregressive Model: AR(p)
Xt = c +p∑
i=1
ψiXt−i + εt (14)
• Moving Average Model: MA(q)
Xt = c +q∑
i=1
θiεt−i + εt (15)
• Autoregressive Moving Average Model: ARMA (p, q)
Xt = AR(p) + MA(q) (16)
22 / 32
Time Series as a Polynomial Equation
Bk Xt = Xt−k (back shift operator)
• AR(p)
Xt = ψ1Xt−1 + · · · + ψpXt−p
Xt = (ψ1B + · · · + ψpBp)Xt
(1 − ψ1B − · · · − ψpBp)Xt = 0
• ARMA(p,q)
Xt − ψ1Xt−1 − · · · − ψpXt−p = εt + θ1εt−1 + · · · + θqεt−q
(1 − ψ1B − · · · − ψpBp)Xt = (1 + θ1B + · · · + θqBq)εt
23 / 32
Stationary Process
The mean and variance do not change over time. No trend.
Not stationary
Time
−0.2
0.0
0.2
0.4
0.6
0.8
2000 4000 6000 8000 10000
Figure 6: Random Walk
Looks like stationary
Time
−10
−5
0
5
10
2000 4000 6000 8000 10000
Figure 7: Detrended
Detrending:
• linear regression
• take a difference
• Autoregressive Integrated Moving Average: ARIMA(p,d,q)
24 / 32
Application on Allele Frequencies
• Influential SNPs – indicative of deterministic trends
• Uninfluential SNPs – random fluctuation?
• Diffusion Model – assumed Markovian process
• Time Series – which model describes the process of changeof allele frequencies
Application
• Objective: model process of change of allele freqeuncies
• Data: SNPs genotypes of 4,798 Holstein bulls with 38,416markers and milk yield
• Genotype inputation: FastPhase 1.4
• Estimation of marker effects: BayesCπ
25 / 32
BayesCπ
Analysis of human mini-exome sequencing data using a Bayesian hierarchical mixturemodel: Genetic Analysis Workshop 17
Bueno Filho JS1,2!, Morota G1!, Tran QT3, Maenner MJ4, Vera-Cala LM4,5, Engelman CD4§, and Meyers KJ4§
1Department of Dairy Science, University of Wisconsin-Madison, USA2Departamento de Ciencias Exatas, Universidade Federal de Lavras, Brasil3Department of Statistics, University of Wisconsin-Madison, USA4Department of Population Health Sciences, University of Wisconsin-Madison, USA5Departamento de Salud Publica, Universidad Industrial de Santander, Colombia
! Contributed equally to this work§Corresponding author
Email addresses:JSB: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]
1
Figure 8: GAW17
26 / 32
Allele Frequency of the Top Marker
Original
Time
Alle
le F
requ
ency
0 5 10 15 20 25 30
0.4
0.6
0.8
Detrended
Time
Alle
le F
requ
ency
5 10 15 20 25 30
−0.
150.
000.
15
Figure 9: Time plots of allele frequencies. Top: Original series. Bottom:Smoothed by taking the first order difference.
27 / 32
Autocorrelation and Partial AutocorrelationARIMA(1,1,1)?
0 2 4 6 8 10 12 14
−0.
40.
00.
40.
8
Lag
AC
F
Original series
2 4 6 8 10 12 14
−0.
20.
00.
20.
4
Lag
Par
tial A
CF
Original series
0 2 4 6 8 10 12 14
−0.
40.
00.
40.
8
Lag
AC
F
First order difference series
2 4 6 8 10 12 14
−0.
4−
0.2
0.0
0.2
0.4
Lag
Par
tial A
CF
First order ifference series
Figure 10: ACF and PACF28 / 32
Model Selection
Table 1: Comparison of several competitive models
Model AIC Model AICARIMA (1,0,0) -51.56 ARIMA (1,1,0) -52.47ARIMA (0,1,0) -49.38 ARIMA (1,0,1) -51.13ARIMA (0,0,1) -46.41 ARIMA (1,1,1) -51.02
ARIMA(1,1,0)
Xt = 0.635Xt−1 + εt
29 / 32
Advanced Models
Time dependent variance
• ARCH (Autoregressive Conditional Heteroskedasticity)
• GARCH (Generalized Autoregressive ConditionalHeteroskedasticity)
Multivariate
• VARMA (Vector Autoregression Moving Average)
• BVARMA (Bayesian Vector Autoregression Moving Average)
30 / 32
Intersection of Mathematics and Statistics
Under certain condition
GARCH(1,1) ≈ Diffusion Model!
31 / 32
Thank you!
32 / 32