allele frequencies as stochastic processes: mathematical & statistical approaches

Allele frequencies as Stochastic ProcessesMathematical and Statistical Approaches

Gota Morota

Nov 30, 2010

1 / 32

Outline

Change of Allele Frequencies as Stochastic Processes

Steady State Distributions of Allele Frequencies

Time Series Analysis

2 / 32

Outline




3 / 32

Outline




4 / 32

Various factors affecting allele frequencies

• Selection, mutation and migration (cross breedings)⇒systematic pressures (Wright 1949)

• Random fluctuations1. Random sampling of gametes (genetic drift)2. Random fluctuation in systematic pressures

⇓

Allele frequencies are funcions of the systematic forces and therandom components

5 / 32

Random walk⇒ Brownian Motion

Time

−0.040

−0.035

−0.030

−0.025

−0.020

−0.015

−0.010

2 4 6 8 10

Figure 1: Time = [1,10]

Time

−0.10

−0.08

−0.06

−0.04

−0.02

20 40 60 80 100

Figure 2: Time = [1:100]

Time

−0.25

−0.20

−0.15

−0.10

−0.05

0.00

0.05

0.10

200 400 600 800 1000

Figure 3: Time = [1:1000]

Time

−0.2

0.0

0.2

0.4

0.6

0.8

2000 4000 6000 8000 10000

Figure 4: Time = [1:10000]

6 / 32

Brownian Motion⇒ Diffusion Model

Time

−0.2

0.0

0.2

0.4

0.6

0.8

2000 4000 6000 8000 10000

Figure 5: Time = [1:10000]

+ conditional on Systematicforces

• treat change of allele frequencies as stochastic porcess

⇓

Diffusion Model

7 / 32

Diffusion ModelIt frames infinite number of paths that allele fequencies would takeover time under certain systematic pressures.

0 2000 4000 6000 8000 10000

TimeA

llele

Fre

quen

cy

0 2000 4000 6000 8000 10000

Time

Alle

le F

requ

ency

0 2000 4000 6000 8000 10000

Time

Alle

le F

requ

ency

• pick up single timepoint t (say 5000 inabove)

• try to find PDF atpoint t

• need to solve partial differntialequation (PDE)

• Fokker-Planck Equation!

8 / 32

Fokker-Planck Equation

• Derived from a continuous time stochastic process (X)• Partial differential equation

∂φ(p, x; t)∂t

=12∂2

∂x2 {Vδxφ(p, x; t)} −∂

∂x{Mδxφ(p, x; t)} (1)

where• p: initial allele frequency (fixed)• x: allele frequency (random variable)• t : time (continuous variable)• φ(p, x; t): PDF• Vδx : variance of δx (amount of change in allele frequency per

time)• Mδx : mean of δx (amount of change in allele frequency per

time)• Vδx and Mδx : both may depend on x and t

9 / 32

Fokker-Planck Equation for Brownian MotionA standard Brownian motion can be constructed from random walkwith error having mean 0 and variance 1 under right scaling. It hasthe PDF of N(0, t).

• when t = 1.0, N(0, 1)• when t = 1.5, N(0, 1.5)

Fokker-Planck equation:

∂φ(p, x; t)∂t

=12∂2

∂x2φ(p, x; t) (2)

= Heat equation (3)

Mδx = 0 and Vδx = 1 in equation (1)Solution:

φ(p.x; t) =1√

2πtexp

(−x2

2t

)(4)

10 / 32

Solution of the Heat Equation (the Heat Kernel)

−2 −1 0 1 2

x

t = 0.00001t = 0.01t=0.1t=1t=10

11 / 32

Under Random Genetic Drift

Mδx = 0 Vδx =x(1 − x)

2Ne

Fokker-Planck equation for random genetic drift:

∂φ(p, x; t)∂t

=1

4Ne

∂2

∂x2 x(1 − x)φ(p, x; t) (5)

Solutions are obtained as infinite series of sum by...

• Kimura (1955) Hypergeometric function

• Korn and Korn (1968) Gegenbauer polynomial

φ = 6p(1 − p)exp(−12Ne

t)+ 30p(1 − p)(1 − 2p)(1 − 2x)

(−32Ne

t)+ · · · ,

12 / 32

Solution of FPE (Kimura 1955)GENETICS: MOTOO KIMURA

FIGS. 1-2.-The processes of the change in the probability distribution of heterallelic classes,due to random sampling of gametes in reproduction. It is assumed that the population startsfrom the gene frequency 0.5 in Fig. 1 (left) and 0.1 in Fig. 2 (right). t = time in genera-tion; N = effective size of the population; abscissa is gene frequency; ordinate is probabilitydensity.

The probability of heterozygosis is calculated by equation (15):

fo12x(1-x~~x~t~dx =(2i+ l)TH. = O 2X(1- X)+O(X, t)di= E (i+1) i i (1-2p) X

(1 -Z2)T,._.(z) e-'i(i + 1)/4N]t dZ.

By virtue of equation (14) (put m = 0), the last integral is 0 except for i = 1.Hence

Hg= pq 1 4 (2)t 2pqe-(l/2lt = Hoe-(l/2N)t, (18)2 3

showing that the heterozygosis decreases exactly at the rate of 1/(2N) per generation.This is readily confirmed by a simple calculation: Let p be the frequency of A inthe population, where the frequency of the heterozygotes is 2p(l - p). Theamount of heterozygosis to be expected after one generation of random sampling ofthe gametes is

E{2 ( + 5p) (1 - P- 6P)} 2p(1 -p) -2E(ap)2=

2p(l - p) - 2 =(-22p(1-p),as was to be shown.

149VOL. 41) 1955

13 / 32

Under Selection and Random Genetic Drift

Mδx = sx(1 − x) Vδx =x(1 − x)

2Ne

∂φ(p, x; t)∂t

=1

4Ne

∂2

∂x2 x(1 − x)φ(p, x; t) − s∂

∂xx(1 − x)φ(p, x; t) (6)

Solutions are obtained as infinite series using oblate spheroidalequation using transformaton of allele frequencies (z = 1-2x)• Kimura (1955)• Kimura and Crow (1956)

φ(p, x, t) =∞∑

k=0

Ck exp(−λk t + 2cx)V (1)1k (z) (7)

where

V (1)1k (z) =

∑n=0,1

fkn T1

n (z)

14 / 32

Kolmogorov Backward Equation

• Derived from a continuous time stochastic process (P)• Partial differential equation

∂φ(p, x; t)∂t

=12

Vδp∂2

∂p2φ(p, x; t) +Mδp∂

∂pφ(p, x; t) (8)

where• p: initial allele frequency (random variable)• x: allele frequency (random variable except x in the time t is

fixed)• t : time (continuous variable)• φ(p, x; t): PDF• Vδp : variance of δp (amount of change in allele frequency)• Mδp : mean of δp (amount of change in allele frequency)• Vδp and Mδp : both may depend on x but not on t (time

homogeneous)15 / 32

Steady State Distribution of Allele FrequenciesEquilibrium• single point (balance between various forces that keep allele

frequecies near equilibrium )• PDF

⇓

PDF of stable equilibrium instead of single point

Steady state allele frequency distribution• Fisher (1922), (1930)• Wright (1931), (1937), (1938)

φ(p, x; t) = solution of a fokker-planck equation (9)

limt→∞

φ(p, x; t) = φ(x) (10)

φ(x) =C

Vδxexp(2

∫Mδx

Vδxdx) (11)

16 / 32

Steady State Distribution – Random Genetic Drift

For a large value of t, only the first few terms have impact ondetermining the actual form of the PDF.

φ = 6p(1 − p)exp(−t

2Ne

)+ 30p(1 − p)(1 − 2p)(1 − 2x)

(−3t2Ne

)+ · · · ,

Asymptotic formula:

limt→∞

φ = C · exp(−12Ne

t)

17 / 32

Graphical Representation (Wright 1931)

114 SEWALL WRIGHT

Before finally accepting this solution, however, it will be well to exam-

ine the terminal conditions. The amount of fixation a t the extremes if N

is large can be found directly from the Poisson series according to which

the chance of drawing 0 where m is the mean number in a sample i s r m .

The contribution to the 0 class will thus be (e-1+e-2+e-3 . . .)f =

e-l

1 -e-l f , = 0.582f.

25% 50% 754, Factor Freq u e n c y

T

FIGURE 3.-Distribution of gene frequencies in an isolated population in which fixation and

loss of genes each is proceeding at the rate 1/4N in the absence of appreciable selection or muta-

tion. y=Loe-TI*N.

This is a little larger than the i f deduced above and indicates a

small amount of distortion near the ends due to the element of approxi-

mation involved in substituting integration for summation. The nature

and amount of this distortion are indicated by the exact distributions ob-

tained in the extreme cases of only 2 and 3 monoecious individuals.

Letting Lo be the initial number of unfixed loci (pairs of allelomorphs)

and T the number of generations we have approximately

18 / 32

Steady State Distribution – Selection and Mutation

Mδx = −ux + v(1 − x) +x(1 − x)

2dadx

Vδx =x(1 − x)

2Ne

φ(x) = C · exp(2Ne a)x4Nev−1(1 − x)4Neu−1 (12)

When A has selecive advantage s over a:

a = 2sx2 + s2x(1 − x) + 0 ∗ (1 − x2)

= 2sx

φ(x) = C · exp(4Nesx)x4Nev−1(1 − x)4Neu−1 (13)

19 / 32

Graphical Representation (Wright 1937)GENETICS: S. WRIGHT

Fig 4

Fig. 5

Fig. 6

Fig. 8

(Captions for figares on opposite page.)

Fig.l

Fi9.2

308 PROC. N. A. S.

20 / 32


When variable is measured sequentially in time resulting data forma time series.

• Diffusion Model – Continuous time stochastic process

• Time Series – Discrete time stochastic process

21 / 32

Basic Models

Observations close together in time tend to be correlated

• Autoregressive Model: AR(p)

Xt = c +p∑

i=1

ψiXt−i + εt (14)

• Moving Average Model: MA(q)

Xt = c +q∑

i=1

θiεt−i + εt (15)

• Autoregressive Moving Average Model: ARMA (p, q)

Xt = AR(p) + MA(q) (16)

22 / 32

Time Series as a Polynomial Equation

Bk Xt = Xt−k (back shift operator)

• AR(p)

Xt = ψ1Xt−1 + · · · + ψpXt−p

Xt = (ψ1B + · · · + ψpBp)Xt

(1 − ψ1B − · · · − ψpBp)Xt = 0

• ARMA(p,q)

Xt − ψ1Xt−1 − · · · − ψpXt−p = εt + θ1εt−1 + · · · + θqεt−q

(1 − ψ1B − · · · − ψpBp)Xt = (1 + θ1B + · · · + θqBq)εt

23 / 32

Stationary Process

The mean and variance do not change over time. No trend.

Not stationary

Time

−0.2

0.0

0.2

0.4

0.6

0.8

2000 4000 6000 8000 10000

Figure 6: Random Walk

Looks like stationary

Time

−10

−5

0

5

10

2000 4000 6000 8000 10000

Figure 7: Detrended

Detrending:

• linear regression

• take a difference

• Autoregressive Integrated Moving Average: ARIMA(p,d,q)

24 / 32

Application on Allele Frequencies

• Influential SNPs – indicative of deterministic trends

• Uninfluential SNPs – random fluctuation?

• Diffusion Model – assumed Markovian process

• Time Series – which model describes the process of changeof allele frequencies

Application

• Objective: model process of change of allele freqeuncies

• Data: SNPs genotypes of 4,798 Holstein bulls with 38,416markers and milk yield

• Genotype inputation: FastPhase 1.4

• Estimation of marker effects: BayesCπ

25 / 32

BayesCπ

Analysis of human mini-exome sequencing data using a Bayesian hierarchical mixturemodel: Genetic Analysis Workshop 17

Bueno Filho JS1,2!, Morota G1!, Tran QT3, Maenner MJ4, Vera-Cala LM4,5, Engelman CD4§, and Meyers KJ4§

1Department of Dairy Science, University of Wisconsin-Madison, USA2Departamento de Ciencias Exatas, Universidade Federal de Lavras, Brasil3Department of Statistics, University of Wisconsin-Madison, USA4Department of Population Health Sciences, University of Wisconsin-Madison, USA5Departamento de Salud Publica, Universidad Industrial de Santander, Colombia

! Contributed equally to this work§Corresponding author

Email addresses:JSB: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]

1

Figure 8: GAW17

26 / 32

Allele Frequency of the Top Marker

Original

Time

Alle

le F

requ

ency

0 5 10 15 20 25 30

0.4

0.6

0.8

Detrended

Time

Alle

le F

requ

ency

5 10 15 20 25 30

−0.

150.

000.

15

Figure 9: Time plots of allele frequencies. Top: Original series. Bottom:Smoothed by taking the first order difference.

27 / 32

Autocorrelation and Partial AutocorrelationARIMA(1,1,1)?

0 2 4 6 8 10 12 14

−0.

40.

00.

40.

8

Lag

AC

F

Original series

2 4 6 8 10 12 14

−0.

20.

00.

20.

4

Lag

Par

tial A

CF

Original series

0 2 4 6 8 10 12 14

−0.

40.

00.

40.

8

Lag

AC

F

First order difference series

2 4 6 8 10 12 14

−0.

4−

0.2

0.0

0.2

0.4

Lag

Par

tial A

CF

First order ifference series

Figure 10: ACF and PACF28 / 32

Model Selection

Table 1: Comparison of several competitive models

Model AIC Model AICARIMA (1,0,0) -51.56 ARIMA (1,1,0) -52.47ARIMA (0,1,0) -49.38 ARIMA (1,0,1) -51.13ARIMA (0,0,1) -46.41 ARIMA (1,1,1) -51.02

ARIMA(1,1,0)

Xt = 0.635Xt−1 + εt

29 / 32

Advanced Models

Time dependent variance

• ARCH (Autoregressive Conditional Heteroskedasticity)

• GARCH (Generalized Autoregressive ConditionalHeteroskedasticity)

Multivariate

• VARMA (Vector Autoregression Moving Average)

• BVARMA (Bayesian Vector Autoregression Moving Average)

30 / 32

Intersection of Mathematics and Statistics

Under certain condition

GARCH(1,1) ≈ Diffusion Model!

31 / 32

Thank you!

32 / 32

allele frequencies as stochastic processes: mathematical & statistical approaches

Technology