modelling substitutions

Modelling substitutions

3. Substitution models

The process of nucleotide changes over time depends on two groups of factors:- mutations generated by accidental damage and copying errors- phenotypic effects of these mutations, and the forces of selection acting on these

On most of the human genome, there appears to be no selective force acting, so by inference mutations in these regions do not seem to have any phenotypic effect (since any phenotypic effect is thought to have at least some selective advantage or drawback; population genetics suggests that an average increase of 0.0001 offspring per individual is thought to be about the smallest amount of fitness advantage that would have a measurable effect on the genome).

Ignoring the force of selection means to model the neutral process of mutation. Still this is difficult, so more simplifying assumptions are made:

- All genomic sites evolve in the same way- The evolution of one site is independent of that of others- What happens to a nucleotide site depends on its current state, not its history

Modelling substitutions

Assumptions:

1 All genomic sites evolve in the same way2 The evolution of one site is independent of that of others3 What happens to a nucleotide site depends on its current state, not its history

(1) Means that we need only develop one model(2) Implies that we can focus on just one nucleotide site at a time(3) Implies we can use a Markov process

Stochastic processes

Mutations are random, so it seems sensible to model the process of mutation by a stochastic process

Def: A stochastic process is a set of random variables X(t), one for each t T. The set of possible values of X(t) (for all t) is called the state space Y.

(Technical note: The precise definition uses an underlying sample space Ω; then X(t) (for any fixed t) is a function ΩY where Y is the state space. The stochastic behaviour of X is then determined by a probability measure on ΩT assigning a probability to subsets of ΩT representing sets of realizations of X(). )

The set T is either ...,-2,-1,0,1,2,... or , the set of real numbers, for a discrete-time and a continuous-time stochastic process respectively. We’ll encounter both examples. Similarly, stochastic processes can be discrete-space (with a finite or infinite number of states) or continuous-space (e.g. Use or 2 as the state space). We’ll use only use finite state spaces.

Markov process

To model the evolution of nucleotides, we can take T to be the real numbers, denoting evolutionary time in some units. X(t) then denotes the identify of a nucleotide at a particular locus in the genome; the state space is thus Y=A,C,G,T. In this setup we’ve used assumptions 1 and 2.

The third assumption, mathematically interpreted, states that the stochastic process X(t) is a Markov process. These are loosely defined as processes “without after-effect”, i.e. those in which what happens in the future does not depend on what happened in the past, only on the present. Or symbolically:

P( X(t0 +h) = y | X(s) = f(s) s t0 ) =

P( X(t0 +h) = y | X(t0) = f(t0)

Andrey Andreyevich Markov, 1856-1922

Modelling the substitution process

It appears reasonable to suppose that the probability of, say, an A mutating into a T in a certain amount of time dt, is (to first approximation) proportional to this time span dt, say fAT*dt. The number fAT is called the rate for the mutation A to T. Similarly, let fAC and fAG denote the rates for mutation into a C and G.

The probability that the nucleotide A mutates into anything is of course fAT+fAC+fAG.

Now suppose the nucleotide is currently in state A. What is the probability of it being in either of the states A, C, G or T after time dt? For N=C,G or T this is

P( X(t+dt)=N | X(t)=A ) = fAN dt

while for X=A this isP( X(t+dt)=A | X(t)=A ) = 1 – (fAT +fAC +fAG ) dt

If we define the matrix R as RNM=fNM for N and M different, and RNN = -MN fNM then this becomes

P( X(t+dt)=M | X(t)=N ) = ( I + R dt )NM

where I is the 4x4 identity matrix.

The matrix R is called the rate matrix. It has the property that its rows sum to 0; this fact is equivalent to saying that “probability is conserved”.

Evolution of the probabilities

What are the probabilities after time 2 dt?

P( X(t+2 dt)=c | X(t)=a ) = (Definition of conditional prob.)

b P(X(t+dt)=b| X(t)=a) P( X(t+ 2dt)=c | X(t+dt)=b, X(t)=a ) = (Markov property)

b P(X(t+dt)=b| X(t)=a) P( X(t+ 2dt)=c | X(t+dt)=b ) = (Definition of rate matrix)

b (I + Rdt)ab(I+Rdt)bc = (Definition of matrix product)

[ (I+R dt)2 ]ac

In words, this result states that the matrix (I+R dt)2 contains the probabilities for any nucleotide to change into any other during the time 2dt.

Dynamics over finite times

This only gives us probabilities for very short time intervals. What about longer times? Suppose A(t) is the matrix giving the probabilities after time t, which can be large. Then, by the same argument, A(t+dt) should satisfy the following relation:

A(t+dt) = (I+R dt)A(t) = A(t) + R A(t) dt.

Taking A(t) to the left-hand side and dividing through dt, we get A(t+dt)-A(t) /dt = R A(t), that is

(d/dt) A(t) = R A(t).

This equation is solved by A(t) = exp( R t ), where exp here denotes the matrix exponential. This is defined in the same way as the ordinary exponential function, replacing the argument by matrices throughout:

exp( R t ) = I + t R + (t2/2!) R2 + (t3/3!) R3 + .....

It is not difficult to show that this equation always converges, for any t, and by differentiating term-by-term it is also straightforward to show that it solves the differential equation above.

Computing the matrix exponential

exp( R t ) = I + t R + (t2/2!) R2 + (t3/3!) R3 + ..... (*)

Computation of exp(Rt) using the formula above is quite inefficient. If R is symmetric (or reversible) then an eigenvalue decomposition can be computed – this does not take much time, but is rather involved; fortunately packages like LaPack exist that do this for you. An eigenvalue decomposition is a pair of matrices A and so that

R = A A-1

and is a diagonal matrix diag(1,...,4). This simplifies the formula above:

exp( R t ) = A diag exp( 1), exp( 2), exp( 3), exp( 4) A-1

If the matrix R is completely general (not symmetric, not reversible), then an eigenvalue decomposition is not “stable”; for some matrices R, even with quite small entries, the eigenvalue decomposition might have very large entries. This makes the numerical precision of the result suddenly much lower. In this case an algorithm called the “Padé approximation” is useful; it is slower than the above approach, but still much faster than computing (*) directly, and numerically stable.

An incompletely known initial state

exp( R t )NM is the probability that nucleotide N evolves into M in time t. If the initial state is not known, but instead the probabilities of starting in any particular state is known. Then the final probability distribution can also be calculated. Suppose vN is the probability of starting in state N, then the probability of ending up in state M is

M vN exp( R t )NM = [ v exp( R t ) ] M

Equilibrium distribution

In the long run, the state of the system X(t) does not depend on the initial state X(0) anymore. The probability distribution at that time is called the equilibrium distribution.

From the explicit solution, it can be shown that A(t) also satisfies the differential equation

(d/dt) A(t) = A(t) R.

The independence of the initial state means that for large t, v0 = v A(t) becomes independent of v. This is the equilibrium distribution. This means that for the same large t, A(t) also no longer depends on t; in other words A(t) R = 0. Multiplying by the left with v you get

v0 R = 0

that is, the equilibrium distribution is the left eigenvector of R corresponding to the eigenvalue 0.

(The right eigenvector for eigenvalue 0 is always (1,1,1,1), because the rows of R sum to 0. In particular this means that A always has an eigenvalue 0.).

Simulating evolution

The formulas for A(t) = exp( R t ) allow us to calculate the probability of ending up in a certain state, conditional on some starting state. They also allows us to sample from the distribution of final states, conditional on a starting state.

Suppose we want to do even more: simulate an entire history X(t), from the right distribution. Simulations allow you to answer much more difficult questions about the process than analytical calculation can. Simulating substitution histories is not very often terribly useful; but other evolutionary processes can be modelled in precisely the same way (i.e. using discrete-state continuous-time Markov models), and the general algorithm is useful.

The algorithm to simulate a history X(t) during an interval [0,T] is straightforward:1. Let X be the initial state, t = 02. Draw a “waiting time” w(X)3. Draw a new state s(X)4. Increase t by w(X), and assign s(X) to X. Go to step 2 if t < T, otherwise stop.


To draw a waiting time, note that the probability for anything to happen during an interval dt, when the process is in state X, is –RXX dt. The probability that nothing happened during that interval is therefore

1+RXX dt

The probability that nothing happened in the interval [0,t] is

(1+RXX dt)t/dt =

limn [ (1+1/n) n ] RXX ) =

exp( RXX ),

writing n=1/(RXX dt); note that RXX is a negative number. In other words, the waiting time until the next event, when currently in state X, is exponentially distributed with parameter -RXX. Sampling from this distribution is done by sampling a number u from the uniform distribution [0,1], then

-log(u)/ (-RXX)

is exponentially distributed with parameter -RXX.


Finally, you need to draw a new state. After drawing a waiting time, you know that the state will change; so the new state y should be drawn from

P(X(t+dt)=y | X(t) = x, X(t+dt) X(t) ) =

P(X(t+dt)=y, X(t+dt) X(t) | X(t) = x ) / P(X(t+dt) X(t) | X(t) = x ) =

P(X(t+dt)=y | X(t) = x ) / P(X(t+dt) X(t) | X(t) = x ) =

Rxy dt / -Rxx dt =

= Rxy / -Rxx

In words, the new state is drawn with a probability proportional to the rate of moving into that state.

Felsenstein’s algorithm

We know how to calculate P( X(t)=y | X(0)=x ), the probability that a nucleotide x evolves into y during time t. But often we have sequences for more than 2 species, related by a phylogenetic tree, and we might be interested in the probability of seeing a configuration like the one to the right, conditional on the phylogeny.

This probability depends on the phylogeny, and this thus gives a way to determine if one phylogeny is more likely than another (in a likelihood approach), or to determine the posterior support of various phylogenies (in a Bayesian approach).

T A A

t1

t4

t2

t3

Felsenstein’s algorithmIn principle it is simple to calculate the probability of the data under the model:

P(c=T,d=A,e=A) = XY P(c=T|b=X) * P(d=A|b=X) * P(b=X|a=Y) * P(e=A|a=Y) * P(a=Y)

The last factor, P(a=Y), is the prior on the distribution of nucleotides. Usually one assumes that the phylogeny drawn is part of a much larger phylogeny in which the same process has gone on for a long time, so it is reasonable to take the equilibrium distribution as the prior.

The formula looks a bit hairy, but computationally it still OK in this example. The trouble is, the summation extends over all internal nodes, of which there are n-1 for a tree with n leaves. There are therefore 4n-1 terms to be summed for an n-leaved tree, which makes this way of calculating the probability impractical.

A better way is to calculate partial probabilities at the internal nodes, in such a way that the computation of any partial probability in any node Q requires only partial probabilities of its descendant nodes. This is an example of a general strategy called “dynamic programming”.

T A A

t1

t4

t2

t3

a

b

c d e

Felsenstein’s algorithmIn this case, the dynamic programming strategy works if you decide to

compute, for internal nodes, the likelihood of the data that “depends” on Q, conditional on the state of Q:

P(all the data on leaves at Q’s descendants | Q=x )

This is called Felsenstein’s peeling algorithm (because you peel branches from the tree until you’re at the root). For the example tree to the right, you first computeP( c=T,d=A | b=X) = P(c=T|b=X) * P(d=A|b=X)

Then, you computeP( c=T,d=A,e=A | a=Y ) = X P(c=T, d=A | b=X ) P(b=X | a=Y ) * P(e=A|a=y)

The term in red can be looked up, because it has just been computed. You can check that at each internal node you need to calculate 4 numbers, and for this you need to do at most two single summations (each involving 4 terms); this happens when both direct descendants of the internal node are themselves internal nodes.

Finally, the likelihood is obtained by summing over the prior P(a=Y).T A A

t1

t4

t2

t3

a

b

c d e

modelling substitutions

Documents