modeling sequence conservation cbb 231 / compsci 261 with phylogenomic hmm’s b. majoros

Modeling Sequence ConservationModeling Sequence ConservationModeling Sequence ConservationModeling Sequence Conservation

CBB 231 / COMPSCI 261

with phylogenomic HMM’swith phylogenomic HMM’swith phylogenomic HMM’swith phylogenomic HMM’s

B. MajorosB. Majoros

Overview of Comparative Genome AnalysisOverview of Comparative Genome Analysis

Noncomparative:

Comparative:

human: ATCTCATTCGCGCATTCTGATCCGATCTATC

chimp: ATCTCATTCGCGCATTCTGATCCGATCTATCmouse: CTCTCATACGCGCCTTCTGTTCCGATGTATCdog: AAGTCATACGGGCAATCTCATGCGAACTACCchicken: GTTTAACTCTCGGATAAATATCCAGCCAACA

human: ATCTCATTCGCGCATTCTGATCCGATCTATC

fpredicted genomic features

model

fpredicted genomic features

model

“informant” genomes

Non-independence of InformantsNon-independence of InformantsDue to their common ancestry, the informant sequences are not independent. We can control for that non-independence by explicitly modeling their dependence structure using a phylogenetic tree:

We will see later that a phylogenetic tree (or “phylogeny”) can be interpreted as a special type of Bayesian network, in which sequence conservation probabilities are expressed as a function of the branch lengths.

Branch lengths represent evolutionary distance, which conflates the distinct phenomena of elapsed time and mutation rate.

Suppose we have a multiple-sequence alignment for a genomic region of interest:

The Utility of Sequence ConservationThe Utility of Sequence Conservation

Phylogenomic methods make use of the assumption that natural selection operates more strongly on some genomic features than others (i.e., coding versus noncoding), resulting in a detectable bias in sequence conservation for the features of interest.

More generally, conservation patterns may differ between levels of DNA organization (i.e., amino acids in coding segments, versus individual nucleotides in conserved noncoding elements).

feature amino acid conservation

nucleotide conservation

exon 1 100% > 71%

intron 1 14% < 51%

exon 2 98% > 85%

intron 2 29% < 49%

exon 3 97% > 82%

intron 3 9% < 49%

exon 4 96% > 83%

A. fumigatus

A. nidulans

Levels of ConservationLevels of Conservation

NN

EsngEsng

ATGATG

EinitEinit EfinEfin

II

TAGTAG

GTGT AGAG

EintEint

A1A1

A2A2 A3A3

SS I2I2I1I1

model of phylogenymodel of gene structure

= a model of gene structure informed by observed evolutionary divergence

Phylogenomic Gene FindingPhylogenomic Gene Finding

PhyloHMM’sPhyloHMM’s

NN

EsngEsng

ATGATG

EinitEinit EfinEfin

II

TAGTAG

GTGT AGAG

EintEint

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A1A1

A2A2 A3

A3

SS I2I2I1

I1

A PhyloHMM is an HMM (or GHMM) in which each state qi has an associated evolution model i describing the expected patterns and rates of evolution in the class of features represented by that state.

In practice, many states will share the same

evolution model (“parameter tying”). A

typical PhyloHMM will have only two evolution models: one for coding

sequence and one for noncoding.

€

φ* =argmaxφ

P(φ |S) =argmaxφ

P(φ,S)P(S)

=argmaxφ

P(φ,S)

=argmaxφ

P(φ)P(S|φ)

€

argmaxφ

P(Si |qi ,di)(qi ,di )∈φ∏ P(qi |qi−1)P(di |qi )

Recall: gene finding with a GHMM involves finding the parse * which is most probable, given the input sequence S:

This can be further factored into a product of emission, transition, and duration probabilities:

emission transition duration

Methods for efficiently evaluating these terms and extracting the optimal parse (via dynamic programming) were described in a previous lecture.

Recall: Gene Finding with a GHMMRecall: Gene Finding with a GHMM

),|,...,()|()(maxarg

)|,...,,()(maxarg

),...,,,(maxarg

),...,,(

),...,,,(maxarg

),...,,|(maxarg

*

)()1(

)()1(

)()1(

)()1(

)()1(

)()1(

φφφφ

φφφ

φφ

φφ

φφφ

SIIPSPP

IISPP

IISP

IISPIISP

IISP

n

n

n

n

n

n

=

=

=

=

=

Given a target sequence S and a set of homologous informant sequences I(1),...,I(n) aligned to S, we wish to find the most probable parse * of S, given S and I(1),...,I(n):

where a parse ={(Γi,di)|0i<L} is a series of feature types Γi and their corresponding lengths di along the sequence in left-to-right order.

Incorporating Homology EvidenceIncorporating Homology Evidence

reroot

If the target sequence S and the informant sequences I(1),...,I(n) are related via a known phylogeny, then we can re-root that phylogeny to place S at the root, and then attach matrices to the branches to describe the mutation probabilities between ancestral sequences and their children:

Re-rooting at S corresponds to conditioning the informants on S...

phylogenyBayesian network

Using a Phylogeny as a Bayesian NetworkUsing a Phylogeny as a Bayesian Network

∑321 ,,

332311221 )|()|()|()|()|(AAA

SAPAIPAAPAAPAIP

∑ ∏⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛=

lesunobservabv

n vparentvPSIIPnonroot

)()1( ))(|()|,...,(

This factorization is possible because of the common (and entirely justifiable) assumption of conditional independence in traditional phylogenetic models.

More generally:

where we have summed over all possible assignments to the unobservables (ancestral vertices).

We can then use the re-rooted tree as a Bayesian network in order to factor the P(I(1),...,I(n)|S,) term:

Factoring the LikelihoodFactoring the Likelihood

Any ancestor having only one child can be eliminated algebraically:

Eliminating Useless AncestorsEliminating Useless Ancestors

This is a form of variable elimination (recall from the Bayesian networks lecture), and is trivially justified:

AA

BB

CC

€

P(A)P(B|A)P(C |B)C

∑B

∑A

∑= P(A)P(C,B|A)

C

∑B

∑A

∑

= P(A) P(C,B|A)B

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

C

∑A

∑

= P(A)P(C |A)C

∑A

∑

Possible only because of the conditional indepen-dence assumption:

P(C | B, A) = P(C | B)

AA

BB

CC

Given a pre-computed alignment of the target and informant sequences, and assuming independence between sites (columns of the alignment), the likelihood can be computed on a per-nucleotide basis using a recursion known as Felsenstein’s pruning algorithm (FPA):

€

Lu(a) =δ(u,a) if u is a leaf

Lc(b)P(c=b|u=a)b∈α∑

c∈C(u)

∏ otherwise

⎧

⎨ ⎪

⎩ ⎪

for C(u) the children of node u, δ(u,a) the Kronecker match function, and the augmented DNA alphabet α={A,C,G,T,-} where ‘-’ denotes missing information, typically due to a gap in the alignment.

Evaluating the LikelihoodEvaluating the Likelihood

Felsenstein’s recurrence should be computed using dynamic programming, for the sake of efficiency. Lu(a) can be evaluated using bottom-up DP (using a postorder tree traversal) or via memoization (computing each value only once and storing it in a “memo”). In either case, Lc(b) can be obtained during all subsequent evaluations via a simple lookup in a DP matrix:

€

matrix(u,a) =δ(u,a) if u is a leaf

matrix(c,b)P(c=b|u=a)b∈α∑

c∈C(u)

∏ otherwise

⎧

⎨ ⎪

⎩ ⎪

Evaluating the Likelihood EfficientlyEvaluating the Likelihood Efficiently

The size of the matrix is |α|×N, for N the number of taxa in the tree and |α|=4 for the DNA alphabet.

Using Felsenstein’s recursion, the conditional likelihood for a single column j of the alignment is given by

P(I(1)[ j],...,I(n)[ j]|S,) = Lr(S[ j])

where r is root node of the tree, S[ j] denotes the jth symbol in the target track of the alignment, and is the set of model parameters. Evaluating the conditional likelihood of the entire alignment (assuming independence between sites) can be accomplished via multiplication:

P(I(1),...,I(n)|S,) = 0j<L Lr(S[ j])

where L is the length of the alignment.

Applying Felsenstein’s AlgorithmApplying Felsenstein’s Algorithm

Now let us attend to . Recall that for the P()P(S|) term we partitioned into a series of states and their durations. We can do the same for the informant term:

€

P( I (1),..., I (n) |S,φ) = P( I i(1),..., I i

(n) |Si ,qi ,di )(qi ,di )∈φ∏

where Ii( j) is the subsequence emitted by qi (the ith state in ) into the I( j)

track of the alignment, and we have again employed a conditional independence assumption between the features emitted by the different states in the parse. This decomposition by state allows us to utilize a different evolution model for different feature types.

Modeling Feature TypesModeling Feature Types

eA[e+1]-A[b]

Computing the product bxe f (x) for an arbitrary interval (b,e) within the sequence can be achieved by simple subtraction followed by exponentiation. Because this operation can be performed in constant time, the use of prefix sum arrays is very fast.

where the output of the ith state in spans the interval (bi,ei) in the alignment. This optimization problem can be solved efficiently using a GHMM decoding algorithm and prefix sum arrays:

∏ ∏

∏

∈ ≤≤−

∈−

=

==

φ

φ

φ

φφ

),(1

),(

)()1(1

*

)()|()|(),|(maxarg

),,|,...,()|()|(),|(maxarg

ii ii

ii

dqj

ejbriiiiiii

dqiii

niiiiiiiii

SLqdPqqPdqSP

dqSIIPqdPqqPdqSP

⎩⎨⎧

>+==

0ifor )(log]1[0ifor )0(log

][ifi-A

fiA

Combining Terms into the Complete FormulaCombining Terms into the Complete Formula

⎪⎩

⎪⎨

⎧

===

∏∑∈ ∈

otherwise)|()(

leaf a is if),(

)(

)(uCc bc

u aubcPbL

uau

aL

α

δ

P(c=b|u=a) is the probability of observing base b in a child node, given we observe a base a at that location in the parent’s genome.

We can model this using a matrix of substitution rates, parameterized by the evolutionary time t that has passed between the parent and child species:

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

=

→→→→

→→→→

→→→→

→→→→

TTGTCTAT

TGGGCGAG

TGGCCCAC

TAGACAAA

pppppppppppppppp

t)(P

parent

child

A C G T A C

G T

Evaluating the Probability of a SubstitutionEvaluating the Probability of a Substitution

Substitution models are typically based on continuous-time Markov models. The Markov property for continuous-time Markov chains states that: )()()( stst PPP =+That is, the probability of a given substitution is insensitive to the absolute position along the time axis (i.e., the substitution rate is stationary), so that time-dependent substitution rates are simply compounded via matrix multiplication.

QPPP

P

IPPPPPP

)()0()(

lim)(

)()()(lim

)()(lim

)(

0

00

tt

tt

t

ttt

t

ttt

dt

td

t

tt

=Δ−Δ

=

Δ−Δ

=Δ

−Δ+=

→Δ

→Δ→Δ

From this we can derive an instantaneous rate matrix Q from P(t), where we make use of the obvious fact that P(0)=I:

Continuous-time Markov ModelsContinuous-time Markov Models

QPP

)()(

tdt

td=

...!2!

)(22

0

+++=== ∑∞

=

ttI

nt

etn

nnt Q

QQ

P Q

12

1

)( −

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

= GGP

nt

t

t

e

ee

tλ

λ

λ

O

eQt (the “matrix exponential”) denotes a Taylor expansion, as shown above.

In practice, we can solve this via eigenvector decomposition:

What remains is to determine Q and the branch lengths {ti} of the phylogeny. We will return to this in a few minutes.

12

1

−

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

= GGQ

nλ

λλ

O

Obtaining P(t) from QObtaining P(t) from Q

Two key properties of rate matrices are reversibility and transition-transversion modeling. A reversible model is one in which

ijπiPij(t)=πjPji(t)

where πi is the background frequency of base i. Reversibility ensures that eigenvalues will be real.

Transition-transversion modeling simply requires that the model be parameterized so that transition (RR, YY) and transversion (RY) rates be differentially expressible.

Constraining the Form of QConstraining the Form of Q

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−

−−

=αααααααααααα

JKQ

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−

−−

=

βαβββααβββαβ

PK 2Q⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−

−−

=

GCA

TCA

TGA

TGC

FEL

απαπαπαπαπαπαπαπαπαπαπαπ

Q

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−

−−

=

GCA

TCA

TGA

TGC

HKY

βπαπβπβπβπαπαπβπβπβπαπβπ

Q⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−

−−

=

GCA

TCA

TGA

TGC

REV

τπωπχπτπκπαπωπκπβπχπαπβπ

Q

Jukes-Cantor: Kimura: Felsenstein:

Hasegawa, Kishino, Yano: General reversible model:

Some Common Forms for QSome Common Forms for Q

Now we return to the final problem of estimating the evolutionary parameters that determine the substitution probabilities P(t).

Consider an evolution model = (, β, Q) consisting of a tree topology , a set β = {ti | 0i<n} of branch lengths, and a rate matrix Q. It is reasonable to infer the phylogeny first, and then to infer β and Q via maximum likelihood, given .

The UPGMA algorithm constructs phylogenetic trees from aligned sequences, as follows: 1. Initialize a population of tree stubs, one stub per sequence 2. Compute all pairwise sequence edit distances between stubs 3. Iteratively combine subtrees: 4. Pick the two closest subtrees Ti and Tj

5. Combine Ti and Tj into a new subtree Tk

6. Remove Ti and Tj from the population, add Tk, and recompute distances between Tk and all other subtrees by averaging Ti & Tj

Estimation of Evolutionary ParametersEstimation of Evolutionary Parameters

€

∀m∈L dkm=nidim+njd jm

ni +nj

⎡

⎣ ⎢

⎤

⎦ ⎥

1 2 3 4 5 1 2 6 5

43

1 2

6 5

43

7

1 2 6 5

43

7 8

UPGMA ExampleUPGMA Example

1 2 6 5

43

7 8

9

Start with a population of taxa

Compute all pairwise edit distances and pick the closest pair (3 & 4)

Join the closest pair with a new ancestor (6), which replaces them in the population

Recompute distances and again combine the closest subtrees

Repeat......

.....until only one tree remains.

€

∀m∈L dkm=cidim+cjd jm

ci +cj

⎡

⎣ ⎢

⎤

⎦ ⎥

True phylogeny:

Phylogeny inferred via UPGMA:

After simulating the evolution of a random 5000 bp sequence over a randomly generated phylogeny, using a random HKY matrix:

Evaluating the Accuracy of UPGMAEvaluating the Accuracy of UPGMA

A better algorithm than UPGMA is the Neighbor-Joining (NJ) algorithm, which follows the same logic as UPGMA, but with different distance formulae.

Choosing the nearest trees to merge is done via Dij:

jiijij rrdD −−=

€

ri =1

|L |−2dik

k∈L

∑

€

∀m∈L dkm=12(dim+d jm−dij )

⎡

⎣ ⎢

⎤

⎦ ⎥

)(2

1jiijki rrdd −+= )(

2

1ijijkj rrdd −+=

Updating of distances for a new subtree Tk is done via:

Branch lengths for the children (i, j) of new node k are:

Improving on UPGMAImproving on UPGMA

True phylogeny:

Phylogeny inferred via Neighbor-Joining:

Applying the Neighbor-Joining algorithm (plus arbitrary rooting), rather than UPGMA, produces a more accuracy topology:

Neighbor-Joining (NJ) AlgorithmNeighbor-Joining (NJ) Algorithm

)},{,|(maxarg}){,( )()1( ,...,,}){,(

* iIISt

i tAPt n

i

QQQ

=

for alignment .

BFGS algorithm : GNU Scientific Library, GSL routine gsl_multimin_fdfminimizer_vector_bfgs. This procedure requires all first partial derivatives of the objective function, which can be evaluated via “differencing”—computing the full tree likelihood at two nearby points and taking the difference:

L(x)/x (L(x+dx)-L(x))/dx

or we can compute the partial derivatives analytically...

Now we return to the final problem of determining Q and the branch lengths{ti}:

)()1( ,...,, nIISA

Estimating Q and {ti}Estimating Q and {ti}

⎪⎩

⎪⎨⎧

≠==

∂∂

≠

=

internal for internal for

leaf a for 0)(

, xufxuf

u

t

aL

yx

u

QPP

)()(

,,

,yx

yx

yx tt

t=

∂∂

€

f== Ly(b)∂Pa,b(tx,y)

∂tx,yb∈α∑ ⎛

⎝ ⎜

⎞

⎠ ⎟ Lother( y) (b)Pa,b(tx,other( y))b∈α∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

€

f≠= Lleft(b)Pa,b(tu,left)b∈α∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

∂Lright(b)∂tx,y

Pa,b(tu,right)b∈α∑ ⎛

⎝ ⎜

⎞

⎠ ⎟+

Lright(b)Pa,b(tu,right)b∈α∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

∂Lleft(b)∂tx,y

Pa,b(tu,left)b∈α∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

Likelihood Gradient Likelihood Gradient

11 )()( −−

⎥⎦

⎤⎢⎣

⎡⎟⎠

⎞⎜⎝

⎛∂∂

=∂

∂GGQGFG

Pt

tππ

o

⎪⎪⎩

⎪⎪⎨

⎧

−−

==

otherwise

if

,

ba

tt

bat

ba

ëë

ee

tef

ba

a

λλ

λ λλ

F=[ fa,b] ⎩⎨⎧=

∂∂

internal for leaf a for 0)(

int uguaLu

π

€

gint= Lleft(b)Pa,b(tu,left)b∈α∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

∂Lright(b)∂π

Pa,b(tu,right) + Lright(b)∂Pa,b(tu,right)

∂π

⎛

⎝ ⎜

⎞

⎠ ⎟

b∈α∑ +

Lright(b)Pa,b(tu,right)b∈α∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

∂Lleft(b)∂π

Pa,b(tu,left) + Lleft(b)∂Pa,b(tu,left)

∂π

⎛

⎝ ⎜

⎞

⎠ ⎟

b∈α∑

€

∂L∂x

(all columns) = L(colj )columnsj≠i

∏ ⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

∂L∂x

(coli )columns

i

∑

struct MyObjective : public GSL::ObjectiveFunction{ virtual double f(const GSL::Vector &currentPoint) { double x=currentPoint[0]; return (x-3)*(x-3); } virtual void gradient(const GSL::Vector &currentPoint,GSL::Vector &gradient) { double x=currentPoint[0]; gradient[0]=2*(x-3); }};

GSL::Vector initialPoint(1);initialPoint.setAllTo(0);MyObjective f;GSL::Optimizer optimizer(GSL::BFGS,f,initialPoint,0.01,GSL::BY_EITHER,

0.01,100);optimizer.run();cout<<"optimal point: "<<optimizer.getOptimalPoint()<<endl;cout<<"took "<<optimizer.iterationsUsed()<<" iterations"<<endl;

Gradient Ascent with the GSLGradient Ascent with the GSL

Classifying Coding vs. Noncoding DNAClassifying Coding vs. Noncoding DNA

noncoding mutation rate

clas

sific

atio

n ac

cura

cy

Figure 9.24: Classification accuracy (y-axis, percentages) of a 0th-order PhyloHMM for an exon identification task. Equal numbers of coding and noncoding segments were independently evolved over a simulated phylogeny, using a Jukes-Cantor model (section 9.6.4) with variable substitution rates, and then classified via likelihood ratio; P(S|coding)/P(S|noncoding). Coding substitution rate was fixed at 5%; noncoding substitution rate was varied from 6% to 80% (x-axis). Increasing the noncoding substitution rate relative to the coding rate quickly enabled the PhyloHMM to achieve reliable discrimination between coding and noncoding elements.

€

QAA,AA=

− βπC απG χπT

βπ A − κπG ωπT

απ A κπC − πT

χπ A ωπC πG −

⎡

⎣

⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥

€

QAA,AC =





⎡

⎣

⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥

€

QTT ,TT =





⎡

⎣

⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥

2nd order REV: 1536 free parameters

?

?

AAC

?

...

ATG

€

P(Z|XY, AAC)XYZ

∑

€

P(W |UV, XYZ)UVW

∑

€

P(L |JK ,UVW)JKL

∑

42n matrices

€

P(G |AT, JKL )

Probabilities are now conditional on some number of symbols in preceding columns.

Modeling Sequential DependenciesModeling Sequential Dependencies

• Sequence conservation patterns can differ between genomic feature types, due to the effects of natural selection, and these biases can be used to improve prediction accuracy

• PhyloHMM’s model conservation patterns in multi-sequence alignments via substitution rate matrices evaluated over a phylogeny

• The likelihood of a column in an alignment, given an evolutionary model, can be evaluated via Felsenstein’s pruning algorithm

• Dependences between columns can be modeled using higher-order rate matrices

• Sequence conservation patterns can differ between genomic feature types, due to the effects of natural selection, and these biases can be used to improve prediction accuracy

• PhyloHMM’s model conservation patterns in multi-sequence alignments via substitution rate matrices evaluated over a phylogeny

• The likelihood of a column in an alignment, given an evolutionary model, can be evaluated via Felsenstein’s pruning algorithm

• Dependences between columns can be modeled using higher-order rate matrices

SummarySummary

modeling sequence conservation cbb 231 / compsci 261 with phylogenomic hmm’s b. majoros

Documents