modeling sequence conservation cbb 231 / compsci 261 with phylogenomic hmm’s b. majoros
TRANSCRIPT
Modeling Sequence ConservationModeling Sequence ConservationModeling Sequence ConservationModeling Sequence Conservation
CBB 231 / COMPSCI 261
with phylogenomic HMM’swith phylogenomic HMM’swith phylogenomic HMM’swith phylogenomic HMM’s
B. MajorosB. Majoros
Overview of Comparative Genome AnalysisOverview of Comparative Genome Analysis
Noncomparative:
Comparative:
human: ATCTCATTCGCGCATTCTGATCCGATCTATC
chimp: ATCTCATTCGCGCATTCTGATCCGATCTATCmouse: CTCTCATACGCGCCTTCTGTTCCGATGTATCdog: AAGTCATACGGGCAATCTCATGCGAACTACCchicken: GTTTAACTCTCGGATAAATATCCAGCCAACA
human: ATCTCATTCGCGCATTCTGATCCGATCTATC
fpredicted genomic features
model
fpredicted genomic features
model
“informant” genomes
Non-independence of InformantsNon-independence of InformantsDue to their common ancestry, the informant sequences are not independent. We can control for that non-independence by explicitly modeling their dependence structure using a phylogenetic tree:
We will see later that a phylogenetic tree (or “phylogeny”) can be interpreted as a special type of Bayesian network, in which sequence conservation probabilities are expressed as a function of the branch lengths.
Branch lengths represent evolutionary distance, which conflates the distinct phenomena of elapsed time and mutation rate.
Suppose we have a multiple-sequence alignment for a genomic region of interest:
The Utility of Sequence ConservationThe Utility of Sequence Conservation
Phylogenomic methods make use of the assumption that natural selection operates more strongly on some genomic features than others (i.e., coding versus noncoding), resulting in a detectable bias in sequence conservation for the features of interest.
More generally, conservation patterns may differ between levels of DNA organization (i.e., amino acids in coding segments, versus individual nucleotides in conserved noncoding elements).
feature amino acid conservation
nucleotide conservation
exon 1 100% > 71%
intron 1 14% < 51%
exon 2 98% > 85%
intron 2 29% < 49%
exon 3 97% > 82%
intron 3 9% < 49%
exon 4 96% > 83%
A. fumigatus
A. nidulans
Levels of ConservationLevels of Conservation
NN
EsngEsng
ATGATG
EinitEinit EfinEfin
II
TAGTAG
GTGT AGAG
EintEint
A1A1
A2A2 A3A3
SS I2I2I1I1
model of phylogenymodel of gene structure
= a model of gene structure informed by observed evolutionary divergence
Phylogenomic Gene FindingPhylogenomic Gene Finding
PhyloHMM’sPhyloHMM’s
NN
EsngEsng
ATGATG
EinitEinit EfinEfin
II
TAGTAG
GTGT AGAG
EintEint
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A1A1
A2A2 A3
A3
SS I2I2I1
I1
A PhyloHMM is an HMM (or GHMM) in which each state qi has an associated evolution model i describing the expected patterns and rates of evolution in the class of features represented by that state.
In practice, many states will share the same
evolution model (“parameter tying”). A
typical PhyloHMM will have only two evolution models: one for coding
sequence and one for noncoding.
€
φ* =argmaxφ
P(φ |S) =argmaxφ
P(φ,S)P(S)
=argmaxφ
P(φ,S)
=argmaxφ
P(φ)P(S|φ)
€
argmaxφ
P(Si |qi ,di)(qi ,di )∈φ∏ P(qi |qi−1)P(di |qi )
Recall: gene finding with a GHMM involves finding the parse * which is most probable, given the input sequence S:
This can be further factored into a product of emission, transition, and duration probabilities:
emission transition duration
Methods for efficiently evaluating these terms and extracting the optimal parse (via dynamic programming) were described in a previous lecture.
Recall: Gene Finding with a GHMMRecall: Gene Finding with a GHMM
),|,...,()|()(maxarg
)|,...,,()(maxarg
),...,,,(maxarg
),...,,(
),...,,,(maxarg
),...,,|(maxarg
*
)()1(
)()1(
)()1(
)()1(
)()1(
)()1(
φφφφ
φφφ
φφ
φφ
φφφ
SIIPSPP
IISPP
IISP
IISPIISP
IISP
n
n
n
n
n
n
=
=
=
=
=
Given a target sequence S and a set of homologous informant sequences I(1),...,I(n) aligned to S, we wish to find the most probable parse * of S, given S and I(1),...,I(n):
where a parse ={(Γi,di)|0i<L} is a series of feature types Γi and their corresponding lengths di along the sequence in left-to-right order.
Incorporating Homology EvidenceIncorporating Homology Evidence
reroot
If the target sequence S and the informant sequences I(1),...,I(n) are related via a known phylogeny, then we can re-root that phylogeny to place S at the root, and then attach matrices to the branches to describe the mutation probabilities between ancestral sequences and their children:
Re-rooting at S corresponds to conditioning the informants on S...
phylogenyBayesian network
Using a Phylogeny as a Bayesian NetworkUsing a Phylogeny as a Bayesian Network
∑321 ,,
332311221 )|()|()|()|()|(AAA
SAPAIPAAPAAPAIP
∑ ∏⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛=
lesunobservabv
n vparentvPSIIPnonroot
)()1( ))(|()|,...,(
This factorization is possible because of the common (and entirely justifiable) assumption of conditional independence in traditional phylogenetic models.
More generally:
where we have summed over all possible assignments to the unobservables (ancestral vertices).
We can then use the re-rooted tree as a Bayesian network in order to factor the P(I(1),...,I(n)|S,) term:
Factoring the LikelihoodFactoring the Likelihood
Any ancestor having only one child can be eliminated algebraically:
Eliminating Useless AncestorsEliminating Useless Ancestors
This is a form of variable elimination (recall from the Bayesian networks lecture), and is trivially justified:
AA
BB
CC
€
P(A)P(B|A)P(C |B)C
∑B
∑A
∑= P(A)P(C,B|A)
C
∑B
∑A
∑
= P(A) P(C,B|A)B
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
C
∑A
∑
= P(A)P(C |A)C
∑A
∑
Possible only because of the conditional indepen-dence assumption:
P(C | B, A) = P(C | B)
AA
BB
CC
Given a pre-computed alignment of the target and informant sequences, and assuming independence between sites (columns of the alignment), the likelihood can be computed on a per-nucleotide basis using a recursion known as Felsenstein’s pruning algorithm (FPA):
€
Lu(a) =δ(u,a) if u is a leaf
Lc(b)P(c=b|u=a)b∈α∑
c∈C(u)
∏ otherwise
⎧
⎨ ⎪
⎩ ⎪
for C(u) the children of node u, δ(u,a) the Kronecker match function, and the augmented DNA alphabet α={A,C,G,T,-} where ‘-’ denotes missing information, typically due to a gap in the alignment.
Evaluating the LikelihoodEvaluating the Likelihood
Felsenstein’s recurrence should be computed using dynamic programming, for the sake of efficiency. Lu(a) can be evaluated using bottom-up DP (using a postorder tree traversal) or via memoization (computing each value only once and storing it in a “memo”). In either case, Lc(b) can be obtained during all subsequent evaluations via a simple lookup in a DP matrix:
€
matrix(u,a) =δ(u,a) if u is a leaf
matrix(c,b)P(c=b|u=a)b∈α∑
c∈C(u)
∏ otherwise
⎧
⎨ ⎪
⎩ ⎪
Evaluating the Likelihood EfficientlyEvaluating the Likelihood Efficiently
The size of the matrix is |α|×N, for N the number of taxa in the tree and |α|=4 for the DNA alphabet.
Using Felsenstein’s recursion, the conditional likelihood for a single column j of the alignment is given by
P(I(1)[ j],...,I(n)[ j]|S,) = Lr(S[ j])
where r is root node of the tree, S[ j] denotes the jth symbol in the target track of the alignment, and is the set of model parameters. Evaluating the conditional likelihood of the entire alignment (assuming independence between sites) can be accomplished via multiplication:
P(I(1),...,I(n)|S,) = 0j<L Lr(S[ j])
where L is the length of the alignment.
Applying Felsenstein’s AlgorithmApplying Felsenstein’s Algorithm
Now let us attend to . Recall that for the P()P(S|) term we partitioned into a series of states and their durations. We can do the same for the informant term:
€
P( I (1),..., I (n) |S,φ) = P( I i(1),..., I i
(n) |Si ,qi ,di )(qi ,di )∈φ∏
where Ii( j) is the subsequence emitted by qi (the ith state in ) into the I( j)
track of the alignment, and we have again employed a conditional independence assumption between the features emitted by the different states in the parse. This decomposition by state allows us to utilize a different evolution model for different feature types.
Modeling Feature TypesModeling Feature Types
eA[e+1]-A[b]
Computing the product bxe f (x) for an arbitrary interval (b,e) within the sequence can be achieved by simple subtraction followed by exponentiation. Because this operation can be performed in constant time, the use of prefix sum arrays is very fast.
where the output of the ith state in spans the interval (bi,ei) in the alignment. This optimization problem can be solved efficiently using a GHMM decoding algorithm and prefix sum arrays:
∏ ∏
∏
∈ ≤≤−
∈−
=
==
φ
φ
φ
φφ
),(1
),(
)()1(1
*
)()|()|(),|(maxarg
),,|,...,()|()|(),|(maxarg
ii ii
ii
dqj
ejbriiiiiii
dqiii
niiiiiiiii
SLqdPqqPdqSP
dqSIIPqdPqqPdqSP
⎩⎨⎧
>+==
0ifor )(log]1[0ifor )0(log
][ifi-A
fiA
Combining Terms into the Complete FormulaCombining Terms into the Complete Formula
⎪⎩
⎪⎨
⎧
===
∏∑∈ ∈
otherwise)|()(
leaf a is if),(
)(
)(uCc bc
u aubcPbL
uau
aL
α
δ
P(c=b|u=a) is the probability of observing base b in a child node, given we observe a base a at that location in the parent’s genome.
We can model this using a matrix of substitution rates, parameterized by the evolutionary time t that has passed between the parent and child species:
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
=
→→→→
→→→→
→→→→
→→→→
TTGTCTAT
TGGGCGAG
TGGCCCAC
TAGACAAA
pppppppppppppppp
t)(P
parent
child
A C G T A C
G T
Evaluating the Probability of a SubstitutionEvaluating the Probability of a Substitution
Substitution models are typically based on continuous-time Markov models. The Markov property for continuous-time Markov chains states that: )()()( stst PPP =+That is, the probability of a given substitution is insensitive to the absolute position along the time axis (i.e., the substitution rate is stationary), so that time-dependent substitution rates are simply compounded via matrix multiplication.
QPPP
P
IPPPPPP
)()0()(
lim)(
)()()(lim
)()(lim
)(
0
00
tt
tt
t
ttt
t
ttt
dt
td
t
tt
=Δ−Δ
=
Δ−Δ
=Δ
−Δ+=
→Δ
→Δ→Δ
From this we can derive an instantaneous rate matrix Q from P(t), where we make use of the obvious fact that P(0)=I:
Continuous-time Markov ModelsContinuous-time Markov Models
QPP
)()(
tdt
td=
...!2!
)(22
0
+++=== ∑∞
=
ttI
nt
etn
nnt Q
P Q
12
1
)( −
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
= GGP
nt
t
t
e
ee
tλ
λ
λ
O
eQt (the “matrix exponential”) denotes a Taylor expansion, as shown above.
In practice, we can solve this via eigenvector decomposition:
What remains is to determine Q and the branch lengths {ti} of the phylogeny. We will return to this in a few minutes.
12
1
−
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
= GGQ
nλ
λλ
O
Obtaining P(t) from QObtaining P(t) from Q
Two key properties of rate matrices are reversibility and transition-transversion modeling. A reversible model is one in which
ijπiPij(t)=πjPji(t)
where πi is the background frequency of base i. Reversibility ensures that eigenvalues will be real.
Transition-transversion modeling simply requires that the model be parameterized so that transition (RR, YY) and transversion (RY) rates be differentially expressible.
Constraining the Form of QConstraining the Form of Q
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−−
−−
=αααααααααααα
JKQ
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−−
−−
=
βαβββααβββαβ
PK 2Q⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−−
−−
=
GCA
TCA
TGA
TGC
FEL
απαπαπαπαπαπαπαπαπαπαπαπ
Q
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−−
−−
=
GCA
TCA
TGA
TGC
HKY
βπαπβπβπβπαπαπβπβπβπαπβπ
Q⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
−−
−−
=
GCA
TCA
TGA
TGC
REV
τπωπχπτπκπαπωπκπβπχπαπβπ
Q
Jukes-Cantor: Kimura: Felsenstein:
Hasegawa, Kishino, Yano: General reversible model:
Some Common Forms for QSome Common Forms for Q
Now we return to the final problem of estimating the evolutionary parameters that determine the substitution probabilities P(t).
Consider an evolution model = (, β, Q) consisting of a tree topology , a set β = {ti | 0i<n} of branch lengths, and a rate matrix Q. It is reasonable to infer the phylogeny first, and then to infer β and Q via maximum likelihood, given .
The UPGMA algorithm constructs phylogenetic trees from aligned sequences, as follows: 1. Initialize a population of tree stubs, one stub per sequence 2. Compute all pairwise sequence edit distances between stubs 3. Iteratively combine subtrees: 4. Pick the two closest subtrees Ti and Tj
5. Combine Ti and Tj into a new subtree Tk
6. Remove Ti and Tj from the population, add Tk, and recompute distances between Tk and all other subtrees by averaging Ti & Tj
Estimation of Evolutionary ParametersEstimation of Evolutionary Parameters
€
∀m∈L dkm=nidim+njd jm
ni +nj
⎡
⎣ ⎢
⎤
⎦ ⎥
1 2 3 4 5 1 2 6 5
43
1 2
6 5
43
7
1 2 6 5
43
7 8
UPGMA ExampleUPGMA Example
1 2 6 5
43
7 8
9
Start with a population of taxa
Compute all pairwise edit distances and pick the closest pair (3 & 4)
Join the closest pair with a new ancestor (6), which replaces them in the population
Recompute distances and again combine the closest subtrees
Repeat......
.....until only one tree remains.
€
∀m∈L dkm=cidim+cjd jm
ci +cj
⎡
⎣ ⎢
⎤
⎦ ⎥
True phylogeny:
Phylogeny inferred via UPGMA:
After simulating the evolution of a random 5000 bp sequence over a randomly generated phylogeny, using a random HKY matrix:
Evaluating the Accuracy of UPGMAEvaluating the Accuracy of UPGMA
A better algorithm than UPGMA is the Neighbor-Joining (NJ) algorithm, which follows the same logic as UPGMA, but with different distance formulae.
Choosing the nearest trees to merge is done via Dij:
jiijij rrdD −−=
€
ri =1
|L |−2dik
k∈L
∑
€
∀m∈L dkm=12(dim+d jm−dij )
⎡
⎣ ⎢
⎤
⎦ ⎥
)(2
1jiijki rrdd −+= )(
2
1ijijkj rrdd −+=
Updating of distances for a new subtree Tk is done via:
Branch lengths for the children (i, j) of new node k are:
Improving on UPGMAImproving on UPGMA
True phylogeny:
Phylogeny inferred via Neighbor-Joining:
Applying the Neighbor-Joining algorithm (plus arbitrary rooting), rather than UPGMA, produces a more accuracy topology:
Neighbor-Joining (NJ) AlgorithmNeighbor-Joining (NJ) Algorithm
)},{,|(maxarg}){,( )()1( ,...,,}){,(
* iIISt
i tAPt n
i
QQQ
=
for alignment .
BFGS algorithm : GNU Scientific Library, GSL routine gsl_multimin_fdfminimizer_vector_bfgs. This procedure requires all first partial derivatives of the objective function, which can be evaluated via “differencing”—computing the full tree likelihood at two nearby points and taking the difference:
L(x)/x (L(x+dx)-L(x))/dx
or we can compute the partial derivatives analytically...
Now we return to the final problem of determining Q and the branch lengths{ti}:
)()1( ,...,, nIISA
Estimating Q and {ti}Estimating Q and {ti}
⎪⎩
⎪⎨⎧
≠==
∂∂
≠
=
internal for internal for
leaf a for 0)(
, xufxuf
u
t
aL
yx
u
QPP
)()(
,,
,yx
yx
yx tt
t=
∂∂
€
f== Ly(b)∂Pa,b(tx,y)
∂tx,yb∈α∑ ⎛
⎝ ⎜
⎞
⎠ ⎟ Lother( y) (b)Pa,b(tx,other( y))b∈α∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
€
f≠= Lleft(b)Pa,b(tu,left)b∈α∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
∂Lright(b)∂tx,y
Pa,b(tu,right)b∈α∑ ⎛
⎝ ⎜
⎞
⎠ ⎟+
Lright(b)Pa,b(tu,right)b∈α∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
∂Lleft(b)∂tx,y
Pa,b(tu,left)b∈α∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
Likelihood Gradient Likelihood Gradient
11 )()( −−
⎥⎦
⎤⎢⎣
⎡⎟⎠
⎞⎜⎝
⎛∂∂
=∂
∂GGQGFG
Pt
tππ
o
⎪⎪⎩
⎪⎪⎨
⎧
−−
==
otherwise
if
,
ba
tt
bat
ba
ëë
ee
tef
ba
a
λλ
λ λλ
F=[ fa,b] ⎩⎨⎧=
∂∂
internal for leaf a for 0)(
int uguaLu
π
€
gint= Lleft(b)Pa,b(tu,left)b∈α∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
∂Lright(b)∂π
Pa,b(tu,right) + Lright(b)∂Pa,b(tu,right)
∂π
⎛
⎝ ⎜
⎞
⎠ ⎟
b∈α∑ +
Lright(b)Pa,b(tu,right)b∈α∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
∂Lleft(b)∂π
Pa,b(tu,left) + Lleft(b)∂Pa,b(tu,left)
∂π
⎛
⎝ ⎜
⎞
⎠ ⎟
b∈α∑
€
∂L∂x
(all columns) = L(colj )columnsj≠i
∏ ⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
∂L∂x
(coli )columns
i
∑
struct MyObjective : public GSL::ObjectiveFunction{ virtual double f(const GSL::Vector ¤tPoint) { double x=currentPoint[0]; return (x-3)*(x-3); } virtual void gradient(const GSL::Vector ¤tPoint,GSL::Vector &gradient) { double x=currentPoint[0]; gradient[0]=2*(x-3); }};
GSL::Vector initialPoint(1);initialPoint.setAllTo(0);MyObjective f;GSL::Optimizer optimizer(GSL::BFGS,f,initialPoint,0.01,GSL::BY_EITHER,
0.01,100);optimizer.run();cout<<"optimal point: "<<optimizer.getOptimalPoint()<<endl;cout<<"took "<<optimizer.iterationsUsed()<<" iterations"<<endl;
Gradient Ascent with the GSLGradient Ascent with the GSL
Classifying Coding vs. Noncoding DNAClassifying Coding vs. Noncoding DNA
noncoding mutation rate
clas
sific
atio
n ac
cura
cy
Figure 9.24: Classification accuracy (y-axis, percentages) of a 0th-order PhyloHMM for an exon identification task. Equal numbers of coding and noncoding segments were independently evolved over a simulated phylogeny, using a Jukes-Cantor model (section 9.6.4) with variable substitution rates, and then classified via likelihood ratio; P(S|coding)/P(S|noncoding). Coding substitution rate was fixed at 5%; noncoding substitution rate was varied from 6% to 80% (x-axis). Increasing the noncoding substitution rate relative to the coding rate quickly enabled the PhyloHMM to achieve reliable discrimination between coding and noncoding elements.
€
QAA,AA=
− βπC απG χπT
βπ A − κπG ωπT
απ A κπC − πT
χπ A ωπC πG −
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
€
QAA,AC =
− βπC απG χπT
βπ A − κπG ωπT
απ A κπC − πT
χπ A ωπC πG −
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
€
QTT ,TT =
− βπC απG χπT
βπ A − κπG ωπT
απ A κπC − πT
χπ A ωπC πG −
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
2nd order REV: 1536 free parameters
?
?
AAC
?
...
ATG
€
P(Z|XY, AAC)XYZ
∑
€
P(W |UV, XYZ)UVW
∑
€
P(L |JK ,UVW)JKL
∑
42n matrices
€
P(G |AT, JKL )
Probabilities are now conditional on some number of symbols in preceding columns.
Modeling Sequential DependenciesModeling Sequential Dependencies
• Sequence conservation patterns can differ between genomic feature types, due to the effects of natural selection, and these biases can be used to improve prediction accuracy
• PhyloHMM’s model conservation patterns in multi-sequence alignments via substitution rate matrices evaluated over a phylogeny
• The likelihood of a column in an alignment, given an evolutionary model, can be evaluated via Felsenstein’s pruning algorithm
• Dependences between columns can be modeled using higher-order rate matrices
• Sequence conservation patterns can differ between genomic feature types, due to the effects of natural selection, and these biases can be used to improve prediction accuracy
• PhyloHMM’s model conservation patterns in multi-sequence alignments via substitution rate matrices evaluated over a phylogeny
• The likelihood of a column in an alignment, given an evolutionary model, can be evaluated via Felsenstein’s pruning algorithm
• Dependences between columns can be modeled using higher-order rate matrices
SummarySummary