hidden markov models. the three main questions on hmms 1.evaluation given a hmm m, and a sequence x,...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Hidden Markov Models
The three main questions on HMMs
1. Evaluation
GIVEN a HMM M, and a sequence x,FIND Prob[ x | M ]
2. Decoding
GIVEN a HMM M, and a sequence x,FIND the sequence of states that maximizes P[ x, | M ]
3. Learning
GIVEN a HMM M, with unspecified transition/emission probs.,and a sequence x,
FIND parameters = (ei(.), aij) that maximize P[ x | ]
Decoding
GIVEN x = x1x2……xN
We want to find = 1, ……, N,such that P[ x, ] is maximized
* = argmax P[ x, ]
We can use dynamic programming!
Let Vk(i) = max{1,…,i-1} P[x1…xi-1, 1, …, i-1, xi, i = k] = Probability of most likely sequence of states ending at state i = k
1
2
K
…
1
2
K
…
1
2
K
…
…
…
…
1
2
K
…
x1
x2 x3 xK
2
1
K
2
The Viterbi Algorithm
Similar to “aligning” a set of states to a sequence
Time:
O(K2N)
Space:
O(KN)
x1 x2 x3 ………………………………………..xN
State 1
2
K
Vj(i)
Evaluation
We demonstrated algorithms that allow us to compute:
P(x) Probability of x given the model
P(xi…xj) Probability of a substring of x given the model
P(i = k | x) Probability that the ith state is k, given x
A more refined measure of which states x may be in
Motivation for the Backward Algorithm
We want to compute
P(i = k | x),
the probability distribution on the ith position, given x
We start by computing
P(i = k, x) = P(x1…xi, i = k, xi+1…xN)
= P(x1…xi, i = k) P(xi+1…xN | x1…xi, i = k)
= P(x1…xi, i = k) P(xi+1…xN | i = k)
Then, P(i = k | x) = P(i = k, x) / P(x)
Forward, fk(i) Backward, bk(i)
The Backward Algorithm – derivation
Define the backward probability:
bk(i) = P(xi+1…xN | i = k)
= i+1…N P(xi+1,xi+2, …, xN, i+1, …, N | i = k)
= l i+1…N P(xi+1,xi+2, …, xN, i+1 = l, i+2, …, N | i = k)
= l el(xi+1) akl i+1…N P(xi+2, …, xN, i+2, …, N | i+1 = l)
= l el(xi+1) akl bl(i+1)
The Backward Algorithm
We can compute bk(i) for all k, i, using dynamic programming
Initialization:
bk(N) = ak0, for all k
Iteration:
bk(i) = l el(xi+1) akl bl(i+1)
Termination:
P(x) = l a0l el(x1) bl(1)
Computational Complexity
What is the running time, and space required, for Forward, and Backward?
Time: O(K2N)
Space: O(KN)
Useful implementation technique to avoid underflows
Viterbi: sum of logs
Forward/Backward: rescaling at each position by multiplying by a constant
Viterbi, Forward, Backward
VITERBI
Initialization:
V0(0) = 1
Vk(0) = 0, for all k > 0
Iteration:
Vl(i) = el(xi) maxk Vk(i-1) akl
Termination:
P(x, *) = maxk Vk(N)
FORWARD
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
fl(i) = el(xi) k fk(i-1) akl
Termination:
P(x) = k fk(N) ak0
BACKWARD
Initialization:
bk(N) = ak0, for all k
Iteration:
bl(i) = k el(xi+1) akl bk(i+1)
Termination:
P(x) = k a0k ek(x1) bk(1)
Posterior Decoding
We can now calculate
fk(i) bk(i)
P(i = k | x) = ––––––– P(x)
Then, we can ask
What is the most likely state at position i of sequence x:
Define ^ by Posterior Decoding:
^i = argmaxk P(i = k | x)
Posterior Decoding
• For each state,
Posterior Decoding gives us a curve of likelihood of state for each position
That is sometimes more informative than Viterbi path *
• Posterior Decoding may give an invalid sequence of states
Why?
Posterior Decoding
• P(i = k | x) = P( | x) 1(i = k)
= {:[i] = k} P( | x)
x1 x2 x3 …………………………………………… xN
State 1
l P(i=l|x)
k
1() = 1, if is true 0, otherwise
Posterior Decoding
• Example: How do we compute P(i = l, ji = l’ | x)?
fl(i) bl(j)
P(i = l, iI = l’ | x) = ––––––– P(x)
x1 x2 x3 …………………………………………… xN
State 1
l P(i=l|x)
k
P(j=l’|x)
A+ C+ G+ T+
A- C- G- T-
A modeling Example
CpG islands in DNA sequences
Example: CpG Islands
CpG nucleotides in the genome are frequently methylated
(Write CpG not to confuse with CG base pair)
C methyl-C T
Methylation often suppressed around genes, promoters CpG islands
Example: CpG Islands
• In CpG islands,
CG is more frequent
Other pairs (AA, AG, AT…) have different frequencies
Question: Detect CpG islands computationally
A model of CpG Islands – (1) Architecture
A+ C+ G+ T+
A- C- G- T-
CpG Island
Not CpG Island
A model of CpG Islands – (2) Transitions
How do we estimate the parameters of the model?
Emission probabilities: 1/0
1. Transition probabilities within CpG islands
Established from many known (experimentally verified)
CpG islands
(Training Set)
2. Transition probabilities within other regions
Established from many known non-CpG islands
+ A C G T
A .180 .274 .426 .120
C .171 .368 .274 .188
G .161 .339 .375 .125
T .079 .355 .384 .182
- A C G T
A .300 .205 .285 .210
C .233 .298 .078 .302
G .248 .246 .298 .208
T .177 .239 .292 .292
Log Likehoods—Telling “Prediction” from “Random”
A C G T
A -0.740 +0.419 +0.580 -0.803
C -0.913 +0.302 +1.812 -0.685
G -0.624 +0.461 +0.331 -0.730
T -1.169 +0.573 +0.393 -0.679
Another way to see effects of transitions:
Log likelihoods
L(u, v) = log[ P(uv | + ) / P(uv | -) ]
Given a region x = x1…xN
A quick-&-dirty way to decide whether entire x is CpG
P(x is CpG) > P(x is not CpG)
i L(xi, xi+1) > 0
A model of CpG Islands – (2) Transitions
• What about transitions between (+) and (-) states?• They affect
Avg. length of CpG island
Avg. separation between two CpG islands
X Y
1-p
1-q
p q
Length distribution of region X:
P[lX = 1] = 1-p
P[lX = 2] = p(1-p)
…
P[lX= k] = pk(1-p)
E[lX] = 1/(1-p)
Geometric distribution, with mean 1/(1-p)
A+ C+ G+ T+
A- C- G- T-
1–p
A model of CpG Islands – (2) Transitions
No reason to favor exiting/entering (+) and (-) regions at a particular nucleotide
To determine transition probabilities between (+) and (-) states
1. Estimate average length of a CpG island: lCPG = 1/(1-p) p = 1 – 1/lCPG
2. For each pair of (+) states k, l, let akl p × akl
3. For each (+) state k, (-) state l, let akl = (1-p)/4(better: take frequency of l in the (-) regions into account)
4. Do the same for (-) states
A problem with this model: CpG islands don’t have exponential length distribution
This is a defect of HMMs – compensated with ease of analysis & computation
Applications of the model
Given a DNA region x,
The Viterbi algorithm predicts locations of CpG islands
Given a nucleotide xi, (say xi = A)
The Viterbi parse tells whether xi is in a CpG island in the most likely general scenario
The Forward/Backward algorithms can calculate
P(xi is in CpG island) = P(i = A+ | x)
Posterior Decoding can assign locally optimal predictions of CpG islands
^i = argmaxk P(i = k | x)
What if a new genome comes?
• We just sequenced the porcupine genome
• We know CpG islands play the same role in this genome
• However, we have no known CpG islands for porcupines
• We suspect the frequency and characteristics of CpG islands are quite different in porcupines
How do we adjust the parameters in our model?
LEARNING
Problem 3: Learning
Re-estimate the parameters of the model based on training data
Two learning scenarios
1. Estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islands
GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls
2. Estimation when the “right answer” is unknown
Examples:GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
QUESTION:Update the parameters of the model to maximize P(x|)
1. When the right answer is known
Given x = x1…xN
for which the true = 1…N is known,
Define:
Akl = # times kl transition occurs in Ek(b) = # times state k in emits b in x
We can show that the maximum likelihood parameters (maximize P(x|)) are:
Akl Ek(b)
akl = ––––– ek(b) = –––––––
i Aki c Ek(c)
1. When the right answer is known
Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data
Drawback: Given little data, there may be overfitting:P(x|) is maximized, but is unreasonable0 probabilities – VERY BAD
Example:Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 = F, F, F, F, F, F, F, F, F, F
Then:aFF = 1; aFL = 0eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1
Pseudocounts
Solution for small training sets:
Add pseudocounts
Akl = # times kl transition occurs in + rkl
Ek(b) = # times state k in emits b in x + rk(b)
rkl, rk(b) are pseudocounts representing our prior belief
Larger pseudocounts Strong priof belief
Small pseudocounts ( < 1): just to avoid 0 probabilities
Pseudocounts
Example: dishonest casino
We will observe player for one day, 600 rolls
Reasonable pseudocounts:
r0F = r0L = rF0 = rL0 = 1;
rFL = rLF = rFF = rLL = 1;
rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair)
rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded)
Above #s pretty arbitrary – assigning priors is an art
2. When the right answer is unknown
We don’t know the true Akl, Ek(b)
Idea:
• We estimate our “best guess” on what Akl, Ek(b) are
• We update the parameters of the model, based on our guess
• We repeat