hmm for cpg islands parameter estimation for hmm maximum likelihood and the information inequality...

HMM for CpG Islands

Parameter Estimation For HMM

Maximum Likelihood and the Information Inequality

Lecture #7

Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Shlomo Moran, following Danny Geiger and Nir Friedman

HMM for CpG Islands

Reminder: Hidden Markov ModelS1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

11 11

( , ) ( , , ; ,..., ) ( )i i i

L

L L s s s ii

p p s s x x m e xs x

Next we apply HMM for the question of recognizing CpG ilands

3

Hidden Markov Model for CpG Islands

The states: Domain(Si)={+, -} {A,C,T,G} (8 values)

In this representation P(xi| si) = 0 or 1 depending on whether xi is consistent with si . e.g. xi= G is consistent with si=(+,G) and with si=(-,G) but not with any other state of si.

A- T+

A T

G +

G

… …… …

4

Reminder: Most Probable state pathS*1

S*2 S*L-1 S*L

x1 x2 XL-1 xL

M M M M

TTTT

Given an output sequence x = (x1,…,xL),

A most probable path s*= (s*1,…,s*

L) is one which maximizes p(s|x).

1( ,..., )

* *1 1 1* ( ,..., ) ( ,..., | ,..., )argmax

Ls s

L L Ls s s p s s x x

5

A- C- T- T+

A C T T

G +

G

Predicting CpG islands via most probable path:

Output symbols: A, C, G, T (4 letters). Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states total).A most probable path (found by Viterbi’s algorithm) predicts CpG islands.

Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.

7

Reminder: Most probable state

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Given an output sequence x = (x1,…,xL), si is a most probable state (at location i) if:

si=argmaxk p(Si=k |x).

( , )( | ) ( , )

( )i

i i

p S kp S k p S k

p

xx xx

8

Finding the probability that a letter is in a CpG island via the algorithm for most probable state:

The probability that the occurrence of G in the i-th location is in a CpG island (+ state) is:

∑s+ p(Si =s+ |x) = ∑s+ F(Si=s+ )B(Si=s+

)

Where the summation is formally over the 4 “+” states, but actually

only state G+ need to be considered (why?)

A- C- T- T+

A C T T

G +

G

i

10

Parameter Estimation for HMM

11

Defining the Parameters

An HMM model is defined by the parameters: mkl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

lk

b

mkl

ek(b)

{ : , are states} { ( ) : is a state, is a letter}kl km k l e b k b

12

Training Sets

To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

13

Maximum Likelihood Parameter Estimation for HMM

The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).

ML parameter estimation looks for θ which maximizes the above.

The exact method for finding or approximating this θ depends on the nature of the training set used.

14

Data for HMM

The training set is characterized by:1.For each xj, the information on the states sj

i (the symbols

xji are usually known).

2.Its size (sum of lengths of all sequences).

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

15

Case 1: ML when Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate mkl and ek(b) for all pairs of states k, l and symbols b.

By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

16

Case 1: Sequences are fully known

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Let Mkl= |{i: si-1=k,si=l}| (in xj). Ek(b)=|{i:si=k,xi=b}| (in xj).

( )

( , ) ( , )

then: ( | ) [ ( )]kl kM E bkkl

k l k b

jp m e bx

For each xj we have:

11

( | ) ( )j

i i i

Lj j

s s s ii

p x m e x

17

Case 1 (cont)

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).

Thus, if Mkl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:

( )

( , ) ( , )

1 ( | ) [ ( )],.., kl kM E bkkl

k l k b

np m e bx x 18

Case 1 (cont)

( )

( , ) ( , )

[ ( )]kl kM E bkkl

k l k b

m e b

So we need to find mkl’s and ek(b)’s which maximize:

Subject to:

For all states , 1 and ( ) 1

[ , ( ) 0 ]

kl kl b

kl k

k m e b

m e b

19

Case 1 (cont)

( )

( , ) ( , )

( )

[ ( )]

[ ( )][ ] [ ]

kl k

kl k

M E bkkl

k l k b

M E bkkl

k l k b

F m e b

m e b

kb

Subject to: for all , 1, and e ( ) 1.kll

k m b

Rewriting, we need to maximize:

20

Case 1 (cont)

If we maximize for each : s.t. 1klMklkl

ll

k m m ( )and also [ ( )] s.t. ( ) 1kE b

k kbb

e b e b

Then we will maximize also F.Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.

21

ML Parameters Estimation for a Single Die

22

Defining The Problem

Let X be a random variable with 6 values x1,…,x6

denoting the six outcomes of a (possibly unfair) die. Here the parameters are θ ={1,2,3,4,5, 6} , ∑θi=1Assume that the data is one sequence:

Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize

2 3 2 21 2 3 4 5 6( | )P Data

Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1 [and θi 0 ]

252 3 2

1 2 3 4 51

i.e., ( | ) 1 ii

P Data

23

Side comment: Sufficient Statistics

To compute the probability of data in the die example we only require to record the number of times Ni falling on side i (namely,N1, N2,…,N6).

We do not need to recall the entire sequence of outcomes

{Ni | i=1…6} is called sufficient statistics for

the multinomial sampling.

654321

5

154321 1)|(N

i iNNNNNDataP

24

Sufficient Statistics

A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood

Formally, s(Data) is a sufficient statistics if for any two datasets D and D’

s(Data) = s(Data’ ) P(Data|) = P(Data’|)

Datasets

Statistics

Exercise:Define “sufficient statistics” for the HMM model.

25

Maximum Likelihood EstimateBy the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ).We will find the parameters by considering the corresponding log-likelihood function:

63 51 2 4

551 2 3 4 1

log ( | ) log 1N

N NN N Nii

P Data

5

16

5

11loglog

i ii ii NN

A necessary condition for (local) maximum is:

01

)|(log5

1

6 =-

-=¶

θ¶

å=i ij

j

j

NNDataP

qqq

26

Finding the Maximum

Rearranging terms:

ii

jj N

N Divide the jth equation by the ith

equation:

Sum from j=1 to 6:

ii

ii

j j

jj N

N

N

N

6

16

1

1

So there is only one local – and hence global – maximum. Hence the MLE is given by:

6,..,1 iN

N ii

6

65

1

6

1 NNN

i ij

j

27

Note: Fractional Exponents are possible

Some models allow ni’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%). Our analysis didn’t assume that the ni are integers, thus it applies also for fractional exponents.

28

Generalization for distribution with any number n of outcomes

Let X be a random variable with k values x1,…,xk denoting the k outcomes of Independently and Identically Distributed experiments, with parameters θ ={1,2,...,k} (θi is the probability of xi). Again, the data is one sequence of length n, in which xi appears ni times.

Then we have to maximize

1 211 2( | ) , ( ... )knn n

kkP Data n n n

Subject to: θ1+θ2+ ....+ θk=1

11

1

1 11

i.e., ( | ) 1k

k

nknn

iki

P Data

29

Generalization for n outcomes (cont)

i k

i k

n n

By treatment identical to the die case, the maximum is obtained when for all i:

Hence the MLE is given by the relative frequencies:

1,..,ii

ni k

n

30

ML for a Single Dice, Normalized Version

Consider the two experiments for a 3-sided dice:1. 8 tosses: 2 x1,, 3 x2, 5 x3.2. 800 tosses: 200 x1,, 300 x2, 500 x3

Clearly, both imply the same ML parameters.In general, when formulating ML for a single dice, we

can ignore the actual number n of tosses, and just use the fraction of each outcome.

31

1 2

1 1

1

1 2

Given positive numbers ,..., s.t. ... 1

Find parameters ,..., which maximize:

( | ) .k

k k

k

pp pk

k p p p p

P Data

Thus we can replace the number of outcomes ni by pi=ni/n, and get the following normalized setting of the ML problem for a single dice:

And the same analysis yields that a maximum is obtained when:

1,..,i ip i k

Normalized version of ML (cont.)

32

Implication:

The Kullback Leibler

Information inequality

33

1 2

1

1

1 2

Let ( ,..., ) be a probability distribution over a -set.

For any other such distribution ( ,..., ),

let the likelihood ( | ) .

Then ( | ) is maximized only when

k

k

k

pp pk

P p p k

Q q q

P Data Q q q q

P Data Q P

.Q

We can rephrase the “ML for single dice” inequality:

Rephrasing the ML inequality

1

1

Let ( ,..., ) be a probability distribution over a -set.

For any other such distribution ( ,..., ), consider the sum

( ) log (data| ) log

Then has a unique maximum at .

k

k

i i

P p p k

Q q q

R Q P Q p q

R Q P

Taking logarithms, we get

34

The Kullback-Leibler Information Inequality

1 1

Given probability distributions over a -set,

( ,..., ) and ( ,..., ).

The of and is defined by:

( || ) log

Then ( || ) 0, with equalit

relativ

y onl

e ent

y

r

when

opy k k

ii

i

k

P p p Q q q

P Q

pD P Q p

q

D P Q

.P Q

35

Proof of the information inequality

By the logarithmic version of the "normalized maximum likelihood"

(2 slides back):

( || ) log log log 0,

and equality holds only when for all . QED

ii i i i i

i

i i

pD P Q p p p p q

q

p q i

36

Using the Solution for the“Dice Maximum Likelihood” to Find Parameters for HMM

When all States are Known

37

The Parameters

Let Mkl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We need to:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

( )

( , ) ( , )

Maximize [ ( )]kl kM E bkkl

k l k b

m e b

k Subject to: for all states , 1, and e ( ) 1, , ( ) 0.kl kl kl b

k m b m e b 38

Apply to HMM (cont.)

We apply the previous technique to get for each k the parameters {mkl|l=1,..,m} and {ek(b)|bΣ}:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

M E bm e b

M E b

Which gives the optimal ML parameters

39

Summary of Case 1: Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate mkl and ek(b) for all pairs of states k, l and symbols b.

When everything is known, we can find the (unique set of) parameters θ* which maximizes

p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

40

Adding pseudo counts in HMM

We may modify the actual count by our prior knowledge/belief (e.g., when the sample set is too small) : rkl is our prior belief on transitions from k to l.rk(b) is our prior belief on emissions of b from state k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

' '' '

( ) ( )then , and ( )

( ) ( ( ') ( '))kl kl k k

kl kkl kl k kl b

M r E b r ba e b

M r E b r b

41

Case 2: State Paths are Unknown.

Here we use

ML with Hidden Parameters

42

Dice likelihood with hidden parameters

Let X be a random variable with 3 values 0,1,2. Hence the parameters are θ ={0,1,2} , ∑θi=1

Assume that the data is a sequence of 2 tosses which we don’t see, but we know the the sum of

the outcomes is 2.

The problem: Find parameters which maximize the likelihood (probability) of the observed data.

Basic fact: The probability of an event is the sum of the probabilities of the simple events it

contains.

The probability space here: all sequences of 2 tosses:

43

(0,0), (0,1), (0, 2), (1,0),..., (2, 2)

Defining The Problem

44

21 0 2Pr(sum = 2| ) = Pr {(1,1), (2,0), (0,2)} 2 .

Thus, we need to find parameters θ which maximize:

Finding an optimal solution is in general a difficult task. Hence we do the following procedure:1. “Guess” initial parameters2. Repeatedly improve the parameters using the EM

algorithm (to be studied later in this course).Next, we exemplify the EM algorithm on the above example.

E step: Average Counts:

45

Pr(sum = 2| ) 0.0625 2 0.125 0.3125.

We use the probabilities of the events to generate “average counts” of the outcomes:Average count of 0 is 2*0.125=0.25.Average count of 1 is 2*0.0625=0.125 .Average count of 2 is 2*0.125=0.25.

0 1 2

2

Assume our initial paramaters are:

0.5, 0.25

Pr(1,1) 0.25 0.0625

Pr(2,0) Pr(0,2) 0.5 0.25 0.125

M step: Updating probabilities by the average counts

46

Pr(sum = 2 | ) 0.04 2 0.16 0.36 0.3125 Pr(sum = 2 | ) .

0 2

1

0.250.4 .

0.6250.125

0.2.0.625

The total of all average count is: 2*0.25+0.125=0.625.The new relative frequencies equal the new parameters, λ1, λ2, λ3 :

2

2

Pr(1,1) 0.2 0.04

Pr(2,0) Pr(0,2) 0.4 0.16

The probabilities of the simple events according to the new parameters:

The probability of the events by the new parameters:

Summary of the algorithm:

47

• Start with some estimated parameters θ.• Use these parameters to define average counts of the outcomes.• define new parameters λ by the relative frequencies of the average

counts.

We will show that this algorithm never decreases, and usually increases the likelihood of the data.

An application of this algorithm for HMM is known as the Baum Welch algorithm, which we will see next.

hmm for cpg islands parameter estimation for hmm maximum likelihood and the information inequality...

Documents