hidden markov model

22
Hidden Markov Model Ka-Lok Ng Dept. of Bioinformatics Asia University

Upload: mandell

Post on 14-Jan-2016

67 views

Category:

Documents


0 download

DESCRIPTION

Hidden Markov Model. Ka-Lok Ng Dept. of Bioinformatics Asia University. 1. 2. 3. Hidden Markov Models and Gene Finding. A rabbit has three homes Three states  1, 2, 3 State transition  such as 1  2, 2 1 … etc Discrete stochastic process - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hidden Markov Model

Hidden Markov Model

Ka-Lok Ng

Dept. of Bioinformatics

Asia University

Page 2: Hidden Markov Model

Hidden Markov Models and Gene Finding

1

2 3

A rabbit has three homesThree states 1, 2, 3State transition such as 1 2, 2 1 … etcDiscrete stochastic process(x0, x1, ….xn) denotes the random sequence of the process where is the rabbit is located

Page 3: Hidden Markov Model

• The occurrence of a future state in a Markov process depends on the immediately preceding state and only on it.

• The matrix P is called a homogeneous transition or stochastic matrix because all the transition probabilities pij are fixed and independent of time.

Hidden Markov Models and Gene Finding

Page 4: Hidden Markov Model

Hidden Markov Models and Gene Finding

5.03.01.01.00

2.06.0002.0

5.01.03.01.00

004.04.02.0

01.01.05.03.0p1j

Page 5: Hidden Markov Model

• A transition matrix P together with the initial probabilities associated with the states completely define a Markov chain.

• One usually thinks of a Markov chain as describing the transitional behavior of a system over equal intervals.

• Situations exist where the length of the interval depends on the characteristics of the system and hence may not be equal. This case is referred to as imbedded Markov chains.

Hidden Markov Models and Gene Finding

Page 6: Hidden Markov Model

• Events A and B• Marginal probability, p(A), p(B)• Joint probability, p(A,B)=p(AB)=p(A∩B)• Conditional probability• p(B|A) = given the probability of A, what is the pr

obability of B• p(A|B) = given the probability of B, what is the pr

obability of A

Bayes probability

http://www3.nccu.edu.tw/~hsueh/statI/ch5.pdf

Page 7: Hidden Markov Model

• General rule of multiplication• p(A∩B)=p(A)p(B|A) • = event A occurs *(after A occurs, then event B occurs)• =p(B)p(A|B) = event B occurs *(after B occurs, then even

t A occurs)• Joint = marginal * conditional• Conditional = Joint / marginal• P(B|A) = p(A∩B) / p(A) • How about P(A|B) ?

Bayes probability

Page 8: Hidden Markov Model

Bayes probability

Page 9: Hidden Markov Model

Bayes probability

3 Defects7 Good

Given 10 films, 3 of them are defected. What is the probability two successive films are defective?

Page 10: Hidden Markov Model

Bayes probability

Loyalty of managers to their employer.

Page 11: Hidden Markov Model

Bayes probability

Probability of new employee loyalty

Page 12: Hidden Markov Model

Bayes probability

Probability (over 10 year and loyal) = ?

Probability (less than 1 year or loyal) = ?

Page 13: Hidden Markov Model

}1{

}1{}1|2{

}21{

}|{

012

001

10

1

xPp

xPxxP

xxP

ixjxPp nnij

Let (x0, x1, ….xn) denotes the random sequence of the process

Joint probability is not easy to calculate.More easy with calculating conditional probability

Hidden Markov Models and Gene Finding

Page 14: Hidden Markov Model

HMMs – allow for local characteristics of molecular seqs. To be modeled and predicted within a rigorous statistical framework

Allow the knowledge from prior investigations to be incorporated into analysis

An example of the HMMAssume every nucleotide in a DNA seq. belongs to either a

‘normal’ region (N) or to a GC-rich region (R). Assume that the normal and GC-rich categories are not ran

domly interspersed with one another, but instead have a patchiness that tends to create GC-rich islands located within larger regions of normal sequence.

Hidden Markov Models and Gene Finding

Page 15: Hidden Markov Model

The states of the HMM – either N or RNNNNNNNNNRRRRRNNNNNNNNNNNNNNNNNRRRRRRRNNNN

The two states emit nucleotides with their own characteristic frequencies. The word ‘hidden’ refers to the fact that the true states is unobserved, or hidden.

TTACTTGACGCCAGAAATCTATATTTGGTAACCCGACGCTAA

seq. 60% AT, 40% GC not too far from a random seq.If we focus on the red GC-rich regions 83% GC (10/12), compared to a

GC frequency of 23% (7/30) in the other seq. HMMs – able to capture both the patchiness of the two classes and the diff

erent compositional frequencies within the categories.

Hidden Markov Models and Gene Finding

Page 16: Hidden Markov Model

HMMs applicationsGene finding, motif identification, prediction of

tRNA, protein domainsIn general, if we have seq. features that we c

an divide into spatially localized classes, with each class having distinct compositions HMMs are a good candidate for analyzing or finding new examples of the feature.

Hidden Markov Models and Gene Finding

Page 17: Hidden Markov Model

Hidden Markov Models and CG rich region

Hidden Markov Models and Gene Finding

Training the HMM The states of the HMM are the two

categories, N or R. Transition probabilities govern the assignment of stated from one position to the next. In the current example, if the present state is N, the following position will be N with probability 0.9, and R with probability 0.1. The four nucleotides in a seq. will appear in each state in accordance to the corresponding emission probabilities.

The working of an HMM 2 steps(1) Assignment of the hidden states.(2) Emission of the observed

nucleotides conditional on the hidden states

N R

Page 18: Hidden Markov Model

Consider the seq. TGCC arise from the set of hidden state NNNN. The probability of the observed seq. is a product of the appropriate emission probabilities:

Pr(TGCC|NNNN) = 0.3*0.2*0.2*0.2 = 0.0024where Pr(T|N) = conditional probability of observing a T a

t a site given that the hidden state is N.In general the probability is computed as the sum over all

hidden states as:

)_Pr()_|Pr()Pr( stateshiddenstateshiddenseqseq

Hidden Markov Models and Gene Finding

...

...

...4321

RRRN

NNNN

seq1

2

Page 19: Hidden Markov Model

The description of the hidden state of the first residue in a seq. introduces a technical detail beyond he scope of this discussion, so we simplify by assuming that the first position is a N state 2*2*2=8 possible hidden states

Hidden Markov Models and Gene Finding

stateshiddensevenNNNNNNNNTGCCTGCC __)Pr()|Pr()Pr(

00175.0

)9.09.09.0()2.02.02.03.0(

)Pr()Pr()Pr(

)|Pr()|Pr()|Pr()|Pr(

)Pr()|Pr(

NNNNNN

NCNCNGNT

NNNNNNNNTGCC

Page 20: Hidden Markov Model

000691.0

)8.01.09.0()4.04.02.03.0(

)Pr()Pr()Pr(

)|Pr()|Pr()|Pr()|Pr(

)Pr()|Pr(

RRRNNN

RCRCNGNT

NNRRNNRRTGCC

Hidden Markov Models and Gene Finding

The most likely path is NNNN which is slightly higher than the path NRRR (0.00123).

We can use the path that contributes the maximum probability as our best estimateof the unknown hidden states.

If the fifth nucleotide in the series were a G or C, the path NRRRR would be morelikely than NNNNN.

Page 21: Hidden Markov Model

Hidden Markov Models and Gene Finding

Hidden Markov Models and Gene Finding

Figure on the right - Schematic of the hidden states included in an HMM

Boxes = signal sensors for regulatory elements, coding region start sites, intron donor and acceptor sites, and translation stop sites

Arrows = content sensors for intergenic regions, exons, and introns,

Each of these regions emits nucleotides with frequencies characteristic of that region, with these frequencies being obtained by training the HMM on data sets of many known genes.

Page 22: Hidden Markov Model

Figure. Predicting genes. Three different prediction methods (Ensembl, Fgenesh, and Genscan) were used on a region of chromosome 17 that includes the well-annotated GOSR2 gene. The black images below indicate the location of matching cDNA/E

ST sequences.

Hidden Markov Models and Gene Finding