hidden markov model ed anderson and sasha tkachev

Hidden Markov Model

Ed Anderson and Sasha Tkachev

Who Was Markov? Graduate of Saint Petersburg University (1878),

where he began a professor in 1886 Applied the method of continued fractions,

pioneered by his teacher Pafnuty Chebyshev, to probability theory

He proved the central limit theorem under fairly general assumptions

Most remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes

In 1923 Norbert Weiner became the first to treat rigorously a continuous Markov process. The foundation of a general theory was provided during the 1930s by Andrei Kolmogorov.

Excerpted from: http://www-groups.dcs.st-and.ac.uk/~history/Mathematicians/Markov.html

Andrei A Markov Born: 14 June 1856

in Ryazan, Russia Died: 20 July 1922

in Petrograd, Russia

What is the Hidden Markov Model?

Clipped from http://www.nist.gov/dads/HTML/hiddenMarkovModel.html

What Makes HMM Useful? Efficiency:

The algorithms are simple enough to be performant for real-time speech recognition.

Speed is advantageous when dealing with large biological data sets

Strong Theoretical Basis Probability distribution must sum to 1. Scores are not influenced by ad-hoc criteria. Scores may be compared across different experiments of

varying size and complexity Well suited for analyzing noisy, time-phased or

sequentially connected events.

What are HMM’s Limitations?Model building is not so easy

“Since HMM training algorithms are local optimizers, it pays to build HMMs on pre-aligned data whenever possible… the parameter space may be complex with may spurious local optima than can trap a training algorithm.”1

Distance between related states must be constantA disadvantage when analyzing distant and

arbitrarily spaced items:Amino acids in folded proteinsRNA base pairs

1Eddy, S.R., Profile hidden Markov models, Bioinformatics Review, 1998, Vol. 14, no. 9 1998, pg. 757

A Concrete Example

Example adapted from http://en.wikipedia.org/wiki/Viterbi_algorithm

Can you guess the weather based on a person’s activity? Use the Forward algorithm to calculate the probabilities.

(A) Transition Probabilities (Π) Initial State Probabilities

Today Rain SunRain 0.7 0.3 Rain 0.6Sun 0.4 0.6 Sun 0.4

(B) Emission Probabilities

IF Walk Shop CleanRain 0.1 0.4 0.5Sun 0.6 0.3 0.1

Tomorrow

Then

Typical Weather

Observation: Walk1

Hidden Statesp(w eather n |w eather n-1)

P(activity |w eather)

Sun-Sun-Sun 0.4 0.6Sun-Sun-Rain 0.4 0.6Sun-Rain-Sun 0.4 0.6Sun-Rain-Rain 0.4 0.6Rain-Sun-Sun 0.6 0.1Rain-Sun-Rain 0.6 0.1Rain-Rain-Sun 0.6 0.1Rain-Rain-Rain 0.6 0.1

.24

.06

Shop2

p(w eather n |w eather n-1)

P(activity |w eather)

0.6 0.30.6 0.30.4 0.40.4 0.40.3 0.30.3 0.30.7 0.40.7 0.4

.18

.16

Clean3

p(w eather n |w eather n-1)

P(activity |w eather) Probability

0.6 0.1 0.0025920.4 0.5 0.008640 False Maximum0.3 0.1 0.0011520.7 0.5 0.013440 True Maximum0.6 0.1 0.0003240.4 0.5 0.0010800.3 0.1 0.0005040.7 0.5 0.005880

.20

.35

How to Avoid False Optima? Is it necessary to calculate every possible path? The Viterbi algorithm can help.

Example from http://www.telecom.tuc.gr/~ntsourak/demo_viterbi.htm

HMM In Speech Recognition Handling a single word; evaluating each HMM according to the input,

using the Viterbi Search Every senone gets a HMM:

Adapted from Shir, O. M., Speech Recognition Seminar, 10/15/03

Leiden Institute of Advanced Computer Science

UW

ONE

TWO

THREE

T

AHW N

RTH IY

5-state HMM

HMM In Speech Recognition

Taken from Shir, O. M., Speech Recognition Seminar, 10/15/03

Leiden Institute of Advanced Computer Science

time

State with best path-scoreState with path-score < bestState without a valid path-score

P (t)j = max [P (t-1) a b (t)]i ij ji

Total path-score ending up at state j at time t

State transition probability, i to j

Score for state j, given the input at time t

HMM in BioinformaticsSequence profilingGene findingProtein secondary structure predictionRadiation hybrid mappingGenetic linkage mappingPhylogenetic analysis

HMM in Sequence Profiling Review – Lecture 7 Highlights Emission probabilities and transition probabilities

HMM in Sequence Profiling Log Odds scores are comparable across different

length sequences

Taken from lecture 7 slides, apparently from Krogh, “Computational Methods in molecular biology, pages 45-63, Elsevier, 1998.

Why HMM for Sequence Analysis?

Position-specific scoring methods make intuitive sense. BLAST and FASTA use pair-wise alignment as opposed

to profile scoring Profile methods have historically used ad hoc scoring

systems. HMM gap penalties a grounded in probability theory. HMMs provide a coherent, probabilistic model. 2

(2) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps. 755-763

Profile HMM Software ‘Motif’ models have strings of match states separated by a small

number of insert states. ‘Profile’ models have insert and delete states associated with each match state.. 3

(3) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps. 755-763

(4) Ibid., Figure 3 on page 758.

4

HMMER ArchitectureBoth local and global profile alignment.

(5) Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.

5

How Does it Work?

Generative models work by recursive enumeration of possible sequences from a finite set of rules.

The Plan 7 architecture explicitly models the entire target sequence, regardless of how much of that sequence matches the main model.

All alignments to a Plan 7 model are “global” alignments!

(6) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.

6

HMMR Programs 7

hmmalign - align sequences to an HMM profile hmmbuild - build a profile HMM from an alignment hmmcalibrate - calibrate HMM search statistics hmmconvert - convert between profile HMM file formats hmmemit - generate sequences from a profile HMM hmmfetch - retrieve an HMM from an HMM database hmmindex - create a binary SSI index for an HMM database hmmpfam - search one or more sequences against an HMM database hmmsearch - search a sequence database with a profile HMM

HMMER’s native alignment format is called Stockholm format, the format of the Pfam protein database that allows extensive markup and annotation.

HMMER can read alignments in several common formats, including the output of the CLUSTAL family of programs, Wisconsin/GCG MSF format, the input format for the PHYLIP phylogenetic analysis programs, and “alighed FASTA” format (where the sequences in a FASTA file contain gap symbols, so that they are all the same length).

(7) Excerpted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.

Building a profile with hmmbuild 8

> hmmbuild globin.hmm globins50.msf

hmmbuild - build a hidden Markov model from an alignmentHMMER 2.3 (April 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Alignment file: globins50.msfFile format: MSFSearch algorithm configuration: Multiple domain (hmmls)Model construction strategy: MAP (gapmax hint: 0.50)Null model used: (default)Prior used: (default)Sequence weighting method: G/S/C tree weightsNew HMM file: globin.hmm

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Alignment: #1Number of sequences: 50Number of columns: 308

Constructed a profile HMM (length 143)Average score: 189.04 bitsMinimum score: -17.62 bitsMaximum score: 234.09 bitsStd. deviation: 53.18 bits


Calibrating the profile 9

> hmmcalibrate globin.hmm

hmmcalibrate -- calibrate HMM search statistics

HMMER 2.3 (April 2003)

Copyright (C) 1992-2003 HHMI/Washington University School of Medicine

Freely distributed under the GNU General Public License (GPL)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

HMM file: globin.hmm

Length distribution mean: 325

Length distribution s.d.: 200

Number of samples: 5000

random seed: 1051632537

histogram(s) saved to: [not saved]

POSIX threads: 4

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

HMM : globins50

mu : -39.897396

lambda : 0.226086

max : -9.567000


Searching the sequence DB 10

Header Section

hmmsearch globin.hmm Artemia.fa

hmmsearch - search a sequence database with a profile HMMHMMER 2.3 (April 2003)Copyright (C) 1992-2003 HHMI/Washington University School of MedicineFreely distributed under the GNU General Public License (GPL)- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

-HMM file: globin.hmm [globins50]Sequence database: Artemia.faper-sequence score cutoff: [none]per-domain score cutoff: [none]per-sequence Eval cutoff: <= 10per-domain Eval cutoff: [none]- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Query HMM: globins50Accession: [none]Description: [none][HMM has been calibrated; E-values are empirical estimates]


Searching the sequence DB (cont.) 11

Sequence Top Hits Section



Alignment Output Section



Score Histogram Section


Local versus Global Alignment 14

HMMER does not do local (Smith/Waterman) and global (Needleman/Wunsch) style alignments in the same way that most computational biology analysis programs do it.

To HMMER, whether local or global alignments are allowed is part of the model, rather than being accomplished by running a different algorithm. You must choose what kind of alignments you want to allow when you build the model By default, hmmbuild builds models which allow alignments that are global with respect to the HMM, local with respect to the sequence, and allows

multiple domains to hit per sequence.


Experimental Observations My tests on the clipped SH3 Domain sequence in the Krogh paper.15

The insert gap penalty was small but significant. The number of inserts had a linear, negative affect on the score. Relative to the overall score, the inserts and deletes had a small effect.

(15) Krogh, “Computational Methods in molecular biology, pages 45-63, Elsevier, 1998.

Avg Log Odds by Domain8.63 -1.46 14.77

Insert Region Log Odds Correlated to Number of Inserts

-1.52

-1.50

-1.48

-1.46

-1.44

-1.42

-1.40

-1.38

-1.36

-1.34

0 2 4 6 8

Total Inserts

Lo

g O

dd

s

hidden markov model ed anderson and sasha tkachev

Documents

state j

best state

best pathscore state

present state

j score

sasha tkachev slide

real time speech recognition

speech recognition seminar