introduction to markov systems - university of florida · 2005. 3. 4. · eel6825: pattern...

EEL6825: Pattern Recognition Introduction to Markov systems

- 1 -

Intr oduction to Markov systems

1. Intr oduction

Up to now, we have talked a lot about building statistical models from data. However, throughout our discussionthus far, we have made the sometimes implicit, simplifying assumption that our individual data samples (or tri-als) are statistically independent. Let,

(1)

be a sequence of discrete random variables. Then the random variables are said to be independent if andonly if,

(2)

(Equation (2) generalizes trivially to sequences of continuous and multivariate random variables.)

Some classic examples of independent events are (1) a series of coin tosses, (2) a series of dice rolls, etc. In manyinstances, however, the independence assumption is not a good assumption, especially when we deal withsequential data generated over time. Consider, for example, the following random variables:

1. A person’s weight.

2. Acoustic signals in speech recognition.

3. Hand gestures.

4. Robot trajectories.

5. The weather.

6. The state of a dynamic system.

In each of the above examples, knowledge of the random variable at previous times gives ususeful information about the random variable at time . If I know that my weight on day is 150 lb., thenmy weight on day is clearly constrained to be in some interval around 150 lb. Similarly, if I measure the posi-tion of my mobile robot at time , , then the measurement of the robot’s position at time isconstrained to be in some neighborhood of ,

(3)

where and are determined by the robot’s maximum acceleration and the robot’s sensor noise models.When statistical independence is not a good assumption, the following approximation is often made:

(4)

In other words, we assume that the outcome of the th measurement (or trial) is dependent on the outcome of theth measurement, but is conditionally independent of the outcomes of all previous measurements at times

given the outcome of the th measurement. This property is universally known as theMarkov property. (The Russian mathematician Andrei Andreivich Markov was the first to extensively study sto-chastic systems with property (4) in the early part of the 20th century.)

Note that although the Markov assumption in equation (4) only relaxes the independence assumption in equation(2) slightly, it proves to be a remarkably powerful tool in modeling data with sequential dependencies. Many realsystems have been modeled very successfully using the Markov property, even when the Markov assumption isonly an approximation of reality (as is the case for many real systems). A recent query of the INSPEC database ofscientific and engineering abstracts generated 25,000 hits for the keyword “Markov.”

Models that make use of the Markov property can be broadly classified into four categories along two dimen-sions: (1) observability and (2) actions. Markov models that are fully observable and involve no actions (i.e arepassive) are called Markov chains or observable Markov models. Markov chains that are only partially observ-able (i.e. the state of the system is observable only indirectly) are called hidden Markov models (HMMs).

X

1

X

2

…, ,{ }

X

j

{ }

P X

t

x

t

=

X

t

1

–

x

t

1

–

=

X

t

2

–

x

t

2

–

=

…

X

1

x

1

=

, , ,( )

P X

t

x

t

=

( )

=

t

1

–

t

2

–

…, ,{ }

t

t

1

–

( )

t

t

1

–

x

t

1

–

y

t

1

–

,( )

t

x

t

1

–

y

t

1

–

,( )

x

t

y

t

,( )

x

t

1

–

y

t

1

–

,( ) δ

x

δ

y

,( )±

=

δ

x

δ

y

P X

t

x

t

=

X

t

1

–

x

t

1

–

=

X

t

2

–

x

t

2

–

=

…

X

1

x

1

=

, , ,( )

P X

t

x

t

=

X

t

1

–

x

t

1

–

=

( )

=

t

t

1

–

( )

t

2

–

t

3

–

…

1

, , ,{ }

t

1

–

( )


- 2 -

Figure 1 below gives simple examples of a Markov chain, and a hidden Markov model, respectively. In theMarkov chain, the represent observable states of the system at a given time step, and the representprobabilistic transitions between the states. In the hidden Markov model (HMM), the states are not directlyobservable (i.e. are hidden), but are only indirectly observed through a stochastic output observable (in this case,the color red or blue). The probability of observing a given observable (e.g red or blue) is state-dependent; in thefigure below, for example, there is a one-third probability of observing red and a two-third probability of observ-ing blue in state .

Markov systems were actions (by some agent) influence the state of the system at the next time step are studied inthe broad area of reinforcement learning and are referred to as Markov Decision Processes. The table below sum-marizes the different types of Markov systems.

Before formalizing the notion of a Markov chain and a hidden Markov model, we list some of the successfulreal-world applications of HMMs:

Actions

Passive Choose actions

Obs

erva

bilit

y

Fully observable

Observable Markov ModelsMarkov Decision Processes

(MDPs)

Partially observable

Hidden Markov Models (HMMs)

Partially Observable Markov Decision Processes

(POMDPs)

1. Speech recognition 6. Human skill modeling (e.g. surgical procedures)

2. Language modeling 7. Human control strategy analysis (e.g. driving)

3. Gesture recognition (e.g. sign language) 8. Robot control (e.g. autonomous driving)

4. Hand-writing recognition 9. And others...

5. Facial-expression recognition (e.g. sign language)

S

i

{ }

a

ij

{ }

S

i

{ }

S

1

S1S2

S3S4

a21

a31

a41

a12

a32

a42a13

a23

a43

a14

a24

a34

a11a22

a33a44

a21

a31

a41

a12

a32

a42a13

a23

a43

a14

a24

a34

a11a22

a33a44

four-state observable Markov model four-state, two-observable HMM

Figure 1


- 3 -

2. Observable Markov models

A. Definitions

Consider a system that consists of distinct states,

, , (5)

with the following properties. At regularly spaced, discrete-time intervals, the system undergoes a change ofstate (possibly back to its present state). Defining , , as the state of the system at time ,assume that the state transitions of this system are governed by first-order Markov state-conditional probabil-ities, such that,

. (6)

Moreover, assume that the state-transition probabilities are stationary, such that,

, . (7)

Together, equations (5), (6) and (7) define a discrete, stationary, first-order Markov chain, or alternatively, anobservable Markov model.

For notational convenience, let,

, , (8)

let the state-transition matrix be defined by,

(9)

where,

, , (10)

, , (11)

and let the -length initial-state probability vector , ,

, (12)

define the initial ( ) state probability distribution. Then, the Markov chain is completely defined by.

B. Example

Consider Figure 2 below. It graphically describes a fully connected three-state Markov model ( ).(Fully connected means that , .) This model could, for example, be a simple meteorologicalmodel. Assume that once a day at high noon, the weather conditions are observed and classified into one ofthree distinct states where,

= rain or snow, = cloudy, and = sunny. (13)

Furthermore, assume that the weather on any given day is dependent only on the weather on the previousday . Then a possible state-transition matrix might be,

N

S

i

{ }

i

1

…

N

, ,{ }∈

q

t

{ }

t

1 2

…, ,{ }∈

t

P q

t

S

j

=

q

t

1

–

S

i

=

q

t

2

–

S

k

=

…, ,( )

P q

t

S

j

=

q

t

1

–

S

i

=

( )

=

P q

t

S

j

=

q

t

1

–

S

i

=

( )

P q

s

S

j

=

q

s

1

–

S

i

=

( )

=

s

∀

t

,

1

>

a

ij

P q

t

S

j

=

q

t

1

–

S

i

=

( )≡

i j

,

1

…

N

, ,{ }∈

N N

×

A

A

a

11

a

12

…

a

1

N

a

21

a

22

…

a

2

N

: :

:

a

N

1

a

N

2

…

a

NN

=

0

a

ij

1

≤ ≤

i j

,

1

…

N

, ,{ }∈

a

ijj

1

=

N

∑

1

=

i

1

…

N

, ,{ }∈

N

π π

i

{ }

=

i

1

…

N

, ,{ }∈

π

i

P q

1

S

i

=

( )

=

i

1

…

N

, ,{ }∈

t

1

=

λ

A

π,{ }

N

3

=

a

ij

0

>

i

∀

j

,

S

1

S

2

S

3

, ,{ }

S

1

S

2

S

3

t

t

1

–

( )

A


- 4 -

(14)

which is illustrated graphically in Figure 3 below.

C. Probability of observation sequences

Given a Markov model , and an observation sequence ,

(15)

of length , the probability of the observation sequence , given the model , , is given by,

(16)

(17)

(Note that in equation (17), the subscripts denote the index of the state at time .)

Example: For the example in Section 2-B, what is the probability of the observation sequence ,

S1S2

S3

a21

a31

a12

a32 a13

a23

a11a22

a33

Figure 2

A

0.4 0.3 0.3

0.2 0.6 0.2

0.1 0.1 0.8

=

S1S2

S3

0.2

0.1

0.3

0.1 0.3

0.2

0.40.6

0.8

Figure 3

λ

A

π,{ }

=

O

O q

1

q

2

…

q

T

, , ,{ }

=

T

O

λ

P O

λ( )

P O

λ( )

P q

1

( )

P q

2

q

1

( ) …

P q

T

q

T

1

–

( )×××

=

P O

λ( ) π

q

1

a

q

1

q

2

…

a

q

T

1

–

q

T

×××

=

q

t

t

O


- 5 -

(18)

given that at time , the weather is sunny?

From the question,

, . (19)

Therefore,

(20)

(21)

(22)

D. Expected length of same-state sequences

Given that at time the state of the model is , the probability that the model will stay in thatstate for exactly time steps is given by,

, (23)

where,

, . (24)

Now,

(25)

where,

(given), (26)

(27)

(28)

so that,

(29)

The expected value of is given by,

(30)

(31)

(32)

Example: For the example in Section 2-B,

days of rain or snow, (33)

O S

3

S

3

S

3

S

1

S

1

S

3

S

2

S

3

, , , , , , ,{ }

=

t

1

=

π

1

π

2

0

= =

π

3

P q

1

S

3

=

( )

1

= =

P O

λ( ) π

3

a

33

a

33

a

31

a

11

a

13

a

32

a

23

=

P O

λ( )

1 0.8 0.8 0.1 0.4 0.3 0.1 0.2

×××××××

=

P O

λ( )

1.536 10

4

–

×≈

t

λ

A

π,{ }

=

S

i

d

P O

d

λ( )

O

d

q

t

S

i

=

q

t

1

+

S

i

=

…

q

t d

1

–+

S

i

=

q

t d

+

, , , ,

S

j

=

{ }

=

j i

≠

P O

d

λ( )

P q

t

S

i

=

( )

P q

τ

S

i

=

q

τ

1

–

S

i

=

( )

τ

t

1

+=

t d

1

–+

∏

P q

t d

+

S

i

≠

q

t d

1

–+

S

i

=

( )××

=

P q

t

S

i

=

( )

1

=

P q

τ

S

i

=

q

τ

1

–

S

i

=

( )

a

ii

=

P q

t d

+

S

i

≠

q

t d

1

–+

S

i

=

( )

1

P q

t d

+

S

i

=

q

t d

1

–+

S

i

=

( )

–

1

a

ii

–= =

P O

d

A

( )

a

iid

1

–

( )

1

a

ii

–

( )×

=

d

E d

[ ]

d P O

d

A

( )×

d

1

=

∞

∑

=

E d

[ ]

d a

iid

1

–

( )

1

a

ii

–

( )××

d

1

=

∞

∑

=

E d

[ ]

11

a

ii

–

( )

--------------------=

E d

[ ]

S

1

5 3

⁄

=


- 6 -

days of cloud cover, (34)

days of sunshine. (35)

E. Probability of being in a given state at time

Given a Markov model , the probability of being in state at time , can be com-puted inductively,

(by definition) (36)

, (37)

Alternatively, we can write,

(38)

where, denotes the th element of the vector .

F. Fully connected Markov chains

Fully connected (finite-state) Markov chains ( , ) are ergodic. That is, the limit as approaches infinity exists and is independent of the initial state probability vector ,

, (39)

Example: The limit in equation (39) typically converges for relatively small values of . For the Markovchain in Section 2-B, for example, equation (39) converges to,

(rain or snow) (40)

(cloudy) (41)

(sunny) (42)

for different initial values of and , as illustrated in Figure 4, for example, with .(This weather model is clearly not for Pittsburgh, PA, where sunshine is a rare occurrence indeed.)

E d

[ ]

S

2

5 2

⁄

=

E d

[ ]

S

3

5

=

t

λ

A

π,{ }

=

S

j

t

P q

t

S

i

=

( )

P q

1

S

i

=

( ) π

i

=

P q

t

S

j

=

( )

P q

t

1

–

S

i

=

( )

a

iji

1

=

N

∑

=

t

2 3

…, ,{ }∈

P q

t

S

j

=

( ) π

'

A

t

1

–

( )

[ ]

j

( )

=

v

[ ]

i

( )

i

v

a

ij

0

>

i

∀

j

,

P q

t

S

j

=

( )

t

π

P q

t

S

j

=

( )

t

∞→

lim

π

'

A

t

1

–

( )

[ ]

j

( )

t

∞→

lim

P S

j

( )

= =

π∀

t

P S

1

( )

0.1818

≈

P S

2

( )

0.2727

≈

P S

3

( )

0.5455

≈

π

t

20

≈

π

0.1 0.1 0.8

T

=

5 10 15 20 25 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

t

PS

j

()

j

12

3

,,

{}

∈,

Figure 4


- 7 -

Example: The limit in equation (39) need not be well defined if some . Consider the two-stateMarkov model in Figure 5 below:

The corresponding matrix is given by,

(43)

Note that for in equation (43),

, , (44)

Therefore, the limit in equation (39) does not exist.

G. Partially connected Markov chains

When some transition probabilities , states may become transient or absorbing. A transient state isdefined by the property,

, (45)

while an absorbing state is defined by the property,

(46)

Example: Consider the three-state Markov model in Figure 6 below, where ,, and .

The states and are transient, while the state is absorbing.

H. Maximum-lik elihood estimation for observable Markov models

Given an observation sequence ,

(47)

The maximum-likelihood estimate of the parameters in the Markov model is given by,

(48)

(49)

a

ij

0

=

A

A

0 1

1 0

=

A

A

2

t

1 0

0 1

=

A

2

t

1

–

( )

0 1

1 0

=

t

1 2

…, ,{ }∈

S1S2

1

1

Figure 5

a

ij

0

=

S

j

P q

t

S

j

=

( )

t

∞→

lim 0

=

S

j

P q

t

S

j

=

( )

t

∞→

lim 1

=

0

a

11

a

12

a

21

a

22

a

23

, , , ,

1

< <

a

13

a

31

a

32

0

= = =

a

33

1

=

S

1

S

2

S

3

S1 S2 S3

a21

a12 a23

a11 a22 1

Figure 6

O

O q

1

q

2

…

q

T

, , ,{ }

=

λ

A

π,{ }

=

a

ij

number of transitions from state

S

i

to state

S

j


S

i

-------------------------------------------------------------------------------------------------------------=

π

i

1

if

q

1

,

S

i

=

0

otherwise

,

=


- 8 -

Example: The following observation sequence ,

= {2233311333333333333333333333112222233121333333333321323322233222222222221......1331222222222111222111333333} (50)

of length was generated from the Markov model in Section 2-B, with .

For each , we have to count the number of transitions from state to state . The matrix gives thosefrequencies:

(51)

Normalizing the rows of , we get that the maximum-likelihood estimate of is given by,

(52)

Comparing in equation (52) to in equation (14), we see that is a reasonable approximation of .Of course, for longer observation sequences (more data), the approximation of will improve asymptoti-cally. Figure 7 below plots the error between the actual model [equation (14)] and the estimated model,

(53)

averaged over 20 randomly generated observation sequences, as the length of the observation sequencesbecomes large.

O

O

T

100

=

π

1 3

⁄

1 3

⁄

1 3

⁄

T

=

a

ij

S

i

S

j

A

ˆ

A

ˆ

7 4 5

5 27 4

4 4 40

=

A

ˆ

A

A

∗

0.438 0.250 0.313

0.139 0.750 0.111

0.085 0.085 0.830

≈

A

∗

A

A

∗

A

A

e

RMS

a

ij

∗

a

ij

–

( )

2

i j

,

∑

=

T

10 50 100 500 1000 5000 10000

0.2

0.4

0.6

0.8

T

e

RM

S

Figure 7


- 9 -

3. Hidden Mark ov models (HMMs)

A. Intr oduction

As we have seen, in Markov chains (or observable Markov models), it is assumed that we can observe thestate at each time step . In hidden Markov models, we remove that condition. In other words, we assumethat the state at each time step is

not

observable; rather, we can only observe the state at each time step indi-rectly through some observable , which is related to the state at time step through some output prob-ability distribution. That is, a hidden Markov model is a doubly stochastic model, where both state transitionsas well as output observables are governed by probabilistic distributions. Before we formalize this notion, letus look at a simple example.

Suppose that we have large urns (bowls), that are filled with differently colored balls. In Figure 2, ,and there are three colors: (1) white, (2) gray and (3) black. We now generate a sequence of balls using thefollowing procedure. We first pick one of the three urns at random, according to some stationary (fixed) prob-ability distribution. Next, we pick a ball a random from that urn, and observe it’s color. The ball is then putback in the urn from which it was selected. Subsequently, a new urn is selected according to some stationaryprobability distribution associated with the current urn. That is, which urn is selected next is conditional ononly the current urn. Once again, we pick a ball from the new urn, and the process is repeated to generate anobservation sequence , of colors. A typical observation sequence is plotted in Figure9 for .

Graphically, we can represent this process as a hidden Markov model (HMM), as shown in Figure 10. Eachurn is represented by a state that is connected to all other states through probabilistic transitions. Each statealso has some output probability distribution (e.g. ratio of black, gray and white balls in each urn) associatedwith it. This output probability distribution for each state is indicated graphically in Figure 10. For example,in state 1 (the left-most state), there are 1/4 white balls, 1/4 gray balls and 1/2 black balls.

q

t

t

t

O

t

q

t

t

#1 #2 #3

Figure 8

N

N

3

=

O

t

{ }

t

1 2

…

T

, , ,{ }∈

T

15

=

Figure 9

Figure 10


- 10 -

Note, that in this process, we can only observe the color of each ball, not the urn from which the ball waspicked. Hence the name

hidden

Markov model, indicating that we cannot directly observe the underlyingsequence of states that generated the observation sequence. For the sample observation sequence shownabove, many different underlying state sequences are possible. In fact, there are a total of total possible sequences.

B. Definition of an HMM

While the above example may appear contrived, it does have all the necessary properties of a hidden Markovmodel. In order to define the notion of an HMM formally, we introduce the following nomenclature:

= state , (54)

=

k

th observable, (55)

= state of the system at time step , (56)

= observable that is observed at time step , . (57)

Given this notation, a discrete-output

1

HMM is completely defined by , where,

= # of states in the model, (58)

= # of possible observables ( in Rabiner tutorial paper [1]), (59)

= state transition matrix with elements , (60)

, , (61)

= output probability distribution matrix with elements , (62)

, , (63)

= -length initial state probability vector with elements , (64)

, (65)

The parameters (probabilities) of the hidden Markov model are stationary (independent of time).

Thus, the hidden Markov model is different from the observable Markov model in that we have added theoutput probability distribution matrix . This makes it a much more powerful parametric formalism for mod-eling sequential, Markovian processes, since can model any arbitrary discrete distribution. Of course, the

matrix also adds a large number of parameters to the model, so typically, more data is required to reliablytrain hidden Markov models

vs.

observable Markov models. For example, if and , theobservable Markov model has,

free parameters, (66)

while, the hidden Markov model has,

free parameters. (67)

[Note that for an observable Markov model, the matrix would be given by,

( identity matrix), (68)

and the observables would be the states, so that .]

1. There are also HMMs for continuous (continuous-valued) observables. These will be discussed later.

3

15

14.3 10

6

×≈

S

i

i

v

k

q

t

t

O

t

t

t

1

…

T

, ,{ }∈

λ

λ

A B

π, ,{ }

=

N

L

M

A

N N

×

a

ij

{ }

a

ij

{ }

P q

t

1

+

S

j

=

q

t

S

i

=

( )

=

i j

1

…

N

, ,{ }∈,

B

L N

×

b

kj

{ }

b

j

k

( ){ }

=

b

j

k

( ){ }

P O

t

v

k

=

q

t

S

j

=

( )

=

j

1

…

N

, ,{ }∈

k

1

…

L

, ,{ }∈

π

N

π

i

π

i

{ }

P q

1

S

i

=

( )

=

i

1

…

N

, ,{ }∈

B

B

B

N

5

=

L

100

=

N

2

N

+

30

=

N

2

N N L

×

+ +

530

=

B

B I

N

=

N N

×

v

i

v

i

S

i

=


- 11 -

For the urn HMM, the output probability distribution matrix is given by,

(69)

4. Thr ee fundamental problems for HMMs

We can define three fundamental problems for hidden Markov models: (1) the

evaluation

problem, (2) the

decod-ing

problem and (3) the

training

problem.

A. Problem #1: Evaluation problem

Given an observation sequence , , and a hidden Markov model ,how do we efficiently compute (i.e. probability of the observation sequence given the HMM )?

B. Problem #2: Decoding problem

Given an observation sequence , , and a hidden Markov model ,how do we efficiently compute the “best” (most likely) state sequence , (i.e. thestate sequence that best explains the observation sequence given )?

C. Problem #3: Training pr oblem

Given an observation sequence , , the number of states , and the number of pos-sible observables (implicitly defined by ), how do we efficiently compute the maximum-likelihood esti-mate of the parameters of the hidden Markov model ? In other words, what HMM maximizes ?

D. Discussion

Before we derive the solutions for these three problems, it is worthwhile to examine where each of the threeproblems would come up in a real-world application of hidden Markov models. Historically, one of the firstapplication areas of HMMs was in speech recognition, so we will choose that as our example. Specifically,we will be looking at isolated spoken word recognition, as illustrated in Figure 11 on the following page. Inthis context, we can assign physical meaning to the hidden states as the phonemes of the words in the vocab-ulary of the word recognizer.

A simple isolated spoken-word recognizer can be constructed as follows. First, we decide which words, , we would like our system to be able to recognize. For example, we might be inter-

ested in recognizing the 10 spoken words corresponding to “zero,” “one,”..., and “nine,” for an automatedphone system application. Next, we record hundreds or thousands of labeled examples of different individu-als saying each of the words out loud. These data will serve as our training data. For each word we train acorresponding HMM to model that word (problem #3 above).

Before using the discrete hidden Markov models, we first have to decide how we will convert the time-sam-pled, continuous-valued acoustic speech signals to sequences of discrete observables. This conversion typi-cally involves two steps: (1) windowing and spectral preprocessing followed by (2) vector quantization.Windowing partitions the acoustic signal into possibly overlapping segments, each of which is then filteredthrough some spectral preprocessor (linear predictive coding, fast Fourier transform, etc.). The acousticspeech signal is thus converted to a sequence of continuous-valued spectral feature vectors , which arethen quantized to discrete observables through vector quantization. It is important to realize that thespectral preprocessing and VQ codebook have to be the same for all spoken words in our training data. There-fore, the VQ codebook is typically trained (using, for example, the LBG algorithm) on all the training data.

Now, suppose that we now want to recognize an unknown utterance by some individual. We first convert thatacoustic signal to a sequence of observables following the same conversion procedure as in the training

B

B

1 4

⁄

3 5

⁄

1 5

⁄

1 4

⁄

1 5

⁄

3 5

⁄

1 2

⁄

1 5

⁄

1 5

⁄

white

( )

gray

( )

black

( )

=

O O

t

{ }

=

t

1

…

T

, ,{ }∈

λ

A B

π, ,{ }

=

P O

λ( )

O

λ

O O

t

{ }

=

t

1

…

T

, ,{ }∈

λ

A B

π, ,{ }

=

Q q

t

{ }

=

t

1

…

T

, ,{ }∈

O

λ

O O

t

{ }

=

t

1

…

T

, ,{ }∈

N

L

O

λ

A B

π, ,{ }

=

λ

P O

λ( )

W

i

{ }

i

1

…

M

, ,{ }∈

W

i

λ

i

v

t

{ }

O

t

{ }

O

t

{ }


- 12 -

reco

gn

ize

d

wo

rd

spe

ctra

l p

rep

roce

ssin

g

sele

ct

ma

xim

um

HM

M fo

r w

ord

#2

λ

2

HM

M fo

r w

ord

#M

λ

M

HM

M fo

r w

ord

#1

λ

1

PO

λ

1

()

PO

λ

2

()

PO

λ

M

()

Vect

or

Qu

an

tiza

tion

aco

ust

ic s

pe

ech s

ign

al H

MM

s tra

ine

d o

ff-lin

e o

n m

an

y in

sta

nce

s o

f w

ord #

Mev

alu

atio

no

rd

eco

din

g

v

t

{}

O

t

{}

Sp

ect

ral

fea

ture

ve

cto

rs

Ob

serv

atio

n

seq

ue

nce

O

Fig

ure

11


- 13 -

phase. Then, we evaluate either (problem #1 above), or , where represents themost likely state sequence (phoneme sequence) for given (problem #2). Finally, whichever HMM yields the largest probability will correspond to the most likely word .

As we shall see later, although different HMM applications vary in detail and complexity, the diagram on theprevious page contains many of the essential components that make up an HMM-based system: (1) spectralpreprocessing, (2) vector quantization and (3) a bank of HMMs.

5. Solution to the evaluation problem (problem #1)

Problem statement: Given an observation sequence , , and a hidden Markov model, how do we efficiently compute (i.e. the probability of the observation sequence

given the HMM )?

A. Intr oduction

In hidden Markov models, the principal problem in computing ,

, , (70)

, (71)

is that we do not know what underlying state sequence ,

, (72)

generated the given observation sequence . If we assume a specific underlying state sequence , then theproblem is much simpler, since,

. (73)

[Recall that in the definition of hidden Markov models, the observation at time , , is dependent only onthe state at time , .] The probability of any specific state sequence can be expressed in terms of theHMM parameters as,

(74)

[This is identical to observable Markov models.] The joint probability of and given can be written as,

(75)

so that,

(76)

Although equation (76) gives a computable expression for , it does not offer a

practical

algorithm forcomputing . Since there are possible state sequences, each of which requires order opera-tions, equation (76) requires on the order of total operations. Even for very moderately sized problemsthis is unreasonable. For example, if and , we would require approximately,

operations. (77)

Clearly, we require a more efficient formulation for evaluating . The

forward algorithm

, describedbelow, does just that. A related algorithm, the

backward algorithm

, while not explicitly used to compute, will be used later to efficiently solve the training problem, and is therefore also presented below.

Together, these two algorithms are sometimes referred to as the

forward-backward algorithm

.

P O

λ

i

( )

P O Q

i

∗ λ

i

,( )

Q

i

∗

O

λ

i

λ

l

W

l

O O

t

{ }

=

t

1

…

T

, ,{ }∈

λ

A B

π, ,{ }

=

P O

λ( )

O

λ

P O

λ( )

O O

t

{ }

=

t

1

…

T

, ,{ }∈

λ

A B

π, ,{ }

=

Q

Q q

t

{ }

=

t

1

…

T

, ,{ }∈

O

Q

P O Q

λ,( )

P O

t

q

t

λ,( )

t

1

=

T

∏

b

q

t

O

t

( )

t

1

=

T

∏

= =

t

O

t

t

q

t

Q

P Q

λ( ) π

q

1

a

q

1

q

2

a

q

2

q

3

…

a

q

T

1

–

q

T

π

q

1

a

q

t

1

–

q

t

t

2

=

T

∏

= =

O

Q

λ

P O Q

λ,( )

P O Q

λ,( )

P Q

λ( )

=

P O

λ( )

P O Q

λ,( )

Q

∑

P O Q

λ,( )

P Q

λ( )

Q

∑

= =

P O

λ( )

P O

λ( )

N

T

2

T

2

TN

T

N

5

=

T

100

=

2 100 5

100

××

10

72

≈

P O

λ( )

P O

λ( )


- 14 -

B. The forward algorithm

Let’s define the following “forward” variable :

, (78)

which denotes the probability of the partial observation sequence and being in state at timestep given the model . The variables can be computed inductively, and from them, is eas-ily evaluated. The forward algorithm is defined below:

1.

Initialization

:

(79)

(80)

, . (81)

2.

Induction

:

(82)

(83)

, , . (84)

3.

Completion

:

[by definition (78)] (85)

The computation of in the induction step above (step 2) accounts for all possible state transitions tostate from time step to time step , and the observable at time step . Figure 12 belowillustrates the induction step graphically.

For the same values of and as before ( , ), the computation of now only takeson the order of,

α

t

i

( )

α

t

i

( )

P O

1

…

O

t

q

t

S

i

=

, , , λ( )

=

O

1

…

O

t

, ,{ }

S

i

t

λ

α

t

i

( )

P O

λ( )

α

1

i

( )

P O

1

q

1

S

i

=

, λ( )

=

α

1

i

( )

P O

1

q

1

S

i

=

λ,( )

P q

1

S

i

=

λ( )

=

α

1

i

( ) π

i

b

i

O

1

( )

=

i

1

…

N

, ,{ }∈

α

t

1

+

j

( )

P O

1

…

O

t

1

+

q

t

1

+

S

j

=

, , , λ( )

=

α

t

1

+

j

( )

P O

1

…

O

t

q

t

S

i

=

, , , λ( )

P q

t

1

+

S

j

=

q

t

S

i

=

λ,( )

i

1

=

N

∑

P O

t

1

+

q

t

1

+

S

j

=

λ,( )

=

α

t

1

+

j

( ) α

t

i

( )

a

iji

1

=

N

∑

b

j

O

t

1

+

( )

=

t

1

…

T

, ,{ }∈

j

1

…

N

, ,{ }∈

P O

λ( ) α

T

i

( )

i

1

=

N

∑

=

α

t

1

+

j

( )

S

j

t

t

1

+

O

t

1

+

t

1

+

S2

S1

SN

S j

a2j

a1j

aNj

t t

1

+

α

t

1

+

j

( )α

t

i

( )

Figure 12

N

T

N

5

=

T

100

=

P O

λ( )


- 15 -

operations, (86)

as opposed to operations. Thus, the forward algorithm offers a practical and efficient means for solvingthe evaluation problem in hidden Markov models.

1

C. The backward algorithm

While the backward algorithm is not used explicitly to solve the evaluation problem, it will be used later forsolving the training problem in conjunction with the forward algorithm. Because of its similarity to the for-ward algorithm, it is presented now, however. First, we define the “backward” variables (similar to theforward variables) to be,

(87)

which denotes the probability of the partial observation sequence given that at time step the model is in state and given the model . As with the forward variables , the backward variables

can be computed inductively. The backward algorithm is defined below:

1.

Initialization

:

, (88)

Note that equation (88)

arbitrarily

defines such that,

, (89)

2.

Induction

:

, , . (90)

Equation (90) is similar to the induction step of the forward algorithm, except that now we propagate the val-ues back from the end of the observation sequence, rather than forward from the beginning of . This isillustrated graphically in Figure 13 below. As with the forward algorithm, the backward algorithm requires onthe order of operations.

1. As we shall see later, this algorithm will require slight modification through scaling for implementation on finite-precision computers in order to prevent numerical underflow.

N

2

T

2500

=

10

72

β

t

i

( )

β

t

i

( )

P O

t

1

+

…

O

T

, ,

q

t

S

i

=

λ,( )

=

O

t

1

+

…

O

T

, ,{ }

t

S

i

λ

α

t

i

( )

β

t

i

( )

β

T

i

( )

1

≡

i

1

…

N

, ,{ }∈

β

T

i

( )

P O

λ( ) α

t

i

( )β

t

i

( )

i

1

=

N

∑

=

t

1

…

T

, ,{ }∈

β

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

j

1

=

N

∑

=

t T

1

–

T

2

–

…

1

, , ,{ }∈

i

1

…

N

, ,{ }∈

O

N

2

T

S1

S2

SN

Si

ai1

ai2

aiN

t t

1

+

β

t

1

+

j

( )β

t

i

( )

Figure 13


- 16 -

6. Solution to the training problem (problem #3)

1

Problem statement: Given an observation sequence , , the number of states , and thenumber of possible observables (implicitly defined by ), how do we efficiently compute the maximum-like-lihood estimate of the parameters in the hidden Markov model ? In other words, what HMM maximizes ?

A. Intr oduction

Baum and his colleagues developed the

Baum-Welch algorithm

for training hidden Markov models in the late1960s and early 1970s. The development of their algorithm precedes the publication by Dempster,

et. al.

[2]of the general Expectation-Maximization (EM) algorithm (1977). It is interesting to note, however, that theBaum-Welch algorithm is simply a special case of the EM algorithm. Intuitively, we see why the EM algo-rithm may be applicable here, since hidden Markov models clearly have hidden information (the underlyingstate sequence corresponding to the observable sequence ). Therefore, in developingthe Baum-Welch algorithm for training HMMs, we will do so within the EM framework.

B. Initial f ormulation

Suppose that we knew the value of the underlying state sequence , . Then, themaximum-likelihood estimate of given is straightforward and given below:

(same as for observable Markov models) (91)

(92)

(93)

Unfortunately, we don’t know the state sequence ; however, if we have a current estimate of, then we can compute “the expected number of...” for each of the numerators and denomina-

tors in equations (91), (92) and (93) (EM rears its ugly head again). Thus, assuming a current estimate of thehidden Markov model , a better (or equally good) estimate will be given by,

(94)

(95)

(96)

where the right-hand sides of equations (94), (95) and (96) can be expressed — as we shall see shortly — interms of (i.e. the current model parameter estimates).

C. Detailed derivation

In order to compute the quantities in equations (94), (95) and (96), we need to define some hidden variablessimilar to the mixture-of-Gaussian problem. Thus, let’s define the following hidden variables for the reesti-mation of the state-transition matrix :

1. We first look at problem #3 rather than problem #2 (the decoding problem) because the solution of prob-lem #3 makes extensive use of the forward-backward algorithm (i.e. solution to problem #1).

O O

t

{ }

=

t

1

…

T

, ,{ }∈

N

L

O

λ

A B

π, ,{ }

=

λ

P O

λ( )

Q q

t

{ }

=

O O

t

{ }

=

Q q

t

{ }

=

t

1 2

…

T

, , ,{ }∈

λ

A B

π, ,{ }

=

O

a

ij


S

i

to state

S

j


S

i

-------------------------------------------------------------------------------------------------------------=

b

j

k

( )

number of times in state

S

j

and observing symbol

v

k


S

j

------------------------------------------------------------------------------------------------------------------------------=

π

i


S

i

at time

t

1

=

( )

=

Q

λ

A B

π, ,{ }

=

λ

A B

π, ,{ }

=

λ

A B

π, ,{ }

=

a

ij

expected number of transitions from state

S

i

to state

S

j

expected number of transitions from state

S

i

-----------------------------------------------------------------------------------------------------------------------------------=

b

j

k

( )

expected number of times in state

S

j


v

k


S

j

-----------------------------------------------------------------------------------------------------------------------------------------------------=

π

i


S

i

at time

t

1

=

( )

=

A B

π, ,{ }

A


- 17 -

(97)

(98)

so that,

. (99)

Let us also define,

(100)

(101)

so that,

. (102)

Similarly, let us define the following hidden variables for the reestimation of the output-probability distribu-tion matrix :

(103)

(104)

so that,

. (105)

Let us also define,

(106)

(107)

so that,

. (108)

In terms of the above definitions, equations (94), (95) and (96) can be written as,

(109)

(110)

y

t

i j

,( )

1

at time

t

the system transitions from

S

i

to

S

j

, ,

0

otherwise

,

=

y i j

,( )


S

i

to state

S

j

=

y i j

,( )

y

t

i j

,( )

t

1

=

T

1

–

∑

=

y

t

i

( )

1

at time

t

the system transitions from state

S

i

, ,

0

otherwise

,

=

y i

( )


S

i

=

y i

( )

y

t

i

( )

t

1

=

T

1

–

∑

=

B

z

t

j k

,( )

1

at time

t

the system is in state

S

j

with output observable

v

k

, ,

0

otherwise

,

=

z j k

,( )


S

j


v

k

=

z j k

,( )

z

t

j k

,( )

t

1

=

T

∑

=

z

t

j

( )

1

at time

t

the system is in state

S

j

, ,

0

otherwise

,

=

z j

( )


S

j

=

z j

( )

z

t

j

( )

t

1

=

T

∑

=

a

ij

E y i j

,( )[ ]

E y i

( )[ ]

-----------------------=

b

j

k

( )

E z j k

,( )[ ]

E z j

( )[ ]

-----------------------=


- 18 -

(111)

where denotes the expectation operator. To complete the derivation, we need to compute the expectedvalues in equations (109), (110) and (111). First, consider :

(112)

(113)

(114)

Let us define as,

(115)

so that,

(116)

(117)

We can compute in terms of the forward and backward variables. Figure 14 below illustrates how thisis done graphically. Recall from Section 5 that,

, and (118)

(119)

so that,

π

i

E y

1

i

( )[ ]

=

E

⋅[ ]

E y i j

,( )[ ]

E y i j

,( )[ ]

E y

t

i j

,( )

t

1

=

T

1

–

∑

E y

t

i j

,( )[ ]

t

1

=

T

1

–

∑

= =

E y i j

,( )[ ]

0

P y

t

i j

,( )

0

=

( )⋅

1

P y

t

i j

,( )

1

=

( )⋅

+

[ ]

t

1

=

T

1

–

∑

=

E y i j

,( )[ ]

P y

t

i j

,( )

1

=

( )

t

1

=

T

1

–

∑

=

ξ

t

i j

,( )

ξ

t

i j

,( )

E y

t

i j

,( )[ ]≡

ξ

t

i j

,( )

P y

t

i j

,( )

1

=

( )

P q

t

S

i

=

q

t

1

+

S

j

O

=

λ, ,( )

= =

ξ

t

i j

,( )

P q

t

S

i

=

q

t

1

+

S

j

O

,

=

λ,( )

P O

λ( )

-----------------------------------------------------------------=

ξ

t

i j

,( )

t

1

–

t t

1

+

t

2

+

S

i

α

t

i

( )

a

ij

b

j

O

t

1

+

( )

S

j

β

t

1

+

j

( )

Figure 14

α

t

i

( )

P O

1

…

O

t

q

t

S

i

=

, , , λ( )

=

β

t

1

+

j

( )

P O

t

2

+

…

O

T

, ,

q

t

1

+

S

j

=

λ,( )

=


- 19 -

(120)

In the numerator of equation (120), the forward variable and the backward variable accountfor the entire observation sequence except for the observation at time and the transition fromstate to from to . These are taken care of by multiplication of and , respectively.

[From the definition of , we require that,

(121)

Therefore, it must be true that,

(122)

Note from equation (90) [reprinted below as equation (123)],

(123)

that this is consistent with equation (89) [reprinted below as equation (124)],

.] (124)

Next, we consider :

(125)

Let us define as,

(126)

so that,

(127)

(128)

Combining equations (109), (112), (115), (125) and (126), we arrive at the following iterative EM update rulefor the state-transition matrix :

(129)

where is given in equation (120) in terms of the current model parameters, and is given in equa-tion (128) in terms of .

The iterative EM update rule for the output-probability-distribution matrix can be similarly derived. Thenumerator in equation (110) can be expressed as,

ξ

t

i j

,( )α

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

P O

λ( )

--------------------------------------------------------------=

α

t

i

( )

β

t

1

+

j

( )

O

t

1

+

t

1

+

S

i

S

j

t

t

1

+

b

j

O

t

1

+

( )

a

ij

ξ

t

i j

,( )

ξ

t

i j

,( )

j

1

=

N

∑

i

1

=

N

∑

1

=

P O

λ( ) α

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

j

1

=

N

∑

i

1

=

N

∑

=

β

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

j

1

=

N

∑

=

P O

λ( ) α

t

i

( )β

t

i

( )

i

1

=

N

∑

=

E y i

( )[ ]

E y i

( )[ ]

E y

t

i

( )[ ]

t

1

=

T

1

–

∑

P y

t

i

( )

1

=

( )

t

1

=

T

1

–

∑

P q

t

S

i

=

O

λ,( )

t

1

=

T

1

–

∑

= = =

γ

t

i

( )

γ

t

i

( )

E y

t

i

( )[ ]≡

γ

t

i

( )

P q

t

S

i

=

O

λ,( )

P q

t

S

i

=

q

t

1

+

S

j

O

=

λ, ,( )

j

1

=

N

∑

= =

γ

t

i

( ) ξ

t

i j

,( )

j

1

=

N

∑

=

A

a

ij

ξ

t

i j

,( )

t

1

=

T

1

–

∑

γ

t

i

( )

t

1

=

T

1

–

∑

⁄

=

ξ

t

i j

,( )

γ

t

i

( )

ξ

t

i j

,( )

B


- 20 -

(130)

(131)

Similarly, the denominator in equation (110) can be expressed as,

(132)

(133)

so that,

(134)

Finally, from equation (111), we get that,

. (135)

D. Summary of results

We have now derived the Baum-Welch algorithm for iteratively training the parameters of a hidden Markovmodel. Given a current estimate of the HMM and an observation sequence ,

, the new estimate of the HMM is given by , where,

, (136)

, , (137)

, (138)

and (139)

(140)

Note from equation (139) the key role that the forward and backward variables play in the HMM reestimationformulas. Also, note that the reestimation equations (136), (137) and (138) guarantee that,

, (141)

E z j k

,( )[ ]

P z

t

j k

,( )

1

=

( )

t

1

=

T

∑

P q

t

S

j

=

O

t

v

k

O

=

λ, ,( )

t

1

=

T

∑

= =

E z j k

,( )[ ] γ

t

j

( )

tO

t

∀

v

k

=

∑

=

E z j

( )[ ]

P z

t

j

( )

1

=

( )

t

1

=

T

∑

P q

t

S

j

=

O

λ,( )

t

1

=

T

∑

= =

E z j

( )[ ] γ

t

j

( )

t

1

=

T

∑

=

b

j

k

( ) γ

t

j

( )

tO

t

∀

v

k

=

∑

γ

t

j

( )

t

1

=

T

∑

⁄

=

π

i

E y

1

i

( )[ ] γ

1

i

( )

= =

λ

A B

π, ,{ }

=

O O

t

{ }

=

t

1

…

T

, ,{ }∈

λ

A B

π, ,{ }

=

a

ij

ξ

t

i j

,( )

t

1

=

T

1

–

∑

γ

t

i

( )

t

1

=

T

1

–

∑

⁄

=

i j

1

…

N

, ,{ }∈,

b

j

k

( ) γ

t

j

( )

tO

t

∀

v

k

=

∑

γ

t

j

( )

t

1

=

T

∑

⁄

=

j

1

…

N

, ,{ }∈

k

1

…

L

, ,{ }∈

π

i

γ

1

i

( )

=

i

1

…

N

, ,{ }∈

ξ

t

i j

,( )α

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

P O

λ( )

--------------------------------------------------------------=

γ

t

i

( ) ξ

t

i j

,( )

j

1

=

N

∑

=

a

ij

j

1

=

N

∑

1

=

i

1

…

N

, ,{ }∈


- 21 -

, (142)

(143)

after each update. Since the reestimation equations are EM equations, we are guaranteed that for each itera-tion,

(144)

When , we are guaranteed to have reached a

local

maximum of the log-probability func-tion. Equations (136), (137) and (138) can also be derived through direct maximization of the -function,

1

or constrained Lagrange optimization of .

E. Simplification

The update equations (136) and (137) can be written in simplified form directly in terms of the forward andbackward variables.

(145)

(146)

Equations (145) and (146) above were derived by combining equations (123), (136), (137), (139) and (140)above. For training on single observation sequences, equations (145) and (146) reduce to,

(147)

(148)

1. The in this context is different from the state-sequence used throughout these notes; it’s just an unfortunate collision of notation.

b

j

k

( )

k

1

=

L

∑

1

=

j

1

…

N

, ,{ }∈

π

ii

1

=

N

∑

1

=

P O

λ( )

P O

λ( )≤

P O

λ( )

P O

λ( )

=

Q

Q

EM

λ λ,( )

P O

λ( )

Q

Q

a

ij

1

P O

λ( )

------------------

α

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

t

1

=

T

1

–

∑

1

P O

λ( )

------------------

α

t

i

( )β

t

i

( )

t

1

=

T

1

–

∑

--------------------------------------------------------------------------------------------=

b

j

k

( )

1

P O

λ( )

------------------

α

t

j

( )β

t

j

( )

tO

t

∀

v

k

=

∑

1

P O

λ( )

------------------

α

t

j

( )β

t

j

( )

t

1

=

T

∑

---------------------------------------------------------------=

a

ij

α

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

t

1

=

T

1

–

∑

α

t

i

( )β

t

i

( )

t

1

=

T

1

–

∑

------------------------------------------------------------------------=

b

j

k

( )

α

t

j

( )β

t

j

( )

tO

t

∀

v

k

=

∑

α

t

j

( )β

t

j

( )

t

1

=

T

∑

--------------------------------------------=


- 22 -

7. Solution to the decoding problem (problem #2)

Problem statement: Given an observation sequence , , and a hidden Markov model, how do we efficiently compute the “best” (most likely) state sequence ,

(i.e. the state sequence that best explains the observation sequence given )?

A. Intr oduction

Since the underlying state sequence that generated the observation sequence isunknown, we would sometimes like to be able to determine the most likely state sequence to have generated

given the hidden Markov model . In speech recognition applications, for example, this is especially rel-evant, since the states of the hidden Markov model can be interpreted as phonemes.

One way to try to solve this problem is to choose the which are

individually

most likely at each time step. The variables that we defined in the previous section are helpful here,

[see equations (126) and (127)]. (149)

(150)

Since,

, and (151)

(152)

the variables can be computes in terms of and ,

(153)

Now, in order to choose the individually optimal state sequence , we simply take the largest value for all ,

, (154)

The basic problem with this solution is that equation (154) permits state sequences that may contain statetransitions,

(155)

that are not possible given (i.e. ), so that . A better solution, therefore is to try tofind the state sequence that gives the single best path for the observation sequence and the HMM ; inother words, we would like to find such that,

(156)

is maximized. The

Viterbi algorithm

, developed in the next section, does just that.

B. The Viterbi algorithm

Thus, we want to maximize . From basic probability theory,

(157)

O O

t

{ }

=

t

1

…

T

, ,{ }∈

λ

A B

π, ,{ }

=

Q q

t

{ }

=

t

1

…

T

, ,{ }∈

O

λ

Q q

t

{ }

=

O O

t

{ }

=

O

λ

q

t

t

γ

t

i

( )

γ

t

i

( )

P q

t

S

i

=

O

λ,( )

=

γ

t

i

( )

P q

t

S

i

O

,

=

λ( )

P O

λ( )

--------------------------------------=

α

t

i

( )

P O

1

…

O

t

q

t

S

i

=

, , , λ( )

=

β

t

i

( )

P O

t

1

+

…

O

T

, ,

q

t

S

i

=

λ,( )

=

γ

t

i

( )

α

t

i

( )

β

t

i

( )

γ

t

i

( )α

t

i

( )β

t

i

( )

P O

λ( )

------------------------

α

t

i

( )β

t

i

( )

α

t

j

( )β

t

j

( )

j

1

=

N

∑

----------------------------------= =

Q

∗

q

t

∗{ }

=

γ

t

i

( )

t

q

t

∗

argmax

i

γ

t

i

( )

=

t

1

…

T

, ,{ }∈

Q

∗

…

q

,

t

S

i

=

q

t

1

+

S

j

=

…, ,{ }

λ

a

ij

0

=

P Q

∗ λ( )

0

=

Q

O

λ

Q

P Q O

λ,( )

P Q O

λ,( )

P Q O

λ,( )

P Q O

λ,( )

P O

λ( )

--------------------------=


- 23 -

so that maximizing is equivalent to maximizing the joint probability of and given ,, since is constant with respect to . As was the case for the evaluation problem, we

could theoretically solve this problem by computing,

, [see equations (73), (74) and (75)] (158)

and then select ,

(159)

as the optimal state sequence. Since there are a total of total possible state sequences (e.g. if and, then ), however, equation (159) would be impractically slow to compute. Therefore, we

will derive an inductive algorithm for solving the decoding problem similar in spirit to the forward-backwardalgorithm. This algorithm is known as the

Viterbi algorithm

.

First, let us define another variable ,

(160)

which can be interpreted as the best score (i.e. highest probability) along a single state path at time for thefirst observations and ending in state . Given the definition in equation (160), the following inductiverelationship exists:

(161)

Equation (161) forms the basis of the Viterbi algorithm and is very similar to the forward algorithm used tocompute the variables. In order to completely specify the Viterbi algorithm, however, we will needanother (Greek) variable which keeps track of which maximizes the right-hand side of equation(161) for each time step . The complete Viterbi algorithm for recovering the most likely state sequence is given below:

1.

Initialization

:

(162)

, (163)

2.

Induction

:

, , (164)

, , (165)

3.

Termination

:

(166)

(167)

(168)

4.

Path (state-sequence) back tracking

:

, (169)

where , , is the most likely state sequence.

P Q O

λ,( )

O

Q

λ

P Q O

λ,( )

P O

λ( )

Q

P Q O

, λ( )

Q

∀

Q

∗

Q

∗

argmax

Q

P Q O

, λ( )

=

N

T

N

5

=

T

100

=

N

T

10

70

≈

δ

t

i

( )

δ

t

i

( )

max

q

1

…

q

t

1

–

, ,

P q

1

…

q

t

S

i

=

O

1

…

O

t

, , , , , λ( )

=

t

t

S

i

δ

t

1

+

j

( )

max

i

δ

t

i

( )

a

ij

[ ]

b

j

O

t

1

+

( )

=

α

t

i

( )

ψ

t

i

( )

i

t

Q

∗

δ

1

i

( )

P q

1

S

i

=

O

1

λ,( )

=

δ

1

i

( ) π

i

b

i

O

1

( )

=

i

1

…

N

, ,{ }∈

δ

t

j

( )

max

i

δ

t

1

–

i

( )

a

ij

[ ]

b

j

O

t

( )

=

j

1

…

N

, ,{ }∈

t

2

…

T

, ,{ }∈

ψ

t

j

( )

argmax

i

δ

t

1

–

i

( )

a

ij

[ ]

=

j

1

…

N

, ,{ }∈

t

2

…

T

, ,{ }∈

P

∗

max

Q

P Q O

, λ( )

=

P

∗

max

i

δ

T

i

( )

=

q

T

∗

argmax

i

δ

T

i

( )

=

q

t

∗ ψ

t

1

+

q

t

1

+

∗( )

=

t T

1

–

T

2

–

…

1

, , ,{ }∈

Q

∗

q

t

∗{ }

=

t

1

…

T

, ,{ }∈


- 24 -

As is the case for the forward-backward algorithm, the Viterbi algorithm requires slight modification forimplementation on finite-precision computers in order to prevent numerical underflow.

8. Hidden Mark ov models with continuous-valued outputs

So far, we have looked only at discrete-output HMMs, where observations consist of sequences of discreteobservables . Although most applications of HMMs deal with continuous-valued signals, these typesof HMMs are used most often in practice because they are by far the most computationally efficient type ofHMM with which to work. Nevertheless, there do exist HMMs with continuous-valued outputs, and we reviewtwo main types of continuous-valued HMMs below.

A. Continuous-output HMMs

Assume that rather than have a sequence of discrete observables , we instead have a sequence ofcontinuous-valued vectors , . Without preprocessing and vector quantization of thesequence , we can no longer use the discrete-output probability matrix . Rather, in continuous-outputHMMs, we assume that each state , , is represented by an output probability density func-tion (pdf) that is a mixture of Gaussians:

, , (170)

where,

(Gaussian pdf with mean and covariance ), (171)

(probability of class given state ), (172)

such that,

and (173)

(174)

Figure 2 illustrates schematically, for example, what a three-state continuous-output HMM with one-dimen-sional continuous-valued outputs may look like. The main drawback of continuous-output HMMs is theirsubstantial computational complexity. Consider for example, the number of free parameters that have to beestimated during training in continuous-output HMMs

vs.

discrete-output HMMs.

: free parameters (same as for discrete-output HMMs) (175)

: free parameters (similar to parameters for discrete-output HMMs) (176)

: free parameters (177)

: free parameters (178)

Thus, from equations (177) and (178) we see that continuous HMMs have on the order of moreparameters to estimate, which leads to much higher computational complexity during training. It is not justtraining that requires much more computation, however; even evaluation of requires an order-of-magnitude increased computation, since each table lookup in the discrete case is replaced by the rela-tively complex computation of ,

O O

t

{ }

=

O O

t

{ }

=

X

x

t

{ }

=

t

1

…

T

, ,{ }∈

X

B

S

j

j

1

…

N

, ,{ }∈

b

j

x

( )

b

j

x

( )

c

jk

b

jk

x

( )

k

1

=

L

∑

=

j

1

…

N

, ,{ }∈

b

jk

x

( )

N

x

µ

jk

Σ

jk

, ,[ ]

=

µ

jk

Σ

jk

c

jk

P

ω

k

S

j

( )

=

ω

k

S

j

c

jkk

1

=

L

∑

1

=

b

j

x

( )

x

d

∞

–

∞

∫

1

=

a

ij

N N

×

c

jk

N L

×

b

j

k

( )

µ

jk

N L d

××

Σ

jk

N L d

2

××

NLd

1

d

+

( )

P X

λ( )

b

j

O

t

( )

b

j

x

( )


- 25 -

(179)

Equation (179) requires Gaussian function evaluations, multiplications and additions,

vs.

onetable lookup in the discrete case.

While computational complexity is a principal reason that discrete-output HMMs are often preferred to con-tinuous-valued HMMs in practice, a secondary important reason is that continuous-valued HMMs tend to bemore sensitive to initial parameter settings [3]. In other words, it is much more likely that in training continu-ous-valued HMMs, the model converges to a poor local maximum of the log-likelihood function, unlessthe initial parameter settings are selected with great care.

Huang [4] tried to address these concerns with development of what he referred to as “semicontinuous”HMMs, which we briefly review in the next section.

B. Semicontinuous-output HMMs

The basic idea of

semicontinuous HMMs

is to combine the computational simplicity of discrete-outputHMMs with the superior modeling capacity of continuous HMMs. Given a sequence of continuous-valuedvectors , , the distribution of the data is first estimated as the mixture of Gaus-sians, using the EM algorithms developed earlier in this course. The output probability of a vector in state

is then estimated as,

(180)

where,

= the th Gaussian pdf estimated through the EM algorithm, and (181)

. (182)

The principal difference between semicontinuous HMMs and continuous HMMs is that instead of Gaussians, there are only Gaussians in the model and that the parameters of the Gaussians are estimatedprior to HMM training. This simplifies the model significantly, not in small part because a semicontinuous

Figure 15

b

j

x

( )

c

jk

b

jk

x

( )

k

1

=

L

∑

=

L

L

L

1

–

λ

X

x

t

{ }

=

t

1

…

T

, ,{ }∈

X

L

x

S

j

b

j

x

( )

p

x

ω

k

( )

b

j

k

( )

k

1

=

L

∑

=

p

x

ω

k

( )

k

b

j

k

( )

P

ω

k

S

j

( )

=

N L

×

L


- 26 -

HMM typically have many fewer parameters that need to be estimated. Unfortunately, semicontinuousHMMs are still significantly more computationally costly than discrete-output HMMs.

In order to clarify the difference between discrete-output and continuous-valued output HMMs, let’s look at asimple example. Consider a single-state HMM with,

(183)

where the th discrete observable corresponds to the real values ,

, . (184)

In other words, the discretization in equation (184) corresponds to a VQ codebook of,

(185)

as illustrated in Figure 16 below.

The pdf for the HMM in equation (183) and the VQ codebook in equation (185) is plotted in Figure 17 for. The effects of vector quantization are clearly evident in the discontinuous profile of .

Now, however, assume that we represent the continuous-valued data not as a VQ codebook, but rather as amixture-of-Gaussians with means ,

, [same as centroids in VQ codebook of equation (185)] (186)

and variances ,

λ

A

1

[ ]

=

B

0.0

0.5

0.3

0.2

0.0

=

π

1

[ ]

=

, ,

=

k

x

k

1

–

5

-----------

xk

5

---

<≤

k

1 2 3 4 5

, , , ,{ }∈

110

------

310

------

510

------

710

------

910

------

, , , ,

1 10

⁄

3 10

⁄

5 10

⁄

7 10

⁄

9 10

⁄

x

O

1 2 3 4 5

Figure 16

x

0 1

),[∈

p x VQ

λ,( )

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

2

2.5

x

px

VQ

λ,(

)

discrete pdf

Figure 17

x

µ

i

µ

i

i

10

------=

i

1 2 3 4 5

, , , ,{ }∈

σ

i

2


- 27 -

, . (187)

The resulting pdf is plotted above in Figure 18. (Note that since in equation (183) is a single-state HMM,the semicontinuous and continuous HMM representations for this problem are the same.) Clearly, the contin-uous-valued HMM is able to represent continuous distributions more accurately.

Evaluation, decoding and training of HMMs with continuous and semicontinuous output probability distribu-tions will not be covered here. We simply note that algorithms developed for discrete-output HMMs previ-ously extend in a straightforward manner to the continuous and semicontinuous case.

9. Implementation issues

A. Discretization compensation

The discrete pdf for the single-state HMM in equation (183) illustrates a key shortcoming of applying dis-crete-output HMMs on continuous-valued data (as is most often the case in real-world applications). Con-sider, for example, two real-data sequences and :

(188)

(189)

which differ only in the first element (0.21

vs.

0.19). Assuming the VQ codebook in (185) and the single-stateHMM in (183), we can evaluate and . First we convert and to sequences of dis-crete observables, and , respectively:

(190)

(191)

Then, evaluating the probabilities of each sequence given ,

(192)

(193)

Thus, even though the two real sequences and are almost identical, they evaluate to radically differentlog-probabilities. Note that the same result could have been obtained for arbitrary long sequences and

, which differ only in one element by some small .

σ

i

2

0.0075

=

i

1 2 3 4 5

, , , ,{ }∈

λ

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

2

2.5

x

px

λ(

)

continuous-valued pdf

Figure 18

X

X

1

X

2

X

1

0.21 0.31 0.33 0.64 0.59 0.27

, , , , ,{ }

=

X

2

0.19 0.31 0.33 0.64 0.59 0.27

, , , , ,{ }

=

λ

P X

1

λ( )

P X

2

λ( )

X

1

X

2

O

1

O

2

O

1

2 2 2 4 3 2

, , , , ,{ }

=

O

2

1 2 2 4 3 2

, , , , ,{ }

=

λ

P X

1

λ( )

ln

P O

1

λ( )

ln 4 0.5

( )

ln 0.3

( )

ln 0.2

( )

ln

+ +

5.586

–= = =

P X

2

λ( )

ln

P O

2

λ( )

ln 0.0

( )

ln 3 0.5

( )

ln 0.3

( )

ln 0.2

( )

ln

+ + +

∞

–= = =

X

1

X

2

X

1

X

2

ε


- 28 -

This singularity result is not desirable; it suggests that discrete-output HMMs may not be very robust tonoise. Remember that HMMs are necessarily trained on

finite

-length sequences, so that rare events win non-zero probability may be possible yet, at the same time, may not be reflected in the data (i.e. may not havebeen observed). The probabilities corresponding to such events will therefore converge to zero during HMMtraining. Alternatively, a sample sequence may have spurious readings due to sensor failure, etc., and suchsequences will evaluate to zero probability on HMMs previously trained on less noisy data.

There are two basic ways to deal with this problem through

discretization compensation

: (1) semicontinuousevaluation, and (2) flooring. In semicontinuous evaluation, the HMM is first trained on discrete data (vector-quantized from real data). When new sequences of real data need to be evaluated, we assume that the VQcodebook previously generated represents a mixture of Gaussians with some uniform variance that can bethought of as a smoothing parameter.

By far the most common method of discretization compensation, however, is

flooring

(also referred to assmoothing). Flooring is simple and effective. As before, discrete-output HMMs are trained normally. Oncethe HMM training has converged, zero entries in the resulting model are replaced with somesmall (typically around 0.0001). Rows in the matrix, column in the matrix, and the vector are thenrenormalized to fit the probabilistic constraints,

, (194)

, (195)

(196)

For example, the matrix for the single-state HMM in equation (183) would be modified for ,

(197)

through flooring. With the floored matrix , the observation sequences in equations (190) and (191) eval-uate to,

(198)

(199)

Clearly, the flooring procedure ensures that observation sequences evaluate to finite log-probabilities. Notethat the log-probability for changes little from ; however, the log-probability for becomesfinite. The log-probabilities in (198) and (199) still appear significantly different primarily because the frac-tion of near-zero observables for (1/6) is relatively high for this simple example. Typically, rare eventsthat lead to zero probabilities in the hidden Markov model occur much less frequently.

B. Scaling for the forward-backward algorithm

The forward-backward algorithm, as previously presented, suffers from numerical underflow problems ifimplemented on any finite precision computer. Recall the forward algorithm,

X

σ

2

λ

A B

π, ,{ }

=

ε

A

B

π

a

ijj

1

=

N

∑

1

=

i

1

…

N

, ,{ }∈

b

j

k

( )

k

1

=

L

∑

1

=

j

1

…

N

, ,{ }∈

π

ii

1

=

N

∑

1

=

B

λ

ε

0.01

=

B

f

0.01 1.02

⁄

0.50 1.02

⁄

0.30 1.02

⁄

0.20 1.02

⁄

0.01 1.02

⁄

=

B

B

f

P

ln

O

1

λ( )

5.705

–=

P O

2

λ( )

ln 9.617

–=

O

1

5.586

–

O

2

O

2


- 29 -

1.

Initialization

:

, (200)

2.

Induction

:

, , (201)

3.

Completion

:

(202)

for evaluating , and the analogous backward algorithm,

1.

Initialization

:

, (203)

2.

Induction

:

, , (204)

Note from equations (201) and (204) that the induction steps of the forward and backward algorithms, respec-tively, multiply together numbers less than or equal to one. This implies that for large (i.e. long observationsequences)

and . (205)

Consider, for example, the simple three-state HMM with four output observables shown in Figure 19 below.This HMM was used to generate a sample observation sequence of observables. For this obser-vation sequence, we compute , , , and plot as a function of in Figure 20. Note that even for this relatively simple HMM and relatively short observation sequence,

(206)

The numerical underflow problem becomes even more severe when the number of possible observablesincreases (since the values will become smaller on average as increases). To illustrate this, we gen-erate another HMM with the same state-transition matrix as before, but with an output-probability distri-bution matrix with (i.e. 100 possible observables). We again compute , ,

, and plot as a function of in Figure 21. Note that now,

α

1

i

( ) π

i

b

i

O

1

( )

=

i

1

…

N

, ,{ }∈

α

t

1

+

j

( ) α

t

i

( )

a

iji

1

=

N

∑

b

j

O

t

1

+

( )

=

t

1

…

T

1

–

, ,{ }∈

j

1

…

N

, ,{ }∈

P O

λ( ) α

T

i

( )

i

1

=

N

∑

=

P O

λ( )

β

T

i

( )

1

=

i

1

…

N

, ,{ }∈

β

t

i

( )

a

ij

b

j

O

t

1

+

( )

j

1

=

N

∑

β

t

1

+

j

( )

=

t T

1

–

…

1

, ,{ }∈

i

1

…

N

, ,{ }∈

T

α

T

i

( )

0

→

β

1

i

( )

0

→

O

T

100

=

α

t

i

( )

t

1

…

100

, ,{ }∈

i

1 2 3

, ,{ }∈

log

10

α

t

i

( )

t

S1 S2 S3

0.2

0.1

0.3

0.1

0.3

0.2

0.4 0.6 0.8

Figure 19

α

100

i

( )

10

60

–

≈

b

j

k

( )

L

A

B

L

100

=

α

t

i

( )

t

1

…

100

, ,{ }∈

i

1 2 3

, ,{ }∈

log

10

α

t

i

( )

t


- 30 -

(207)

Clearly, for arbitrarily long observation sequences, the forward-backward algorithm will not be numericallystable. Therefore, in order to make the algorithm numerically stable, we introduce

scaling

of the forward andbackward variables that will keep the values of those variables within the dynamic range of a typical floatingpoint computer.

Below, we present the

scaled forward algorithm

:

1.

Initialization

:

, (208)

, (

scaling

) (209)

2.

Induction

:

, , (210)

, , (

scaling

) (211)

0 20 40 60 80 100

-60

-50

-40

-30

-20

-10

0

log

10

α

t

i

()

t

L

4

=

Figure 20

0 20 40 60 80 100

-175

-150

-125

-100

-75

-50

-25

0

log

10

α

t

i

()

t

L

100

=

L

4

=

Figure 21

α

100

i

( )

10

180

–

≈

α

˜

1

i

( ) π

i

b

i

O

1

( )

=

i

1

…

N

, ,{ }∈

α

ˆ

1

i

( )

c

1

α

˜

1

i

( )

=

i

1

…

N

, ,{ }∈

α

˜

t

1

+

j

( ) α

ˆ

t

i

( )

a

iji

1

=

N

∑

b

j

o

t

1

+

( )

=

t

1

…

T

1

–

, ,{ }∈

j

1

…

N

, ,{ }∈

α

ˆ

t

1

+

j

( )

c

t

1

+

α

˜

t

1

+

j

( )

=

t

1

…

T

1

–

, ,{ }∈

j

1

…

N

, ,{ }∈


- 31 -

, (

scaling

) (212)

3.

Completion

:

(213)

In the scaled forward algorithm, the forward variables are scaled at each time step by a scaling coefficient so that the scaled forward variables meet the following constraint:

, (214)

By comparing the scaled forward algorithm [equations (208) to (213)] with the unscaled forward algorithm[equations (200) to (202)], we observe that the scaled forward variables are related to the forward vari-ables of the unscaled forward algorithm by the following relationship:

(215)

For ,

(216)

so that from equation (214),

(217)

Since,

, (218)

equation (217) can be written as,

(219)

so that we arrive at equation (213),

(220)

For large , will typically be very small, so that in practice we compute equation (220) instead as,

(221)

Once the scaled forward algorithm has been executed, the

scaled backward algorithm

follows:

c

t

1

α

˜

t

i

( )

i

1

=

N

∑

⁄

=

t

1

…

T

, ,{ }∈

P O

λ( )

1

c

tt

1

=

T

∏

⁄

=

t

c

t

α

ˆ

t

i

( )

α

ˆ

t

i

( )

i

1

=

N

∑

1

=

t

1

…

T

, ,{ }∈

α

ˆ

t

i

( )

α

t

i

( )

α

ˆ

t

i

( )

c

ττ

1

=

t

∏

α

t

i

( )

=

t T

=

α

ˆ

T

i

( )

c

tt

1

=

T

∏

α

T

i

( )

=

α

ˆ

T

i

( )

i

1

=

N

∑

c

tt

1

=

T

∏

α

T

i

( )

i

1

=

N

∑

1

= =

P O

λ( ) α

T

i

( )

i

1

=

N

∑

=

c

tt

1

=

T

∏

P O

λ( )

1

=

P O

λ( )

1

c

tt

1

=

T

∏

⁄

=

T

P O

λ( )

P O

λ( )

log

c

t

log

t

1

=

T

∑

–=


- 32 -

1.

Initialization

:

, (222)

, (

scaling

) (223)

2.

Induction

:

, (224)

, , (

scaling

) (225)

Note that for the scaled backward algorithm, we use the same scaling coefficients computed in the scaledforward algorithm. The scaled backward variables are related to the backward variables of theunscaled backward algorithm by the following relationship:

(226)

As an example of the scaled forward-backward algorithm, let us compute , ,, and plot as a function of for the three-state HMM with for which we

previously computed the unscaled variables (Figure 22).Note that now the scaled forward variablesstay well within the floating-point range of a typical computer, and that this result holds no matter how large

is.

C. Scaling and the Baum-Welch algorithm

Previously, we derived the following iterative training algorithm for optimizing the parameters of a hiddenMarkov model given some observation sequence ,

(227)

β

˜

T

i

( )

1

=

i

1

…

N

, ,{ }∈

β

ˆ

T

i

( )

c

T

β

˜

T

i

( )

=

i

1

…

N

, ,{ }∈

β

˜

t

i

( )

a

ij

b

j

o

t

1

+

( )β

ˆ

t

1

+

j

( )

j

1

=

n

∑

=

t T

1

–

…

1

, ,{ }∈

i

1 2

…

N

, , ,{ }∈

β

ˆ

t

i

( )

c

t

β

˜

t

i

( )

=

t T

1

–

…

1

, ,{ }∈

i

1 2

…

n

, , ,{ }∈

c

t

β

ˆ

t

i

( )

β

t

i

( )

β

ˆ

t

i

( )

c

ττ

t

=

T

∏

β

t

i

( )

=

α

ˆ

t

i

( )

t

1

…

100

, ,{ }∈

i

1 2 3

, ,{ }∈

log

10

α

ˆ

t

i

( )

t

L

100

=

α

t

i

( )

T

0 20 40 60 80 100

-2.5

-2

-1.5

-1

-0.5

0

log

10

α

ˆ

t

i

()

t

L

100

=

Figure 22

λ

O

a

ij

α

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

t

1

=

T

1

–

∑

α

t

i

( )β

t

i

( )

t

1

=

T

1

–

∑

------------------------------------------------------------------------=


- 33 -

(228)

in terms of the unscaled forward-backward variables, where is the current estimate of themodel, and is the new estimate of the HMM. Let us consider how the update equations (227)and (146) change if expressed in terms of the scaled forward-backward variables. Consider equation (227)first.

First let us substitute for and for in (227) and see how we need to modify the result-ing equation to make it equivalent to (227):

(229)

Now, combine equation (229) with equations (215) and (226) so that,

(230)

(231)

From equation (220),

(232)

Therefore, to express equation (227) in terms of the scaled forward-backward variables, we simply need todivide in the denominator of equation (229) by . The correct update equation in terms of thescaled forward-backward variables is therefore given by,

(233)

b

j

k

( )

α

t

j

( )β

t

j

( )

tO

t

∀

v

k

=

∑

α

t

j

( )β

t

j

( )

t

1

=

T

∑

--------------------------------------------=

λ

A B

π, ,{ }

=

λ

A B

π, ,{ }

=

α

ˆ

t

i

( )

α

t

i

( )

β

ˆ

t

i

( )

β

t

i

( )

a

ij

α

ˆ

t

i

( )

a

ij

b

j

O

t

1

+

( )β

ˆ

t

1

+

j

( )

t

1

=

T

1

–

∑

α

ˆ

t

i

( )β

ˆ

t

i

( )

t

1

=

T

1

–

∑

------------------------------------------------------------------------=?

a

ij

c

ττ

1

=

t

∏

α

t

i

( )

a

ij

b

j

O

t

1

+

( )

c

ττ

t

1

+=

T

∏

β

t

1

+

j

( )

t

1

=

T

1

–

∑

c

ττ

1

=

t

∏

α

t

i

( )

c

ττ

t

=

T

∏

β

t

j

( )

t

1

=

T

1

–

∑

--------------------------------------------------------------------------------------------------------------------------=?

a

ij

c

ττ

1

=

T

∏

α

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

t

1

=

T

1

–

∑

c

ττ

1

=

T

∏

c

t

α

t

i

( )β

t

i

( )

t

1

=

T

1

–

∑

----------------------------------------------------------------------------------------------=?

a

ij

1

P O

λ( )

------------------

α

t

i

( )

a

ij

b

j

O

t

1

+

( )β

t

1

+

j

( )

t

1

=

T

1

–

∑

1

P O

λ( )

------------------

c

t

α

t

i

( )β

t

i

( )

t

1

=

T

1

–

∑

--------------------------------------------------------------------------------------------=?

α

ˆ

t

i

( )β

ˆ

t

j

( )

c

t

a

ij

α

ˆ

t

i

( )

a

ij

b

j

O

t

1

+

( )β

ˆ

t

1

+

j

( )

t

1

=

T

1

–

∑

α

ˆ

t

i

( )β

ˆ

t

i

( )

c

t

------------------------

t

1

=

T

1

–

∑

------------------------------------------------------------------------=


- 34 -

Now let us consider the update for the output probability matrix in equation (146). Again, let us substitute for and for ,

(234)

Now, combine equation (234) with equations (215) and (226) so that,

(235)

(236)

Therefore, to express equation (146) in terms of the scaled forward-backward variables, we need to divide in both the numerator and denominator of equation (234) by . The correct update equation in

terms of the scaled forward-backward variables is therefore given by,

(237)

Equations (233) and (237) represent the Baum-Welch reestimation formulas in terms of the scaled forward-backward variables.

D. Scaling and the Viterbi algorithm

The Viterbi algorithm for finding the most likely state sequence given and was previously derived as:

1.

Initialization

:

(238)

, (239)

2.

Induction

:

, , (240)

, , (241)

B

α

ˆ

t

i

( )

α

t

i

( )

β

ˆ

t

i

( )

β

t

i

( )

b

j

k

( )

α

ˆ

t

j

( )β

ˆ

t

j

( )

tO

t

∀

v

k

=

∑

α

ˆ

t

j

( )β

ˆ

t

j

( )

t

1

=

T

∑

--------------------------------------------=?

b

j

k

( )

c

ττ

1

=

t

∏

α

t

j

( )

c

ττ

t

=

T

∏

β

t

j

( )

tO

t

∀

v

k

=

∑

c

ττ

1

=

t

∏

α

t

j

( )

c

ττ

t

=

T

∏

β

t

j

( )

t

1

=

T

∑

---------------------------------------------------------------------------------------=?

b

j

k

( )

1

P O

λ( )

------------------

c

t

α

t

j

( )β

t

j

( )

tO

t

∀

v

k

=

∑

1

P O

λ( )

------------------

c

t

α

t

j

( )β

t

j

( )

t

1

=

T

∑

--------------------------------------------------------------------=?

α

ˆ

t

i

( )β

ˆ

t

j

( )

c

t

b

j

k

( )

α

ˆ

t

j

( )β

ˆ

t

j

( )

c

t

------------------------

tO

t

∀

v

k

=

∑

α

ˆ

t

j

( )β

ˆ

t

j

( )

c

t

------------------------

t

1

=

T

∑

--------------------------------------------=

Q

O

λ

δ

1

i

( )

P q

1

S

i

=

O

1

λ,( )

=

δ

1

i

( ) π

i

b

i

O

1

( )

=

i

1

…

N

, ,{ }∈

δ

t

j

( )

max

i

δ

t

1

–

i

( )

a

ij

[ ]

b

j

O

t

( )

=

j

1

…

N

, ,{ }∈

t

2

…

T

, ,{ }∈

ψ

t

j

( )

argmax

i

δ

t

1

–

i

( )

a

ij

[ ]

=

j

1

…

N

, ,{ }∈

t

2

…

T

, ,{ }∈


- 35 -

3.

Termination

:

(242)

(243)

(244)

4.

Path (state-sequence) back tracking

:

, (245)

where , , is the most likely state sequence.

This algorithm is easily modified to eliminate numerical underflow problems by defining,

(246)

instead of,

(247)

Consequently, equation (239) changes to,

, ; (248)

equation (240) changes to,

, , ; (249)

equation (241) changes to,

, , ; (250)

and equation (243) changes to,

(251)

E. Multiple observation sequences

In many applications, rather than train a hidden Markov model on one long observation sequence , weinstead would like to train the model on short sequences , . For example, inspeech recognition applications, where each HMM may represent one spoken word, we would like to trainthat HMM on many different utterances of that word. In handwritten gesture recognition, where each HMMmay represent one of several different gestures (e.g. written letters), we again would like to train those HMMson many different examples of that gesture.

Training a single HMM on multiple observation sequences leads to two changes in the general HMM frame-work: (1) the HMM structure that is chosen for those applications, and (2) a modified training algorithm forhandling multiple observation sequences.

When training on multiple, relatively short observation sequences, typically the HMM model is restricted to aleft-to-right structure. Consider, for example, the left-to-right model in Figure 23.

Except for self-transitions, connections are only allowed in one direction. A fundamental property of all left-to-right HMMs is that the state transition probabilities are restricted to,

, (252)

P

∗

max

Q

P Q O

, λ( )

=

P

∗

max

i

δ

T

i

( )

=

q

T

∗

argmax

i

δ

T

i

( )

=

q

t

∗ ψ

t

1

+

q

t

1

+

∗( )

=

t T

1

–

T

2

–

…

1

, , ,{ }∈

Q

∗

q

t

∗{ }

=

t

1

…

T

, ,{ }∈

φ

t

i

( )

max

q

1

…

q

T

, ,

P q

1

…

q

t

S

i

=

O

1

…

O

t

, , , , , λ( )

log

=

δ

t

i

( )

max

q

1

…

q

T

, ,

P q

1

…

q

t

S

i

=

O

1

…

O

t

, , , , , λ( )

=

φ

1

i

( ) π

i

( )

log

b

i

O

1

( )[ ]

log

+=

i

1

…

N

, ,{ }∈

φ

t

j

( )

max

i

φ

t

1

–

i

( )

a

ij

log

+

[ ]

b

j

O

t

( )[ ]

log

+=

j

1

…

N

, ,{ }∈

t

2

…

T

, ,{ }∈

ψ

t

j

( )

argmax

i

φ

t

1

–

i

( )

a

ij

log

+

[ ]

=

j

1

…

N

, ,{ }∈

t

2

…

T

, ,{ }∈

P

∗

log

max

i

φ

T

i

( )[ ]

=

O

λ

M

O

m

( )

{ }

m

1

…

M

, ,{ }∈

a

ij

a

ij

0

=

j i

<∀


- 36 -

State transitions may further be restricted to prevent large changes in state indices. For the HMM above, forexample, the state-transition matrix is given by,

(253)

The modification of the Baum-Welch reestimation formulas for multiple observation sequences , is straightforward, if we assume that each observation sequence is independent of

every other observation sequence in the training set. Our goal now is to optimize the parameters in to max-imize,

(254)

Since the reestimation formulas are based on frequencies of occurrence of various events, the reestimationformulas for multiple observation sequences are modified by adding together the individual frequencies ofoccurrence for each sequence. In terms of the unscaled variables,

(255)

(256)

Note that the terms were cancelled out of the numerator and denominator previously for thecase of a single observation sequence. For multiple observation sequences, these terms no longer cancel out.In terms of the scaled variables,

A

S

1

S

2

S

3

S

4

S

5

Figure 23

A

a

11

a

12

a

13

0 0

0

a

22

a

23

a

24

0

0 0

a

33

a

34

a

35

0 0 0

a

44

a

45

0 0 0 0 1

=

O

m

( )

{ }

m

1

…

M

, ,{ }∈

O

m

( )

λ

P O

λ( )

P O

m

( )

λ( )

m

1

=

M

∏

=

a

ij

1

P O

m

( )

λ( )

--------------------------

α

tm

i

( )

a

ij

b

j

O

t

1

+

m

( )

( )β

t

1

+

m

j

( )

t

1

=

T

m

1

–

∑

m

1

=

M

∑

1

P O

m

( )

λ( )

--------------------------

α

tm

i

( )β

tm

i

( )

t

1

=

T

m

1

–

∑

m

1

=

M

∑

---------------------------------------------------------------------------------------------------------------------=

b

j

k

( )

1

P O

m

( )

λ( )

--------------------------

α

tm

j

( )β

tm

j

( )

tO

tm

( )

∀

v

k

=

∑

m

1

=

M

∑

1

P O

m

( )

λ( )

--------------------------

α

tm

j

( )β

tm

j

( )

t

1

=

T

m

∑

m

1

=

M

∑

---------------------------------------------------------------------------------------------=

1

P O

m

( )

λ( )⁄


- 37 -

(257)

(258)

Note that the terms no longer need to be explicitly included in equations (257) and (258),since equations (232) and (236) show that the scaled forward-backward variables implicitly include the

terms. Finally, note that equations (110) and (111) in Rabiner’s HMM tutorial paper [1] bothcontain errors, which equations (257) and (258) fix.

10. Baum-Welch convergence

Below, we illustrate some convergence issues of the Baum-Welch algorithm for a simple example and a real-world example.

A. Simple example

Here we conduct the following simple experiment to investigate how the convergence of the Baum-Welchalgorithm is affected by the number of states that we assume for our model. We first generate an observationsequence of length for the two-observable, two-state hidden Markov model in Figure 24

where,

, , . (259)

We then train three HMMs , , from random initial parameter settings, where has states. The three resulting HMMs are shown in Figure 25 on the next page. Finally, we compute the normal-ized probability measure (normalized with respect to , the length of the observation sequence),

(260)

Table 1 below reports the resulting values. The final single-state HMM cannot encode anysequential structure, and therefore captures only the aggregate distribution of observables (1/2 and 1/2). Thetwo-state HMM converges to the generating HMM in the above figure, which is encouraging. Althoughthe Baum-Welch algorithm is only guaranteed to find a local maximum of the log-probability function, in thiscase it converges to the globally optimal value. Note also that improves significantly from to

a

ij

α

ˆ

tm

i

( )

a

ij

b

j

O

t

1

+

m

( )

( )β

ˆ

t

1

+

m

j

( )

t

1

=

T

m

1

–

∑

m

1

=

M

∑

α

ˆ

tm

i

( )β

ˆ

tm

i

( )

c

tm

-----------------------------

t

1

=

T

m

1

–

∑

m

1

=

M

∑

------------------------------------------------------------------------------------------=

b

j

k

( )

α

ˆ

tm

j

( )β

ˆ

tm

j

( )

c

tm

-----------------------------

tO

tm

( )

∀

v

k

=

∑

m

1

=

M

∑

α

ˆ

tm

j

( )β

ˆ

tm

j

( )

c

tm

-----------------------------

t

1

=

T

m

∑

m

1

=

M

∑

------------------------------------------------------------------=

1

P O

m

( )

λ( )⁄

1

P O

m

( )

λ( )⁄

O

T

10 000

,

=

λ

0.9

0.9

0.10.1

Figure 24

A

0.1 0.9

0.9 0.1

=

B

0.1 0.9

0.9 0.1

=

π

0.5

0.5

=

λ

i

i

1 2 3

, ,{ }∈

λ

i

N i

=

T

P O

λ

i

( )

10

P O

λ

i

( )

T

⁄

log

P O

λ

i

( )

1

T

⁄

= =

P O

λ

i

( )

λ

1

λ

2

P O

λ

i

( )

λ

1


- 38 -

. This implies that two states are able to capture sequential characteristics of the observation sequence which the one-state HMM cannot capture. In this case, we know this to be true since the observationsequence was generated from a two-state HMM.

When we increase the number of states to three, the Baum-Welch algorithm converges to in the figureabove. Although has significantly more representational power than , the normalized probability

does not appreciably increase. This indicates that the two-state model is probably sufficient formodeling the sequential properties of , and adding more states will not improve the model substantially. Infact, note that for the model , the single left state of is effectively split up into two states in .

The above results suggest that, although we frequently do not know the exact number of states that we shoulduse for modeling specific data in real applications, we may arrive at the optimal number of states for a spe-cific observation sequence by training models with a different number of states, and then selecting the small-est (most general) model which yields the largest normalized probability value . To illustrate thispoint, we consider a more complicated example below.

0.9

0.9

0.10.1

0.42

0.05

0.9

0.9

0.05

0.5

0.04 0.08 0.04

λ

1

λ

3

λ

2

(note: due to round-off error, some rows of

A

do not add up to 1)

Figure 25

λ

2

O

O

λ

3

λ

3

λ

2

P O

λ

i

( )

λ

2

O

λ

3

λ

2

λ

3

Table 1: Normalized probabilities for HMMs in Figure 25

1 0.5000

2 0.5875

3 0.5875

i

P O

λ

i

( )

P O

λ

i

( )


- 39 -

B. Real-world example: modeling human control trajectories

In previous work [5], hidden Markov models have been applied towards modeling and analyzing human con-trol strategies in a dynamic graphic driving simulator (shown in Figure 26). For our example here, we prepro-cess and vector-quantize (to levels) a sample human control trajectory in order to generate a discreteobservation sequence of length . We then train an eight-state ( ) HMM on this obser-vation sequence using the Baum-Welch reestimation formulas.

It is instructive to look at the parameter values to which the final model converges. We can do thisgraphically by plotting the and matrices in Figure 27, where darker shades represent smaller values andlighter shades represent larger relative values. Although the model is parameterized by a total of,

L

64

=

O

T

3875

=

N

8

=

λ

8

steering wheelspeedometer compass

map

car

Figure 26

A B

,{ }

A

B

0 2 4 6 8

0

2

4

6

8

0 2 4 6 8

0

10

20

30

40

50

60

a

ij

b

j

k

( )

j

i

j

k

Figure 27


- 40 -

free parameters, (261)

(ignoring the vector), the final model retains only nonzero parameters. In effect, the model was auto-matically reduced in terms of complexity during training. This type of model reduction is very typical of theBaum-Welch training algorithm.

Now, let us conduct the same experiment as in part A. We first use the final HMM to generate a test obser-vation sequence of length , and then train models , ,with states,from random initial parameter settings. Figure 28 below plots the normalized probabilities as afunction of the number of states . The maximum is given by,

for (262)

If we evaluate the test sequence on the generating HMM , we get that,

(263)

These results indicate that when trained and evaluated on real-data, the

precise

number of states in the hiddenMarkov model is not so important as the

approximate

number of states required to sufficiently encode thesequential properties of the underlying data. In the above example, one or two states clearly are insufficient tocapture the sequential properties of the generating HMM . From equations (262) and (263) it is also evi-dent that the Baum-Welch algorithm (a special case of the general EM algorithm) only converges to localmaximum. Nevertheless, the local maximum to which the training algorithm converges is typically a rela-tively good local maximum, as is the case here.

Before we leave this topic, a quick word about the rate of convergence of the Baum-Welch algorithm. In atypical training scenario, we execute the Baum-Welch reestimation formulas until the change in the modelfalls below some threshold , such that,

(264)

N N

×

N L

×

+

576

=

π

142

λ

8

O

'

T

10000

=

λ

'

i

i

1

…

10

, ,{ }∈

N i

=

P O

'

λ

'

i

( )

i

1 2 3 4 5 6 7 8 9 100

0.02

0.04

0.06

0.081 2 3 4 5 6 7 8 9 10

i

PO

'

λ

'

i

()

Figure 28

max

i

P O

'

λ

'

i

( )

0.080

≈

i

9

=

O

'

λ

8

P O

'

λ

8

( )

0.085

≈

λ

8

0 25 50 75 100 125 150

0.02

0.03

0.04

0.05

0.06

0.07

PO

'

λ

'

8

t

()

()

t

Figure 29

δ

P O

λ

t

( )

( )

P O

λ

t

1

–

( )

( )

–

P O

λ

t

( )

( )

--------------------------------------------------------------

δ<


- 41 -

where denotes the model after the th step of the algorithm. From experience, that threshold shouldbe set to some very small value (e.g. ) in order to ensure that the model has reached a finallocal maximum. Consider, for example, the convergence of the Baum-Welch algorithm for trained on as plotted in Figure 29. Note that levels out several times before the final local maximum isreached.

[1] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,”

Proc. of the IEEE

, vol. 77, no. 2, pp. 257-86, 1989.

[2] A.P. Dempster, N.M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EMalgorithm,”

Journal of the Royal Statistical Society, Series B

, vol. 39, no. 1, pp. 1-38, 1977.

[3] L. R. Rabiner, B. H. Juang, S. E. Levinson and M. M. Sondhi, “Some properties of continuous hiddenMarkov model representations,”

AT&T Technical Journal

, vol. 64, no. 6, pp. 1211-1222, 1986.

[4] X. D. Huang, Y. Ariki and M. A. Jack,

Hidden Markov models for speech recognition

, Edinburgh Uni-versity Press, Edinburgh, 1990.

[5] M. C. Nechyba,

Learning and validation of human control strategies

, Ph.D. Thesis, Carnegie MellonUniversity, 1998.

λ

t

( )

λ

t

δ

0.000001

=

λ

'

8

O

'

P O

'

λ

'

8

t

( )

( )

introduction to markov systems - university of florida · 2005. 3. 4. · eel6825: pattern...

Documents