hidden markov models for speech recognitionberlin.csie.ntnu.edu.tw/courses/speech...

100
Hidden Markov Models for Speech Recognition References: 1. Rabiner and Juang. Fundamentals of Speech Recognition. Chapter 6 2. Huang et. al. Spoken Language Processing. Chapters 4, 8 3. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, vol. 77, No. 2, February 1989 4. Gales and Young. The Application of Hidden Markov Models in Speech Recognition, Chapters 1-2, 2008 5. Young. HMMs and Related Speech Recognition Technologies. Chapter 27, Springer Handbook of Speech Processing, Springer, 2007 6. J.A. Bilmes , A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, U.C. Berkeley TR-97-021 Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Upload: vothuy

Post on 15-Apr-2018

237 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

Hidden Markov Models for Speech Recognition

References1 Rabiner and Juang Fundamentals of Speech Recognition Chapter 62 Huang et al Spoken Language Processing Chapters 4 83 Rabiner A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition Proceedings of the IEEE

vol 77 No 2 February 19894 Gales and Young The Application of Hidden Markov Models in Speech Recognition Chapters 1-2 20085 Young HMMs and Related Speech Recognition Technologies Chapter 27 Springer Handbook of Speech Processing

Springer 20076 JA Bilmes A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and

Hidden Markov Models UC Berkeley TR-97-021

Berlin ChenDepartment of Computer Science amp Information Engineering

National Taiwan Normal University

SP - Berlin Chen 2

Hidden Markov Model (HMM)A Brief Overview

Historyndash Published in papers of Baum in late 1960s and early 1970sndash Introduced to speech processing by Baker (CMU) and Jelinek

(IBM) in the 1970s (discrete HMMs)ndash Then extended to continuous HMMs by Bell LabsAssumptionsndash Speech signal can be characterized as a parametric random

(stochastic) processndash Parameters can be estimated in a precise well-defined mannerThree fundamental problemsndash Evaluation of probability (likelihood) of a sequence of

observations given a specific HMMndash Determination of a best sequence of model statesndash Adjustment of model parameters so as to best account for

observed signals (or discrimination purposes)

SP - Berlin Chen 3

Stochastic Process

bull A stochastic process is a mathematical model of a probabilistic experiment that evolves in time and generates a sequence of numeric valuesndash Each numeric value in the sequence is modeled by a random

variablendash A stochastic process is just a (finiteinfinite) sequence of random

variables

bull Examples(a) The sequence of recorded values of a speech utterance(b) The sequence of daily prices of a stock(c) The sequence of hourly traffic loads at a node of a

communication network(d) The sequence of radar measurements of the position of an

airplane

SP - Berlin Chen 4

Observable Markov Model

bull Observable Markov Model (Markov Chain)ndash First-order Markov chain of N states is a triple (SA)

bull S is a set of N statesbull A is the NN matrix of transition probabilities between states

P(st=j|st-1=i st-2=k helliphellip) asymp P(st=j|st-1=i) asymp Aij

bull is the vector of initial state probabilitiesj =P(s1=j)

ndash The output of the process is the set of states at each instant of time when each state corresponds to an observable event

ndash The output in any given state is not random (deterministic)

ndash Too simple to describe the speech signal characteristics

First-order and time-invariant assumptions

SP - Berlin Chen 5

Observable Markov Model (cont)

S1 S2

11 SSP 22 SSP

12 SSP

21 SSP

S1 S1S1 S2

S2 S1 S2S2

21 SSP

111 SSSP

222 SSSP

212 SSSP

221 SSSP

112 SSSP

121 SSSP

First-order Markov chain of 2 states

Second-order Markov chain of 2 states 122 SSSP

211 SSSP

(Prev State Cur State)

SP - Berlin Chen 6

Observable Markov Model (cont)

bull Example 1 A 3-state Markov Chain State 1 generates symbol A only State 2 generates symbol B only

and State 3 generates symbol C only

ndash Given a sequence of observed symbols O=CABBCABC the only one corresponding state sequence is S3S1S2S2S3S1S2S3 and the corresponding probability is

P(O|)=P(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)=0103030702030302=000002268

105040 502030207010103060

As2 s3

A

B C

06

07

0301

02

0201

03

05

s1

SP - Berlin Chen 7

Observable Markov Model (cont)

bull Example 2 A three-state Markov chain for the Dow Jones Industrial average

030205

tiπ

The probability of 5 consecutive up days

006480605

11111days econsecutiv 54

111111111 aaaa

PupP

SP - Berlin Chen 8

Observable Markov Model (cont)

bull Example 3 Given a Markov model what is the mean occupancy duration of each state i

iiiiii

ii

d

dii

iiii

dii

dii

dii

iid

ii

i

aaaa

aa

aaadddPd

aa

iddP

11

111=

11

state ain duration ofnumber Expected1=

statein duration offunction massy probabilit

11

1

1

1

Time (Duration)

Probability

a geometric distribution

SP - Berlin Chen 9

Hidden Markov Model

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 2: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 2

Hidden Markov Model (HMM)A Brief Overview

Historyndash Published in papers of Baum in late 1960s and early 1970sndash Introduced to speech processing by Baker (CMU) and Jelinek

(IBM) in the 1970s (discrete HMMs)ndash Then extended to continuous HMMs by Bell LabsAssumptionsndash Speech signal can be characterized as a parametric random

(stochastic) processndash Parameters can be estimated in a precise well-defined mannerThree fundamental problemsndash Evaluation of probability (likelihood) of a sequence of

observations given a specific HMMndash Determination of a best sequence of model statesndash Adjustment of model parameters so as to best account for

observed signals (or discrimination purposes)

SP - Berlin Chen 3

Stochastic Process

bull A stochastic process is a mathematical model of a probabilistic experiment that evolves in time and generates a sequence of numeric valuesndash Each numeric value in the sequence is modeled by a random

variablendash A stochastic process is just a (finiteinfinite) sequence of random

variables

bull Examples(a) The sequence of recorded values of a speech utterance(b) The sequence of daily prices of a stock(c) The sequence of hourly traffic loads at a node of a

communication network(d) The sequence of radar measurements of the position of an

airplane

SP - Berlin Chen 4

Observable Markov Model

bull Observable Markov Model (Markov Chain)ndash First-order Markov chain of N states is a triple (SA)

bull S is a set of N statesbull A is the NN matrix of transition probabilities between states

P(st=j|st-1=i st-2=k helliphellip) asymp P(st=j|st-1=i) asymp Aij

bull is the vector of initial state probabilitiesj =P(s1=j)

ndash The output of the process is the set of states at each instant of time when each state corresponds to an observable event

ndash The output in any given state is not random (deterministic)

ndash Too simple to describe the speech signal characteristics

First-order and time-invariant assumptions

SP - Berlin Chen 5

Observable Markov Model (cont)

S1 S2

11 SSP 22 SSP

12 SSP

21 SSP

S1 S1S1 S2

S2 S1 S2S2

21 SSP

111 SSSP

222 SSSP

212 SSSP

221 SSSP

112 SSSP

121 SSSP

First-order Markov chain of 2 states

Second-order Markov chain of 2 states 122 SSSP

211 SSSP

(Prev State Cur State)

SP - Berlin Chen 6

Observable Markov Model (cont)

bull Example 1 A 3-state Markov Chain State 1 generates symbol A only State 2 generates symbol B only

and State 3 generates symbol C only

ndash Given a sequence of observed symbols O=CABBCABC the only one corresponding state sequence is S3S1S2S2S3S1S2S3 and the corresponding probability is

P(O|)=P(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)=0103030702030302=000002268

105040 502030207010103060

As2 s3

A

B C

06

07

0301

02

0201

03

05

s1

SP - Berlin Chen 7

Observable Markov Model (cont)

bull Example 2 A three-state Markov chain for the Dow Jones Industrial average

030205

tiπ

The probability of 5 consecutive up days

006480605

11111days econsecutiv 54

111111111 aaaa

PupP

SP - Berlin Chen 8

Observable Markov Model (cont)

bull Example 3 Given a Markov model what is the mean occupancy duration of each state i

iiiiii

ii

d

dii

iiii

dii

dii

dii

iid

ii

i

aaaa

aa

aaadddPd

aa

iddP

11

111=

11

state ain duration ofnumber Expected1=

statein duration offunction massy probabilit

11

1

1

1

Time (Duration)

Probability

a geometric distribution

SP - Berlin Chen 9

Hidden Markov Model

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 3: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 3

Stochastic Process

bull A stochastic process is a mathematical model of a probabilistic experiment that evolves in time and generates a sequence of numeric valuesndash Each numeric value in the sequence is modeled by a random

variablendash A stochastic process is just a (finiteinfinite) sequence of random

variables

bull Examples(a) The sequence of recorded values of a speech utterance(b) The sequence of daily prices of a stock(c) The sequence of hourly traffic loads at a node of a

communication network(d) The sequence of radar measurements of the position of an

airplane

SP - Berlin Chen 4

Observable Markov Model

bull Observable Markov Model (Markov Chain)ndash First-order Markov chain of N states is a triple (SA)

bull S is a set of N statesbull A is the NN matrix of transition probabilities between states

P(st=j|st-1=i st-2=k helliphellip) asymp P(st=j|st-1=i) asymp Aij

bull is the vector of initial state probabilitiesj =P(s1=j)

ndash The output of the process is the set of states at each instant of time when each state corresponds to an observable event

ndash The output in any given state is not random (deterministic)

ndash Too simple to describe the speech signal characteristics

First-order and time-invariant assumptions

SP - Berlin Chen 5

Observable Markov Model (cont)

S1 S2

11 SSP 22 SSP

12 SSP

21 SSP

S1 S1S1 S2

S2 S1 S2S2

21 SSP

111 SSSP

222 SSSP

212 SSSP

221 SSSP

112 SSSP

121 SSSP

First-order Markov chain of 2 states

Second-order Markov chain of 2 states 122 SSSP

211 SSSP

(Prev State Cur State)

SP - Berlin Chen 6

Observable Markov Model (cont)

bull Example 1 A 3-state Markov Chain State 1 generates symbol A only State 2 generates symbol B only

and State 3 generates symbol C only

ndash Given a sequence of observed symbols O=CABBCABC the only one corresponding state sequence is S3S1S2S2S3S1S2S3 and the corresponding probability is

P(O|)=P(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)=0103030702030302=000002268

105040 502030207010103060

As2 s3

A

B C

06

07

0301

02

0201

03

05

s1

SP - Berlin Chen 7

Observable Markov Model (cont)

bull Example 2 A three-state Markov chain for the Dow Jones Industrial average

030205

tiπ

The probability of 5 consecutive up days

006480605

11111days econsecutiv 54

111111111 aaaa

PupP

SP - Berlin Chen 8

Observable Markov Model (cont)

bull Example 3 Given a Markov model what is the mean occupancy duration of each state i

iiiiii

ii

d

dii

iiii

dii

dii

dii

iid

ii

i

aaaa

aa

aaadddPd

aa

iddP

11

111=

11

state ain duration ofnumber Expected1=

statein duration offunction massy probabilit

11

1

1

1

Time (Duration)

Probability

a geometric distribution

SP - Berlin Chen 9

Hidden Markov Model

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 4: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 4

Observable Markov Model

bull Observable Markov Model (Markov Chain)ndash First-order Markov chain of N states is a triple (SA)

bull S is a set of N statesbull A is the NN matrix of transition probabilities between states

P(st=j|st-1=i st-2=k helliphellip) asymp P(st=j|st-1=i) asymp Aij

bull is the vector of initial state probabilitiesj =P(s1=j)

ndash The output of the process is the set of states at each instant of time when each state corresponds to an observable event

ndash The output in any given state is not random (deterministic)

ndash Too simple to describe the speech signal characteristics

First-order and time-invariant assumptions

SP - Berlin Chen 5

Observable Markov Model (cont)

S1 S2

11 SSP 22 SSP

12 SSP

21 SSP

S1 S1S1 S2

S2 S1 S2S2

21 SSP

111 SSSP

222 SSSP

212 SSSP

221 SSSP

112 SSSP

121 SSSP

First-order Markov chain of 2 states

Second-order Markov chain of 2 states 122 SSSP

211 SSSP

(Prev State Cur State)

SP - Berlin Chen 6

Observable Markov Model (cont)

bull Example 1 A 3-state Markov Chain State 1 generates symbol A only State 2 generates symbol B only

and State 3 generates symbol C only

ndash Given a sequence of observed symbols O=CABBCABC the only one corresponding state sequence is S3S1S2S2S3S1S2S3 and the corresponding probability is

P(O|)=P(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)=0103030702030302=000002268

105040 502030207010103060

As2 s3

A

B C

06

07

0301

02

0201

03

05

s1

SP - Berlin Chen 7

Observable Markov Model (cont)

bull Example 2 A three-state Markov chain for the Dow Jones Industrial average

030205

tiπ

The probability of 5 consecutive up days

006480605

11111days econsecutiv 54

111111111 aaaa

PupP

SP - Berlin Chen 8

Observable Markov Model (cont)

bull Example 3 Given a Markov model what is the mean occupancy duration of each state i

iiiiii

ii

d

dii

iiii

dii

dii

dii

iid

ii

i

aaaa

aa

aaadddPd

aa

iddP

11

111=

11

state ain duration ofnumber Expected1=

statein duration offunction massy probabilit

11

1

1

1

Time (Duration)

Probability

a geometric distribution

SP - Berlin Chen 9

Hidden Markov Model

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 5: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 5

Observable Markov Model (cont)

S1 S2

11 SSP 22 SSP

12 SSP

21 SSP

S1 S1S1 S2

S2 S1 S2S2

21 SSP

111 SSSP

222 SSSP

212 SSSP

221 SSSP

112 SSSP

121 SSSP

First-order Markov chain of 2 states

Second-order Markov chain of 2 states 122 SSSP

211 SSSP

(Prev State Cur State)

SP - Berlin Chen 6

Observable Markov Model (cont)

bull Example 1 A 3-state Markov Chain State 1 generates symbol A only State 2 generates symbol B only

and State 3 generates symbol C only

ndash Given a sequence of observed symbols O=CABBCABC the only one corresponding state sequence is S3S1S2S2S3S1S2S3 and the corresponding probability is

P(O|)=P(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)=0103030702030302=000002268

105040 502030207010103060

As2 s3

A

B C

06

07

0301

02

0201

03

05

s1

SP - Berlin Chen 7

Observable Markov Model (cont)

bull Example 2 A three-state Markov chain for the Dow Jones Industrial average

030205

tiπ

The probability of 5 consecutive up days

006480605

11111days econsecutiv 54

111111111 aaaa

PupP

SP - Berlin Chen 8

Observable Markov Model (cont)

bull Example 3 Given a Markov model what is the mean occupancy duration of each state i

iiiiii

ii

d

dii

iiii

dii

dii

dii

iid

ii

i

aaaa

aa

aaadddPd

aa

iddP

11

111=

11

state ain duration ofnumber Expected1=

statein duration offunction massy probabilit

11

1

1

1

Time (Duration)

Probability

a geometric distribution

SP - Berlin Chen 9

Hidden Markov Model

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 6: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 6

Observable Markov Model (cont)

bull Example 1 A 3-state Markov Chain State 1 generates symbol A only State 2 generates symbol B only

and State 3 generates symbol C only

ndash Given a sequence of observed symbols O=CABBCABC the only one corresponding state sequence is S3S1S2S2S3S1S2S3 and the corresponding probability is

P(O|)=P(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)=0103030702030302=000002268

105040 502030207010103060

As2 s3

A

B C

06

07

0301

02

0201

03

05

s1

SP - Berlin Chen 7

Observable Markov Model (cont)

bull Example 2 A three-state Markov chain for the Dow Jones Industrial average

030205

tiπ

The probability of 5 consecutive up days

006480605

11111days econsecutiv 54

111111111 aaaa

PupP

SP - Berlin Chen 8

Observable Markov Model (cont)

bull Example 3 Given a Markov model what is the mean occupancy duration of each state i

iiiiii

ii

d

dii

iiii

dii

dii

dii

iid

ii

i

aaaa

aa

aaadddPd

aa

iddP

11

111=

11

state ain duration ofnumber Expected1=

statein duration offunction massy probabilit

11

1

1

1

Time (Duration)

Probability

a geometric distribution

SP - Berlin Chen 9

Hidden Markov Model

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 7: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 7

Observable Markov Model (cont)

bull Example 2 A three-state Markov chain for the Dow Jones Industrial average

030205

tiπ

The probability of 5 consecutive up days

006480605

11111days econsecutiv 54

111111111 aaaa

PupP

SP - Berlin Chen 8

Observable Markov Model (cont)

bull Example 3 Given a Markov model what is the mean occupancy duration of each state i

iiiiii

ii

d

dii

iiii

dii

dii

dii

iid

ii

i

aaaa

aa

aaadddPd

aa

iddP

11

111=

11

state ain duration ofnumber Expected1=

statein duration offunction massy probabilit

11

1

1

1

Time (Duration)

Probability

a geometric distribution

SP - Berlin Chen 9

Hidden Markov Model

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 8: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 8

Observable Markov Model (cont)

bull Example 3 Given a Markov model what is the mean occupancy duration of each state i

iiiiii

ii

d

dii

iiii

dii

dii

dii

iid

ii

i

aaaa

aa

aaadddPd

aa

iddP

11

111=

11

state ain duration ofnumber Expected1=

statein duration offunction massy probabilit

11

1

1

1

Time (Duration)

Probability

a geometric distribution

SP - Berlin Chen 9

Hidden Markov Model

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 9: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 9

Hidden Markov Model

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 10: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 10

Hidden Markov Model (cont)

bull HMM an extended version of Observable Markov Modelndash The observation is turned to be a probabilistic function (discrete or

continuous) of a state instead of an one-to-one correspondence of a state

ndash The model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)

bull What is hidden The State SequenceAccording to the observation sequence we are not sure which state sequence generates it

bull Elements of an HMM (the State-Output HMM) =SABndash S is a set of N statesndash A is the NN matrix of transition probabilities between statesndash B is a set of N probability functions each describing the observation

probability with respect to a statendash is the vector of initial state probabilities

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 11: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 11

Hidden Markov Model (cont)

bull Two major assumptions ndash First order (Markov) assumption

bull The state transition depends only on the origin and destinationbull Time-invariant

ndash Output-independent assumptionbull All observations are dependent on the state that generated them

not on neighboring observations

jitt AijPisjsPisjsP 11

tttttttt sPsP oooooo 2112

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 12: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 12

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Discrete and finite observations

bull The observations that all distinct states generate are finite in numberV=v1 v2 v3 helliphellip vM vkRL

bull In this case the set of observation probability distributions B=bj(vk) is defined as bj(vk)=P(ot=vk|st=j) 1kM 1jNot observation at time t st state at time t for state j bj(vk) consists of only M probability values

A left-to-right HMM

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 13: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 13

Hidden Markov Model (cont)

bull Two major types of HMMs according to the observationsndash Continuous and infinite observations

bull The observations that all distinct states generate are infinite and continuous that is V=v| vRd

bull In this case the set of observation probability distributions B=bj(v) is defined as bj(v)=fO|S(ot=v|st=j) 1jN bj(v) is a continuous probability density function (pdf)and is often a mixture of Multivariate Gaussian (Normal)Distributions

M

kjkjk

tjk

jk

jkj d

πwb

1

1

21 2

1exp2

12

μvΣμvΣ

v

CovarianceMatrix

Mean Vector

Observation VectorMixtureWeight

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 14: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 14

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributionsndash When X=(x1 x2hellip xd) is a d-dimensional random vector the

multivariate Gaussian pdf has the form

ndash If x1 x2hellip xd are independent the covariance matrix is reduced to diagonal covariance

bull Viewed as d independent scalar Gaussian distributions bull Model complexity is significantly reduced

jijijjiiijij

ttt

t

μμxxEμxμxEi-j

EE

ELπ

Nf d

of elevment The

oft determinan the theis and matrix coverance theis

rmean vecto ldimensiona- theis where

21exp

2

1

th

1

212

Σ

ΣΣμμxxμxμxΣΣ

xμμ

μxΣμxΣ

ΣμxΣμxX

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 15: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 15

Hidden Markov Model (cont)

bull Multivariate Gaussian Distributions

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 16: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 16

Hidden Markov Model (cont)

bull Covariance matrix of the correlated feature vectors (Mel-frequency filter bank outputs)

bull Covariance matrix of the partially de-correlated feature vectors (MFCC without C0)ndash MFCC Mel-frequency cepstral

coefficients

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 17: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 17

Hidden Markov Model (cont)

bull Multivariate Mixture Gaussian Distributions (cont)ndash More complex distributions with multiple local maxima can be

approximated by Gaussian (a unimodal distribution) mixtures

ndash Gaussian mixtures with enough mixture components can approximate any distribution

1 11

M

kk

M

kkkkk wNwf Σμxx

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 18: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 18

Hidden Markov Model (cont)

bull Example 4 a 3-state discrete HMM

ndash Given a sequence of observations O=ABC there are 27 possible corresponding state sequences and therefore the corresponding probability is

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

06

07

0301

02

0201

03

05 105040

106030201070

502030502030207010103060

333

222

111

CBACBA

CBA

A

bbbbbbbbb

070207050

0070101070 when

sequence state

23222

322322

27

1

27

1

ssPssPsPP

sPsPsPPsssgE

PPPP

i

ii

ii

iii

i

λS

CBAλSOS

SλSλSOλSOλO

Ergodic HMM

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 19: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 19

Hidden Markov Model (cont)

bull Notationsndash O=o1o2o3helliphellipoT the observation (feature) sequencendash S= s1s2s3helliphellipsT the state sequencendash model for HMM =AB ndash P(O|) The probability of observing O given the model ndash P(O|S) The probability of observing O given and a state

sequence S of ndash P(OS|) The probability of observing O and S given ndash P(S|O) The probability of observing S given O and

bull Useful formulasndash Bayesrsquo Rule

BP

APABPBP

BAPBAP

BPBAPAPABPBAP

yprobabilit thedescribing model

BPAPABP

BPBAP

BAP

λ

λλλ

λλ

λ

chain rule

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 20: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 20

Hidden Markov Model (cont)

bull Useful formulas (Cont)ndash Total Probability Theorem

BB

BallBall

BdBBfBAfdBBAf

BBPBAPBAPAP

continuous is if

disjoint and disrete is if

nn

n

xPxPxPxxxPxxx

2121

21

tindependen are if

z

kz zdzzqzf

zkqkzPzqE continuous

discrete

z

Expectation

marginal probability

A B4

B1

B2B3

B5

Venn Diagram

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 21: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 21

Three Basic Problems for HMM

bull Given an observation sequence O=(o1o2hellipoT)and an HMM =(SAB)ndash Problem 1

How to efficiently compute P(O|) Evaluation problem

ndash Problem 2How to choose an optimal state sequence S=(s1s2helliphellip sT) Decoding Problem

ndash Problem 3How to adjust the model parameter =(AB) to maximize P(O|) Learning Training Problem

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 22: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 22

Basic Problem 1 of HMM (cont)

Given O and find P(O|)= Prob[observing O given ]bull Direct Evaluation

ndash Evaluating all possible state sequences of length T that generating observation sequence O

ndash The probability of each path Sbull By Markov assumption (First-order HMM)

Sallall

PPPP

SSOSOOS

TT sssssss

T

ttt

T

t

tt

aaa

ssPsP

ssPsPP

132211

211

2

111

S

SP

By Markov assumption

By chain rule

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 23: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 23

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (cont)ndash The joint output probability along the path S

bull By output-independent assumptionndash The probability that a particular observation symbolvector is

emitted at time t depends only on the state st and is conditionally independent of the past observations

SOP

T

tts

T

ttt

T

t

Ttt

T

TT

tb

sP

sPsP

sPP

1

1

21

1111

11

o

o

ooo

oSO

By output-independent assumption

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 24: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 24

Basic Problem 1 of HMM (cont)

bull Direct Evaluation (Cont)

ndash Huge Computation Requirements O(NT)bull Exponential computational complexity

bull A more efficient algorithms can be used to evaluate ndash ForwardBackward ProcedureAlgorithm

Tssssss

sssss

allTssssssssss

all

TTT

T

TTT

babab

bbbaaa

PPP

ooo

ooo

SOSO

s

S

1221

21

11

21132211

21

21

ADD 1- NTN2 MUL N1T-2 TTT Complexity

OP

tstt tbsP oo

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 25: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 25

Basic Problem 1 of HMM (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes that aij has been computed

bull Direct Evaluation (Cont)State-time Trellis Diagram

s2

s1

s3

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 26: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 26

Basic Problem 1 of HMM- The Forward Procedure

bull Based on the HMM assumptions the calculation ofand involves only

and so it is possible to compute the likelihood with recursion on

bull Forward variable ndash The probability that the HMM is in state i at time t having

generating partial observation o1o2hellipot

ssP 1tt tt sP o 1ts tsto

t

λisoooPi tt21t

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 27: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 27

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull Algorithm

ndash Complexity O(N2T)

bull Based on the lattice (trellis) structurendash Computed in a time-synchronous fashion from left-to-right where

each cell for time t is completely computed before proceeding to time t+1

bull All state sequences regardless how long previously merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

NjT-t baiαjα

Ni bπiα

1

11

1

11

ion 3Terminat

1 11Induction 2

1tion Initializa 1

O

o

o

TNN-T-NN-

T N+N T-N+N2

2

111 ADD 11 MUL

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 28: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 28

Basic Problem 1 of HMM- The Forward Procedure (cont)

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

ttt

ttt

obai

obisjsPisoooP

obisooojsPisoooP

objsisoooP

objsoooP

jsoPjsoooP

jsPjsoPjsoooP

jsPjsoooP

jsoooPj

11

111121

111211121

11121

121

121

121

21

21

λλ

λλ

λ

λ

λλ

λλλ

λλ

λ

first-order Markovassumption

BAPAPABP

tjtt objsoP λ

Ball

BAPAP

APABPBAP outputindependentassumption

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 29: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 29

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull 3(3)=P(o1o2o3s3=3|)=[2(1)a13+ 2(2)a23 +2(3)a33]b3(o3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

State

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s1

OT-1

si denotes that bj(ot) has been computed aij denotes aij has been computed

s2

s1

s3

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 30: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 30

Basic Problem 1 of HMM- The Forward Procedure (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035+05002+040009)07=01792

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 31: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 31

Basic Problem 1 of HMM- The Backward Procedure

bull Backward variable t(i)=P(ot+1ot+2hellipoT|st=i )

TNNT-NN-

T NNT-N

jbP

NiT-tjbai

Nii

N

jjj

N

jttjijt

2

22

111

111

T

11 ADD

212 MUL Complexity

nTerminatio 3

1 11 Induction 2

1 1 tionInitializa 1

oO

o

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 32: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 32

Basic Problem 1 of HMM- Backward Procedure (cont)

bull Why

bull

isP

isP

isPisP

isPisPisP

isPisPii

t

tTt

ttTt

tTttttt

tTtttt

tt

1

1

2121

2121

O

ooo

ooo

oooooo

oooooo

iiisP ttt O

N

itt

N

it iiisPP

11 OO

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 33: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 33

Basic Problem 1 of HMM- The Backward Procedure (cont)

bull 2(3)=P(o3o4hellip oT|s2=3)=a31 b1(o3)3(1) +a32 b2(o3)3(2)+a33 b1(o3)3(3)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O2 O3 OT

1 2 3 T-1 T Time

s2

s3

s3

OT-1

s2

s3

s1

State

s2

s1

s3

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 34: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

HMM is a Kind of Bayesian Network

SP - Berlin Chen 34

S1 S2 S3 ST

O1 O2 O3 OT

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 35: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 35

Basic Problem 2 of HMM

How to choose an optimal state sequence S=(s1s2helliphellip sT)

N

1m tt

ttN

1m t

ttt

mmii

msPisP

PisP

i

λO

λOλO

λO

bull The first optimal criterion Choose the states st are individually most likely at each time t

Define a posteriori probability variable

ndash Solution st = argi max [t(i)] 1 t Tbull Problem maximizing the probability at each time t individually

S= s1s2hellipsT may not be a valid sequence (eg astst+1 = 0)

λO isPi tt

state occupation probability (count) ndash a soft alignment of HMM state to the observation (feature)

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 36: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 36

Basic Problem 2 of HMM (cont)

bull P(s3 = 3 O | )=3(3)3(3)

O1

State

O2 O3 OT

1 2 3 T-1 T time

OT-1

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s1

s3

s2

s3

s3

s2

s3

s3

s2

s3

s3

3(3)

s2

s3

s3

s2

s3

s1

3(3)

s2

s1

s3

a23=0

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 37: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 37

Basic Problem 2 of HMM- The Viterbi Algorithm

bull The second optimal criterion The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm

ndash Instead of summing up probabilities from different paths coming to the same destination state the Viterbi algorithm picks and remembers the best path

bull Find a single optimal state sequence S=(s1s2helliphellip sT)

ndash How to find the second third etc optimal state sequences (difficult )

ndash The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

bull State-time trellis diagram1 R Bellman rdquoOn the Theory of Dynamic Programmingrdquo Proceedings of the National Academy of Sciences 19522AJ Viterbi Error bounds for convolutional codes and an asymptotically optimum decoding algorithmrdquo

IEEE Transactions on Information Theory 13 (2) 1967

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 38: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 38

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm

ndash Complexity O(N2T)

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor maxarg

maxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

max variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 39: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 39

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

s2

s3

s1

O1

s2

s3

s1

s2

s3

s1

s2

s1

s3

State

O2 O3 OT

1 2 3 T-1 T time

s2

s3

s1

OT-1

s2

s1

s3

3(3)

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 40: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 40

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull A three-state Hidden Markov Model for the Dow Jones Industrial average

06

0504 07

01

03

(06035)07=0147

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 41: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 41

Basic Problem 2 of HMM- The Viterbi Algorithm (cont)

bull Algorithm in the logarithmic form

iδs

aij

baij

itt

issssPi

sss=

TNi

T

ijtNit

tjijtNit

tttssst

T

T

t

1

11

111

21121

21

21

maxarg from backtracecan We

gbacktracinFor logmaxarg

log logmaxinduction By

statein ends andn observatio first for the accounts which at timepath single a along scorebest the=

logmax variablenew a Define

n observatiogiven afor sequence statebest a Find

121

o

ooo

oooOS

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 42: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 42

Homework 1bull A three-state Hidden Markov Model for the Dow Jones

Industrial average

ndash Find the probability P(up up unchanged down unchanged down up|)

ndash Fnd the optimal state sequence of the model which generates the observation sequence (up up unchanged down unchanged down up)

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 43: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 43

Probability Addition in F-B Algorithm

bull In Forward-backward algorithm operations usually implemented in logarithmic domain

bull Assume that we want to add and1P 2P

21

12

loglog221

loglog121

21

1logloglog

else1logloglog

if

PPbb

PPbb

bb

bb

bPPP

bPPP

PP

The values of can besaved in in a table to speedup the operations

xb b1log

P1

P2

P1 +P2logP1

logP2 log(P1+P2)

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 44: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 44

Probability Addition in F-B Algorithm (cont)

bull An example codedefine LZERO (-10E10) ~log(0) define LSMALL (-05E10) log values lt LSMALL are set to LZEROdefine minLogExp -log(-LZERO) ~=-23double LogAdd(double x double y)double tempdiffz if (xlty)

temp = x x = y y = tempdiff = y-x notice that diff lt= 0if (diffltminLogExp) if yrsquo is far smaller than xrsquo

return (xltLSMALL) LZEROxelsez = exp(diff)return x+log(10+z)

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 45: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 45

Basic Problem 3 of HMMIntuitive View

bull How to adjust (re-estimate) the model parameter =(AB) to maximize P(O1hellip OL|) or logP(O1hellip OL |)ndash Belonging to a typical problem of ldquoinferential statisticsrdquondash The most difficult of the three problems because there is no known

analytical method that maximizes the joint probability of the training data in a close form

ndash The data is incomplete because of the hidden state sequencesndash Well-solved by the Baum-Welch (known as forward-backward)

algorithm and EM (Expectation-Maximization) algorithmbull Iterative update and improvementbull Based on Maximum Likelihood (ML) criterion

HMM theof sequence state possible a -HMM for the utterances training have that weSuppose-

loglog

loglog

1 1

121

S

SOSO

OOOO

S

L

PPP

PP

R

l alll

L

ll

L

llL

The ldquolog of sumrdquo form is difficult to deal with

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 46: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 46

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Hard Assignmentndash Given the data follow a multinomial distribution

State S1

P(B| S1)=24=05

P(W| S1)=24=05

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 47: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 47

Maximum Likelihood (ML) Estimation A Schematic Depiction (12)

bull Soft Assignmentndash Given the data follow a multinomial distributionndash Maximize the likelihood of the data given the alignment

State S1 State S2

07 03

04 06

09 01

05 05

P(B| S1)=(07+09)(07+04+09+05)

=1625=064

P(W| S1)=(04+05)(07+04+09+05)

=0925=036

P(B| S2)=(03+01)(03+06+01+05)

=0415=027

P(W| S2)=( 06+05)(03+06+01+05)

=01115=073

1 1 OssP tt 2 2 OssP tt

121 tt

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 48: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 48

Basic Problem 3 of HMMIntuitive View (cont)

bull Relationship between the forward and backward variables

ti

N

jjit

ttt

baj

isPi

o

ooo

1

1

21

ij

N

jtjt

tTttt

abj

isPi

111

21

o

ooo

N

itt

ttt

Pii

isPii

1

O

O

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 49: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 49

Basic Problem 3 of HMMIntuitive View (cont)

bull Define a new variable

ndash Probability being at state i at time t and at state j at time t+1

bull Recall the posteriori probability variable

N

m

N

nttnmnt

ttjijtttjijt

ttt

nbam

jbaiP

jbaiP

jsisPji

1 111

1111

1

o

oλO

oλO

λO

)(for 11

1 TtjijsisPiN

jt

N

jttt

Ο

λO 1 jsisPji ttt

OisPi tt

N

mtt

ttt

mm

iii

1

as drepresente becan also Note

BP

BApBAp

i

j

t t+1

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 50: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 50

Basic Problem 3 of HMMIntuitive View (cont)

bull P(s3 = 3 s4 = 1O | )=3(3)a31b1(o4)1(4)

O1

s2

s1

s3

s2

s1

s3

s2

s1

s1

State

O2 O3 OT

1 2 3 4 T-1 T time

OT-1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s1

s3

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 51: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 51

Basic Problem 3 of HMMIntuitive View (cont)

bull

bull

bull A set of reasonable re-estimation formula for A is

1

1in state to state from ns transitioofnumber expected

T

tt jiji O

1

1

1

1 1in state from ns transitioofnumber expected

T

t

T

t

N

jtt ijii O

itii

1 1 at time statein times)of(number freqency expected

ijξ

ijia 1T-

1t t

1T-

1t t

ij

state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

λO 1 jsisPji ttt

OisPi tt

Formulae for Single Training Utterance

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 52: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 52

Basic Problem 3 of HMMIntuitive View (cont)

bull A set of reasonable re-estimation formula for B isndash For discrete and finite observation bj(vk)=P(ot=vk|st=j)

ndash For continuous and infinite observation bj(v)=fO|S(ot=v|st=j)

statein timesofnumber expected symbol observing and statein timesofnumber expected

T

1t

T

such that 1t

j

j

jjjsPb

t

t

kkkj

k

vovvov

M

kjk

tjk

jk

Ljk

M

kjkjkjkj jk

cNcb1

1 21

1 21exp

2

1 μvμvΣ

Σμvv

Modeled as a mixture of multivariate Gaussian distributions

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 53: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 53

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)bull Define a new variable

ndash is the probability of being in state j at time twith the k-th mixture component accounting for ot

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttttt

t

ttttt

t

ttt

ttt

ttt

ttt

Nc

Nc

ss

jj

jspkmjspjskmP

j

jspkmjspjskmP

j

jspjskmp

j

jskmPj

jskmPjsP

kmjsPkj

11

applied) is assumptiont independen-on(observati

Σμo

Σμo

λoλoλ

λOλOλ

λOλO

λO

λOλO

λO

c11

c12

c13

N1

N2N3

Distribution for State 1

kjt kjt

M

mtt mjj

1 Note

BP

BApBAp

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 54: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 54

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

T

1t

T

1t mixture and stateat nsobservatio of (mean) average weightedkj

kjkj

t

tt

jk

jmγ

jkγ

jkjc T

1t

M

1m t

T

1t t

jk

statein timesofnumber expected

mixture and statein timesofnumber expected

T

1t

T

1t

mixture and stateat nsobservatio of covariance weighted

kj

kj

kj

t

tjktjktt

jk

μoμo

Σ

Formulae for Single Training Utterance

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 55: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 55

Basic Problem 3 of HMMIntuitive View (cont)

bull Multiple Training Utterances

台師大s2

s1

s3

FB FB FB

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 56: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 56

Basic Problem 3 of HMMIntuitive View (cont)

ndash For continuous and infinite observation (Cont)

L

l

T

t

lt

L

l

T

tt

lt

jkl

l

kj

kjkj

1 1

1 1

mixture and stateat nsobservatio of (mean) average weighted

jmγ

jkγ

jkjc

L

l

T

t

M

m

lt

L

l

T

t

lt

jkl

l

1 1 1

1 1 statein timesofnumber expected

mixture and statein timesofnumber expected

L

l

T

t

lt

L

l

T

t

tjktjkt

lt

jk

l

l

kj

kj

kj

1 1

1 1

mixture and stateat nsobservatio of covariance weighted

μoμo

Σ

Formulae for Multiple (L) Training Utterances

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 57: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 57

Basic Problem 3 of HMMIntuitive View (cont)

ndash For discrete and finite observation (cont)

L

l

li i

Lti

11

11)( at time statein times)of(number freqency expected

ijξ

ijia

L

l

-T

t

lt

L

l

-T

t

lt

ijl

l

1

1

1

1

1

1 state fromn transitioofnumber expected

state to state fromn transitioofnumber expected

statein timesofnumber expected symbol observing and statein timesofnumber expected

1 1t

1such that

1t

L

l

T lt

L

l

T lt

kkkj

l

l

k

j

j

jjjsPb

vovvov

Formulae for Multiple (L) Training Utterances

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 58: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 58

Semicontinuous HMMs bull The HMM state mixture density functions are tied

together across all the models to form a set of shared kernelsndash The semicontinuous or tied-mixture HMM

ndash A combination of the discrete HMM and the continuous HMMbull A combination of discrete model-dependent weight coefficients and

continuous model-independent codebook probability density functionsndash Because M is large we can simply use the L most significant

valuesbull Experience showed that L is 1~3 of M is adequate

ndash Partial tying of for different phonetic class

kk

M

1k j

M

1k kjj Nkbvfkbb Σμooo

state output Probability of state j k-th mixture weight

t of state j(discrete model-dependent)

k-th mixture density function or k-th codeword(shared across HMMs M is very large)

kvf o

kvf o

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 59: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 59

Semicontinuous HMMs (cont)

s2

s3

s1

s2

s3

s1

Mb

kb

b

2

2

2

1

Mb

kb

b

1

1

1

1

Mb

kb

b

3

3

3

1

11 ΣμN

22 ΣμN

MMN Σμ

kkN Σμ

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 60: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 60

HMM Topology

bull Speech is time-evolving non-stationary signalndash Each HMM state has the ability to capture some quasi-stationary

segment in the non-stationary speech signalndash A left-to-right topology is a natural candidate to model the

speech signal (also called the ldquobeads-on-a-stringrdquo model)

ndash It is general to represent a phone using 3~5 states (English) and a syllable using 6~8 states (Mandarin Chinese)

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 61: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 61

Initialization of HMMbull A good initialization of HMM training

Segmental K-Means Segmentation into Statesndash Assume that we have a training set of observations and an initial estimate of all

model parametersndash Step 1 The set of training observation sequences is segmented into states based

on the initial model (finding the optimal state sequence by Viterbi Algorithm)ndash Step 2

bull For discrete density HMM (using M-codeword codebook)

bull For continuous density HMM (M Gaussian mixtures per state)

ndash Step 3 Evaluate the model scoreIf the difference between the previous and current model scores is greater than a threshold go back to Step 1 otherwise stop the initial model is generated

j

jkkb j statein vectorsofnumber the statein index codebook with vectorsofnumber the

state of cluster in classified vectors theofmatrix covariance sample

state of cluster in classified vectors theofmean sample statein vectorsofnumber by the divided

state of cluster in classified vectorsofnumber clustersofset aintostateeach within n vectorsobservatioecluster th

jm

jmj

jmwMj

jm

jm

jm

s2s1 s3

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 62: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 62

Initialization of HMM (cont)

Training Data

Initial Model

Model Reestimation

StateSequenceSegmemtation

Estimate parameters of Observation via

Segmental K-means

Model Convergence

NO

Model Parameters

YES

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 63: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 63

Initialization of HMM (cont)

bull An example for discrete HMMndash 3 states and 2 codeword

bull b1(v1)=34 b1(v2)=14bull b2(v1)=13 b2(v2)=23bull b3(v1)=23 b3(v2)=13

O1

State

O2 O3

1 2 3 4 5 6 7 8 9 10O4

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

O5 O6 O9O8O7 O10

v1

v2

s2s1 s3

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 64: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 64

Initialization of HMM (cont)

bull An example for Continuous HMMndash 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 NON

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means 111111121212

131313 141414

s2s1 s3

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 65: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 65

Known Limitations of HMMs (13)

bull The assumptions of conventional HMMs in Speech Processingndash The state duration follows an exponential distribution

bull Donrsquot provide adequate representation of the temporal structure of speech

ndash First-order (Markov) assumption the state transition depends only on the origin and destination

ndash Output-independent assumption all observation frames are dependent on the state that generated them not on neighboring observation frames

Researchers have proposed a number of techniques to address these limitations albeit these solution have not significantly improved speech recognition accuracy for practical applications

iitiii aatd 11

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 66: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 66

Known Limitations of HMMs (23)

bull Duration modeling

geometricexponentialdistribution

empirical distribution

Gaussian distribution

Gammadistribution

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 67: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 67

Known Limitations of HMMs (33)

bull The HMM parameters trained by the Baum-Welch algorithm (or EM algorithm) were only locally optimized

Current Model Configuration Model Configuration Space

Likelihood

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 68: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 68

Homework-2 (12)

s2

s1

s3

A34B33C33

034

034

033033

033

033033

033

034

A33B34C33 A33B33C34

TrainSet 11 ABBCABCAABC 2 ABCABC 3 ABCA ABC 4 BBABCAB 5 BCAABCCAB 6 CACCABCA 7 CABCABCA 8 CABCA 9 CABCA

TrainSet 21 BBBCCBC 2 CCBABB 3 AACCBBB 4 BBABBAC 5 CCA ABBAB 6 BBBCCBAA 7 ABBBBABA 8 CCCCC 9 BBAAA

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 69: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 69

Homework-2 (22)

P1 Please specify the model parameters after the first and 50th iterations of Baum-Welch training

P2 Please show the recognition results by using the above training sequences as the testing data (The so-called inside testing) You have to perform the recognition task with the HMMs trained from the first and 50th iterations of Baum-Welch training respectively

P3 Which class do the following testing sequences belong toABCABCCABAABABCCCCBBB

P4 What are the results if Observable Markov Models were instead used in P1 P2 and P3

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 70: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 70

Isolated Word Recognition

Word Model M2

2Mp X

Word Model M1

1Mp X

Word Model MV

VMp X

Word Model MSil

SilMp X

Feature Extraction

Mos

t Lik

e W

ord

Sele

ctor

MML

Feature Sequence

X

SpeechSignal

Likelihood of M1

Likelihood of M2

Likelihood of MV

Likelihood of MSil

kk

MpLabel XX maxarg

Viterbi Approximation

kk

MpLabel SXXS

maxmaxarg

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 71: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 71

Measures of ASR Performance (18)

bull Evaluating the performance of automatic speech recognition (ASR) systems is critical and the Word Recognition Error Rate (WER) is one of the most important measures

bull There are typically three types of word recognition errorsndash Substitution

bull An incorrect word was substituted for the correct wordndash Deletion

bull A correct word was omitted in the recognized sentencendash Insertion

bull An extra word was added in the recognized sentence

bull How to determine the minimum error rate

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 72: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 72

Measures of ASR Performance (28)bull Calculate the WER by aligning the correct word string

against the recognized word stringndash A maximum substring matching problemndash Can be handled by dynamic programming

bull Example

ndash Error analysis one deletion and one insertionndash Measures word error rate (WER) word correction rate (WCR)

word accuracy rate (WAR)

Correct ldquothe effect is clearrdquoRecognized ldquoeffect is not clearrdquo

504

13sentencecorrect in thewordsofNo

wordsIns- Matched100RateAccuracy Word

7543

sentencecorrect in the wordsof No wordsMatched100Rate Correction Word

5042

sentencecorrect in the wordsof No wordsInsDelSub100RateError Word

matched matchedinserted

deleted

WER+WAR=100

Might be higher than 100

Might be negative

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 73: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 73

Measures of ASR Performance (38)bull A Dynamic Programming Algorithm (Textbook)

denotes for the word length of the correctreference sentencedenotes for the word length of the recognizedtest sentence

minimum word error alignmentat the a grid [ij]

hit

hit

kinds ofalignment

Ref i

Test j

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 74: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 74

Measures of ASR Performance (48)bull Algorithm (by Berlin Chen)

Direction) (Vertical Deletion 2B[0][j]

11]-G[0][jG[0][j] ce referen m1jfor

Direction) l(Horizonta on Inserti1B[i][0]

11][0]-G[iG[i][0] testn 1ifor

0G[0][0] tionInitializa 1 Step

test i for reference j for

Direction) (Diagonal match 4 Direction) (Diagonaltion Substitu3

Direction) (Vertical n Deletio2 Direction) l(Horizonta on Inserti1

B[i][j]

Match) LT[i]LR[j] (if 1]-1][j-G[ion)Substituti LT[i]LR[j] (if 11]-1][j-G[i

)(Delection 11]-G[i][j )(Insertion 11][j]-G[i

minG[i][j]

ce referen m1jfor testn 1ifor

Iteration 2 Step

diagonallydown go then onSubstitutior h HitMatc LR[i] LR[j]print else down go then Deletion LR[j]print 2B[i][j] if else left go then nInsertioLT[i] print 1B[i][j] if

B[0][0])(B[n][m] path backtrace Optimal RateError Word100RateAccuracy Word

mG[n][m]100RateError Word

Backtrace and Measure 3 Step

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

Ref j

Test i

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 75: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 75

Measures of ASR Performance (58)

CorrectReference Word Sequence

Recognizedtest WordSequence

1 2 3 4 5 hellip hellip i hellip hellip n-1 n

mm-1

j4

3

21

1Ins

Del

Ins (ij)

Ins (nm)

Del

Del

bull A Dynamic Programming Algorithmndash Initialization

grid[0][0]score = grid[0][0]ins= grid[0][0]del = 0grid[0][0]sub = grid[0][0]hit = 0grid[0][0]dir = NIL

for (i=1ilt=ni++) testgrid[i][0] = grid[i-1][0]grid[i][0]dir = HORgrid[i][0]score +=InsPengrid[i][0]ins ++

00

for (j=1jlt=mj++) referencegrid[0][j] = grid[0][j-1]grid[0][j]dir = VERTgrid[0][j]score

+= DelPengrid[0][j]del ++

2Ins 3Ins

1Del

2Del3Del

HTK

(i-1j-1)

(i-1j)

(ij-1)

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 76: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 76

Measures of ASR Performance (68)bull Programfor (i=1ilt=ni++) test gridi = grid[i] gridi1 = grid[i-1]

for (j=1jlt=mj++) reference h = gridi1[j]score +insPen

d = gridi1[j-1]scoreif (lRef[j] = lTest[i])

d += subPenv = gridi[j-1]score + delPenif (dlt=h ampamp dlt=v) DIAG = hit or sub

gridi[j] = gridi1[j-1]gridi[j]score = dgridi[j]dir = DIAGif (lRef[j] == lTest[i]) ++gridi[j]hitelse ++gridi[j]sub

else if (hltv) HOR = ins gridi[j] = gridi1[j] gridi[j]score = hgridi[j]dir = HOR++ gridi[j]ins

else VERT = del

gridi[j] = gridi[j-1]gridi[j]score = vgridi[j]dir = VERT++gridi[j]del

for j for i

B A B C

C

C

B

C

A

00

(InsDelSubHit)

(0000) (1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1102)

(1202)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1103)

(1203)Delete C

Hit C

Hit B

Del C

Hit A

Ins B

A C B C CB A B CTest

Correct

Del CHit CHit BDel CHit AIns B

HTK

bull Example 1Correct

Test

Still have anOther optimalalignment

Alignment 1 WER= 60

structure assignment

structure assignment

structure assignment

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 77: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 77

Measures of ASR Performance (78)

B A A C

C

C

B

C

A

00

(InsDelSubHit)

(0000)(1000) (2000) (3000) (4000)

(0100)

(0200)

(0300)

(0400)

(0500)

i

j

(0010)

(0110)

(0201)

(0301)

(0401)

(1001)

(1101)or(0020)

(1201)or (0120)

(0211)

(0311)

(2001)

(1011)

(1111)

(1211)

(0311) (0221)or (1302)

(3001)

(2002)

(2102)or (1021)

(1112)

(1212)Delete C

Hit C

Sub B

Del C

Hit A

Ins B

A C B C CB A A CTest

Correct

Del CHit CSub BDel CHit AIns B

bull Example 2Correct

Test

A C B C CB A A CTest

Correct

Del CHit CDel BSub CHit AIns BB A A CTestCorrect

Del CHit CSub BSub CSub A

A C B C C

Alignment 1 WER= 80

Alignment 2WER=80

Alignment 3WER=80

Note the penalties for substitution deletionand insertion errors are all set to be 1 here

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 78: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 78

Measures of ASR Performance (88)

bull Two common settings of different penalties for substitution deletion and insertion errors

HTK error penalties subPen = 10delPen = 7insPen = 7

NIST error penaltiessubPenNIST = 4delPenNIST = 3insPenNIST = 3

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 79: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 79

Homework 3bull Measures of ASR Performance

100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 大100000 100000 興100000 100000 村100000 100000 死100000 100000 傷100000 100000 慘100000 100000 重100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

Reference100000 100000 桃100000 100000 芝100000 100000 颱100000 100000 風100000 100000 重100000 100000 創100000 100000 花100000 100000 蓮100000 100000 光100000 100000 復100000 100000 鄉100000 100000 打100000 100000 新100000 100000 村100000 100000 次100000 100000 傷100000 100000 殘100000 100000 周100000 100000 感100000 100000 觸100000 100000 最100000 100000 多helliphellip

ASR Output

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 80: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 80

Homework 3

bull 506 BN stories of ASR outputsndash Report the CER (character error rate) of the first one 100 200

and 506 storiesndash The result should show the number of substitution deletion and

insertion errors

------------------------ Overall Results ------------------------------------------------------------------------

SENT Correct=000 [H=0 S=506 N=506]WORD Corr=8683 Acc=8606 [H=57144 D=829 S=7839 I=504 N=65812]===================================================================

------------------------ Overall Results ----------------------------------------------------------------------

SENT Correct=000 [H=0 S=1 N=1]WORD Corr=8152 Acc=8152 [H=75 D=4 S=13 I=0 N=92]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=100 N=100]WORD Corr=8766 Acc=8683 [H=10832 D=177 S=1348 I=102 N=12357]===================================================================------------------------ Overall Results -----------------------------------------------------------------------

SENT Correct=000 [H=0 S=200 N=200]WORD Corr=8791 Acc=8718 [H=22657 D=293 S=2824 I=186 N=25774]===================================================================

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 81: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 81

Symbols for Mathematical Operations

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 82: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 82

The EM Algorithm (17)

A B

Observed data O ldquoball sequencerdquoLatent data S ldquobottle sequencerdquo

Parameters to be estimated to maximize logP(O|λ)=P(A)P(B)P(B|A)P(A|B)P(R|A)P(G|A)P(R|B)P(G|B)

o1o2helliphellipoT p(O|λ)

λ

s2

s1

s3

A3B2C5

A7B1C2 A3B6C1

07

0303

0202

0103

07

p(O|λ)gt p(O|λ)

06

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 83: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 83

The EM Algorithm (27)

bull Introduction of EM (Expectation Maximization)ndash Why EM

bull Simple optimization algorithms for likelihood function relies on the intermediate variables called latent dataIn our case here the state sequence is the latent data

bull Direct access to the data necessary to estimate the parameters is impossible or difficultIn our case here it is almost impossible to estimate AB without consideration of the state sequence

ndash Two Major Steps bull E expectation with respect to the latent data using the current

estimate of the parameters and conditioned on the observations

bull M provides a new estimation of the parameters according to Maximum likelihood (ML) or Maximum A Posterior (MAP) Criteria

OλS E

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 84: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 84

The EM Algorithm (37)

bull Estimation principle based on observations

ndash The Maximum Likelihood (ML) Principlefind the model parameter so that the likelihood is maximumfor example if is the parameters of a multivariate normal distribution and X is iid (independent identically distributed) then the ML estimate of is

ndash The Maximum A Posteriori (MAP) Principlefind the model parameter so that the likelihood is maximum

n

i

tMLiMLiML

n

iiML nn 11

1 1 μxμxΣxμ

n21 XXXX nxxxx 21

ΦxpΦ

Φ xΦp

ΣμΦ

ΣμΦ

ML and MAP

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 85: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 85

The EM Algorithm (47)

bull The EM Algorithm is important to HMMs and other learning techniquesndash Discover new model parameters to maximize the log-likelihood

of incomplete data by iteratively maximizing the expectation of log-likelihood from complete data

bull Firstly using scalar (discrete) random variables to introduce the EM algorithmndash The observable training data

bull We want to maximize is a parameter vectorndash The hidden (unobservable) data

bull Eg the component probabilities (or densities) of observable data or the underlying state sequence in HMMs

O

Sλ λOP

λSO log P λOPlog

O

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 86: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 86

The EM Algorithm (57)

ndash Assume we have and estimate the probability that each occurred in the generation of

ndash Pretend we had in fact observed a complete data pair with frequency proportional to the probability to computed a new the maximum likelihood estimate of

ndash Does the process convergendash Algorithm

bull Log-likelihood expression and expectation taken over S

λOλOSλSO PPP

λOSλSOλO logloglog PPP

λ λSO P

λO

λ

SO

S

Bayesrsquo rule

take expectation over S

incomplete data likelihood

SSλOSλOSλSOλOS

λOλOSλO

loglog

loglog

PPPP

PPPs

unknown model setting

complete data likelihood

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 87: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 87

The EM Algorithm (67)

ndash Algorithm (Cont)bull We can thus express as follows

bull We want

S

S

SS

λOSλOSλλ

λSOλOSλλ

λλλλ

λOSλOSλSOλOS

λO

log

log

where

loglog

log

PPH

PPQ

HQ

PPPP

P

λλλλλλλλ

λλλλλλλλ

λOλO

loglog

HHQQHQHQ

PP

λOPlog

λOλO PP loglog

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 88: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 88

The EM Algorithm (77)

bull has the following property

ndash Therefore for maximizing we only need to maximize the Q-function (auxiliary function)

0 0

)1log(

1

log

λλλλ

λOSλOS

λOSλOS

λOS

λOSλOS

λOS

λλλλ

S

S

S

HH

PP

xxPP

P

PP

P

HH

S

λSOλOSλλ log PPQ

λλλλ HH

λOPlog

Jensenrsquos inequality

Kullbuack-Leibler (KL) distance

Expectation of the completedata log likelihood with respectto the latent state sequences

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 89: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 89

EM Applied to Discrete HMM Training (15)

bull Apply EM algorithm to iteratively refine the HMM parameter vector ndash By maximizing the auxiliary function

ndash Where and can be expressed as

)( πBAλ

S

S

λSOλOλSO

λSOλOSλλ

PlogP

P

PlogPQ

T

tts

T

tsss

T

tts

T

tsss

T

tts

T

tsss

ttt

ttt

ttt

baP

baP

baP

1

1

1

1

1

1

1

1

1

loglogloglog

loglogloglog

11

11

11

oλSO

oλSO

oλSO

λSO P λSO P

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 90: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 90

EM Applied to Discrete HMM Training (25)

bull Rewrite the auxiliary function as

N

j k votj

tT

ts

N

i

N

j

T

tij

ttT

tss

N

iis

kt

t

tt

kbP

jsPkb

PP

Q

aP

jsisPa

PP

Q

PisP

PP

Q

QQQQ

1 all 1

1 1

1

1

1

all

1

1

1

1

all

loglog

log

log

loglog

1

1

λOλO

λOλSO

λOλO

λOλSO

λOλO

λOλSO

λ

bλaλλλλ

Sb

Sa

baπ

wi yi

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 91: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 91

EM Applied to Discrete HMM Training (35)

bull The auxiliary function contains three independentterms and ndash Can be maximized individuallyndash All of the same form

ija kbji

when valuemaximum has

and where

N

1j j

jj

j

N

1j j

N

1j jjN21

w

wyF

0y1yylogwyyygF

y

y

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 92: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 92

EM Applied to Discrete HMM Training (45)

bull Proof Apply Lagrange Multiplier

N

1j j

jj

N

1j j

N

1j j

N

1j j

j

j

j

j

j

N

1j

N

1j jjj

N

1j jj

w

wy

wwy

jyw

0yw

yF

1yylogwylogwF

that Suppose

Multiplier Lagrange applyingBy

Constraint

Lagrange Multiplier httpwwwslimycom~steuardteachingtutorialsLagrangehtml

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 93: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 93

EM Applied to Discrete HMM Training (55)

bull The new model parameter set can be expressed as

BAπλ =

T

tt

T

vot

t

T

tt

T

vot

t

i

T

tt

T

tt

T

tt

T

ttt

ij

i

i

i

isP

isP

kb

i

ji

isP

jsisPa

iP

isP

ktkt

1

st1

1

st1

1

1

1

11

1

1

11

11

O

O

O

O

OO

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 94: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 94

EM Applied to Continuous HMM Training (17)

bull Continuous HMM the state observation does not come from a finite set but from a continuous spacendash The difference between the discrete and continuous HMM lies

in a different form of state output probabilityndash Discrete HMM requires the quantization procedure to map

observation vectors from the continuous space to the discrete space

bull Continuous Mixture HMMndash The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

Distribution for State i

M

kjkjk

tjk

jk

Ljk

M

kjkjkjk

M

kjkjkj

cNc

bcb

1

121

1

1

21exp

2

1 μoΣμoΣ

Σμo

oo

M

1k jk 1c

wi1

wi2

wi3

N1

N2N3

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 95: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 95

EM Applied to Continuous HMM Training (27)

bull Express with respect to each single mixture component

ojb ojkb

M

k

T

ttksks

M

k

M

k

T

tsss

T

tts

T

tsss

T

tttttt

ttt

bca

bap

1 11 1

1

1

1

1

1

1 2

11

11

o

oλSO

sequence state with thealong sequencecomponent mixture possible theof one

1

1

111

SK

oλKSO

T

ttksks

T

tsss tttttt

bcap

S K

λKSOλO pp

M

k

M

k

M

k

T

t tk

TMTTMM

T

t

M

k tk

T t

t t

a

aaaaaaaaa

a

1 1 1 1

212222111211

1 1

1 2

Note

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 96: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 96

EM Applied to Continuous HMM Training (37)

bull Therefore an auxiliary function for the EM algorithm can be written as

S K

S K

λKSOλO

λKSO

λKSOλOKSλλ

log

log

pp

p

pPQ

T

t

T

tkstks

T

tsss tttttt

cbap1 1

1

1logloglogloglog

11oλKSO

cQQQQQ c λbλaλλλλ baπ mixture

componentsGaussiandensity

functions

state transitionprobabilities

initial probabilities

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 97: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 97

EM Applied to Continuous HMM Training (47)

bull The only difference we have when compared with Discrete HMM training

T

ttjk

N

j

M

kttc

T

ttjk

N

j

M

ktt

ckkjsPQ

bkkjsPQ

1 1 1

1 1 1

log

log

oλΟcλ

oλΟbλb

kjt

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 98: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 98

EM Applied to Continuous HMM Training (57)

T

tt

T

ttt

jk

T

tjktjkt

jk

T

t

N

j

M

ktjkt

jk

jktjkjk

tjk

jktjkt

jktjktjk

jktjkt

jkt

jkLjkjkttjk

M

kttt

jk

jk

jk

bjkQ

b

Lb

Nb

kkjsPjk

1

1

1

1

1 1 1

1

11

1

21

2

1

0

log

log21log2

12log2log

21exp

2

1

Let

μoΣ

μ

o

μbλ

μoΣμ

o

μoΣμoΣo

μoΣμoΣ

Σμoo

λΟ

b

here symmetric is and

)(

1

jkΣd

d xCCxCxx TT

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 99: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 99

EM Applied to Continuous HMM Training (67)

T

tt

T

t

tjktjktt

jk

jkjkt

jktjktjkjk

T

t

T

ttjkjkjkt

jkt

jktjktjk

T

t

T

ttjkt

T

tjk

tjktjktjkjkt

jk

T

t

N

j

M

ktjkt

jk

jkt

jktjktjkjk

jkt

jktjktjkjkjkjkjk

tjk

jktjkt

jktjktjk

jk

jk

jkjk

jkjk

jk

bjkQ

b

Lb

1

1

11

1 1

1

11

1 1

1

1

111

1

1 1 1

111

1111

1

021

log

21

21

21log

21log2

12log2log

μoμoΣ

ΣΣμoμoΣΣΣΣΣ

ΣμoμoΣΣ

ΣμoμoΣΣ

Σ

o

Σbλ

ΣμoμoΣΣ

ΣμoμoΣΣΣΣΣ

o

μoΣμoΣo

b

TT

dd XabX

XbXa T

T

)( 1

here symmetric is and

)det(det

jkΣd

d TXXXX

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc

Page 100: Hidden Markov Models for Speech Recognitionberlin.csie.ntnu.edu.tw/Courses/Speech Processing/Lectures2015... · Hidden Markov Models for Speech Recognition ... A Tutorial on Hidden

SP - Berlin Chen 100

EM Applied to Continuous HMM Training (77)

bull The new model parameter set for each mixture component and mixture weight can be expressed as

T

tt

T

ttt

T

t

tt

T

tt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

o

λOλO

oλO

λO

μ

T

tt

T

t

tjktjktt

T

t

tt

T

t

tjktjkt

tt

jk

kj

kj

pkkjsp

pkkjsp

1

1

1

1

μoμo

λOλO

μoμoλO

λO

Σ

T

1t

M

1k t

T

1t t

jkkj

kjc