hierarchical markov network icri-ci retreat 2013 may 9, 2013 boris ginzburg icri - computational...

Hierarchical Markov Network

ICRI-CI Retreat 2013May 9, 2013

Boris Ginzburg

ICRI - Computational Intelligence

ACKNOWLEDGMENT:Daniel Rubin, Dima Vainbrand, Ronny Ronen, Ohad Falik, Zev Rivlin, Mike

Deisher, Shai Fine, Shie Mannor

Summary • Hierarchical Hidden Markov Model (H-HMM):

A known statistical model for complex temporal pattern recognition (Fine, Singer, Tishby-1998)

• Hierarchical Markov Network (HMN):

- Compact and computationally efficient extension of H-HMM based on merging of identical sub-models

- A new efficient Viterbi algorithm: sharing of computations for sub-model by its “parents”

ICRI - Computational Intelligence2

Hidden Markov Model

Hidden Markov Model (HMM) is among the leading tools used for temporal pattern recognition.

Used in:

- Speech recognition

- Handwriting

- Language processing

- Gesture recognition

- Bioinformatics

- Machine translation

- Speech synthesis


background

Hidden Markov Model

HMM is stochastic FSM described by Markov model:

α(i,j)≔ Prob( q(t+1)=j | q(t)=i)

Initial probability: π(i)≔Prob (q(1) = i);

State [i] can emit symbol [o] with β(i,o):= Prob ( o | q(t) = i);


background

1

3

2 4

5

30%

30%

40%

20%

80%

1

324

1.21.2 2.72.7 3.43.4 4.94.9

The state of model is hidden from observer.

HMM: Viterbi AlgorithmProblem:Given the observation sequence: O=( o(1),…o(T) ), what is the most probable state sequence: Q=(q(1),…q(T))?

Solution: Forward-Backward (Baum-Welch) algorithm.

For each state [x] and time t:

Given that system was in [x] at time t, let’s define δ(x,t) - the likelihood of the most probable state sequence, which could generate observation ( o(1),…,o(t) ):

δ(x,t)≔ max P(q(1),…,q(t-1) | q(t)=x, o(1),…o(t)).

We will use log-likelihood: S(x, t):= -log (δ(x,t)).

The S(x,t) is called token at state [x] at moment t


background

HMM: Viterbi Algorithm

Forward-backward algorithm (Baum-Welch) is based on principle of dynamic programming (Viterbi)


Initialization (t=1): S(x,1) = p(x)+ b(x,o(1)); ψ(x,1) = 0;

Induction (t t+1):S(y,t+1) = min[ S(x,t) + a(x,y) + b(y,o(t+1)) ]; ψ(y,t+1) = argmin[ S(y,t+1) ];

Termination (t=T):Smin = min[ S(x,T) ]; q(T) = argmin[ S(x,T ) ];

Backward procedure, path recovery (T1):q(t) = ψ(q(t+1),t+1);

background

Hierarchy of HMMs

Example.

Speech recognition - multi-layer hierarchy of HMMs


background

Acoustic Model-1:Sub-phoneme level

Acoustic Model-2: Word level

Language model: Bi-grams/Thr-grams

N AY S

nice beach

Hierarchical HMM


H-HMM replaces the complex hierarchy of simple HMMs, with one unified model [Fine, Singer, Tishby-1998].

H-HMM - hierarchical FSM with two types of states:

- “Complex” state - state which is itself HMM

- “Production” state - simple state on lowest level of hierarchy, which produces observed symbol

Efficient Viterbi algorithm on HHMM with complexity O(T*N2 ), where

T – sequence durationN – number of states in H-HMM

[K. Murphy & Paskin-2001] and [Wakabayashi & Miura - 2012).

background

b i ʧ

speech beach

s p i ʧ p i ʧ

peach

Scalability Problem


Example: dictionary ={speech, beach, peach}. Use 3-state HMM for phoneme model

- 10 instances of HMMs

- only 5 different HMM templates

problem

PROBLEM: structural & computational redundancy, both for “Hierarchy of HMMs” and for H-HMM

Hidden Markov NetworkHierarchical Markov Network: Compact representation of H-HHM, where each sub-model is embedded once and serves multiple “parents”


b i ʧ

speech beach

s p i ʧ p i ʧ

peach

HMN is based on “call-return” semantics: parent node calls sub-HMM, which computes the score of subsequence and returns result to parent node.

solution

HMN: Viterbi algorithmThe key observation: Viterbi computations inside identical HMMs, are almost the same. Consider e.g. H-HMM for “beach” and “peach”:


beach

b.0 b.1 i.0 i.1 ʧ.0 ʧ.1

peach

p.0 p.1 i.0 i.1 ʧ.0 ʧ.1

30,t

10,t

x1 x2 x3 x1 x2 x3

35,t+1

45,t+2

52,t+3

52,t+3

15,t+1

25,t+2

32,t+3

32,t+3

solution

HMN: Viterbi algorithmToken S(.) in sub-HMM “i“ in the word ‘beach”, is based on score of previous phoneme “b”:

S(beach.i.x1,t) = [S(beach.b1, t-1) + a(beach.b1, beach.i0)]+

[a(beach.i0, beach.i.x1) + b(beach.i.x1,o(t)) ];

For “peach” S(.) will be based on the score from “p”:

S(peach.i.x1,t) = [S(peach.p1, t-1) + a(peach.p1, peach.i0)] +

[a(peach.i0, peach.i.x1) + b(peach.i.x1,o(t)) ];

Two last terms in both expressions are equal:

a(beach.i0, beach.i.x1) + b(beach.i.x1,o(t)) =

a(peach.i0, peach.i.x1) + b(peach.i.x1,o(t))

We can do the computation once, and use it for both words.


solution

HMN: Viterbi algorithm

One sub-HMM can serve multiple parents: computes the score of sub-sequence, and returns it to parents


solution

beach

b.0 b.1 i.0 i.1 ʧ.0 ʧ.1

peach

p.0 p.1 i.0 i.1 ʧ.0 ʧ.1

30,t

10,t

x1 x2 x3

30+22,t+3

5,t+1

15,t+2

22,t+3

10+22,t+3

HMN: call-returnChild HMM:

– serves multiple calls from multiple nodes. Child maintains list of received calls.

– all calls, received at the same moment, are merged and computed together; child keeps list of “return address”

– multiple tokens can be generated by one call (all marked by time when call was started)

– when token reaches ‘end” state, the score is sent to parent

Parent node:

– Maintains list of open calls and prefix scores

– Add prefix-score to the score received from child


solution

HMN: call-return


i.0 i.1

200

id scoret

t

0

id t

t i.1

return address

1

num oftokens

18,0

241 t+1

1 t+1 i.1 2

18,1

15,1

10,2

14,2

152 t+2

2 t+2 i.1 2 40,3

803 t+3

604 t+4

38,3

3 t+3 i.1 2

4 t+4 i.1 1

10,4

solution

HMN: Temporal hierarchy

How to support multiple temporal scales on different level of hierarchy?

Possible directions: – Exponential increase time scale by each level of hierarchy

∆d = 2* ∆d+1

– One call cover a number of time overlapping sub-sequences, child selects sequence with best score S(x, td) = min (S(x,td+1),S(x,t d+1+1)


solution

Inspired by Tali Tishby’ talkyesterday

HMN: Temporal hierarchy


i.0 i.1

200

id scoret

t

0

id t

t i.1

return address

1

num oftokens

18,0

1 t+1 i.1 2

18,1

15,1

10,2

14,2

152 t+2

2 t+2 i.1 2 40,3

604 t+4

38,3

3 t+3 i.1 2

4 t+4 i.1 1

10,4

solution

HMN: performance

HMN has potential performance benefits over HMM/H-HMM if cost of HMM > cost of “call-return”,

- Cost of HMM is ~ number of arcs.

- Cost of call / return is fixed, depends only on number of return tokens / call; does not depend on size of HMM.

Back-of-the envelope:

- Cost of Viterbi on 5-state HMM ~ 10 MACs;

- Cost of one return token–1 MAC;

Additional HMN cost - increased complexity:

- book-keeping in HMM, complex parent node structure…


solution

Next Steps

- A promising idea, need to establish its efficiency beyond plain “back-of-the-envelop”; prototype to check performance claims

- Extend HMN theory to support multiple temporal scales on different levels of hierarchy

- Connection between Convolutional Neural Network and HMN


Next steps

BACKUP


hierarchical markov network icri-ci retreat 2013 may 9, 2013 boris ginzburg icri - computational...

Documents