hierarchical markov network icri-ci retreat 2013 may 9, 2013 boris ginzburg icri - computational...
TRANSCRIPT
Hierarchical Markov Network
ICRI-CI Retreat 2013May 9, 2013
Boris Ginzburg
ICRI - Computational Intelligence
ACKNOWLEDGMENT:Daniel Rubin, Dima Vainbrand, Ronny Ronen, Ohad Falik, Zev Rivlin, Mike
Deisher, Shai Fine, Shie Mannor
Summary • Hierarchical Hidden Markov Model (H-HMM):
A known statistical model for complex temporal pattern recognition (Fine, Singer, Tishby-1998)
• Hierarchical Markov Network (HMN):
- Compact and computationally efficient extension of H-HMM based on merging of identical sub-models
- A new efficient Viterbi algorithm: sharing of computations for sub-model by its “parents”
ICRI - Computational Intelligence2
Hidden Markov Model
Hidden Markov Model (HMM) is among the leading tools used for temporal pattern recognition.
Used in:
- Speech recognition
- Handwriting
- Language processing
- Gesture recognition
- Bioinformatics
- Machine translation
- Speech synthesis
ICRI - Computational Intelligence3
background
Hidden Markov Model
HMM is stochastic FSM described by Markov model:
α(i,j)≔ Prob( q(t+1)=j | q(t)=i)
Initial probability: π(i)≔Prob (q(1) = i);
State [i] can emit symbol [o] with β(i,o):= Prob ( o | q(t) = i);
ICRI - Computational Intelligence4
background
1
3
2 4
5
30%
30%
40%
20%
80%
1
324
1.21.2 2.72.7 3.43.4 4.94.9
The state of model is hidden from observer.
HMM: Viterbi AlgorithmProblem:Given the observation sequence: O=( o(1),…o(T) ), what is the most probable state sequence: Q=(q(1),…q(T))?
Solution: Forward-Backward (Baum-Welch) algorithm.
For each state [x] and time t:
Given that system was in [x] at time t, let’s define δ(x,t) - the likelihood of the most probable state sequence, which could generate observation ( o(1),…,o(t) ):
δ(x,t)≔ max P(q(1),…,q(t-1) | q(t)=x, o(1),…o(t)).
We will use log-likelihood: S(x, t):= -log (δ(x,t)).
The S(x,t) is called token at state [x] at moment t
ICRI - Computational Intelligence5
background
HMM: Viterbi Algorithm
Forward-backward algorithm (Baum-Welch) is based on principle of dynamic programming (Viterbi)
ICRI - Computational Intelligence6
Initialization (t=1): S(x,1) = p(x)+ b(x,o(1)); ψ(x,1) = 0;
Induction (t t+1):S(y,t+1) = min[ S(x,t) + a(x,y) + b(y,o(t+1)) ]; ψ(y,t+1) = argmin[ S(y,t+1) ];
Termination (t=T):Smin = min[ S(x,T) ]; q(T) = argmin[ S(x,T ) ];
Backward procedure, path recovery (T1):q(t) = ψ(q(t+1),t+1);
background
Hierarchy of HMMs
Example.
Speech recognition - multi-layer hierarchy of HMMs
ICRI - Computational Intelligence7
background
Acoustic Model-1:Sub-phoneme level
Acoustic Model-2: Word level
Language model: Bi-grams/Thr-grams
N AY S
nice beach
Hierarchical HMM
ICRI - Computational Intelligence8
H-HMM replaces the complex hierarchy of simple HMMs, with one unified model [Fine, Singer, Tishby-1998].
H-HMM - hierarchical FSM with two types of states:
- “Complex” state - state which is itself HMM
- “Production” state - simple state on lowest level of hierarchy, which produces observed symbol
Efficient Viterbi algorithm on HHMM with complexity O(T*N2 ), where
T – sequence durationN – number of states in H-HMM
[K. Murphy & Paskin-2001] and [Wakabayashi & Miura - 2012).
background
b i ʧ
speech beach
s p i ʧ p i ʧ
peach
Scalability Problem
ICRI - Computational Intelligence9
Example: dictionary ={speech, beach, peach}. Use 3-state HMM for phoneme model
- 10 instances of HMMs
- only 5 different HMM templates
problem
PROBLEM: structural & computational redundancy, both for “Hierarchy of HMMs” and for H-HMM
Hidden Markov NetworkHierarchical Markov Network: Compact representation of H-HHM, where each sub-model is embedded once and serves multiple “parents”
ICRI - Computational Intelligence10
b i ʧ
speech beach
s p i ʧ p i ʧ
peach
HMN is based on “call-return” semantics: parent node calls sub-HMM, which computes the score of subsequence and returns result to parent node.
solution
HMN: Viterbi algorithmThe key observation: Viterbi computations inside identical HMMs, are almost the same. Consider e.g. H-HMM for “beach” and “peach”:
ICRI - Computational Intelligence11
beach
b.0 b.1 i.0 i.1 ʧ.0 ʧ.1
peach
p.0 p.1 i.0 i.1 ʧ.0 ʧ.1
30,t
10,t
x1 x2 x3 x1 x2 x3
35,t+1
45,t+2
52,t+3
52,t+3
15,t+1
25,t+2
32,t+3
32,t+3
solution
HMN: Viterbi algorithmToken S(.) in sub-HMM “i“ in the word ‘beach”, is based on score of previous phoneme “b”:
S(beach.i.x1,t) = [S(beach.b1, t-1) + a(beach.b1, beach.i0)]+
[a(beach.i0, beach.i.x1) + b(beach.i.x1,o(t)) ];
For “peach” S(.) will be based on the score from “p”:
S(peach.i.x1,t) = [S(peach.p1, t-1) + a(peach.p1, peach.i0)] +
[a(peach.i0, peach.i.x1) + b(peach.i.x1,o(t)) ];
Two last terms in both expressions are equal:
a(beach.i0, beach.i.x1) + b(beach.i.x1,o(t)) =
a(peach.i0, peach.i.x1) + b(peach.i.x1,o(t))
We can do the computation once, and use it for both words.
ICRI - Computational Intelligence12
solution
HMN: Viterbi algorithm
One sub-HMM can serve multiple parents: computes the score of sub-sequence, and returns it to parents
ICRI - Computational Intelligence13
solution
beach
b.0 b.1 i.0 i.1 ʧ.0 ʧ.1
peach
p.0 p.1 i.0 i.1 ʧ.0 ʧ.1
30,t
10,t
x1 x2 x3
30+22,t+3
5,t+1
15,t+2
22,t+3
10+22,t+3
HMN: call-returnChild HMM:
– serves multiple calls from multiple nodes. Child maintains list of received calls.
– all calls, received at the same moment, are merged and computed together; child keeps list of “return address”
– multiple tokens can be generated by one call (all marked by time when call was started)
– when token reaches ‘end” state, the score is sent to parent
Parent node:
– Maintains list of open calls and prefix scores
– Add prefix-score to the score received from child
ICRI - Computational Intelligence14
solution
HMN: call-return
ICRI - Computational Intelligence15
i.0 i.1
200
id scoret
t
0
id t
t i.1
return address
1
num oftokens
18,0
241 t+1
1 t+1 i.1 2
18,1
15,1
10,2
14,2
152 t+2
2 t+2 i.1 2 40,3
803 t+3
604 t+4
38,3
3 t+3 i.1 2
4 t+4 i.1 1
10,4
solution
HMN: Temporal hierarchy
How to support multiple temporal scales on different level of hierarchy?
Possible directions: – Exponential increase time scale by each level of hierarchy
∆d = 2* ∆d+1
– One call cover a number of time overlapping sub-sequences, child selects sequence with best score S(x, td) = min (S(x,td+1),S(x,t d+1+1)
ICRI - Computational Intelligence16
solution
Inspired by Tali Tishby’ talkyesterday
HMN: Temporal hierarchy
ICRI - Computational Intelligence17
i.0 i.1
200
id scoret
t
0
id t
t i.1
return address
1
num oftokens
18,0
1 t+1 i.1 2
18,1
15,1
10,2
14,2
152 t+2
2 t+2 i.1 2 40,3
604 t+4
38,3
3 t+3 i.1 2
4 t+4 i.1 1
10,4
solution
HMN: performance
HMN has potential performance benefits over HMM/H-HMM if cost of HMM > cost of “call-return”,
- Cost of HMM is ~ number of arcs.
- Cost of call / return is fixed, depends only on number of return tokens / call; does not depend on size of HMM.
Back-of-the envelope:
- Cost of Viterbi on 5-state HMM ~ 10 MACs;
- Cost of one return token–1 MAC;
Additional HMN cost - increased complexity:
- book-keeping in HMM, complex parent node structure…
ICRI - Computational Intelligence18
solution
Next Steps
- A promising idea, need to establish its efficiency beyond plain “back-of-the-envelop”; prototype to check performance claims
- Extend HMN theory to support multiple temporal scales on different levels of hierarchy
- Connection between Convolutional Neural Network and HMN
ICRI - Computational Intelligence19
Next steps
BACKUP
ICRI - Computational Intelligence20