a spectral algorithm for learning hidden markov modelsmohri/aml15/kentaro_spectral_hmm.pdf · a...
Post on 20-Jun-2020
4 Views
Preview:
TRANSCRIPT
Introduction Learning Algorithm Learning Guarantees Conclusion
A Spectral Algorithm for LearningHidden Markov Models
Based on Daniel Hsu, Sham Kakade, and Tong ZhangarXiv:0811.4413
Kentaro Hanaki
New York University
May 5, 2015
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Table of Contents
Introduction
Learning Algorithm
Learning Guarantees
Conclusion
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Table of Contents
Introduction
Learning Algorithm
Learning Guarantees
Conclusion
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
What is HMM?
h1 h4h3h2 h5
x1 x4x3x2 x5
I Probabilistic model for sequential observations Pr[x1, · · · , xT ]
I Observations are generated by their underlying hidden states
I Hidden states follows Markov assumptions
I Low-dim. hidden states → Dynamics easier to understand
I Speech recognition, NLP (PoS tagging, NER, MT, ...), etc.
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Parameters of HMM
h1 h4h3h2 h5
x1 x4x3x2 x5
T = Pr(ht | ht-1)
O = Pr(xt | ht)
I Pr[x1, · · · , xT ] =∑
h1,··· ,hT (∏T
t=1 Pr[ht |ht−1]Pr[xt |ht ])Pr[h1]
I Observation matrix Oi ,a = Pr[xt = i |ht = a]
I Transition matrix Ta,b = Pr[ht = a|ht−1 = b]
I Prior probability for the initial hidden states πa = Pr[ht = a]
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Training HMM — Traditional Approaches
Natural loss for probabilistic model is negative log likelihood (NLL)
1. Find the global minimum of NLLI NLL is not convex due to the presence of hidden statesI Solving non-convex optimization problem is extremely hard
2. Heuristically minimize NLL using EM algorithmI NLL is convex once the hidden states are fixedI E-step: Infer and fix the hidden states based on the parametersI M-step: Minimize NLL with respect to parametersI Tend to get stuck into local minimaI Works fine in practice, hard to analyze learning guarantees
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Contribution of the Paper
The paper proposed an efficient training method that admits aunique solution and learning guarantees
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Table of Contents
Introduction
Learning Algorithm
Learning Guarantees
Conclusion
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Overview of the Algorithm
I Learn model for predicting joint probabilities for observations
I Learning is hard due to the presence of hidden statesI Two steps solution without directly referring to hidden states:
I Map to observation space so that probs can be estimatedI Find subspace that is tractable
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Observation Operator Representation
In HMM, joint probability can be written as
Pr[x1, · · · , xt ] = ~1TmAxt · · ·Ax1~π
Ax is the observation operator
Ax = Tdiag(Ox ,1, · · · ,Ox ,m)
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Proof
Proof for t = 2 (Generalization is straightforward)
Pr[x1 = i , x2 = j] =∑a,b
Pr[x1 = i , x2 = j |h1 = a, h2 = b]Pr[h1 = a, h2 = b]
=∑a,b
Pr[x1 = i , x2 = j |h1 = a, h2 = b]Tb,aπa
=∑a,b
Pr[x1 = i |h1 = a]Pr[x2 = j |h2 = b]Tb,aπa
=∑a,b
Oj,bTb,aOi,aπa
= ~1TmTdiag(Oj,1, · · · ,Oj,m)Tdiag(Oi,1, · · · ,Oi,m)~π
= ~1TmAjAi~π
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Observation Operator Representation
Problem: Estimation of ~π and Ax requires inferring hidden states
I ~π: Prior for hidden states
I Ax : Transition between hidden states
Solution: Map everything into the space in which
I each component can be estimated from observation
I dynamics as easy as in hidden state space
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Mapping to Subspace
I First map everything into observation space using O
~π → O~π, ~1m → O~1m, Ax → OAxO−1
I O−1 not well-defined and dynamics complicated in this space
I Find U such that UTO is invertible (i.e., preserves hiddenstate dynamics) and
~π → ~b1 = (UTO)~π, ~1m → ~b∞ = (UTO)~1m
Ax → Bx = (UTO)Ax(UTO)−1
Pr[x1, · · · , xt ] = ~bT∞Bxt · · ·Bx1~b1
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Estimating b1, b∞ and B
b1, b∞ and Bx can be estimated using
(P1)i = Pr[x1 = i ]
(P2,1)i ,j = Pr[x2 = i , x1 = j ]
(P3,x ,1)i ,j = Pr[x3 = i , x2 = x , x1 = j ]
Lemmab1, b∞, Bx can be expressed as
b1 = UTP1
b∞ = (PT2,1U)+P1
Bx = (UTP3,x ,1)(UTP2,1)+
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Finding U
LemmaAssume ~π > 0 and that O and T have column rank m (i.e., anytwo hidden states are not identical). If U is the matrix of leftsingular vectors of P2,1 corresponding to non-zero singular values,then UTO is invertible.
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Proof for the Existence of (UTO)−1
P2,1 can be rewritten as
P2,1 = OTdiag(~π)OT
So, rank(P2,1) ≤ rank(O). Also, Tdiag(~π)OT has full row rankfrom assumptions. This implies
O = P2,1(Tdiag(~π)OT )+
Therefore, rank(O) ⊂ rank(P2,1). So,
rank(O) = rank(P2,1) = rank(U)
and UTO has rank m and invertible.
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Learning Algorithm
1. Randomly sample N observation triples (x1, x2, x3) andestimate P1, P2,1 and P3,x ,1
2. Compute SVD of P2,1 and let U be the matrix consisting ofleft singular vectors corresponding to m largest singular values
3. Compute model parameters b1, b∞ and Bx
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Table of Contents
Introduction
Learning Algorithm
Learning Guarantees
Conclusion
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Learning Bound for Joint Probabilities
TheoremThere exists a constant C > 0 s.t.
Pr
∑x1,··· ,xt
|Pr[x1, · · · , xt ]− P̂r[x1, · · · , xt ]| ≥ ε
≤ exp
(−
σm(O)2σm(P2,1)4Nε2
C(1 + n0(ε)σm(P2,1)2)t2
)
where
n0(ε) = min{k : ε(k) ≤ ε/4}
ε(k) = min
∑j∈S
Pr[x2 = j] : S ⊂ [n], |S| = n − k
and σm is the m-th largest singular value.
Note: Bound gets looser as the length of sequence gets longer!!!
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Brief Idea of the Proof
1. Compute the sampling error for P1, P2,1 and P3,x ,1
2. Compute how the sampling error is propagated to accuracyI Compute the approximation error for UI Compute the approximation error for b1, b∞ and Bx
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Learning Bound for Conditional Probabilities
Conditional probability
Pr[xT |xT−1, · · · , x1] =bT∞BxT bT∑x b
T∞BxbT
bt+1 =BxT+1
bTbT∞BxT bT
Conditional probability also have learning guarantee
I Bound independent of the length of the sequenceI Two more parameters compared to joint probability bound
I α: Smallest value of transition matrix Ax
I γ: Error for hidden states measured in observation space
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Table of Contents
Introduction
Learning Algorithm
Learning Guarantees
Conclusion
Spectral Algorithm for Learning HMM New York University
Introduction Learning Algorithm Learning Guarantees Conclusion
Conclusion
I Spectral learning for HMM yields a unique solution
I Joint probability estimated without inferring hidden states
I Learning guarantee for joint probability depends on the length,but that for conditional probability does not
Spectral Algorithm for Learning HMM New York University
top related