a spectral algorithm for learning hidden markov modelsmohri/aml15/kentaro_spectral_hmm.pdf · a...

Introduction Learning Algorithm Learning Guarantees Conclusion

A Spectral Algorithm for LearningHidden Markov Models

Based on Daniel Hsu, Sham Kakade, and Tong ZhangarXiv:0811.4413

Kentaro Hanaki

New York University

May 5, 2015

Spectral Algorithm for Learning HMM New York University


Table of Contents

Introduction

Learning Algorithm

Learning Guarantees

Conclusion



What is HMM?

h1 h4h3h2 h5

x1 x4x3x2 x5

I Probabilistic model for sequential observations Pr[x1, · · · , xT ]

I Observations are generated by their underlying hidden states

I Hidden states follows Markov assumptions

I Low-dim. hidden states → Dynamics easier to understand

I Speech recognition, NLP (PoS tagging, NER, MT, ...), etc.



Parameters of HMM

h1 h4h3h2 h5

x1 x4x3x2 x5

T = Pr(ht | ht-1)

O = Pr(xt | ht)

I Pr[x1, · · · , xT ] =∑

h1,··· ,hT (∏T

t=1 Pr[ht |ht−1]Pr[xt |ht ])Pr[h1]

I Observation matrix Oi ,a = Pr[xt = i |ht = a]

I Transition matrix Ta,b = Pr[ht = a|ht−1 = b]

I Prior probability for the initial hidden states πa = Pr[ht = a]



Training HMM — Traditional Approaches

Natural loss for probabilistic model is negative log likelihood (NLL)

1. Find the global minimum of NLLI NLL is not convex due to the presence of hidden statesI Solving non-convex optimization problem is extremely hard

2. Heuristically minimize NLL using EM algorithmI NLL is convex once the hidden states are fixedI E-step: Infer and fix the hidden states based on the parametersI M-step: Minimize NLL with respect to parametersI Tend to get stuck into local minimaI Works fine in practice, hard to analyze learning guarantees



Contribution of the Paper

The paper proposed an efficient training method that admits aunique solution and learning guarantees



Table of Contents

Introduction

Learning Algorithm

Learning Guarantees

Conclusion



Overview of the Algorithm

I Learn model for predicting joint probabilities for observations

I Learning is hard due to the presence of hidden statesI Two steps solution without directly referring to hidden states:

I Map to observation space so that probs can be estimatedI Find subspace that is tractable



Observation Operator Representation

In HMM, joint probability can be written as

Pr[x1, · · · , xt ] = ~1TmAxt · · ·Ax1~π

Ax is the observation operator

Ax = Tdiag(Ox ,1, · · · ,Ox ,m)



Proof

Proof for t = 2 (Generalization is straightforward)

Pr[x1 = i , x2 = j] =∑a,b

Pr[x1 = i , x2 = j |h1 = a, h2 = b]Pr[h1 = a, h2 = b]

=∑a,b

Pr[x1 = i , x2 = j |h1 = a, h2 = b]Tb,aπa

=∑a,b

Pr[x1 = i |h1 = a]Pr[x2 = j |h2 = b]Tb,aπa

=∑a,b

Oj,bTb,aOi,aπa

= ~1TmTdiag(Oj,1, · · · ,Oj,m)Tdiag(Oi,1, · · · ,Oi,m)~π

= ~1TmAjAi~π



Observation Operator Representation

Problem: Estimation of ~π and Ax requires inferring hidden states

I ~π: Prior for hidden states

I Ax : Transition between hidden states

Solution: Map everything into the space in which

I each component can be estimated from observation

I dynamics as easy as in hidden state space



Mapping to Subspace

I First map everything into observation space using O

~π → O~π, ~1m → O~1m, Ax → OAxO−1

I O−1 not well-defined and dynamics complicated in this space

I Find U such that UTO is invertible (i.e., preserves hiddenstate dynamics) and

~π → ~b1 = (UTO)~π, ~1m → ~b∞ = (UTO)~1m

Ax → Bx = (UTO)Ax(UTO)−1

Pr[x1, · · · , xt ] = ~bT∞Bxt · · ·Bx1~b1



Estimating b1, b∞ and B

b1, b∞ and Bx can be estimated using

(P1)i = Pr[x1 = i ]

(P2,1)i ,j = Pr[x2 = i , x1 = j ]

(P3,x ,1)i ,j = Pr[x3 = i , x2 = x , x1 = j ]

Lemmab1, b∞, Bx can be expressed as

b1 = UTP1

b∞ = (PT2,1U)+P1

Bx = (UTP3,x ,1)(UTP2,1)+



Finding U

LemmaAssume ~π > 0 and that O and T have column rank m (i.e., anytwo hidden states are not identical). If U is the matrix of leftsingular vectors of P2,1 corresponding to non-zero singular values,then UTO is invertible.



Proof for the Existence of (UTO)−1

P2,1 can be rewritten as

P2,1 = OTdiag(~π)OT

So, rank(P2,1) ≤ rank(O). Also, Tdiag(~π)OT has full row rankfrom assumptions. This implies

O = P2,1(Tdiag(~π)OT )+

Therefore, rank(O) ⊂ rank(P2,1). So,

rank(O) = rank(P2,1) = rank(U)

and UTO has rank m and invertible.



Learning Algorithm

1. Randomly sample N observation triples (x1, x2, x3) andestimate P1, P2,1 and P3,x ,1

2. Compute SVD of P2,1 and let U be the matrix consisting ofleft singular vectors corresponding to m largest singular values

3. Compute model parameters b1, b∞ and Bx



Table of Contents

Introduction

Learning Algorithm

Learning Guarantees

Conclusion



Learning Bound for Joint Probabilities

TheoremThere exists a constant C > 0 s.t.

Pr

∑x1,··· ,xt

|Pr[x1, · · · , xt ]− P̂r[x1, · · · , xt ]| ≥ ε

≤ exp

(−

σm(O)2σm(P2,1)4Nε2

C(1 + n0(ε)σm(P2,1)2)t2

)

where

n0(ε) = min{k : ε(k) ≤ ε/4}

ε(k) = min

∑j∈S

Pr[x2 = j] : S ⊂ [n], |S| = n − k

and σm is the m-th largest singular value.

Note: Bound gets looser as the length of sequence gets longer!!!



Brief Idea of the Proof

1. Compute the sampling error for P1, P2,1 and P3,x ,1

2. Compute how the sampling error is propagated to accuracyI Compute the approximation error for UI Compute the approximation error for b1, b∞ and Bx



Learning Bound for Conditional Probabilities

Conditional probability

Pr[xT |xT−1, · · · , x1] =bT∞BxT bT∑x b

T∞BxbT

bt+1 =BxT+1

bTbT∞BxT bT

Conditional probability also have learning guarantee

I Bound independent of the length of the sequenceI Two more parameters compared to joint probability bound

I α: Smallest value of transition matrix Ax

I γ: Error for hidden states measured in observation space



Table of Contents

Introduction

Learning Algorithm

Learning Guarantees

Conclusion



Conclusion

I Spectral learning for HMM yields a unique solution

I Joint probability estimated without inferring hidden states

I Learning guarantee for joint probability depends on the length,but that for conditional probability does not


a spectral algorithm for learning hidden markov modelsmohri/aml15/kentaro_spectral_hmm.pdf · a...

Documents