a spectral algorithm for learning hidden markov models hdjhsu/papers/hmm-slides.pdf · a spectral...

A spectral algorithm forlearning hidden Markov models

h1

x1x2

h3

x3

. . .

h2

Daniel Hsu Sham M. Kakade Tong ZhangUCSD TTI-C Rutgers

A spectral algorithm for learning hidden Markov models Nr. 1

Motivation

• Hidden Markov Models (HMMs) – popular model for sequential data(e.g. speech, bio-sequences, natural language)

h1

x1

h2

x2

h3

x3

. . .

Each state defines

distribution over

observations

Hidden state process (Markovian)

Observation sequence

• Hidden state sequence not observed; hence, unsupervised learning.


Motivation

• Why HMMs?

– Handle temporally-dependent data

– Succinct “factored” representation when state space is low-dimensional(c.f. autoregressive model)

• Some uses of HMMs:

– Monitor “belief state” of dynamical system

– Infer latent variables from time series

– Density estimation


Preliminaries: parameters of discrete HMMs

Sequences of hidden states (h1, h2, . . .) and observations (x1, x2, . . .).

• Hidden states {1, 2, . . . , m};Observations {1, 2, . . . , n}

• Conditional independences

h1

x1

h2

x2

h3

x3

. . .

• Initial state distribution ~π ∈ Rm

~πi = Pr[h1 = i]

• Transition matrix T ∈ Rm×m

Tij = Pr[ht+1 = i|ht = j]

• Observation matrix O ∈ Rn×m

Oij = Pr[xt = i|ht = j]

Pr[x1:t] =X

h1

Pr[h1] ·X

h2

Pr[h2|h1] Pr[x1|h1] · . . . ·X

ht+1

Pr[ht+1|ht] Pr[xt|ht]


Preliminaries: learning discrete HMMs

• Popular heuristic: Expectation-Maximization (EM)(a.k.a. Baum-Welch algorithm)

• Computationally hard in general under cryptographic assumptions(Terwijn, ’02)

• This work: Computationally efficient algorithm with learning guaranteesfor invertible HMMs:

Assume T ∈ Rm×m and O ∈ Rn×m have rank m.

(here, n ≥ m).


Our contributions

• Simple and efficient algorithm for learning invertible HMMs.

• Sample complexity bounds for

– Joint probability estimation (total variation distance):

Xx1:t

|Pr[x1:t]−cPr[x1:t]| ≤ ε

(relevant for density estimation tasks)

– Conditional probability estimation (KL distance):

KL(Pr[xt|x1:t−1]‖cPr[xt|x1:t−1]) ≤ ε

(relevant for next-symbol prediction / “belief states”)

• Connects subspace identification to observation operators.


Outline

1. Motivation and preliminaries

2. Discrete HMMs: key ideas

3. Observable representation for HMMs

4. Learning algorithm and guarantees

5. Conclusions and future work


Outline







Discrete HMMs: linear model

Let ~ht ∈ {~e1, . . . , ~em} and ~xt ∈ {~e1, . . . , ~en} (coord. vectors in Rm and Rn)

E[~ht+1|~ht] = T ~ht and E[~xt|~ht] = O ~ht

In expectation, dynamics and observation process are linear!

e.g. conditioned on ht = 2 (i.e. ~ht = ~e2):

E[~ht+1|~ht] =

24

0.2 0.4 0.60.3 0.6 0.40.5 0 0

3524

010

35 =

24

0.40.60

35

Upshot: Can borrow “subspace identification” techniques from linear systemstheory.


Discrete HMMs: linear model

Exploiting linearity

• Subspace identification for general linear models

– Use SVD to discover subspace containing relevant states, then learneffective transition and observation matrices (Ljung, ’87).

– Analysis typically assumes additive noise (independent of state),e.g. Gaussian noise (Kalman filter); not applicable to HMMs.

• This work: Use subspace identification, then learn alternative HMMparameterization.


Discrete HMMs: observation operators

For x ∈ {1, . . . , n}: define

Ax4=

T

Ox,1 0. . .

0 Ox,m

∈ Rm×m

[Ax]i,j = Pr[ ht+1 = i ∧ xt = x | ht = j ].

The {Ax} are observation operators (Schutzenberger, ’61; Jaeger, ’00).

. . .T

O

xt

ht

Axt



Using observation operators

Matrix multiplication handles “local” marginalization of hidden variables: e.g.

Pr[x1, x2] =X

h1

Pr[h1] ·X

h2

Pr[h2|h1] Pr[x1|h1] ·X

h3

Pr[h3|h2] Pr[x2|h2]

= ~1>mAx2Ax1~π

where ~1m ∈ Rm is the all-ones vector.

Upshot: The {Ax} contain the same information as T and O.



Learning observation operators

• Previous methods face the problem of discovering and extracting therelationship between hidden states and observations (Jaeger, ’00).

– Various techniques proposed (e.g. James and Singh, ’04; Wiewiora, ’05).

– Formal guarantees were unclear.

• This work: Combine subspace identification with observation operatorsto yield observable HMM representation that is efficiently learnable.


Outline







Observable representation for HMMs

Key rank condition: require T ∈ Rm×m and O ∈ Rn×m to have rank m(rules out pathological cases from hardness reductions)

Define P1 ∈ Rn, P2,1 ∈ Rn×n, P3,x,1 ∈ Rn×n for x = 1, . . . , n by

[P1]i = Pr[x1 = i][P2,1]i,j = Pr[x2 = i, x1 = j]

[P3,x,1]i,j = Pr[x3 = i, x2 = x, x1 = j]

(probabilities of singletons, doubles, and triples).

Claim: Can recover equivalent HMM parameters from P1, P2,1, {P3,x,1}, andthese quantities can be estimated from data.



“Thin” SVD: P2,1 = UΣV > where U = [~u1| . . . |~um] ∈ Rn×m

Guaranteed m non-zero singular values by rank condition.

E[~xt|~ht]~u2

~u1

R(U) = R(O)

New parameters (based on U) implicitly transform hidden states

~ht 7→ (U>O)~ht = U>E[~xt|~ht]

(i.e. change to coordinate representation of E[~xt|~ht] w.r.t. {~u1, . . . , ~um}).



For each x = 1, . . . , n,

Bx4= (U>P3,x,1) (U>P2,1)+ (X+ is pseudoinv. of X)

= (U>O) Ax (U>O)−1 . (algebra)

The Bx operate in the coord. system defined by {~u1, . . . , ~um} (columns of U).

Pr[x1:t] = ~1>mAxt . . . Ax1~π = ~1>m(U>O)−1Bxt . . . Bx1(U>O)~π

xt

O

T

Axt∼Bxt

ht

~u1

~u2

U>(O~ht)

Bxt (U>O~ht) = U>(OAxt~ht)

Upshot: Suffices to learn {Bx} instead of {Ax}.


Outline







Learning algorithm

The algorithm

1. Look at triples of observations (x1, x2, x3) in data;

estimate frequencies P1, P2,1, and {P3,x,1}

2. Compute SVD of P2,1 to get matrix of top m singular vectors U(“subspace identification”)

3. Compute Bx4= (U>P3,x,1)(U>P2,1)+ for each x

(“observation operators”)

4. Compute b14= U>P1 and b∞

4= (P>2,1U)+P1


Learning algorithm

• Joint probability calculations:

Pr[x1, . . . , xt]4= b>∞Bxt . . . Bx1 b1.

• Conditional probabilities: Given x1:t−1,

Pr[xt|x1:t−1]4= b>∞Bxt bt

where

bt4=

Bxt−1 . . . Bx1 b1

b>∞Bxt−1 . . . Bx1 b1

≈ (U>O)E[~ht|x1:t−1].

“Belief states” bt linearly related to conditional hidden states.(bt live in hypercube [−1,+1]m instead simplex ∆m)


Sample complexity bound

Joint probability accuracy: with probability ≥ 1− δ,

O

(t2

ε2·(

m

σm(O)2σm(P2,1)4+

m · n0

σm(O)2σm(P2,1)2

)· log

1δ

)

observation triples sampled from the HMM suffices to guarantee

∑x1,...,xt

|Pr[x1, . . . , xt]− Pr[x1, . . . , xt]| ≤ ε.

• m: number of states

• n0: number of observations that account for most of the probability mass

• σm(M): mth largest singular value of matrix M

Also have a sample complexity bound for conditional probability accuracy.


Conclusions and future work

Summary:

• Simple and efficient learning algorithm for invertible HMMs(sample complexity bounds, streaming implementation, etc.)

• Observable representation lets us avoid estimating T and O(n.b. could recover these if we really wanted, but less stable).

• SVD subspace captures dynamics of observable representation.

• Salvages old ideas and techniques from automata and control theory.

Future work:

• Improve efficiency, stability

• General linear dynamical models

• Behavior of EM under the rank condition?


Thanks!


Sample complexity bound

Conditional probability accuracy: with probability ≥ 1− δ,

poly(1/ε, 1/α, 1/γ, 1/σm(O), 1/σm(P2,1)) · (m2 + mn0) · log1δ

observation triples sampled from the HMM suffices to guarantee

KL(Pr[xt|x1:t−1]‖Pr[xt|x1:t−1]) ≤ ε

for all t and all history sequences x1:t−1.

• n0: number of observations that account for 1− ε of total probability mass,where ε = σm(O)σm(P2,1)ε/(4

√m)

• γ = inf~v:‖~v‖1=1 ‖O~v‖1 “value of observation” (Even-Dar et al, ’07)

• α = minx,i,j [Ax]ij (stochasticity requirement)


Computer simulations

• Simple “cycle” HMM: m = 9 states, n = 180 observations

• Train using 20000 sequences of length 100 (generated by true model)

• Results (average log-loss on 2000 test sequences of length 100):

True model 4.79 nats per symbolEM, initialized with true model 4.79 nats per symbolEM, random initializations 5.15 nats per symbolOur algorithm 4.88 nats per symbolOur algorithm, followed by EM 4.81 nats per symbol


a spectral algorithm for learning hidden markov models hdjhsu/papers/hmm-slides.pdf · a spectral...

Documents