a spectral algorithm for learning hidden ...pages.stern.nyu.edu/~dalderuc/hmm.pdfa spectral...

A SPECTRAL ALGORITHM FOR LEARNING HIDDENMARKOV MODELS THAT HAVE SILENT STATES

DRAFT – PLEASE DO NOT CITE

Dean Alderucci

OVERVIEWThe literature on Probably Approximately Correct (PAC) learning of Hidden Markov Models

(HMMs) reveals an intimate, bijective relationship between the (unknown) sequence of states in the HMM and the string of outputs generated by those states. For simplicity, I refer to each such output of a state as a “letter” from some appropriate alphabet, and a sequence of letters is referred to as a “string”. In the normal HMM model, each state emits one letter at every time step with probability one,and so each letter received by the learning algorithm is known to match some unknown state in the HMM. Existing PAC learning models are predicated upon this feature. Appendix A provides a brief overview of some known results for learning HMMs.

However, there are situations in which the HMM to be learned includes “silent states”, which donot emit a letter. HMMs with silent states can often provide a very compact representation of the phenomena being modeled, and can greatly reduce the number of transitions among states. HMMs with silent states may more naturally model reality, or may be especially advantageous in environmentsin which computational processing depends on the number of transitions.

As one example, HMMs with silent states are employed in computational biology to model families of related genetic sequences and determine which sequences belong to which family. Related sequences match each other along certain portions of their “strings” but not along other portions due to insertions or deletions at random points (e.g., due to replication errors, evolution or other changes in one of the organisms generating some but not all of the sequences). When applied to this use, the HMMs are known as “profile HMMs”. The states of a profile HMM have a very regular transition structure, as depicted in Figure 1 below.

The discussion that follows outlines an algorithm for learning HMMs with silent states. Clearly, the introduction of an unknown number of silent states complicates the learning process: there is no longer a one-to-one correspondence between outputs, which are observable, and the sequence of states transitioned through, which are never known.

In particular, the algorithm presented here builds upon the algorithm and analysis developed by Hsu, Kakade, and Zhang (hereinafter referred to as the "HKZ algorithm") in order to accommodate silent states. Familiarity with that algorithm and the notation in [Hsu et al.] is assumed. Appendix B presents select highlights from that work. The new algorithm below includes models which appear to be different from the work described by [Hsu et al.]. The new algorithm also includes some fairly straightforward extensions to [Hsu et al.] and to the Observable Operator Model on which [Hsu et al.] builds (and which was first described by [Jaeger]). For these extensions I have noted where in [Hsu et al.] the original work is stated and how exactly their work was modified.

In the discussion below, the term "normal HMM" means an HMM that has no silent states, and the term "silent HMM" means an HMM with at least one silent state. Similarly, “normal state” is a state that always emits a letter, and a “silent state” is a state that never emits a letter. References to definitions and lemmas from [Hsu et al.] follow the original numbering in that paper, while definitions and lemmas for my algorithm are referenced by letters rather than by numbers.

Profile HMMsFor clarity of exposition, profile HMMs will be used to illustrate several features of silent

HMMs and the algorithm described here. Nevertheless the algorithm applies to more general HMMs, and is not limited to profile HMMs.

As depicted in Figure 1 below, the states of a profile HMM are grouped into match, insert and delete states. Match states model the positions where different strings (e.g., from different organisms) properly align with each other. For example, a path from the begin state through only match states to the end would model a string without insertions or deletions: a reference string in essence. Of course, even in this case various different reference strings could be generated based on the outputs probabilities of each match states.

Insert states have loops to themselves, thereby modeling the insertion of one or more letters between two aligned (matching) positions in strings. Delete states are silent, and thereby model the elimination of a letter from a string. Each delete state has one transition to another delete state, which allows multiple letters to be eliminated from a string.

Figure 1 – A simple Profile HMM

AssumptionsIn the discussion below, several assumption are made for simplification, although not all are

necessary to the analysis presented. These assumption are valid for profile HMMs such as the one depicted above. Later it will be noted which assumptions can be relaxed.

1. The HMM has an end state and no cycles so all strings are of arbitrary but finite length.2. No silent state has a loop to itself.3. There are no "backwards" transitions. For every transition from state i to state j,

i ≤ j if i is a non-silent statei < j if i is a silent state

4. The states in the HMM exhibit a regular state structure (repeated groups of match, insert and delete states) and a regular transition structure:

Every match state Mi transitions:to insert state Ii to delete state Di+1 to match state Mi+1

Every insert state Ii transitions:to itself Ii to match state Mi+1

of 27

Every delete state Di transitions:to delete state Di+1 to match state Mi+1

A more general way of expressing assumption 4 is that the HMM has a compact representation which bounds the number of transitions in the HMM to O(m), rather than O(m2) in a general HMM, where m is the number of states. For example, with the minor exception of the first and last groups,

there are equal numbers (roughly m3

) of match, insert and delete states in the HMM of Figure 1. Each

match state has three transitions out, each insert state has two transitions out, and each delete state has

two transitions out, so a profile HMM with m states has about 73

m transitions. Similarly, the number

of transitions into any state is a constant, and is ≤ 3.

Primary Modifications to HKZ algorithm.Lemma 1 of [Hsu et al.] introduces the notation

Ax=T diag (O x ,1 , ... ,O x ,m)

in which T is the state transition probability matrix of the HMM, O x , i is the probability of emitting letter x from state i, and so

diag (O x ,1 , ... , O x ,m)is a diagonal matrix in which the ith diagonal element is the probability of emitting letter x from state i. For simplicity we will use the more concise notation:

diag (O x) :=diag (O x ,1 , ... , O x , m)so

Ax=T diag (O x)This expression renders the computation of joint probabilities for normal HMMs very compact.

The joint probability of a sequence of letters x1, x2, … xt (i.e. the probability of the normal HMM outputting that sequence) is:

Prob [ x1, x2, ... , xt ]=1T Ax tAx t−1

... Ax1π

in which π represents the initial state distribution.The above equation can be expanded to:

=1T T diag (O x t)T diag (O x t−1

)...T diag (O x1)π

and this representation will be more amenable to various reductions below.

Definition A:Let S denote a m×1 vector of probabilities, where each element Si of S is the probability of being in state i after having just emitted the most recent letter.

In a normal HMM, a superscript on vector S denotes the number of letters that were output, and also the (discrete) time step at which the probabilities are valid. The vector S(t+1) represents the state probabilities that exist immediately after the time at which the probabilities are S(t).

of 27

Observation A:In a normal HMM,

S(t+1)=diag (Ox)T S(t)

can be viewed as a vector of conditional state probabilities combined with a specific letter x output.Specifically, each element Si

(t+1) of vector S(t+1) indicates the probability of being in state i and having emitted letter x, given that in the previous time step the state probabilities were those specified by the vector S(t). Note that this departs slightly from the notation of [Hsu et al.], which groups together{diag(Ox) T} rather than {T diag(Ox)}. The reasons for this difference will soon be clear.

Thus {diag(Ox) T} can be viewed as an operator on a state probability vector S(t). Chaining together these operators, we can determine the probability of generating multiple consecutive letters and ending up in different states, conditioned upon the current state probabilities. For example, the expression:

diag(Ox) T diag(Oy) T S(t) = S(t+2)

represents the probability of being in each state, and having emitted the letters y and then x, given the starting point of state probability vector S(t). Note that S(t) can be viewed (in the normal HMM) as implicitly representing that there have been t time steps and therefore t letters were previously output, even though S(t) does not in any way store which letters were output.

Observation B:The probability that the first letter emitted is x1, after the first time step in a normal HMM, is:

S(1)=diag (Ox1

) π

Theorem A: In a silent HMM, the operator

C x=T T ' diag (O x)Twhere

T= ∏r∈ silent

Rr

satisfies the equation:

S(t+1)=T T ' diag (O x)T S( t)

where t ≥ 1 corresponds to successive letters output by the HMM.

This operator Cx is therefore analogous to the operator for normal HMMs:diag(Ox) T

This expression and its constituent terms will be explained in detail. However, first it will be helpful to understand the paradigm upon which the expression is based. This paradigm is the foundation for all changes to the [Hsu et al.] algorithm described in this paper.

of 27

Handling Silent StatesIn a silent HMM, a normal state will output a letter while a silent state will not, so each time

step will in general not correspond to a letter output. In other words, each letter output corresponds to one or more state transitions. Therefore, for each letter output, we must consider all possible silent states which could have been traversed after leaving the normal state which output the letter.

Definition B:An interval in a silent HMM includes all state transitions (through any normal and silent states) immediately before the outputting of the next letter.

For easy comparison with normal HMMs, we continue using the notation t to denote successive intervals in the HMMs activity, but incrementing t denotes outputting a letter, not a single state transition. Therefore, one increment in t can correspond to many transitions through silent states, which do not output any letters.

First note that for each normal state in the silent HMM, outputting a letter is gua.ranteed and so this increments t. Therefore, for normal state j, the probability of being in state j in the next increment of t is essentially as described above for normal HMMs:

element j of {S(t+1)} = element j of {diag(Ox) T S(t)}

In other words, for a normal state j, we transition from some other state (whether normal or silent) into j and output some letter x, and there are no other transitions. The probabilities of transitioning and outputting a letter depend on the probabilities during the preceding time step of whichstate we are in.

This operator need only calculate probabilities for rows that correspond to normal states. In practice this would mean the matrix multiplications would be simplified since all rows for silent states would be ignored. However, for ease of notation the operator could instead be represented by a modified diagonal matrix:

diag (O x)

which is a matrix identical to diag(Ox), except that the rows corresponding to silent states are all zero. Thus multiplication by this diagonal matrix would yield a state probability vector S which had zero for all silent states.

Turning now to silent states, in the same step (i.e. corresponding to one letter output), there can be transitions from the normal state through one or more silent states.

Observation C:In a single step, the HMM transitions through exactly one normal state at the beginning of the step.

This clearly follows from Definition B. Since a step includes all activity immediately before theoutputting of the next letter, the next step must commence with the outputting of a letter. Also, since each normal state outputs one letter only one normal state is included in a step.

Observation D:In a single step, the HMM transitions through any silent states only after transitioning through exactly one normal state.

of 27

Much like Observation C, since there are only normal and silent states, and since the single normal state begins a step, all other states, which must be silent, must come after that normal state during a single step.

Observation E:From the above two Observations, we can model the transitions through silent states as two

phases representing the possible events after a letter is output by some normal state. In the first phase, we consider the probability of transitioning into the silent state from some normal state that has just emitted a letter. After all such transitions out of normal states are concluded, we turn to the second phase, in which we consider the probability of transitioning into the silent state from all other silent states.

First Phase of silent state transitions during one step:For each silent state j, we can transition into j from any normal state during the same step, hence

there is no change in superscript on state probability vector S. Therefore we can compute the (updated)transition probability as:

row j of {T' S(t)} = row j of {S(t)}

where T' is a copy of the transition matrix T, except that T' includes only the columns corresponding to normal states. Therefore, the product T' S(t) only takes into account elements of S(t) that correspond to normal state probabilities, and only generates probabilities for silent states. We must also introduce scaling of previously-computed probabilities for normal states because the probability of being in a normal state implies that there was not a (permissible) transition from a normal state to a silent state. The existing probabilities of being in a normal state must be scaled by the sum of all probabilities of transitioning to a normal state. This sum is exactly the probability that there is no transition from the normal state to the silent state. In summary, the full operator T' has the following characteristics:

normal state rows: has a diagonal element equal to sum of all transition probabilities to any normal state

i.e. scales the previously-computed values for normal states silent state rows: modified copy of the corresponding row from the T

i.e. outputs probabilities for silent states, andonly considers (is based on) normal state probabilities

Note that since all silent states are ignored by the operator T', the above suggested operator

diag (O x)is unnecessary. We can represent the calculation as diag(Ox), since silent state calculations will be ignored by the next operator T'.

Second Phase of silent state transitions during one step:By assumption 3, any transition from one silent state to another must be from lower state

number to higher state number. We therefore iterate through each silent state in ascending order, and consider potential predecessor silent states from lowest to highest state number. This can be understood from the following procedure, which enforces the sequence required by transition ordering:

of 27

for j = 1 .. m for only silent states jprob[being in state j] = as calculated in phase onefor k = 1 .. (j -1) for only silent states k

prob[being in state j] += prob[ being in state k] * prob[transition from k to j]

We can express this process with the following matrix equations for updating probabilities:

T S (t )=( ∏

j∈silent

R j)S (t )

where each Rj includesfor each silent row i < j, a corresponding row with a diagonal element to scale the probabilityfor (silent) row j, a row from T that includes only columns corresponding to silent states,for each silent row i > j, a zero rowfor each normal row, a corresponding row from the identity matrix I

By these properties, multiplying a state probability vector S by Rj induces the following row transformations:

silent state rows < j: Rj doesn't scales the previously-computed values for silent states, as explained in phase one, in order to account for the probability of not transitioning into the next silent state

silent state rows j: Rj computes the probability of being in transitioning to state j from another silent state

silent state rows > j: Rj returns zero, as these probabilities have not been calculatednormal state rows: Rj doesn't alter the previously computed probabilities for normal states

Changing rows above j to zero probability is not strictly necessary. Subsequent applications of Rj will eventually overwrite the probability for silent state j since no silent state has a self loop, the current probability of being in state j is not considered.

Note that the ordering described above requires that the Rj operators are applied to S in ascending order. For example, if 3, 6 and 9 are the silent states in an HMM, then the second phase updates would be:

T S (t )=R9 R6 R3 S (t )

We denote the combined operator (on normal and silent rows) by

C x=T T ' diag (Ox)Tand therefore as explained above:

S (t+1)=[T T ' diag (O x)T ]S(t )

=[( ∏j∈silent

R j)T ' diag (O x)T ]S (t )

Note that the operator Cx can be viewed as modeling the change in state probabilities according to the three types of events that occur to change states: (i) outputting a letter from a normal state, (ii) transitioning from a normal state to a silent state, and (iii) transitioning from a silent state to one or more other silent states.

of 27

Accordingly, each of the three constituent operators

diag (O x)T T ' Tby construction is an operator on a vector of state probabilities, and returns a vector of (updated) state probabilities.

Note that the paradigm of 'steps' commence immediately before a letter is output mandates a different notation than Ax in Lemma 1 of [Hsu et al.]. Ax = T diag(Ox), which means that Ax, under the current paradigm, combines the emitting of a letter from a normal state in one step with the transition toanother (potentially normal) states in the next step. In contrast, in the operator Cx the two components {T diag(Ox)} are multiplied in the opposite order as they are in Ax.

Complexity of OperatorsEach of the constituents of the operator Cx are efficient. By assumption 4, each state has a

constant number of transitions in, and transitions out. Therefore, each of the constituent operators involves O(m) operations, and in particular O(1) per row (state) involved by the operator. Even without assumption 4, the number of operations per row would be just O(m).

Definition CLet π' denote the vector of probabilities of being in a state immediately before the first letter is output, i.e. immediately before the first 'step'.

Note that this is different than the probability π that the HMM starts in a state. For example, the HMM can start in one silent state and transition to one or more other silent states before outputting the first letter. However, it is possible that only one silent state has a non-zero probability in π.

The vector π' can be calculated using the above operators. Since before the first letter is output, the only possible transitions are among silent states, we can conclude:

π '=T π

Theorem BThe probability that the first letter generated by the HMM is x can be calculated by:

Prob[x1 = x] = 1T Cx π' where 1T is a row vector in which every element is 1.

From the definition of π', it is easy to see thatS(1) = Cx π'

where S(1) is the vector of state probabilities after the first letter x is output. The 1T operator sums the column vector across all states.

Theorem CThe joint probability that the HMM generates a sequence of letters x1, …, xt can be calculated by:

Prob[x1, …, xt] = 1T Cxt … Cx1 π'

Much like Theorem B above, each successive application of a particular Cx operator on the initial state

of 27

probability vector π' generates a state probability vector S, conditioned upon the particular letter being output. Summing across all state probabilities at the end yields the probability of the sequence.

Theorem DThe vector of probabilities of the first letter output by the HMM is:

P1 = O π'

P1 is defined in section 3.1 of [Hsu et al.], and the above equation is very similar to that in [Hsu et al.]. By the definition of O, multiplying O by a state probability vector yields a vector of probabilities, each element of which represents the probability of some state outputting the particular letter. Recall that from definition B the start of a step necessarily means a transition to a noral state andthe outputting of a letter.

For columns of O representing silent states, all elements are zero, whereas for columns of O representing normal states are probability vectors and so must sum to 1.

Definition DLet Q(t) denote a matrix of transitioning across steps from immediately before step t to step t, unconditioned on the letter output.

That is, each element Qij of Q represents the probability of being in state j at the end of step t-1, i.e. immediately before outputting a letter in step t, and

then being in state i at the end of step t, i.e. immediately before the next letter is output,

all regardless of the particular letter output during step t.

For example, element i, j of Q(1) represents the probability of being in state j immediately before outputting the first letter, and then being in state i immediately before the second letter is output (i.e. at the end of step 1), regardless of the first letter output.

Theorem E

Q(1)=T T ' T diag (π ' )

By definition, π' is the vector of state probabilities immediately before step 1. From theorem B, S(1) = Cx π'

where S(1) is the vector of state probabilities after the first letter x is output. If we alter this equation slightly to

Cx diag(π')then the resulting m x m matrix has the following properties. The element in row i column j of this product is element j of π' times element i,j of Cx. This is exactly the probability of starting in state i, and ending in state j conditioned on output letters x. Therefore the sum over all possible letters would yield the probability of being in each state after step 1 not conditioned on (regardless of) the letter output.

∑x

S (1)=∑

x

C x diag (π)=∑x

(T T ' diag (Ox)T )diag (π)

of 27

=T T ' (∑x

diag (O x))T diag (π)=T T ' I T diag (π)

=T T ' T diag (π)

Corollary The product

Cx Q(1) is an m x m matrix, in which the element at row i column j represents the probability of:

being in state j immediately before the first letter, and then being in state i immediately before the third letter is output,

not conditioned upon the particular first letter, but conditioned upon the second letter being x.

Theorem FThe vector of probabilities of the first pair of letters output by the HMM is:

P21 = O Q(1) OT

P21 is defined in section 3.1 of [Hsu et al.]. From the definition of Q(1), multiplying a row i of Q(1) by all columns of OT yields a row vector representing the probability of the first letter being each possible letter, conditioned upon ending in state i at the end of the first step (i.e. immediately before thesecond letter is output). Therefore Q(1) OT is the matrix over all states at the end of the first step. Multiplying a row j of O by this matrix yields a row vector of probabilities representing the probability of the first letter being each possible letter, conditioned upon the second letter being j. Therefore the product is exactly the pairwise probabilities P21.

Theorem GThe vector of probabilities of the first three letters output by the HMM, conditioned on the second letterbeing x, is:

P3x1 = O Cx Q(1) OT

P3x1 is defined in section 3.1 of [Hsu et al.]. From the above corollary to Theorem E, multiplying a row i of (Cx Q(1)) by all columns of OT yields a row vector representing the probability ofthe first letter being each possible letter, conditioned upon ending in state i at the end of the second stepand conditioned upon the second letter being x. Therefore (Cx Q(1)) OT is the matrix over all states at the end of the second step. Multiplying a row j of O by this matrix yields a row vector of probabilities representing the probability of the first letter being each possible letter, conditioned upon the second letter being x and the third letter being j. Therefore the product is exactly the triplet probabilities P3x1.

We are now prepared to show that the above modifications permit the spectral algorithm of [Hsuet al.] to handle silent states. As described below in Appendix B, that algorithm computes the following parameters from the observed letters:

b1 = UT P1

b∞ = (P2,1T U)+ P1

Bx = UT P3,x,1 (UT P2,1)+ for each letter x

of 27

We can prove the following lemmas which are virtually identical to those in Lemma 3 of [Hsu et al.]:

Lemma A:The computed parameters, which are based exclusively on observable letters, are related to unobserved HMM parameters as follows:1. b1 = UT O π' 2. b∞

T = 1T (UT O)-1

3. Bx=(U T O) T T ' diag (O x)T (U T O)−1

Lemma A.1 follows readily from the fact that P1 = O π' as shown in Theorem D.

Proof of Lemma A.2:The equation for b∞

T isb∞

T = ((P2,1T U)+ P1)T = P1

T (UT P2,1)+

We can re-express P1T from Theorem D:

P1T = π'T OT

Note that since

T T ' Thas columns that each sum to 1, we can write:

1T T T ' T =1T

where 1T denotes a row vector in which every element is 1. Further, any row vector π'T can be expressed as

π'T = 1T diag(π') Therefore

P 1T=1T T T ' T diag (π ' )OT

which can further be rewritten:

P 1T=1T

(U T O)−1

(U T O) T T ' T diag (π ' )OT

which by Theorem F is

P 1T=1T

(U T O)−1U T P 2,1

We can plug this value for P1T into the expression for b∞

T

b∞T=1T

(U T O)−1U T P 2,1(U

T P 2,1)+

b∞T=1T

(U T O)−1

□

Proof of Lemma A.3:The quantity is defined as:

Bx = UT P3,x,1 (UT P2,1)+

of 27

We can expand P3,x,1 by Theorem G:Bx = UT O Cx Q(1) OT (UT P2,1)+

by Theorem E:

B x=U T O C x T T ' T diag (π ' )OT(U T P 2,1)

+

which can further be rewritten

B x=U T O C x (UT O)

−1(U T O )T T ' T diag (π ' )OT

(U T P2,1)+

and by Theorem F can be simplified

B x=U T O C x (UT O)

−1 U T P2,1 (UT P2,1)

+

and further simplified:

B x=(U T O)C x(U T O)−1

which by Theorem A is:Bx=(UT O) T T ' T diag (O x)T (U T O)

−1

□

Theorem HProb[x1, x2, … xt] = b∞

T Bxt Bxt-1 Bx2 Bx1 b1

Proof:The right hand side of the expression can be rewritten using the equations in Lemma A as:

1T (UT O)-1 (UT O) Cxt (UT O)-1 … (UT O) Cx1 (UT O)-1 UT O π' 1T Cxt … Cx1 π'

and by Theorem C:Prob[x1, …, xt] = 1T Cxt … Cx1 π'

Modified Assumptions of [Hsu et al.]Condition 1 of [Hsu et al.] is that every element of π > 0, and both T and O have full rank. The

modified analysis above requires only that P21 be full rank and that UT O has full rank. These modified requirements are also consequences of Condition 1 of [Hsu et al.]. Note that the fact that UT O has full rank means that quantity can be inverted as in Lemmas A.2 and A.3.

Section 2.1 of [Hsu et al.] also requires that the number of different letters in the alphabet of the HMM be greater than the number of states. This is not required if P21 has full rank.

[Remaining WorkThe basic algorithm of [Hsu et al.] (i.e. computation of P matrices and b matrices) is altered only modestly, though the appropriate quantities and analysis have undergone significant modification. What now remains is to prove the accuracy of the probability estimates and sample complexity for the new algorithm.]

of 27

6 References

N. Abe, M. Warmuth, “On the Computational Complexity Of Approximating Distributions By Probabalitic Automata”, Machine Learning, Vol. 9, Issue 2, pp. 205-260 (1992).

A. Clark, F. Thollard. "PAC-learnability of Probabilistic Deterministic Finite State Automata", Journal of Machine Learning Research, Vol. 5, pp. 473-497 (2004).

P. Dupont, F. Denis,Y. Esposito, "Links Between Probabilistic Automata And Hidden Markov Models:Probability Distributions, Learning Models And Induction Algorithms", Pattern Recognition, vol. 38, pp. 1349-1371 (2005).

R. Gavalda, P. Keller, J. Pineau, et al. “PAC-Learning of Markov Models with Hidden State”, Machine Learning: ECML 2006, Proceedings, Lecture Notes In Computer Science, Vol. 4212, pp. 150-161 (2006).

D. Hsu, S. Kakade, T. Zhang, "A Spectral Algorithm for Learning Hidden Markov Models", Proceedings of the 22nd Annual Conference on Learning Theory (2009).

H. Jaeger. "Observable operator models for discrete stochastic time series". Neural Computation, vol. 12, issue 6, pp. 1371 – 1398 (2000).

M. Kearns,Y. Mansour, R. Rubinfeld, D. Ron, R. Schapire, L. Sellie, "On the Learnability Of Discrete Distributions", Proceedings of the 26th Annual ACM Symposium on Theory of Computing, (1994).

M. Kearns, L. Valiant, "Cryptographic Limitations On Learning Boolean Formulae And Finite Automata", Journal of the ACM 41(1) pp. 67–95 (1994).

R. Lyngso, C. Pedersen, "Complexity of Comparing Hidden Markov Models", Algorithms And Computation, Proceedings, Lecture Notes In Computer Science, Vol. 2223, pp. 416-428 (2001).

E. Mossel, S. Roch, “Learning Nonsingular Phylogenies And Hidden Markov Models”, The Annals of Applied Probability, Vol. 16, No. 2, 583–614 (2006).

D. Ron, Y. Singer, N. Tishby, "On the Learnability and Usage of Acyclic Probabilistic Finite Automata", Journal Of Computer And System Sciences, Vol. 56, Issue 2, pp. 133-152 (1998).

S. Siddiqi, B. Boots, G. Gordon, "Reduced Rank Hidden Marko Models", Proceedings of the 13th International Conference on Artifiical Intelligence and Statiistics (2010).

S. Terwijn, “On the Learnability of Hidden Markov Models”, Grammatical Inference: Algorithms And Applications, Lecture Notes In Artificial Intelligence, Vol. 2484, pp. 261-268 (2002).

of 27

APPENDIX ASome Basic Results in Learning Hidden Markov Models

A.1 Overview• Basics of Hidden Markov Models• Primary Definitions and Concepts • Results for Learnability of General Hidden Markov Models• Approximating Hidden Markov Models

A.2 Basics of Hidden Markov ModelsA Hidden Markov Model (HMM) is a widely-used tool for modeling discrete distributions in

many areas such as speech recognition, natural language processing and DNA sequence alignment. Informally, an HMM is a type of finite state automaton. Starting from an initial state, at each time step the HMM emits one random output (a "letter") from its alphabet, and transitions to the next state at random. An HMM is “hidden” because you never know which state it is in. The only data the learningalgorithm is provided is the sequence of outputs (a "string") induced by the (hidden) state sequence.

Figure A-1 – A simple HMM and one output string

At each time step, both the next state and the letter emitted are determined only by the current state. This is known as the "Markov property" or the "memoryless" property. For example, in Figure 1above, from state S1, there is a 50% chance that the next state is S2, and a 50% chance the next state is S3. Also from state S1, there is a 60% chance that the letter output is "A", and a 40% chance it is "B".

of 27

One example:passes through S1, S3;outputs "BC"S1

S3

S2

A: 60%B: 40%C: 0D: 0

A: 50%B: 0C: 25%D: 25%

A: 0B: 10%C: 45%D: 45%

To S3 - 50%

To S2 - 50%

Formally, an HMM is defined by several characteristics:Definition: An HMM consists of the following:

• S - a set of states • Σ - an alphabet of possible observations (also called "emissions") • T - transition probabilities between states• O - observation (emission) probabilities for each state• π - probabilities of the initial state •

If the number of states is N, then: the initial state probabilities π is a N x 1 vector, which sums to 1 the transition matrix T is N x N, in which each column sums to 1 the observation matrix is |Σ| x N, in which each column sums to 1

Note also that one or more states are "final" states which signal the end of a run of the HMM. On each run the HMM provides the learning algorithm with only the sequence of letters that are output as the HMM transitions through a sequence of (hidden) states until reaching a final state. The strings output by an HMM typically vary in length since the path from the initial state to the final state is stochastic and can include different numbers of states.

Learning GoalsIn general different applications have different goals in learning HMMs. One goal is to learn

the topology of a HMM, i.e. to learn the particular transitions probabilities among states, including which transitions between states cannot occur. In some applications, when learning the topology of an unknown HMM the learning algorithm does not even know the number of states. In other applications,the topology is simpler to learn because the number of states is known or assumed, and only transition probabilities among states must be learned. This is sometimes referred to as parameter estimation.

Another possible learning goal is to learn only the characteristics of the strings that the HMM can output – i.e. to learn the distribution of the strings. If the learning algorithm is interested in the distribution of strings but not the topology of the HMM that generates those strings, then the learning algorithm might for example use a hypothesis class that is not even an HMM.

In practice, HMMs are very often learned by a type of "Expectation-Maximization" algorithm which iteratively re-estimates likelihoods of strings and converges to a local optimum. Although these algorithms are very fast, they suffer from two significant limitations. First, since the algorithm does not converge to a global optimum, it gives no performance guarantees. Second, these algorithms require that the HMM topology be known, and so they only learn the transition and observation probabilities. This requirement can be especially problematic, since in many problems the number of states is not known even to any nontrivial approximation.

Note that, unlike some other PAC learning problems, there are no negative examples in learningHMMs. Each output string is generated by the HMM and therefore must be considered a positive example. In fact, the stings are generated according to the HMM's distribution. This distribution is often the very thing to be learned rather than merely an inconvenient feature of the learning process.

of 27

A.3 Main Results for Learning HMMsLearning general HMMs is hard. Nevertheless, progress has been made in recent years in

learning specific classes of HMMs. Before proceeding with the main results, the following observations and definitions will be helpful.

Observation 1: For any distribution of strings output by an HMM, there are an infinite number of HMMs with different states and topologies that can generate that distribution exactly.

This Observation means that, absent any further information or constraints, an algorithm that learns an HMM topology can never know whether the hypothesis HMM has the same topology as the HMM that generated the strings used in the learning algorithm. At best, the learned topology would produce a distribution of strings equivalent to the distribution of the target HMM.

A simple example shows this. Consider the following pair of HMMs:

Figure A-2 – Pair of HMMs with equivalent output distributions

Assume in both the top and bottom HMMs, both S1's have exactly the same probability of outputting each letter in the HMM's alphabet, and both S5's also have equivalent output probabilities. In the top HMM, state S2 follows state S1 with certainty. In the bottom HMM, each of states S3 and S4 follow S1 with probability 1/2. If both S3 and S4 have the same probability of outputting letters as each other and as S2, then the two HMMs produce the same distribution and cannot be distinguished from their outputs alone.

In fact, this observation is at the heart of much of the hardness results in learning HMMs. If for example S3 and S4 had almost the same output distributions, it would be very hard to distinguish the two states, and therefore it would be hard to even know that there were two states rather than one state. For example, assume the following output probabilities:

State S3: output letter A with probability 1 State S4: output letter A with probability 1 – ε (for some small ε)

output letter B with probability ε

On most trials both states S3 and S4 would output A, and even once a few ε's were output, it would not be easy to determine whether there were two states, or just one state which occasionally output B.

of 27

S1

S1

S2

S3

S4

S5

S5

Two types of finite automata are related to HMMs. Since both have been studied extensively they will be useful in simplifying some of the proofs related to HMMs. However, since the present topic is HMMs and not finite automata most of the related results are stated without proof.

Definition: A Probabilistic Deterministic Finite Automaton (PDFA) is a finite automaton that has, for each state, a probability distribution over the transitions going out from that state, and each transition corresponds to a different letter that is output.

Such a finite automaton is deterministic in that given any state and symbol that is output from that state, the next state is determined. For example, for an alphabet Σ = {A,B}, there are (at most) twotransitions from a state to other states, and each transition has corresponding an letter and a probability.

Figure A-3 – a simple PDFA

Definition: A Probabilistic Non-deterministic Finite Automaton (PNFA) is a finite automaton that has, for each state, a probability distribution over the transitions going out from that state, and each transition corresponds to a letter that is output.

A PNFA is very similar to a PDFA, except a PNFA may have, from any state, more than one transition that outputs the same letter. Such an automaton is non-deterministic in that given any state and symbol that is output from that state, the next state is not determined. This type of finite automaton has more expressive power, and consequently it is harder to obtain positive results for learning.

Figure A-4 – a simple PNFA

of 27

q1

q2

q3

A, 50%

B, 50%

q1

q2

q3

A, 50%

B, 25%

q4

B, 25%

Relationship between PDFA, PNFA and HMMsIt is easy to see that PDFAs are a subclass of PNFAs – a PDFA merely has (at most) one

transition per letter from any state, while a PNFA can have more than one transition per letter from any state.

More importantly, HMMs and PNFAs are essentially equivalent. Any HMM can be transformed to a PNFA with the same number of states. Any PNFA can be transformed to an HMM, though in general with a different number of states. See [Dupont, Denis, Esposito] for the specific conversion procedures between the two.

Any PDFA can be converted to an HMM having a different number of states- see [Gavalda et al.] or [Terwijn] for the specific conversion procedure. However, some PNFAs (and therefore some HMMs) cannot be transformed to a PDFA (unless we allow an infinite number of states). A simple example is shown in Figures 5 and 6 of [Dupont, Denis, Esposito]. Therefore, PDFAs are in fact a proper subclass of PNFAs.

A.3.1 Learning General HMMs is HardTwenty years ago [Abe, Warmuth] proved that learning distributions generated by either HMMs

or PNFAs is hard. The proof provided by [Abe, Warmuth] so extremely long and detailed that describing it would detract from the exposition of this paper. Nevertheless some very high level points from that proof will be made. Then, there will be a different but simpler proof sketch which builds upon information that is assumed to be possessed by the reader.

Specifically, [Abe, Warmuth] proved that you cannot even learn the optimal parameters (e.g., transition and observation probabilities for an HMM having a known number of states) in polynomial time unless RP = NP. Stated another way, when the target number of states is known, then learning is essentially a probability estimation problem: the goal is to learn those values which globally maximize the likelihood of the observed strings, rendering the learned HMM sufficiently close to some 'optimal' HMM that produced the strings.

The training time is polynomial in accuracy, confidence, and example length, but is exponential in alphabet size (unless RP = NP). When the alphabet size is variable, it is hard to maximize the likelihood of an individual string using HMMs (unless P = NP). The required sample complexity is moderate. Except for some log factors, the number of examples required is linear in number of transition probabilities to be trained and a low degree polynomial in the example length, accuracy and confidence.

[Abe, Warmuth] also show that when the number of states in the HMM is unknown, an HMM cannot be PAC learned even for the simplest alphabet with two letters.

Looking beyond the formal proof of [Abe, Warmuth], the following sketches may provide a more intuitive reason for the hardness in learning HMMs.

[Kearns, Valiant] Theorem Under the Discrete Cube Root assumption, the class of acyclic PDFAs of polynomial size cannot be PAC learned efficiently.

Corollary HMMS are not efficiently learnableSince every PDFA can be transformed to an HMM, this theorem shows that the class of general HMMscannot be PAC learned efficiently under the Discrete Cube Root assumption.

of 27

[Kearns, Mansour, et al.] TheoremUnder the Noisy Parity assumption, PDFAs are not efficiently learnable.

Proof sketch:We construct a PDFA that generates a parity function over uniform examples. In the example below, there are three bits x1, x2, x3 which form the example x, and one 'label' bit f(x). The target parity includes x1 and x3, but not x2.

Figure A-5 – PDFA for noisy parity

Imagine the top row as the "parity 0" row and the bottom as the "parity 1" row. In general, when there is a transition which outputs a "0", the parity is unchanged (i.e. do not switch rows), and when there is a transition which outputs a "1", the parity flips (i.e. switch rows). However, the transition for x2 does not change the parity (row) since x2 is not in the target parity.

All transitions that generate the bits x1, x2, x3 of the example have probability 1/2. Therefore thedistribution over the first 3 bits is uniform. The final transitions which generate the label bit f(x), in which there is a η chance of switching to the other row, and a 1 – η chance of staying in the same row. Therefore, the label bit f(x) is the correct parity of x with probability 1 – η and incorrect with probability η.

If you could learn the likelihood (probability) of any particular four bit strings b1 b2 b3 b4 being generated by this PDFA, this would be equivalent to determining which string has a higher probability, b1 b2 b3 followed by 1 or by 0. The higher probability string would indicate the "true" final (parity) bit (since η < 1/2). You could therefore learn noisy parity in violation of the assumption. Note that even ifyou could just evaluate which of two strings was more likely, rather than their exact likelihoods, you would have the same ability to learn noisy parity.

Corollary HMMS are not efficiently learnableSince every PDFA can be transformed to an HMM, this theorem shows that the class of general HMMscannot be PAC learned efficiently under the Noisy Parity assumption.

As a prelude to issues raised below, a PDFA that represents a noisy parity function has states which are too "similar". The states are hard to distinguish based on prefixes of strings that are output upon passing through similar states.

of 27

0

q1,0

q1,1

q2,0

q2,1

q3,0

q3,1

qf,0

qf,1

0

0 0

0

0

1

1

1

1

1

1

0

1

x1 x

2x

3 f(x)

A.4 Approximating HMMs Although learning HMMs is hard, there exists an algorithm for approximating HMMs.

[Gavalda et al.] show an algorithm for efficiently approximating the distribution of strings generated byan HMM using PDFAs, and the approximation can achieve arbitrary accuracy under the L∞ norm. Furthermore, the algorithm does not require that the learning algorithm know the number of states in the HMM. A primary requirement, discussed below, is that the states be "different enough" from each other.

We first show that PNFAs can be approximated by PDFAs. Since every HMM can be transformed to a PNFA, this means that any HMM can be approximated by a PDFA. We then show an algorithm that constructs a PDFA that approximates a target distribution (e.g., generated by an HMM) with arbitrary accuracy.

Theorem 1. Any PNFA (and so any HMM) can be approximated with a PDFA that is ε-close.

Using the L∞ norm, we define the distance between two distributions (e.g., the distributions between the target PNFA and the PDFA to construct) as the maximum difference between probabilities that two distributions assign to the same string (over all possible strings).

Let ε be some small probability. We will ignore strings that have less than probability ε of being generated by the target PNFA. Those strings will be referred to as "unlikely" strings, and strings which have probability >= ε will be referred to as "likely strings".

Observation 1. The PNFA can generate at most 1/ε distinct likely strings.Let Slikely be the set of all 1/ε likely strings. If Slikely contained more than 1/ε likely strings, their probabilities alone would sum to more than 1. However, the sum of all probabilities of all (likely and unlikely) strings must be exactly 1.

In order to see that approximation is always possible, and also as a warm up for the algorithm, note that we could (but will not) easily construct a simple PDFA that generates all strings in Slikely with the same probabilities as the target PNFA. Recall that with a PDFA a possible sequence of states is completely determined by a string the PDFA generates. A PDFA to generate Slikely would be a tree. The nodes at level i would correspond to the possible letters in the ith position of all strings in Slikely. The probability of transitioning from one level i to a node in the next level i+1 would be the probability thata particular letter is the (i+1)th letter conditioned upon the ith letter. This tree would have |Slikely| ≤ 1/ε leaves.

As alluded to above, a primary impediment to learning HMMs and various finite automata is that two (or more) states may be very "similar" in that they yield similar output distributions starting at either state. Therefore, the [Gavalda et al.] algorithm requires distinguishability among states.

Definition Two states s1, s2 in a PDFA have distinguishability μ if the two probabilities of the distributions over strings starting at s1, s2 differ by μ.

Definition A PDFA has distinguishability μ if every two states in the PDFA has distinguishability ≥ μ.

of 27

In other words, when a PDFA has distinguishability μ, the strings starting at any two states are "sufficiently different". Analogously, [Ron, Singer, Tishby] show that one can PAC lean PDFAs which are acyclic and which meet a minimum distinguishability criterion.

Definition A prefix of a string s is the sequence of the first α letters of the string for some α > 0.

Definition A suffix σ of a string s is the sequence of the letters which follow some prefix p. Therefore the string s can be written as the concatenation pσ.

The [Gavalda et al.] algorithm uses the set of strings output by the PNFA to builds a PDFA one state at a time. During the construction process, each state is either a "finalized" state or a "candidate" state. Finalized states output a particular letter with a desired probability, and are also sufficiently "different" from other states. Candidate states are those which have not yet been demonstrated to be sufficiently different.

Recall that in a PDFA a sequence of letters uniquely corresponds to a unique sequence of transitions and so to a unique sequence of states. Each state in the constructed PDFA will essentially represent a different prefix: the sequence of letters that are generated to get to that state from the initial state. Each state will also collect or "store" a set of suffixes, namely the suffixes of the strings that would pass through that state.

[Gavalda et al.] Algorithm Inputs: a set S of strings which were generated by a target PNFA that is μ-distinguishable, δ, maximum number of states n.Outputs: a PDFA that approximates the PNFA distribution.

1. Start with one finalized state: the initial state.

2. For each letter (e.g., A,B) which is a first letter of some string in S, create a corresponding candidate state (e.g., SA, SB) with a transition from the initial state.

3. Starting with one string s from S, process each letter in s by traversing the PDFA.

Begin at the initial state. Each letter tells you which state to transition to next. Continue processing letters for as long as you transitioning from a finalized state, not a candidate state.

4. If a candidate state is reached, add the remaining suffix of the string s to that candidate state.

5. Determine whether the number of suffixes in a candidate state is "large enough". If not, continue processing with the next string.

6. If the number of suffixes in a candidate state is "large enough", determine whether the candidatestate should be merged with an existing finalized state.

Generally, we merge a candidate state with an existing finalized state if the two states are "close enough". As discussed below, "close enough" generally means the frequency with each suffix occurring in the candidate state is about the same as the frequency with which that suffix occurs in the existing finalized state.

of 27

7. If we do not merge, then we promote the candidate state to a new finalized state. We also createnew candidate states: one for each first letter in the suffixes of the newly promoted state. Each of these new candidate states is linked by a transition from the new finalized state.

8. After processing all strings, there is a tree in which each node corresponds to a suffix. Set the probabilities of reaching the node to the frequency of that prefix over all strings. This can be done one level at a time by setting the transition probability as the probability of letter (i+1) conditioned upon letter i.

'Large enough'Step 5 requires determining whether the number of suffixes in a candidate state is "large

enough". Specifically, a candidate state is large enough if the number of suffixes exceeds:

3(1+μ

4)

(μ

4)

2 lg4(n∣Σ∣+2)

δμ

See the proof in section 5 of [Gavalda et al.].

'Close enough'Step 6 requires determining whether the frequency with each suffix σ occurring in the candidate

state sc is about the same as the frequency with which that suffix occurs in some existing finalized state sf. Specifically, a candidate state is close enough to an existing finalized state sf if:

∃s f such that∀σ ,∣ freq(σ occurring in sc)− freq (σ occurring in s f )∣≤μ

2

See the proof in section 5 of [Gavalda et al.].

As stated above, the learned PDFA can be converted to an HMM. However, since this conversion does not replicate the target HMM, only its distribution, this step is probably unnecessary for most situations.

of 27

APPENDIX BOverview of the Hsu, Kakade, Zhang Spectral Method

A recent and exciting body of work utilizes spectral learning algorithms for HMMs. [Hsu, Kakade, Zhang] provide an algorithm for PAC (Probably Approximately Correct) learning a subclass ofHMMs from a set of strings, in fact from very small suffixes of those strings. The algorithm provides an accurate approximation for the HMM distribution according to the L1 norm (i.e. the "Mahattan distance")

Much like the [Gavalda et al.] algorithm described above, the [Hsu, Kakade, Zhang] algorithm requires that the states of the HMM are "different enough" – i.e. distributions of strings induced by different hidden states are distinct. This requirement is fulfilled by a conditions on the observation matrix of the HMM. The algorithm also imposes a similar requirement on the correlation between consecutive letters in a string.

The algorithm imposes three primary requirements on the HMM to be learned:

1. The number of letters in the HMM's alphabet is not less than the number of states in the HMM.

2. The observation and transition matrices are full rank.

3. Every state has a non-zero probability of being selected as the first state.

Requirement 2 essentially means that no state has a distribution of outputs that is a linear combination of other states' output distributions. In combination with Requirement 1, the rank of both matrices is equal to the number of states n. Requirement 3 means that every state has a positive probability of generating the first letter. This feature will be important to the algorithm, as described below.

The algorithm learns from three matrices which are estimated from the observed strings:

[P1] probabilities for the first letter in the string[P2,1] probabilities for the first two letters in the string[P3,x,1] probabilities for the first three letters in the string

For an HMM with an alphabet of L letters, the dimensions of these matrices are:[P1] L x 1 [P2,1] L x L[P3,x,1] L x L for each of L matrices (one for each second letter x)

Each of the above matrices can be empirically estimated from the set of strings – in fact from just the first three letters of each string. As will be seen below, the algorithm can learn the HMM even while ignoring all letters beyond the first three. In general, this is enabled by Requirement 3, which means that every state has a chance of being the first state, and therefore has a chance of outputting the first letter. The authors point out that the algorithm can be extended to larger sets of consecutive letters, and can take advantage of singles, pairs, and triples other than those that are at the beginnings ofthe strings.

of 27

The algorithm works as follows:

Inputs: number of states n, a set of strings output from the target HMM meeting all requirements.Output: parameters for the learned representation of an HMM (but not an HMM)

1. From the set of all strings, estimate [P1], [P2,1] and each [P3,x,,1].

2. Compute U: the matrix of left singular vectors corresponding to the largest singular values in a singular value decomposition of an estimate for P2,1. In other words:

P2,1 = U Σ VT

whereΣ is a diagonal matrix of singular values, in descending orderU is a matrix of left singular vectorsV is a matrix of right singular vectors

3. Compute the following parameters:b1 = UT P1

b∞ = (P2,1T U)+ P1

Bx = UT P3,x,1 (UT P2,1)+ for each letter xwhere M+ is the Moore-Penrose pseudo-inverse of M

The model parameters computed in step 3 can be used to determine may desirable characteristics of the distribution of strings from the target HMM.

[Hsu, Kakade, Zhang] Theorem The likelihood of any particular string of letters x1, x2, … xt can be computed from:

Prob[x1, x2, … xt] = b∞T Bxt Bxt-1 Bx2 Bx1 b1

To prove this theorem we need the following four lemmas. LetT be the HMM's transition matrix O be the HMM's observation matrixπ be the HMM's probabilities of initial states

We will also slightly abuse notation by assuming that the matrices estimated in Step 1 are the true quantities. The results remain the same for estimates that are "close enough".

Lemma 1. b1 = UT O π

Proof:From the definition of observation matrix O and initial state matrix π, the probability distribution of thefirst letter in an output string is:

O π = P1

Since b1 = UT P1, b1 = UT O π as well.

of 27

Lemma 2. P2,1 = O T diag(π) OT

where diag(π) is a diagonal matrix in which each diagonal element is an element of π.

Proof:From the definition of observation matrix O and initial state matrix π, the matrix

diag(π) OT

for each row i and column j denotes the probability of the first state being I and the first letter being j. Multiplying this matrix by the transition matrix T gives a matrix

T diag(π) OT

in which for each row i and column j there is the probability of the second state being i and the first letter being j. Finally, multiplying by observation matrix O gives a matrix

O T diag(π) OT

in which for each row i and column j there is the probability of the second letter being i and the first letter being j. This is the definition of P2,1.□

Lemma 3. b∞T = 1T (UT O)-1, where 1T is a row vector in which every element is 1.

Proof:The equation for b∞

T isb∞

T = ((P2,1T U)+ P1)T = P1

T (UT P2,1)+

We can re-express P1T from Lemma 1:

P1T = πT OT

Note that since T is a transition (probability) matrix, each column of T sums to 1. This simple fact can be expressed as:

1T T = 1T

where 1 denotes a column vector in which every element is 1. Further, any row vector πT can be expressed as

1T diag(π) Therefore

P1T = 1T T diag(π) OT

which can further be rewritten since O has full rank:P1

T = 1T (UT O)-1 (UT O) T diag(π) OT which by Lemma 2 is

P1T = 1T (UT O)-1 UT P2,1

we can plug this value for P1T into the expression for b∞

T b∞

T = (1T (UT O)-1 UT P2,1) (UT P2,1)+

Since (UT P2,1) and its inverse are multiplied together, this simplifies to:b∞

T = 1T (UT O)-1

□

of 27

Lemma 4. Bx = (UT O) T diag(Ox) (UT O)-1 for every letter xwhere diag(Ox) is a diagonal in which the diagonal element is the row from O for letter x (i.e. the probability of observing letter x from each state)

Proof:The quantity is defined as:

Bx = UT P3,x,1 (UT P2,1)+

We can expand P3,x,1 in terms of T diag(Ox), since P3,x,1 is the matrix of probabilities of all first and third letter probabilities, conditioned on the second letter being x:. Recall from Lemma 2:

T diag(π) OT

is a matrix in which the element at row i and column j denotes the probability of the second state being i and the first letter being j. Therefore the matrix

diag(Ox) T diag(π) OT

represents the probability of the second state being i and the first letter being j, given that the second letter is x. Similarly, the matrix

T diag(Ox) T diag(π) OT

represents the probability of the third state being i and the first letter being j, given that the second letteris x. Finally,

O T diag(Ox) T diag(π) OT

represents the probability of the third letter being i and the first letter being j, given that the second letter is x. In other words

P3,x,1 = O T diag(Ox) T diag(π) OT

We now use a similar expansion as in Lemma 3, since O has full rank:P3,x,1 = O T diag(Ox) (UT O)-1 (UT O) T diag(π) OT

which simplifies again because of Lemma 2 to:P3,x,1 = O T diag(Ox) (UT O)-1 UT P2,1

Plugging this value into the expression for Bx

Bx = UT O T diag(Ox) (UT O)-1 UT P2,1 (UT P2,1)+

As in Lemma 4, since (UT P2,1) and its inverse are multiplied together, this simplifies to:Bx = (UT O) T diag(Ox) (UT O)-1

□

We are now prepared to prove the [Hsu, Kakade, Zhang] Theorem. The likelihood of a particular string of letters x1, x2, … xt can be computed in a straightforward manner from the definition of transition and observation matrices. For example, the probability that the first letter is letter x1 is:

Prob[x1] = 1T diag(Ox1) π Since from Lemma 3 we know

1T T= 1T Prob[x1] = 1T T diag(Ox1) π

We can build upon this for any arbitrary string using reasoning similar to that in Lemma 4 to find:Prob[x1, x2, … xt] = 1T T diag(Oxt) … T diag(Ox1) π

of 27

Therefore, we would like to construct this product from the learned model parameters b1, b∞, Bx. Nowwe can use Lemma 4 to create the telescoping product:

Bx = (UT O) T diag(Ox) (UT O)-1

Bxt ... Bx2 Bx1 = (UT O) T diag(Oxt)(UT O)-1 … (UT O) T diag(Ox2)(UT O)-1 (UT O) T diag(Ox1)(UT O)-1 = (UT O) T diag(Oxt) ... T diag(Ox2) T diag(Ox1) (UT O)-1

ThereforeBxt ... Bx2 Bx1 b1

= (UT O) T diag(Oxt) ... T diag(Ox2) T diag(Ox1) (UT O)-1 b1 = (UT O) T diag(Oxt) ... T diag(Ox2) T diag(Ox1) (UT O)-1 UT O π = (UT O) T diag(Oxt) ... T diag(Ox2) T diag(Ox1) π

and finallyb∞

T Bxt ... Bx2 Bx1 b1 = b∞

T (UT O) T diag(Oxt) ... T diag(Ox2) T diag(Ox1) π = 1T (UT O)-1 (UT O) T diag(Oxt) ... T diag(Ox2) T diag(Ox1) π = 1T T diag(Oxt) ... T diag(Ox2) T diag(Ox1) π = Prob[x1, x2, … xt]

□

Note that computing the SVD of an L x L matrix, as well as other matrix operations in the algorithm, can be done in polynomial time. The algorithm learns a hypothesis which is not an HMM but includes a set of what one can consider to be "hidden states" and other quantities.

This PAC algorithm has engendered significant interest. For example, a modification by [Siddiqi et al.] adapts the spectral algorithm to HMMs in which the transition and observation matrices do not have full rank. This modification not only handles HMMs that could not be handled by the original algorithm, it also allows for a more compact representation (since now the rank is less than the number of states) and therefore more efficient learning.

of 27

a spectral algorithm for learning hidden ...pages.stern.nyu.edu/~dalderuc/hmm.pdfa spectral...

Documents