on identification of hidden markov models using spectral...
TRANSCRIPT
DEGREE PROJECT, IN ENGINEERING PHYSICS; SYSTEMS, CONTROL AND , SECOND LEVELROBOTICS
STOCKHOLM, SWEDEN 2015
On Identification of Hidden MarkovModels Using Spectral andNon-Negative Matrix FactorizationMethods
ROBERT MATTILA
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING
KTH Royal Institute of Technology
Abstract
Department of Automatic Control
School of Electrical Engineering
Master of Science
On Identification of Hidden Markov Models Using Spectral and
Non-Negative Matrix Factorization Methods
by Robert Mattila
Hidden Markov Models (HMMs) are popular tools for modeling discrete time series.
Since the parameters of these models can be hard to derive analytically or directly mea-
sure, various algorithms are available for estimating these from observed data. The
most common method, the Expectation-Maximization algorithm, suffers from problems
with local minima and slow convergence. A spectral algorithm that has received con-
siderable attention in the field of machine learning claims to avoid these issues. This
thesis implements and benchmarks said algorithm on various systems to see how well it
performs.
One of the concerns with the proposed spectral algorithm is that it cannot guarantee that
the estimates are stochastically valid: it may recover negative or complex probabilities,
due to an eigenvalue decomposition.
Another approach to the HMM identification problem is to leverage results from Non-
Negative Matrix Factorization (NNMF) theory. Inspired by an algorithm employing a
Structured NNMF (SNNMF), assumptions are presented to guarantee that the factor-
ization problem can be cast into a convex optimization problem.
Three novel recursive algorithms are then derived for estimating the dynamics of an
HMM when the sensor dynamics are known. These can be used in an online setting
where time and/or computational resources are limited, since they only require the
current estimate of the HMM parameters and the new observation. Numerical results
for the algorithms are provided.
KTH Kungliga Tekniska Hogskolan
ReferatAvdelningen for reglerteknik
Skolan for elektro- och systemteknik
Examensarbete
Identifiering av dolda Markovmodeller via spektrala och icke-negativa
matrisfaktoriseringsmetoder
av Robert Mattila
Ett populart verktyg for modellering av diskreta tidsserier ar dolda Markovmodeller
(eng. Hidden Markov Models, HMMs). I praktiken kan de ingaende parametrarna
i dessa modeller vara svara att harleda eller direkt mata, vilket har lett till att di-
verse algoritmer for att skatta dessa utifran uppmatt data har konstruerats. Den van-
ligaste algoritmen for sadan skattning, Forvantans-Maximerings (eng. Expectation-
Maximization) metoden, har visats ha problem med lokala minima och langsam konver-
gens. En metod som anvander spektral faktorisering sags undga dessa problem och har
fatt stor uppmarksamhet inom maskininlarning. Detta examensarbete implementerar
denna algoritm och utvarderar dess prestanda pa diverse system.
Ett av problemen med denna algoritm for spektralt lararande (eng. spectral learning)
ar att den inte lamnar nagra garantier for att de skattade parametrarna ar stokastiskt
korrekta: resultatet kan bli sannolikheter som ar negativa eller komplexa. Detta pa
grund utav en egenvardesfaktorisering i ett steg utav algoritmen.
En annan mojlig vag att ga for att skatta parametrarna i en HMM ar att anvanda
resultat fran teori rorande icke-negativ matrisfaktorisering (eng. Non-Negative Matrix
Factorization, NNMF). Inspirerade av en algoritm som anvander en strukturerad NNMF
sa presenteras antaganden som gor att faktoriseringsproblemet kan konverteras till ett
konvext optimeringsproblem.
Tre nya rekursiva algoritmer harleds sedan som kan skatta dynamiken hos en HMM
nar sensorn som anvands for att gora matningar antags ha kand dynamik. Dessa
kan anvandas i realtidsapplikationer dar tid- och/eller berakningsresurser ar begransade
eftersom de enbart behover en gammal skattning och en ny observation for att forbattra
skattningen. Numeriska simulationer utfors och resultaten presenteras sedan for ett
antal system.
Acknowledgements
I would like to thank my supervisor Prof. Bo Wahlberg and Assoc. Prof. Cristian
Rojas for inviting me to work with them and for their input and support during the
development of this thesis.
I had the opportunity to spend some time at University of British Columbia (UBC)
with Prof. Vikram Krishnamurthy, who gave me valuable feedback and almost too
many things to try out during my stay there, for which I am very grateful.
I would also like to thank his students; Sujay Bhatt, Anup Aprem and Yan Duan, for
welcoming me so kindly to the lab.
Robert Mattila
January 2015
v
Contents
Abstract iii
Referat iv
Acknowledgements v
Contents vi
List of Figures xi
List of Tables xiii
Abbreviations xv
Symbols xvii
1 Introduction 1
1.1 Background and Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1.1 Example of a Markov Chain . . . . . . . . . . . . . . . . 5
1.3.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2.1 Example of an HMM . . . . . . . . . . . . . . . . . . . . 7
1.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Expectation-Maximization Method . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Overview of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Methods for Identification Using Batch Data 13
2.1 Introduction to Spectral Learning . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Derivation of the Spectral Learning Algorithm . . . . . . . . . . . . . . . 14
2.4 Introduction to Non-Negative Matrix Factorization . . . . . . . . . . . . . 22
vii
Contents viii
2.5 Identification using Structured Non-Negative Matrix Factorization . . . . 23
3 Methods for Online Identification 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Convexity and Convex Optimization . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 The Projected Gradient Descent Method . . . . . . . . . . . . . . 29
3.2.3 The Primal-Dual Method . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Online Structured Non-Negative Matrix Factorization for Known SensorDynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Derivation of the Algorithms . . . . . . . . . . . . . . . . . . . . . 32
3.3.1.1 Estimating S2,1 Recursively . . . . . . . . . . . . . . . . . 34
3.3.1.2 Projected Gradient Descent . . . . . . . . . . . . . . . . . 35
3.3.1.3 The Primal-Dual Method without Inequality Constraints 38
3.3.2 Spherical Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Notes on Implementation 45
4.1 Measuring the Accuracy of Estimates . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . 46
4.1.3 Probability of Output Sequences . . . . . . . . . . . . . . . . . . . 48
4.1.4 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Re-Ordering the States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Some Notes on Implementing the Spectral Learning Algorithm . . . . . . 51
4.3.1 Faulty Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.2 If the Matrix is not Diagonal? . . . . . . . . . . . . . . . . . . . . . 52
4.3.3 Separating the Eigenvalues . . . . . . . . . . . . . . . . . . . . . . 52
5 Numerical Results for Spectral Learning and Structured Non-NegativeMatrix Factorization 55
5.1 Comparison of Spectral Learning and Structured Non-Negative MatrixFactorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Performance of the Spectral Learning Algorithm . . . . . . . . . . . . . . 57
5.3 Higher-Dimensional Systems with Spectral Learning . . . . . . . . . . . . 61
5.4 Perturbation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6 Numerical Results for Online Structured Non-Negative Matrix Fac-torization 67
6.1 Comparison of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Performance as Dimension Increases . . . . . . . . . . . . . . . . . . . . . 69
6.3 Tracking Time-Varying Dynamics . . . . . . . . . . . . . . . . . . . . . . . 70
7 Conclusions 73
7.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3.1 Combining with Other Methods . . . . . . . . . . . . . . . . . . . 76
Contents ix
7.3.2 Other Explanations for the Performance of the Spectral LearningAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.3 How to choose the weight ρ(k)? . . . . . . . . . . . . . . . . . . . . 76
7.3.4 Formal Convergence Properties of Online Structured Non-NegativeMatrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.5 Adaptive Step-Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3.6 The Primal-Dual Method with the Inequality Constraint . . . . . . 77
7.3.7 Other Parametrizations than Spherical . . . . . . . . . . . . . . . . 77
7.3.8 Computational Complexity of Online Structured Non-NegativeMatrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A Examples Used in Benchmarks 79
B Computing Environment 83
Bibliography 85
List of Figures
1.1 Example of a Markov chain, illustrating Equation (1.10). The blue circlesrepresent the discrete states of the system and the numbers on the edgesthe probabilities of transitioning between two states. . . . . . . . . . . . . 5
1.2 Example of an HMM, illustrating Equation (1.10) and Equation (1.14).The blue circles represent the hidden state of the system and the greensquares the possible observations. Two green squares with the same num-ber are equivalent, they correspond to the same observation, and are onlydrawn separated in the figure for clarity. . . . . . . . . . . . . . . . . . . . 7
2.1 The spectral learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 The structured non-negative matrix factorization algorithm. . . . . . . . . 25
3.1 Example of a convex function. The (gray) line between any two pointson the graph lies above the graph. A negative gradient is shown as ared arrow. Following the direction of the gradient at each point (i.e.performing a gradient descent) will reach the global minimum. . . . . . . 28
3.2 The online structured non-negative matrix factorization algorithm em-ploying the projected gradient descent method. . . . . . . . . . . . . . . . 37
3.3 The online structured non-negative matrix factorization algorithm em-ploying the primal-dual method. . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 The online structured non-negative matrix factorization algorithm em-ploying a spherical coordinates parametrization. . . . . . . . . . . . . . . . 43
4.1 Graph depiction of the HMMs Σa and Σb which are related by renamingthe hidden states. They cannot be distinguished from just sequences ofobservations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Performance of the spectral learning algorithm when Step 4 and 5 areperformed Ng times to find the set of random variables giving the largestspread of the eigenvalues. Every data point is averaged over 75 simulations. 53
5.1 Comparison of SL and SNNMF. All data points are averaged over 20simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Performance of the spectral learning algorithm on numerous examples(found in Appendix A). All data points are averaged over 20 simulations. 58
5.3 Performance of the EM-algorithm on three examples. Every data pointis an average over three realizations of the HMM. Random initial guesseswhere used for starting the EM-algorithm. . . . . . . . . . . . . . . . . . . 60
5.4 Performance of the spectral learning algorithm as the dimension of theHMM increases. All data points are averaged over 15 simulations with 3random matrices, each evaluated over 5 simulations. . . . . . . . . . . . . 62
xi
List of Figures xii
6.1 Performance of the three recursive methods on two examples: E2 withX = 3, Y = 3 and E3 with X = 3, Y = 10. Every data point is anaverage over ten simulations. . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 Performance of PGDM as the dimension of the system increases. Thesystems are generated randomly with Y = X. Each data point is anaverage over nine simulations with three pairs of random matrices, eachused for three simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 Error as PGDM tracks a time-varying system. The left plot shows thecase where only the Markov chain changes (at the dashed red line), andthe sensor dynamics stays constant. The right plot shows the case whereboth the sensor dynamics and the Markov chain change. Every data pointis an average over ten simulations. Two different choices for the weightin the update of the estimate of the S2,1 matrix are shown. . . . . . . . . 71
List of Tables
5.1 Mean slopes of the dF -error for the spectral learning algorithm in thelog-log plot Figure 5.2 together with the second largest eigenvalue of thetransition matrix for various examples. The system matrices can be foundin Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 The result of Section 5.2 (Figure 5.2) sorted according to the error inthe estimate of the observation matrix when 107 samples were available(which is an indicator of how well the spectral learning algorithm hasperformed). The condition number of OT appears to correlate well withthis, except for the outlier Example 8. . . . . . . . . . . . . . . . . . . . . 65
6.1 Comparison of the convergence rate and time consumption of the threerecursive methods on two examples (E2 with X = 3, Y = 3 and E3 withX = 3, Y = 10). Mean Slope refers to the slopes in Figure 6.1. . . . . . . 67
6.2 Performance of the PGDM as the dimension of the system increases.States refer to the value of X and Y (equal). Mean Slope refers to theslopes in Figure 6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
xiii
Abbreviations
EM Expectation-Maximization
HMM Hidden Markov Model
NNMF Non-Negative Matrix Factorization
PDM Primal Dual Method
PGDM Projected Gradient Descent Method
SCM Spherical Coordinates Method
SL Spectral Learning
SNNMF Structured Non-Negative Matrix Factorization
SVD Singular Value Decomposition
xv
Symbols
·, · estimate of numerical quantity ·
1 (column) vector of ones
[·]• element • of tensor ·
I identity matrix
δi,j the Kronecker delta
∆ the probability simplex
∆· change of ·
·+ Moore-Penrose psuedo-inverse
Tr · matrix trace
diag · matrix with elements of vector · as diagonal elements
ei ith Cartesian (column) unit vector
∇· gradient
N (·, ·) normally distributed number
Ω· square root of term in generalized Pythagorean trigonometric identity
O observation matrix (elements in columns sum to one)
T transition matrix (elements in columns sum to one)
dF (·, ·) mean squared error of matrix elements
S· moment
P permutation matrix
π stationary distribution
A dummy variable
Jij matrix of zeros, except for a one at element (i, j)
ρ(·) weight of new observations compared to old observations
L Lagrangian
xvii
Chapter 1
Introduction
1.1 Background and Setting
Mathematical modelling is one of the cornerstones of much of the knowledge of mankind.
However, the complexity of some systems limits how well certain parameters in the
mathematical model can be determined numerically. And not only complexity, there
might be other limiting factors; sometimes the time needed to derive an analytical
model is not available, or it could be necessary to break the system in order to be
able to perform some measurements, which can be very costly. This has lead to the
field of system identification. Instead of analytically deriving or measuring parameters,
algorithms are developed that will generate estimates of the parameters that fit observed
data well. This thesis treats such identification procedures for a certain kind of model,
namely the Hidden Markov Model (HMM).
The foundation of the theory of Markov models was laid by Andrew Markov in the
beginning of the 20th century. Since then, this mathematical model has found applica-
tions in a wast amount of fields; such as automatic speech recognition, natural language
processing and genomic sequence modeling. A Markov chain is one such model, defined
as a discrete stochastic process, traversing through its finite set of states as time evolves
in discrete steps. The critical assumption is that the probability of the current state
transitioning to another is independent of the history of the process: it does only depend
on the current state of the system.
Usually, the future state of the system is impossible to know, since the state changes
are random. However, through statistical analysis, some things can be deduced. One of
the most important is the expected value of the system state in the future. In financial
markets, this could tell how a stock is expected to perform and whether or not we should
sell or buy more shares.
1
Chapter 1. Introduction 2
There are many generalizations of the basic Markov chain. It is common that the state
itself of the system can not be directly observed and one has to rely on observations
of some other quantities. For example, the state of a medical patient; sick or healthy,
can not be directly observed by the doctor. The doctor use readings of a thermometer,
answers to questions and other instruments to try to find the true (hidden) state of the
patient. If the true state of the system behaves like a Markov chain, then such a system
is called an HMM.
Another possible generalization is to imagine that there is a control input that can
be designed to steer the system towards a desired state. Mathematically, this reflects
itself as changing the probabilities of certain transitions in the underlying (hidden)
Markov chain. This is what is known as a Partially Observed Markov Decision Process
(POMDP).
So-called HMM-filters have been devised to find the hidden state of a system from
observations of the HMM, and have in the recent years been proven very successful
in various domains. Such filters, together with dynamic programming, constitute the
fundamental tools for solving POMDP-problems.
HMM-filters are also used for prediction and filtering purposes. Regardless of how they
are applied, they require the knowledge of all probabilities of the HMM, i.e. the param-
eters of the HMM. Since these are not always known, either by analytical derivations or
direct measurements on the system, they might have to be estimated.
The current predominating methods for estimating these parameters, such as the Baum-
Welch/Expectation-Maximization (EM) method from Baum et al. [3], rely on local op-
timization based schemes to iteratively calculate better and better estimates. This
approach is prone to problems with local minima and slow convergence. Some recent
results by Hsu et al. [12] in the field of machine learning seem promising for solving the
estimation problem without the issue of sensitivity to initial guesses and slow conver-
gence. They obtain an estimate through a one-shot method employing techniques from
linear algebra.
The aim of this project is two-fold. We first intend to explore and implement the so-
called spectral learning algorithm for HMMs by Hsu et al. [12]. We will draw some
parallels to similar work and see how the algorithm relates. The second intention is to
examine the problem of estimating the parameters in an online setting, meaning that a
new estimate is calculated using only a new measurement and the old estimate. This
is important in a real-time application where memory and/or computational resources
might be limited.
Chapter 1. Introduction 3
1.2 Notation
This section will briefly introduce some mathematical concepts and notation that will
be used throughout the text. We will let × be reserved for scalar multiplications, since
there will be no need for using the cross product between vectors.
Elements of vectors and matrices are given as subscripts of brackets: if A is a matrix
then the element on the ith row and jth column is [A]ij . The ith Cartesian unit vector
will be denoted ei and the vector of ones as 1, both column vectors.
An empirical estimate of a numerical quantity x, not necessarily scalar, will be denoted
as x. The probability measure of an event will be written Pr[ · ].
The (n− 1)-dimensional probability simplex ∆ is defined as
∆ = x ∈ Rn : x ≥ 0, 1Tx = 1. (1.1)
In Equation (1.1), and in the rest of the text, ≥ should be evaluated elementwise on
vector and matrix quantities.
The gradient of a (scalar, vector or matrix) function f with respect to x will be denoted
∇xf or ∂f∂x depending on which one is more visually pleasing in the present context.
The Hadamard, or elementwise, product between two matrices A and B will be denoted
with the -operator and is defined as
[A B]ij = [A]ij × [B]ij . (1.2)
The Frobenius norm of a matrix A is written and defined as
‖A‖F =
√∑ij
[A]2ij , (1.3)
where the sum is over all elements of A.
A normally distributed number with mean µ and standard deviation σ will be written
N (µ, σ).
Chapter 1. Introduction 4
1.3 Hidden Markov Models
This section will formally introduce the model that is central of this thesis, namely the
Hidden Markov Model (HMM). We will begin by discussing the standard Markov model,
or Markov chain, since the HMM is a generalization.
A Markov chain is a discrete stochastic model that can be represented as a sequence
of stochastic variables x1, x2, x3, . . . . The state of the Markov chain at time k is xk.
The crucial assumption is that these variables obey the Markov property. This is an
assumption that the conditional probability distribution of future states depends only
on the current state and not on past states. Formally, this can be stated as
Pr[xk+1 = i|x1 = i1, x2 = i2, . . . , xk = ik] = Pr[xk+1 = i|xk = ik]. (1.4)
The possible values that the variables xi can take is a set called the state-space of the
Markov chain. We will denote this set with X and let it be a finite consecutive subset
of N+ that includes 1, i.e. X = 1, 2, . . . , X. Here X denotes the number of discrete
states of the Markov chain, and will be referred to as the dimension of the Markov chain.
If the transition probabilities do not depend on the absolute time, the chain is said to
be time-invariant or stationary. This means that
Pr[xk+m+1 = i|xk+m = j] = Pr[xk+1 = i|xk = j], (1.5)
for all m. In this case, the concept of the chain’s stationary distribution is often intro-
duced:
Definition 1.1. The stationary distribution, π ∈ [0, 1]X , of a time-invariant Markov
chain is defined as
[π]i = Pr[xk = i] = Pr[x1 = i]. (1.6)
Remark 1.2. We will provide an example of a time-variant Markov chain in a later
chapter. In that case, we will treat it slightly heuristically and assume that the chain is
stationary on each of the intervals between the change of dynamics.
1.3.1 Transitions
Since there are X × X possible state transitions in a Markov chain of dimension X,
we can define a matrix that will keep track of the probabilities of each one of these
transitions:
Chapter 1. Introduction 5
1 : sunny
2 : cloudy 3 : rainy
0.1
0.8
0.10.1
0.2
0.7
0.40.4
0.2
Figure 1.1: Example of a Markov chain, illustrating Equation (1.10). The blue circlesrepresent the discrete states of the system and the numbers on the edges the probabil-
ities of transitioning between two states.
Definition 1.3. The (state) transition matrix 1, T ∈ [0, 1]X×X , is defined as
[T ]ij = Pr[xk+1 = i|xk = j]. (1.7)
The transition matrix is, by construction, a stochastic matrix, which implies that it has
the following two properties:
[T ]ij ∈ [0, 1] ∀ i, j = 0, 1, . . . , X (1.8)
andX∑i=1
[T ]ij = 1 ∀ j = 0, 1, . . . , X. (1.9)
This means that the elements of T are non-negative and that the elements of each column
of T sum to one.
1.3.1.1 Example of a Markov Chain
Consider the following example. The weather in a small town can only be in one out of
three states: sunny, cloudy and rainy. Our assumption that the state-space X ⊂ N+ is
without loss of generality, since we can simply assign a unique number to each one of
these states. A scientist in this town has calculated the probabilities for changes in the
weather. For example, he says that there is a seventy per cent chance of a cloudy day
turning into a rainy day after the night.
1 It is also common to define the transition matrix as P with [P ]ij = Pr[xk+1 = j|xk = i], i.e. as thetranspose of the T defined here, TT = P . We will use T instead of P to keep consistency with previouswork in the field.
Chapter 1. Introduction 6
The scientist have provided us only with the following transition matrix
T =
0.8 0.1 0.2
0.1 0.2 0.4
0.1 0.7 0.4
(1.10)
and says that the ordering is: one - sunny, two - cloudy and three - rainy. It is common
to visualize Markov chains as finite transition systems. This representation is equal to
providing a transition matrix, if the transition probabilities are specified on the edges
of the graph. We can easily construct the transition system in Figure 1.1 from the
transition matrix that the scientist has provided.
1.3.2 Observations
A common extension of the Markov chain is to assume that the current state is observed
through a noisy sensor. This makes the model an HMM, since the true state of the
Markov chain is never directly seen. If we assume that the noise is discrete valued and
only takes a finite number of values, then the possible observations of the Markov chain
will lie in a finite discrete set as well. Let the number of possible discrete observations
be Y and map each observation to a unique element in the set 1, 2, . . . , Y = Y. This
makes it possible to define a matrix of the probabilities of making each observation,
given each hidden state of the Markov chain. The observation (also called emission)
made at time k is yk ∈ Y. Formally, this means that:
Definition 1.4. Let the observation ( or, emission) matrix 2, O ∈ [0, 1]Y×X , be defined
as
[O]ij = Pr[yk = i|xk = j]. (1.11)
Assuming that we always make some observation, then we note that O is a stochastic
matrix:
[O]ij ∈ [0, 1] ∀ i = 1, 2, . . . , Y and j = 1, 2, . . . , X (1.12)
andY∑i=1
[O]ij = 1 ∀ j = 1, 2, . . . , X. (1.13)
This means that the elements in each of one of the columns of O sum to one.
Chapter 1. Introduction 7
1 : sunny
2 : cloudy 3 : rainy
1
2
3
1
2
3
1
2
3
0.1
0.8
0.10.1
0.2
0.7
0.4
0.4
0.2
0.9
0.1
0.0
0.2
0.5
0.3
0.0
0.2
0.8
Figure 1.2: Example of an HMM, illustrating Equation (1.10) and Equation (1.14).The blue circles represent the hidden state of the system and the green squares thepossible observations. Two green squares with the same number are equivalent, theycorrespond to the same observation, and are only drawn separated in the figure for
clarity.
1.3.2.1 Example of an HMM
Continuing on the previous example, assume that we are imprisoned in a room without
any windows in the strange three-weather town. Our only method of deducing the
outside weather is by observing the outerwear the prison guard wears each day when
he comes to feed us. In this setting, the state of the weather is hidden from us, but
we can make (noisy) observations that tell us clues about the outside conditions. If the
guard has a limited wardrobe, then the only possible observations might be: (1) shirt
and short pants, (2) long pants and a coat, and (3) raincoat and an umbrella.
Since the guard checks the weather forecast each morning before leaving for work, he
is fairly good at dressing appropriately for the weather. Assume that, on a sunny day,
there is a ninety per cent chance of him wearing shirt and short pants, and a ten per
cent chance of him wearing long pants and a coat. Similar probabilities can be assigned
to the two other hidden states (cloudy and rainy) and observation (raincoat and an
umbrella). We can summarize this in an observation matrix:
O =
0.9 0.2 0.0
0.1 0.5 0.2
0.0 0.3 0.8
, (1.14)
2As with the case of the transition matrix, there is another common definiton of the observationmatrix: B = OT . We will use O to keep consistency with previous work.
Chapter 1. Introduction 8
where we have assigned a number to each one of the observations, or in an illustration
such as Figure 1.2.
1.4 Problem Formulation
In practice, the transition and observation matrices are usually unknown and hard to
model. The data that is readily available is usually a sequence of observations from
the system: y1, y2, . . . , yk. The system identification (or learning) problem is to find
estimates O and T from the measured output sequence.
We will in a later chapter also be concerned with the problem of only estimating the tran-
sition matrix from a sequence of observations, when the observation matrix is assumed
to be known.
1.4.1 Example Problem
Again continuing on the previous example, we might not be lucky enough to have a
scientist friend that can provide us with a model of the weather, nor any a priori insight
in how well the guard is at dressing appropriately every morning. In this case, we can
simply make a list of the outfit each day over a period of time (say: day 1 - shirt
and short pants, day 2 - shirt and short pants, day 3 - long pants and a coat, . . . )
and then apply one of the algorithms outlined in this text to get good estimates of the
(weather) transition matrix and the (outfit) observation matrix. Once we have estimates
of these two matrices we can apply an HMM-filter to deduce the current weather and do
predictions about future weather conditions, which we can use to plan our escape from
the jail.
1.5 Expectation-Maximization Method
The Expectation-Maximization (EM) method is one of the most widely employed meth-
ods for solving the above outlined estimation problem. Since its weaknesses are some of
the motivations of the algorithms in this text, we will provide a very brief summary of
the algorithm here.
It solves the following optimization problem iteratively:
θ∗ = arg maxθ∈Θ
Pr[y1, y2, . . . , yk|θ], (1.15)
Chapter 1. Introduction 9
where θ are the parameters of the model, Θ is the feasible parameter set and θ∗ the
maximum-likelihood estimate of the true θ. However, in EM, instead of working directly
with the likelihood function, Pr[y1, y2, . . . , yk|θ], an auxiliary likelihood function is con-
sidered. After an initial guess θ0 of the true θ has been chosen, the algorithm iteratively
repeats the following two steps:
Expectation Step Calculate the auxiliary likelihood function
Q(θn−1, θ) = E[
log Pr[x1, . . . , xk, y1, . . . , yk | θ] | y1, . . . , yk, θn−1
]. (1.16)
Maximization Step Update the parameter estimation as
θn = arg maxθ∈Θ
Q(θn−1, θ). (1.17)
The method is also referred to as the Baum-Welch method when it is applied to an
HMM. See for example Rabiner [23] and references therein for more details.
The two main interventions against EM is that the algorithm sometimes is extremely
slow and that it is only guaranteed to converge to a local minima of the likelihood
function.
1.6 Overview of Related Work
There have been attempts to solve the HMM identification problem by casting the
HMM to a linear stochastic state-space model and then using identification techniques
for linear systems to recover the parameters. One of the problems with this approach is
that the parameters of the HMM (i.e. the transition and observation matrix) have some
constraints that a usual linear state-space model do not; namely that the elements in
each column should sum to one and be non-negative.
A subset of linear systems are positive linear systems, which are closer to HMMs since
they have postivity constraints. However, they do not have the sum-to-one constraints.
The identification of such systems has been studied by for example Anderson [2]. Van-
luyten [26] gives a thorough overview of such, and related, work. He notes that it is
often wrongly stated that obtaining a valid HMM from one with negative parameters
(which is what results from some identification techniques) is just a matter of finding
a similarity transform. He develops a method of positive identification using positive
matrix factorization-methods, meaning that the estimated matrices have non-negative
elements by construction.
Chapter 1. Introduction 10
However, he does only solve the identification problem for HMMs that are on the form
of Mealy machines. In, what he refers to as, a Mealy HMM the output at the present
time does not depend solely on the current state of the system, but also on what state
the system will progress to. This reassembles very much the observation operator form
employed by Jaeger [13] and Hsu et al. [12]. In their representation of an HMM, the
transition and emission matrices are grouped together.
In this thesis, the interest lies in recovering the transition and observation matrices
explicitly (which gives more insight and intuition on how the system functions). This
corresponds to the Moore HMM of Vanluyten [26], where the event of progressing to a
certain state and the event of producing a certain observation are independent. Van-
luyten [26] shows that a Moore HMM can easily be converted into a Mealy HMM by,
roughly, multiplying together the system matrices. The formula for this conversion in
Vanluyten [26, p. 68] is identical to the formula for finding the observation operators
given in Hsu et al. [12, Lemma 1].
Vanluyten [26] notes that the conversion in the other direction, i.e. from a Mealy HMM
to a Moore HMM, is also always possible, but results in a highly non-minimal represen-
tation of the HMM and that there is currently no known way of reducing a non-minimal
HMM to minimal one, if the positivity of the parameters have to be conserved. This
thesis will focus on identifying the parameters directly and not converting from a Mealy
to a Moore representation of the HMM.
Rodu et al. [24] make a comparison between different methods for HMM identification,
albeit only for models on the Mealy-form (i.e. not recovering explicit expressions for the
system matrices).
Hjalmarsson and Ninness [11] present a novel method for estimating the parameters of
an HMM using sub-space inspired techniques from the automatic control field. Their
algorithm is non-iterative, but cannot guarantee that the estimates are stochastically
valid.
Johnson [14] provides a linear algebraic explanation of the algorithm of Hsu et al. [12].
He does not outline the steps required to recover the transition and observation matrix
of the HMM explicitly though. This recovery step is only left as a note in an appendix
of Hsu et al. [12] and has been generalized to work on a more general class of systems
by Anandkumar et al. [1].
Chapter 1. Introduction 11
1.7 Outline of Thesis
Chapter 1 provides an overview of the problem at hand along with necessary concepts.
Chapter 2 treats the identification procedure when a batch of data is given using the
spectral learning algorithm of Hsu et al. [12] and the structured non-negative matrix
factorization algorithm of Lakshminarayanan and Raich [16] and Vanluyten et al. [28].
Chapter 3 is devoted to the online estimation problem. Assumptions are stated and the
structured non-negative matrix factorization algorithm is recast into a convex optimiza-
tion problem that can be solved recursively. Three methods of solving the estimation
problem are presented and discussed to some length. The result is three novel methods
of performing online estimation of the transition probabilities of an HMM.
Chapter 4 discusses some practical concerns when implementing the algorithms, along
with methods of measuring the accuracy of the estimates. This is necessary for the
benchmarks to come in the next two chapters.
Chapter 5 and Chapter 6 present numerical results from simulations of the algorithms
outlined in the previous chapters. The batch data identification methods are tested on
various systems with various amounts of available samples. The online methods are
tested on time-invariant systems, but also on a time-variant system where the dynamics
of the HMM change with time.
Chapter 7 concludes the thesis and summarizes the results. Implications for future work
are also provided.
Chapter 2
Methods for Identification Using
Batch Data
This chapter introduces two algorithms for identification of HMMs when batch data
is available. This means that the algorithms works in an offline setting: a batch of
observations are presented and the algorithms process the data to generate estimates of
the parameters of the HMM. If a new observation is made, then the process has to be
repeated from the beginning.
2.1 Introduction to Spectral Learning
Spectral learning refers to algorithms using spectral (i.e. eigen- or singular value) de-
compositions to find estimates of some unknown parameters of a system. For the case
of HMMs, this means finding estimates of the transition matrix and the observation
matrix. Hsu et al. [12] proposed a spectral method for identification of HMMs, which
has been generalized to other models by Anandkumar et al. [1].
Although Hsu et al. [12] are concerned with estimating probabilities of sequences of
outputs, i.e. estimating the cumulative joint distribution Pr[y1, y2, . . . , yk] and the con-
ditional distribution Pr[yk+1|y1, y2, . . . , yk], they leave a note (Hsu et al. [12, Appendix
C]) on how the method by Mossel and Roch [22] can be used in conjunction with their
method to recover explicit expressions for the (estimated) transition and observation
matrices. This combined method will be outlined in the subsequent sections.
The idea of Hsu et al. [12] is to relate observable quantities, correlations in triplets in
an output sequence, to the system parameters using some clever algebraic tricks. Since
a large part of the description was displaced to an appendix, some fairly crucial steps
13
Chapter 2. Methods for Identifications Using Batch Data 14
were explained in rather few words. In this chapter, we will try to explain these steps
in more detail and emphasize some of the quirks of the algorithm.
2.2 Moments
The method by Hsu et al. [12] (which will be referred to as the spectral learning al-
gorithm) relies on getting empirical estimates of the joint probabilities of n-tuples of
observations (nth order moments), with n being 1, 2 and 3. The first and second order
moments are vector and matrix quantities, respectively:
Definition 2.1. The first order moment, S1 ∈ [0, 1]Y , is defined as
[S1]i = Pr[y1 = i] ∀ i = 1, 2, . . . , Y, (2.1)
and the second order moment, S1,2 ∈ [0, 1]Y×Y , is defined as
[S2,1]ij = Pr[y2 = i, y1 = j] ∀ i, j = 1, 2, . . . , Y. (2.2)
Remark 2.2. Note that it is possible to define other second order moments, such as S1,3,
in an analog fashion.
The third order moment is a third order tensor, but can be represented as a matrix if
one index is fixed:
Definition 2.3. The third order moments, S3,y,1 ∈ [0, 1]Y×Y , are defined as
[S3,y,1]ij = Pr[y3 = i, y2 = y, y1 = j], (2.3)
for i, j = 1, 2, . . . , Y .
Remark 2.4. S1, S2,1 and S3,y,1 are denoted P1, P2,1 and P3,y,1 in Hsu et al. [12] and
related work. This notation is a bit unfortunate since it clashes with the customary
notation of the transition matrix as P and can therefore cause unnecessary confusion.
2.3 Derivation of the Spectral Learning Algorithm
We will in this section derive relations between the moments (of which we can form
empirical estimates using measured batch data) and the HMM parameters T and O, i.e.
the transition and observation matrices. As we derive these expressions, the algorithm
will become clear. We provide a summary of the algorithm in Figure 2.1.
Chapter 2. Methods for Identifications Using Batch Data 15
To begin with, consider the ith component of the first order moment vector,
[S1]i = Pr[y1 = i]
=∑j∈X
Pr[y1 = i|x1 = j] Pr[x1 = j]
=∑j∈X
[O]ij [π]j
= [Oπ]i, (2.4)
where the law of total probability was used in second equality. This is equivalent to
S1 = Oπ, (2.5)
if written as a vector equation. An intuitive interpretation of this equation is that π
provides the probability of of the Markov chain being in a certain state, and that O
then multiplies π to provide the probability of each possible observation. We will now
derive similar expressions for the higher order moments as well. A few more steps are
required to express the second order moment matrix using O and T . Consider S2,1 on
component form,
[S2,1]ij = Pr[y2 = i, y1 = j]
=∑m∈X
Pr[y2 = i, y1 = j|x2 = m] Pr[x2 = m]
=∑m∈X
Pr[y2 = i|x2 = m]︸ ︷︷ ︸=[O]im
Pr[y1 = j|x2 = m] Pr[x2 = m]
=∑m∈X
[O]im∑n∈X
Pr[y1 = j|x2 = m,x1 = n]︸ ︷︷ ︸=Pr[y1=j|x1=m]=[O]jn
Pr[x1 = n|x2 = m] Pr[x2 = m]
=∑m∈X
[O]im∑n∈X
[O]jnPr[x1 = n, x2 = m]
Pr[x2 = m]Pr[x2 = m]
=∑m∈X
[O]im∑n∈X
[O]jn Pr[x2 = m|x1 = n] Pr[x1 = n]
=∑m∈X
[O]im∑n∈X
[O]jn[T ]mn[π]n
=∑m∈X
∑n∈X
[O]im[T ]mn[π]n[OT ]nj
= [OTdiag(π)OT ]ij , (2.6)
where the third and forth equalities use the conditional independence of the Markov
chain and the fifth and sixth is an application of Bayes’ formula. This can be written
Chapter 2. Methods for Identifications Using Batch Data 16
as a matrix expression as
S2,1 = OTdiag(π)OT . (2.7)
It might be easier to gain some intuition of this equation by a simple reformulation:
S2,1 = OT diag(π)OT
= OT[O diag(π)]T . (2.8)
Taken from the right to the left; π provides the probability of the Markov chain being
in a certain state, O then gives the probability of the first item in the observation pair,
T multiplies to transition the Markov chain to the next state, which then the second O
multiplies to give a probability for the second observation.
As mentioned earlier, the third order moment is a (third order) tensor, but can be
written as a second order tensor (i.e. a matrix) if one index is fixed. Let us fix the
second index to be y. In this case, we can derive the following expression for the (i, j)th
component of S3,y,1:
[S3,y,1]ij = Pr[y3 = i, y2 = y, y1 = j]
=∑n∈X
Pr[y3 = i, y2 = y, y1 = j|x3 = n] Pr[x3 = m]
=∑n∈X
Pr[y3 = i|x3 = n]︸ ︷︷ ︸=[O]in
Pr[y2 = y, y1 = j|x3 = n] Pr[x3 = m]
=∑n∈X
[O]in∑m∈X
Pr[y2 = y, y1 = j|x3 = n, x2 = m]
× Pr[x2 = m|x3 = n] Pr[x3 = m]
=∑n∈X
[O]in∑m∈X
Pr[y2 = y|x2 = m] Pr[y1 = j|x2 = m]
× Pr[x3 = n|x2 = m] Pr[x2 = m]
=∑n∈X
[O]in∑m∈X
[O]ym Pr[y1 = j|x2 = m][T ]nm Pr[x2 = m]
=∑n∈X
∑m∈X
[O]in[O]ym[T ]nm∑l∈X
Pr[y1 = j|x2 = m,x1 = l]
× Pr[x1 = l|x2 = m] Pr[x2 = m]
=∑n∈X
∑m∈X
[O]in[O]ym[T ]nm∑l∈X
[O]jl Pr[x2 = m|x1 = l] Pr[x1 = l]
=∑n∈X
∑m∈X
∑l∈X
[O]in[O]ym[T ]nm[O]jl[T ]mlπl
=∑n∈X
∑m∈X
∑l∈X
[O]in[T ]nm[O]ym[T ]mlπl[OT ]lj
=[OT diag(eTyO)T diag(π)OT ]ij . (2.9)
Chapter 2. Methods for Identifications Using Batch Data 17
Or, expressed as a matrix equation
S3,y,1 = OTdiag(eTyO)Tdiag(π)OT . (2.10)
We will also need to consider the second order moment with a “jump”, S3,1. Instead
of redoing similar calculations as those resulting in Equation (2.7), we note that by the
law of total probability
[S3,1]ij = Pr[y3 = i, y1 = j]
=∑y∈Y
Pr[y3 = i, y2 = y, y1 = j]
=∑y∈Y
[S3,y,1]ij (2.11)
and thus that
S3,1 =∑y∈Y
S3,y,1
=∑y∈Y
OTdiag(eTyO)Tdiag(π)OT
= OT∑y∈Y
diag(eTyO)Tdiag(π)OT
= OT∑
y∈Y [O]y,1 0. . .
0 ∑y∈Y [O]y,X
Tdiag(π)OT
= OTITdiag(π)OT
= OTTdiag(π)OT , (2.12)
where the fact that O is a stochastic matrix and therefore that the elements of each
column sum to one was used in fifth equality.
The spectral learning method relies on two assumptions. Firstly,
Assumption 2.5 (Hsu et al. [12]). π > 0 elementwise, and O and T are rank X.
Secondly, when Hsu et al. [12] generalized the method in Mossel and Roch [22] to handle
cases where there are more possible observations than hidden states, they introduced a
matrix U ∈ RY×X with the following property:
Assumption 2.6 (Hsu et al. [12]). UTO is invertible.
Chapter 2. Methods for Identifications Using Batch Data 18
This matrix can be freely chosen as long as Assumption 2.6 is satisfied. However, to
make the algorithm more concrete, Hsu et al. [12] provide a suggestion for how this
matrix can be chosen. We state, and slightly rephrase, Lemma 2 of Hsu et al. [12]:
Lemma 2.7. Assume Assumption 2.5 holds, then rank(S2,1) = X and the matrix V of
left singular vectors of S2,1 corresponding to non-zero singular values, has range(V ) =
range(O). So taking U = V fulfills Assumption 2.6.
The interested reader can find the proof in Hsu et al. [12, p. 6]. We will from hereon
assume that the choice suggested in Lemma 2.7 is made for U .
Now that we have derived expressions for all the necessary moment tensors, we will
show how they can be combined cleverly to yield an expression for recovering O. The
vital step of the algorithm is as follows: We will rewrite Equation (2.10) to introduce
Equation (2.12), which is easy since they share a common factor. This will result in a
diagonalization from which the diagonal with eigenvalues can be identified as one row
of O. As noted above, we introduce the matrix U to be able to handle the case that we
have more possible discrete observations than hidden states. Consider Equation (2.10)
pre-multiplied by UT ,
UTS3,y,1 = UTOT diag(eTyO)T diag(π)OT
= UTOT diag(eTyO) (UTOT )−1(UTOT )︸ ︷︷ ︸=I
Tdiag(π)OT
= UTOT diag(eTyO)(UTOT )−1UT OTT diag(π)OT︸ ︷︷ ︸= (2.12)
= UTOT diag(eTyO)(UTOT )−1UTS3,1. (2.13)
UTS3,1 has full row rank by Assumption 2.5 and Assumption 2.6, so multiplying by its
Moore-Penrose psuedo-inverse (see, for example, Golub and Van Loan [10] for details)
from the right gives
(UTS3,y,1)(UTS3,1)+ = (UTOT ) diag(eTyO) (UTOT )−1. (2.14)
This is an eigendecomposition, so the eigenvalues of (UTS3,y,1)(UTS3,1)+ are precisely
the elements of the yth row of the emission matrix O (i.e. eTyO). Everything on the left
hand side of Equation (2.14) can be estimated from data, which in turn allows us to
calculate an estimate of O, row by row.
There is a delicacy here though that, for example, was not recognized by Mattfeld
[20]. That is the order of the eigenvalues. Equation (2.14) allows us to calculate an
Chapter 2. Methods for Identifications Using Batch Data 19
estimate of one row of O at a time. But using, say, the eig-command1 in MATLAB,
the eigenvalues will be returned in a descending (not guaranteed though) order (see
StackOverflow [25]). If this command is used to calculate the eigenvalues of the left
hand side of Equation (2.14) for each row of O, then the assembled observation matrix
will be faulty, since the elements of each row will be sorted according to the elements’
sizes. The elements of the rows should be sorted according to the order of the hidden
states.
The trick is to exploit the fact that the same matrix, UTOT , diagonalizes the left hand
side of Equation (2.14) for all y ∈ Y. To get a consistent ordering amongst the elements
in the rows of O, we do an eigendecomposition for some y and then use the ordered set of
eigenvectors from this decomposition to diagonalize the left hand side of Equation (2.14)
for every other y. Mattfeld [20, see especially Algorithm 1, p. 13] made the above
mentioned error, and thus get an inconsistent ordering of the elements amongst the
rows of the estimated observation matrix.
Hsu et al. [12] introduce some further improvements to the estimation procedure of the
observation matrix to increase the robustness. It could be that some row, let us say
row y, of O has (at least) two elements that are equal. This means that the eigenvalues
of (UTS3,y,1)(UTS3,1)+ are not unique. In this case, if the geometric multiplicity is
lower than the algebraic multiplicity, we would not get the required amount of (unique)
eigenvectors to do an eigendecomposition. In continuation, we would not be able to
invert the matrix of eigenvectors to do the diagonalization transformation and recover
the other rows of O.
We could do the initial eigendecomposition for every y ∈ Y until we find one that is valid
in the sense outlined in the previous paragraph. However, this would fail if no row of O
is valid. It is shown in Mossel and Roch [22] that if we, instead of a single row, consider
a random combination of all rows of O, then the eigenvalues will be separated with high
probability. To do this, we define the Y random variables gy ∼ N (0, 1) : y = 1, . . . , Y and consider the weighted summation of Equation (2.14),
∑y∈Y
gy(UTS3,y,1)(UTS3,1)+ =
∑y∈Y
gy(UTOT ) diag(eTyO)(UTOT )−1
=∑y∈Y
(UTOT )gy diag(eTyO)(UTOT )−1
= (UTOT )∑y∈Y
gy diag(eTyO)
(UTOT )−1. (2.15)
1http://se.mathworks.com/help/matlab/ref/eig.html
Chapter 2. Methods for Identifications Using Batch Data 20
Performing an eigendecomposition of∑
y∈Y gy (UTS3,y,1)(UTS3,1)+ will recover the di-
agonalization transformation UTOT , up to permutations and scalings of the columns,
which can be used to diagonalize the left hand side of Equation (2.14) for every y ∈ Y.
The permutations of the columns correspond to different labelings of the hidden states.
This is further elaborated on in Section 4.1.2 and Section 4.2. Roughly, since the hid-
den states are hidden, their naming is irrelevant in a realization since they cannot be
observed. The above discussion was concerned with guaranteeing a consistent ordering
of the elements amongst the rows of O, i.e. that we use the same labeling of the hidden
states for every possible observation, but not that we recover the order corresponding
to the true system matrices.
The diagonalization matrix we recover from the eigendecomposition of the sum of the
left hand side of Equation (2.15) would be UTOTKP, where K is a diagonal matrix
of scaling factors and P is the identity matrix with some columns swapped. Which
columns of I that are swapped to form P is given by what algorithm we use to do the
eigendecomposition (for example the eig-command in MATLAB).
We multiply both sides of Equation (2.14) by this matrix from the right and by its
inverse from the left, which yields
(UTOTKP)−1(UTS3,y,1)(UTS3,1)+(UTOTKP)
= (UTOTKP)−1(UTOT ) diag(eTyO)(UTOT )−1(UTOTKP)
= (KP)−1(UTOT )−1(UTOT ) diag(eTyO)(UTOT )−1(UTOT )(KP)
= P−1K−1 diag(eTyO)KP
= P−1K−1K diag(eTyO)P
= P−1 diag(eTyO)P, (2.16)
since K is diagonal.
We can thus calculate the left hand side of Equation (2.16) for every y ∈ Y to recover
each row of O as the diagonal of the resulting expression. The observation matrix we
construct using this procedure can however have some columns swapped compared to
the true observation matrix. Hsu et al. [12, p. 30] state that the recovered observation
matrix is in exact correspondence with the true observation matrix, but that is not
always true.
Once O is recovered, we can find π and T using the following relations:
O+S1 = O+Oπ = π (2.17)
Chapter 2. Methods for Identifications Using Batch Data 21
Algorithm: Spectral Learning (SL)
Input: X - number of hidden states,Y - number of possible discrete observations,M - number of observation triplets to sample.
Output: HMM parameters O, T and π.
1. Sample M triplets of observations (y1, y2, y3) from the HMM and form empiricalestimates S1, S2,1, S3,1 and S3,y,1 for y = 1, . . . , Y .
2. Calculate the SVD of S2,1 and form U by taking the left singular vectors corre-sponding to the X largest singular values as columns.
3. Calculate Y random variables gy ∼ N(0, 1) : y = 1, . . . , Y .
4. Perform an eigendecomposition of the left hand side of Equation (2.15) when theabove estimates are used, i.e. of∑
y∈Ygy(U
T S3,y,1)(UT S3,1)+. (2.19)
5. Use the (same) matrix of eigenvectors from Step 4 to diagonalize(U S3,y,1)(UT S3,1)+ for y = 1, . . . , Y and take the diagonal as row y of O. Thediagonalizations are performed by multiplicating from the left and by the inversefrom the right in accordance with Equation (2.16).
6. Calculateπ ← O+S1 (2.20)
andT ← O+S2,1(O+)T diag(π)−1. (2.21)
7. Return O, T and π.
Figure 2.1: The spectral learning algorithm.
and
O+S2,1(O+)Tdiag(π)−1 = O+(OTdiag(π)OT )(O+)Tdiag(π)−1 = T. (2.18)
Note that the transition matrix and stationary distribution that we recover using the
above relations have a corresponding re-ordering of the states as that of the observation
matrix used in the expressions.
The spectral learning algorithm uses the method outlined above to recover estimates
of the observation matrix, transition matrix and stationary distribution using estimates
S1, S2,1, S3,1 and S3,y,1 instead of the true tensors. The algorithm is summarized in
Figure 2.1.
Chapter 2. Methods for Identifications Using Batch Data 22
2.4 Introduction to Non-Negative Matrix Factorization
One quite serious flaw of the spectral learning algorithm is that it leaves no guarantees
that the estimated matrices are valid stochastic matrices. It is a one-shot method
that can result in a transition or observation matrix with negative (or perhaps worse,
imaginary) elements. It appears that this problem has not yet been tackled successfully
in the literature. Another approach to the identification procedure of HMMs is to
leverage the results from Non-Negative Matrix Factorization (NNMF) theory. Much of
this work is inspired by the identification methods developed in the control field for
positive linear systems.
Using an NNMF ensures the non-negativity of the estimated parameters. This is not
enough for the HMM identification problem however, since it has some further complica-
tions compared to regular linear systems: the system parameters have to be stochastic.
In particular, this means that the elements of each column of O and T have to sum to
one. These constraints have to be considered when performing the factorization.
It is possible to formulate the HMM identification problem using an NNMF as an op-
timization problem. This is done by Vanluyten et al. [28] and Lakshminarayanan and
Raich [16]. Their methods are very similar and only differ in the way that the opti-
mization problem is solved. Their methods are also conceptually close to the spectral
learning algorithm outlined in the previous section: a matrix of second order moments
is empirically estimated from data and then decomposed into factors, which can be
identified as the parameters of the HMM.
Vanluyten et al. [27] generalize the method to use higher order moments as well. This
complicates the procedure of recovering explicit expressions for the transition and ob-
servation matrix. Recovering explicit expressions is not the aim in Vanluyten et al. [27]
though, and only formulas to recover their product are derived. This is also done by
Finesso et al. [9] and Cybenko and Crespi [7] using slightly different procedures.
In this text, only the method of Vanluyten et al. [28] and Lakshminarayanan and Raich
[16] will be discussed, since they reassemble the spectral learning algorithm the most.
Chapter 2. Methods for Identifications Using Batch Data 23
2.5 Identification using Structured Non-Negative Matrix
Factorization
A Non-Negative Matrix Factorization (NNMF) is a decomposition of a non-negative
matrix Q into two non-negative matrices V and A such that
Q = V A. (2.22)
Non-negativity in this text refers to elementwise non-negativity of the matrices. Methods
for solving this decomposition usually relaxe the above constraint to
Q ≈ V A. (2.23)
See for example Lee and Seung [17] for some background on NNMF and a discussion on
how the decomposition can be performed numerically.
Equation (2.22) is the standard NNMF. The methods mentioned in the previous sec-
tion that employ a pure NNMF for the identification procedure do not recover explicit
expressions for the transition and observation matrix. Since our interest lies in actually
recovering these matrices, we will only describe the method that uses a slight variation
of the standard NNMF. Vanluyten et al. [28] introduce the Structured Non-Negative Ma-
trix Factorization (SNNMF), which is also employed by Lakshminarayanan and Raich
[16] without citing Vanluyten et al. [28] and without introducing the name SNNMF.
In an SNNMF, the decomposition of Q is still performed into two non-negative matrices
V and A (albeit different from those in Equation (2.22)), but using a slightly different
structure of the decomposition:
Q = V AV T . (2.24)
Again, the decomposition is usually done approximatively and not exactly.
The insight of Vanluyten et al. [28] and Lakshminarayanan and Raich [16] is to compare
Equation (2.7) from the previous section,
S2,1 = OT diag(π)OT , (2.7)
to Equation (2.24). They realize that it is possible to identify
Q = S2,1, (2.25)
V = O, (2.26)
A = T diag(π), (2.27)
Chapter 2. Methods for Identifications Using Batch Data 24
and then formulate and solve Equation (2.24) approximately as a constrained minimiza-
tion problem,
minV,A‖S2,1 − V AV T ‖
s.t.
A ≥ 0, 1TA1 = 1,
V ≥ 0, 1TV = 1T ,(2.28)
where V ∈ RY×X and A ∈ RX×X . The resulting V immediately recovers the observation
matrix, O. T and π can be recovered using the following expressions,
π = AT1 (2.29)
and
T = Adiag(π)−1. (2.30)
The constraints and the relations for T and π will be discussed further in the next
chapter, but it is intuitively plausible that they enforce the stochasticity of the involved
quantities.
Note that the exact norm is unspecified in Equation (2.28). Vanluyten et al. [28] con-
sider the Kullback-Leibler divergence and propose an iterative algorithm for solving the
minimization problem. Lakshminarayanan and Raich [16] consider both the Kullback-
Leibler divergence and an unspecified norm, presumably the Euclidean, but provide
only an iterative method employing an alternating least squares approach for solving
the minimization problem for the unspecified norm.
The idea is then to, just as in the spectral learning algorithm, estimate S2,1 from data
and use S2,1 in the minimization problem to recover estimates O, T and π. The benefit
of this approach is that the estimates are valid stochastically by construction.
However, Equation (2.28) is a non-convex optimization problem, which brings along the
question of local minima. Both the algorithm by Vanluyten et al. [28] and the algorithm
by Lakshminarayanan and Raich [16] are only guaranteed to converge to a local minima.
This is the current trade-off between the spectral learning algorithm, that avoids the
problem with local minima but fails to guarantee the generation of valid estimates, and
the SNNMF algorithm.
The complete SNNMF algorithm is outlined in Figure 2.2, where it should be noted that
minimizing ‖ · ‖ is the same as minimizing ‖ · ‖2.
Chapter 2. Methods for Identifications Using Batch Data 25
Algorithm: Structured Non-Negative Matrix Factorization (SNNMF)
Input: X - number of hidden states,Y - number of possible discrete observations,M - number of observation pairs to sample.
Output: HMM parameters O, T and π.
1. Sample M pairs of observations (y1, y2) from the HMM and form an empiricalestimate S2,1.
2. Solve the optimization problem
minV ∈RY ×X
A∈RX×X
‖S2,1 − V AV T ‖2F
s.t.
A ≥ 0, 1T A1 = 1,
V ≥ 0, 1T V = 1T .(2.31)
3. Calculateπ ← AT1 (2.32)
andT ← Adiag(π)−1. (2.33)
4. Return O ← V , T and π.
Figure 2.2: The structured non-negative matrix factorization algorithm.
Chapter 3
Methods for Online Identification
This chapter presents three novel methods for estimating the dynamics of the Markov
chain underlying an HMM when the dynamics of the sensor used to make observations
of the HMM are assumed to be known. We first give a brief introduction to convexity
and convex optimization, since we will be employing such concepts when we in the later
sections derive the methods.
3.1 Introduction
In the previous chapter, we assumed that we were given a sequence of outputs from an
HMM, or could wait long enough to sample said sequence. This sequence was then used
to estimate the transition and observation matrix of the generating HMM. One possible
critique of this approach is that once a new measurement becomes available, everything
has to be recalculated from scratch. This is not a problem for applications where time,
memory and computational resources are near unlimited. However, for some real-time
applications, one or more of these factors can be severely limited.
By online identification, we refer to a recursive formulation of an identification algorithm.
This means that once a new measurement becomes available, a new estimate is generated
using only the new measurement and the old estimate. In this way, we avoid storing all
observations in memory, and we also avoid redoing expensive calculations from scratch
(such as the singular-value decomposition in the spectral learning algorithm).
The aim of this chapter is to formulate the SNNMF algorithm from the previous chapter
(see Section 2.5) on a recursive form (under some assumptions). To the best of our
knowledge, this has not been done before in the literature, and is the main result of this
thesis.
27
Chapter 3. Methods for Online Identification 28
Figure 3.1: Example of a convex function. The (gray) line between any two points onthe graph lies above the graph. A negative gradient is shown as a red arrow. Followingthe direction of the gradient at each point (i.e. performing a gradient descent) will
reach the global minimum.
3.2 Convexity and Convex Optimization
3.2.1 Introduction
We will see that the optimization problem we try to solve as part of the SNNMF, under
some assumptions, is convex. This section presents some results and methods from
convex optimization that will be used in the recursive formulation.
Most of the concepts and methods that will be discussed below can be found in Boyd
and Vandenberghe [4], which is one of the standard textbooks on convex optimization.
For completeness, we cite their definition of a convex set and a convex function:
Definition 3.1 (Boyd and Vandenberghe [4]). A set C is convex if for any x1, x2 ∈ C
and any θ ∈ [0, 1], we have
θx1 + (1− θ)x2 ∈ C. (3.1)
Definition 3.2 (Boyd and Vandenberghe [4]). A function f : Rn → R is convex if the
domain of f is a convex set and if for all x1 and x2 in the domain of f , and θ ∈ [0, 1],
we have
f(θx1 + (1− θ)x2) ≤ θf(x1) + (1− θ)f(x2). (3.2)
The geometrical interpretation of a convex function is that the line between any two
points on the function’s surface always lies above (or touches) the surface. What is so
valuable about this property is that if we follow the negative gradient of the function, i.e.
the steepest slope downhill, to end up in a local minimum, then we are guaranteed that
it is also a global minimum. This is because any local minimum of a convex function is
also a global minimum (Boyd and Vandenberghe [4, c.f. Section 4.2.2]). Figure 3.1 tries
to make this intuitively plausible. See Boyd and Vandenberghe [4] for a more rigorous
discussion.
Chapter 3. Methods for Online Identification 29
Before presenting a selection of methods used in convex optimization, we will first prove
a result that will justify our use of them in the sections to come.
Theorem 3.3. Let A ∈ RY×Y , B ∈ RY×X and C ∈ RX×X , then ||A − BCBT || is a
convex function in C.
Proof. Let f(C) = ‖A − BCBT ‖, θ ∈ [0, 1] and C1, C2 ∈ RX×X . Then following
Definition 3.2,
f(θC1 + (1− θ)C2) = ‖A−B(θC1 + (1− θ)C2)BT ‖
= ‖ θA+ (1− θ)A︸ ︷︷ ︸=A
−θBC1BT − (1− θ)BC2B
T ‖
= ‖θ[A−BC1BT ] + (1− θ)[A−BC2B
T ]‖
≤ ‖θ[A−BC1BT ‖+ ‖(1− θ)[A−BC2B
T ]‖
= θ‖A−BC1BT ‖+ (1− θ)‖A−BC2B
T ‖
= θf(C1) + (1− θ)f(C2). (3.3)
Remark 3.4. Note that the norm is left unspecified since only the triangle inequality was
used in the proof. Also note that ‖ · ‖2 is convex if ‖ · ‖ is convex.
3.2.2 The Projected Gradient Descent Method
The regular gradient descent method, mentioned briefly above, is a standard method in
optimization. The idea is to follow the steepest downhill slope of a function to reach
the bottom of a “valley”. For non-convex functions, this means that a local minimum
will be approached. If the function is convex however, then any local minimum is also
a global minimum.
Formally, the regular gradient descent method is as follows. Assume that the aim is to
minimize some function f(x), i.e. solve the problem
minx
f(x). (3.4)
Take x0 as an initial guess of a global minimum and then iteratively refine the guess as
xk+1 = xk − ηk∇xf(xk), (3.5)
Chapter 3. Methods for Online Identification 30
where ηk is the so-called step-size of the gradient descent. (Sometimes, the normalized
gradient ∇xf(xk)‖∇xf(xk)‖ is used and explicitly written out, but it is equal to taking ηk =
ηk × 1‖∇xf(xk)‖ .)
This will make x move according to the steepest downward slope of the surface of f(x).
Note that there are no restrictions on what values x can take. Thus, Equation (3.5) is
an unconstrained minimization algorithm.
If the optimization problem is on the form
minx
f(x)
s.t. x ∈ C, (3.6)
for some convex set C, then the update rule Equation (3.5) will not guarantee that the
constraint is fulfilled.
The projected gradient descent method is a relatively straight-forward generalization of
the gradient descent method to handle constraints. In the projected gradient descent
method constraints are enforced by performing a projection of xk+1 onto the constraint
set C. The method can formally be written as
xk+1 = ProjC
xk − ηk∇xf(xk)
, (3.7)
where ProjCa
is the projection operator solving
ProjCa
= arg minb∈C
‖a− b‖. (3.8)
A more detailed description of the projected gradient descent method, along with dis-
cussions on the convergence for different step-sizes, can be found in Fan and Yao [8] and
Wang and Xiu [30].
3.2.3 The Primal-Dual Method
Another popular constrained optimization method is the primal-dual method. Unlike the
projected gradient descent method, the constraints are only enforced softly, meaning that
they will be fulfilled asymptotically.
Chapter 3. Methods for Online Identification 31
For a problem with equality constraints,
minx
f(x)
s.t. g(x) = 0, (3.9)
where g(x) ∈ Rm, the so-called Lagrangian L is introduced as
L(x, λ) = f(x) + λT g(x), (3.10)
where λ ∈ Rm. The algorithm then iterates between the primal update step,
xk+1 = xk − ηk∇xL(xk, λk) (3.11)
and the dual update step,
λk+1 = λk + ηk∇λL(xk, λk). (3.12)
A more thorough discussion can be found in Krishnamurthy and Abad [15] and Boyd
and Vandenberghe [4].
3.3 Online Structured Non-Negative Matrix Factorization
for Known Sensor Dynamics
In some applications, either the transition matrix or the observation matrix is known.
The transition matrix describes the system dynamics and can at times be modelled
using domain knowledge. The observation matrix models the noise and the sensor used
to measure the system. Since the sensor is a design parameter for some systems, we will
assume that the observation matrix is known in this chapter. Thus, we will only try to
recover the dynamics of the underlying Markov chain of the HMM.
With this assumption, we will see that the optimization problem of the SNNMF formu-
lation becomes convex. We can then apply the methods described above and be sure
that the global minimum (i.e. the true observation matrix) will be reached.
It is worth reasoning about how this assumptions would influence the spectral learning
algorithm, since it was the topic of the first part of this thesis. The eigendecomposition
would be avoided under this assumption since it is used for recovering O. This would
get rid of the possibility to end up with negative or complex elements in the estimates.
However, there is still no guarantee that the sum-to-one condition will be satisfied.
Chapter 3. Methods for Online Identification 32
Lindberg and Omre [19] provide an example of an application where this assumption
is made, i.e., where the observation matrix is known, but the dynamics of the Markov
chain is not. They model earthquakes using a generalized HMM with multiple layers
of hidden states. Vercauteren et al. [29] provide another application, when they try to
solve the problem of estimating the number of competing terminals in a wireless network
to be able to tune parameters that will increase its performance.
3.3.1 Derivation of the Algorithms
We will continue the discussion from where we left it in Section 2.5. To recap, the
identification problem had been formulated as the following minimization problem,
minV ∈RY ×X
A∈RX×X
‖S2,1 − V AV T ‖2F
s.t.
A ≥ 0, 1T A1 = 1,
V ≥ 0, 1T V = 1T ,(2.31)
from which estimates of the HMM parameters could be obtained using
O = V , (3.13)
π = AT1, (2.32)
T = Adiag(π)−1. (2.33)
This exploited the fact that we can write the second order moment matrix S2,1 as
S2,1 = OT diag(π)OT (2.7)
= OAOT . (3.14)
The constraints enforced on V are fairly self-explaining. Since we identify V = O, V ≥ 0
ensures that the we do not deal with negative probabilities. 1T V = 1T makes sure that
we get a stochastic matrix, where the elements in each column sum to one. Note that
these two constraints together ensure that every element lies in the interval [0, 1].
Chapter 3. Methods for Online Identification 33
The constraints that we enforce on A is non-negativity and that the sum of all its
elements should be one, which follows from the fact that
1TA1 = 1TTdiag(π)1
= 1TTπ
= 1Tπ
= 1, (3.15)
since the sum of the elements of each column of T is one and π is a stochastic vector.
The estimate of the stationary distribution, Equation (2.32), follows from
AT1 = diag(π)T T1
= diag(π)1
= π. (3.16)
Note that this ensures that π is stochastic since from the constraint 1T A1 = 1, it follows
that
1 = 1T AT1
= 1T π. (3.17)
To derive the formula for the estimate of the transition matrix, we simply solve for
T in the definition of A, namely Equation (3.14), to get T = Adiag(π)−1. Note that
Equation (2.33) also guarantees that T is a valid transition matrix, since
1T T = 1T Adiag(π)−1
= 1T Adiag(AT1)−1
= 1T
[A]1,1 · · · [A]1,X
. . .
[A]X,1 · · · [A]X,X
1[A]1,1+···+[A]X,1
· · · 0
. . .
0 · · · 1[A]1,X+···+[A]X,X
=[
[A]1,1+[A]2,1+···+[A]X,1
[A]1,1+[A]2,1+···+[A]X,1· · · [A]1,X+[A]2,X+···+[A]X,X
[A]1,X+[A]2,X+···+[A]X,X
]= 1T , (3.18)
which shows that the elements of each column in T sum to one.
Remark 3.5. The inverse matrix in the expression for T , Equation (2.33), is easy to calcu-
late since it is the inverse of a diagonal matrix: diag([γ1 . . . γn])−1 = diag([ 1γ1
. . . 1γn
]).
Chapter 3. Methods for Online Identification 34
As mentioned above, the necessary assumption that we will make in this chapter is that
we know the sensor dynamics (i.e. O). Equation (2.31) then reduces to
minA∈RX×X
‖S2,1 −OAOT ‖2F
s.t. A ≥ 0, 1T A1 = 1. (3.19)
Referring to Theorem 3.3, we can conclude that the cost function is convex. Further
on, it is easily seen by vectorizing A (i.e. stacking the column vectors as one big vector)
that the constraint set is simply the (X2−1)-dimensional probability simplex. Boyd and
Vandenberghe [4, Section 2.2.4] show that this is a convex set. Thus, the optimization
problem in Equation (3.19) is a convex optimization problem.
We will now leverage the methods from the previous section to solve the problem. All of
these methods require gradients of some functions. To avoid having to perform numerical
calculations of these gradients, we will derive analytical expressions.
3.3.1.1 Estimating S2,1 Recursively
However, before solving the optimization problem, we will briefly explain how S2,1 can
be estimated online.
The basic idea is to keep track of the number of times that the output has been observed
to jump between two possible outputs, i.e. keep track of the relative frequencies of pairs
of observations. There are at least two ways of numerically representing the current
estimate. We can either keep track of the probabilities that can be calculated from the
counting, i.e. keep an explicit estimate S2,1, or keep track of the exact number of times
that all possible pairs of observations have been seen and then when needed, calculate
S2,1.
Each approach has its advantages and disadvantages. It seems convenient to keep an ex-
plicit representation of S2,1 available all the time. The problem is updating the estimate
once a new observation is available. To make sure that every observation is accounted
for correctly, one has to keep track of the time k since starting the estimation procedure.
Given the previous observation yk−1 and the new observation yk, the update rule can
be written
[S2,1]ij ←[S2,1]ij × (k − 1) + δi,ykδj,yk−1
k, (3.20)
where δi,j is the Kronecker delta. S2,1 is converted to the exact number of previously
observed jumps between the two outputs j and i, the element corresponding to the new
observation is then increased and everything is normalized to yield a new S2,1.
Chapter 3. Methods for Online Identification 35
This is the method we will use in the algorithms to be presented in this chapter. Another
approach is to keep track of the number of jumps, and then normalize this matrix to get
an estimate of S2,1 once needed. This implicit representation has some benefits, since
we have some flexibility of how we generate our estimate of S2,1. For example, we can
chose to weight new observations more if we are tracking a system that is time-varying.
The update procedure in this case would be
[C]ij ← [C]ij + δi,ykδj,yk−1× ρ(k) (3.21)
and
S2,1 ←C∑ij [C]ij
, (3.22)
where we use C as a counting matrix and ρ(k) as a time-varying weight of the obser-
vations. Taking ρ(k) = 1 yields an unweighted estimate and is equivalent to Equa-
tion (3.20). Taking ρ(k) as some increasing function puts more weight to newer obser-
vations.
3.3.1.2 Projected Gradient Descent
The first method we consider for solving Equation (3.19) is the projected gradient descent
method. As explained in Section 3.2.2, we first perform a regular gradient descent and
then project the resulting estimate onto the constraint set. In our case, the constraint set
is the probability simplex. Projections onto simplices have, among others, been studied
by Chen and Ye [6] and Wang and Carreira-Perpinan [31].
We will not dwell on this matter here, but mention that in our implementation, the
method from Wang and Carreira-Perpinan [31] is used due to its ease of implementation.
The cost function that we try to minimize is ‖S2,1 −OAOT ‖2F . We seek to calculate its
gradient with respect to A. Let
f(A) =1
2‖S2,1 −OAOT ‖2F . (3.23)
It is well-known that the following relations hold for the squared Frobenius norm of a
matrix: ‖Ψ‖2F = Tr(ΨTΨ) = Tr(ΨΨT ). Using this, we can expand Equation (3.23) as
f(A) =1
2‖S2,1 −OAOT ‖2F
=1
2Tr([S2,1 −OAOT ][ST2,1 −OATOT ]). (3.24)
Chapter 3. Methods for Online Identification 36
Let Jij be the matrix of all zeros except for the element (i, j) which is a one, i.e.
[Jij ]ab = δi,aδj,b. Also let ∆A = εJij , with 1 ε > 0, and to ease the notation in the
following equations, denote S2,1 as simply S. Then
f(A+ ∆A) =1
2Tr
[(S −OAOT )−O∆AOT ][(ST −OATOT )−O∆ATOT ]
=1
2Tr
[S −OAOT ][ST −OATOT ]
︸ ︷︷ ︸=f(A)
+1
2Tr− [S −OAOT ]O∆ATOT −O∆AOT [ST −OATOT ]
+O∆AOTO∆ATOT
≈ f(A) +1
2Tr− [S −OAOT ]O∆ATOT −O∆AOT [ST −OATOT ]
= f(A)− Tr
[S −OAOT ]O∆ATOT
= f(A)− Tr
[S −OAOT ]OJTijO
T× ε (3.25)
where terms of higher order than linear were disregarded. Comparing this expression to
the first order Taylor expansion of f(A+ ∆A),
f(A+ εJij) ≈ f(A) + ε∂f(A)
∂Aij, (3.26)
we conclude that
∂f(A)
∂Aij= −Tr
[S −OAOT ]OJTijO
T
= −Tr
[S −OAOT ]OJjiOT. (3.27)
and thus that[∇A‖S2,1 −OAOT ‖2F
]ij
= −2Tr
[S2,1 −OAOT ]OJjiOT. (3.28)
To generate recursive estimates of the transition matrix; we first update our estimate of
S2,1 as explained in the previous section and then perform a projected gradient descent
on the cost function ‖S2,1 − OAOT ‖2F with respect to A. Since we have an explicit
expression for the gradient, we present the full algorithm in Figure 3.2.
Remark 3.6. The initial guess of A is not very important since the algorithm is globally
convergent.
Chapter 3. Methods for Online Identification 37
Algorithm: Online SNNMF using Projected Gradient Descent Method (PGDM)
Input: X - number of hidden states,O - observation matrix,yprev - previous observation,ynow - current observation,k - time since start of estimation procedure,
S2,1 - current estimate of S2,1,
A - current estimate of the variable A.
Output: Recursive estimates of HMM parameters; T and π.
0. Initialize the recursion with k = 1, S2,1 ∈ RY×Y as a matrix of zeros and A ∈ RX×Xas I × 1
X . Chose a step-size ηk.
1. Update the estimate of S2,1,
[S2,1]ij ←[S2,1]ij × (k − 1) + δi,ynowδj,yprev
k, (3.29)
where δi,j is the Kronecker delta.
2. Let [Jij ]ab = δi,aδj,b and calculate the gradient of the cost function[G]ij←[∇A‖S2,1 −OAOT ‖2F
]ij
= −2Tr
[S2,1 −OAOT ]OJjiOT. (3.28)
3. Perform the projected gradient descent
A← Proj∆
A− ηk ×G
, (3.30)
where ∆ is the probability simplex.
4. Calculate and return:π ← AT1 (2.32)
andT ← Adiag(π)−1. (2.33)
5. Take
yprev ← ynow (3.31)
k ← k + 1 (3.32)
and save a new measurement in ynow, then go to Step 1.
Figure 3.2: The online structured non-negative matrix factorization algorithm em-ploying the projected gradient descent method.
Chapter 3. Methods for Online Identification 38
3.3.1.3 The Primal-Dual Method without Inequality Constraints
We will now solve the optimization using the primal-dual method instead. The La-
grangian L from Equation (3.10) translates to the following expression for the problem
at hand:
L(A, λ) = ‖S2,1 −OAOT ‖2F + λ(1TA1− 1), (3.33)
where λ ∈ R. This follows from Equation (3.19) by a simple reformulation of the equality
constraint.
Note that we do not in this formulation restrict the elements of A to lie in [0, 1], only
that they sum to one. The inequality constraint can be incorporated in the primal-dual
method as well by restraining a new set of Lagrangian multipliers (i.e. extending λ)
to never change sign. This can be achieved by for example using a parametrization of
the extra Lagrangian multipliers that employ exponential functions. We will not pursue
that path in this thesis and will leave it as future work. However, using only the equality
constraint appears to give good empirical results.
The update rules for the primal variable (A) and the dual (λ) are
Ak+1 = Ak − ηk∇AL(Ak, λk), (3.34)
λk+1 = λk + ηk∇λL(Ak, λk). (3.35)
Thus we will need the gradients of the Lagrangian with respect to A and λ. As in the
previous section, let [Jij ]ab = δi,aδj,b, i.e. the matrix of all zeros except for the element
at row i and column j which is a one. Reusing the result of Equation (3.28) and with f
defined as in Equation (3.23), the first gradient can be calculated as
[∇AL(A, λ)
]ij
= 2∂f(A)
∂Aij+ λ
[∇A1TA1
]ij
= −2Tr
[S2,1 −OAOT ]OJjiOT
+ λ, (3.36)
since
∇A(1TA1) = ∇A∑i,j∈X
([A]ij)
=
1 · · · 1...
. . ....
1 · · · 1
. (3.37)
Chapter 3. Methods for Online Identification 39
Further on, the gradient with respect to λ is
∇λL(A, λ) =∂
∂λ
‖S2,1 −OAOT ‖2F + λ(1TA1− 1)
= 1TA1− 1. (3.38)
With expressions for the gradients, we present the full algorithm in Figure 3.3.
3.3.2 Spherical Coordinates
The final method that we will derive is fundamentally different from the two previous
ones. They have been constrained optimization methods, due to Equation (3.19) be-
ing a constrained optimization problem. We will now reformulate the problem as an
unconstrained optimization problem, but still enforce the constraints.
We will adapt the method from Krishnamurthy and Abad [15] which cleverly exploits
the Pythagorean trigonometric identity, i.e., for any x ∈ R:
sin2 x+ cos2 x ≡ 1. (3.44)
The idea is to generalize this one-dimensional identity to a multi-dimensional identity,
and then parametrize the elements of A, which are constrained to lie on a simplex,
using the terms in the identity. Since products involving the squares of sin and cos are
guaranteed to lie in [0, 1], we end up with elements guaranteed to lie in the same interval.
Furthermore, since the identity holds, the sum of all elements will be one.
We will make use of the following set of functions (for X > 1):
Ωn(α) =
cos[α]1 if n = 1,
cos[α]n ×∏n−1k=1 sin[α]k if 2 ≤ n ≤ X2 − 1,
sin[α]X2−1 ×∏X2−2k=1 sin[α]k if n = X2,
(3.45)
where α ∈ RX2−1. Each Ωn can be interpreted as the square root of one term in a
generalized Pythagorean trigonometric identity.
Chapter 3. Methods for Online Identification 40
Algorithm: Online SNNMF using the Primal-Dual Method (PDM)
Input: X - number of hidden states,O - observation matrix,yprev - previous observation,ynow - current observation,k - time since start of estimation procedure,
S2,1 - current estimate of S2,1,
A - current estimate of the variable A,λ - current value of the dual variable.
Output: Recursive estimates of HMM parameters; T and π.
0. Initialize the recursion with k = 1, S2,1 ∈ RY×Y as a matrix of zeros, A ∈ RX×Xas I × 1
X and λ = 0. Chose a step-size ηk.
1. Update the estimate of S2,1,
[S2,1]ij ←[S2,1]ij × (k − 1) + δi,ynowδj,yprev
k, (3.39)
where δi,j is the Kronecker delta.
2. Let [Jij ]ab = δi,aδj,b and calculate the gradient of the Lagrangian with respect tothe primal and dual variable:[
GA]ij←[∇AL(A, λ)
]ij
= −2Tr
[S2,1 −OAOT ]OJjiOT
+ λ. (3.36)
and [Gλ]ij←[∇λL(A, λ)
]ij
= 1T A1− 1. (3.38)
3. Update the primal variable
A← A− ηk ×GA. (3.40)
4. Update the dual variableλ← λ+ ηk ×Gλ. (3.41)
5. Calculate and return:π ← AT1 (2.32)
andT ← Adiag(π)−1. (2.33)
6. Take
yprev ← ynow (3.42)
k ← k + 1 (3.43)
and save a new measurement in ynow, then go to Step 1.
Figure 3.3: The online structured non-negative matrix factorization algorithm em-ploying the primal-dual method.
Chapter 3. Methods for Online Identification 41
Remark 3.7. To get some intuition on why this is true, consider how the Ωn-functions
can be derived:
1 ≡
cos2 x+ sin2 x =
cos2 x+ sin2 x× (cos2 y + sin2 y)︸ ︷︷ ︸≡1
=
cos2 x+ sin2 x× (cos2 y + sin2 y × [cos2 z + sin2 z]). (3.46)
...
We parametrize the A matrix using the parameter vector α and the Ωn-functions as
[A]ij
= Ω2(i−1)X+j. (3.47)
It is straight forward to check that the sum of all elements of A using this parametrization
is equal to one and also lie in the interval [0, 1], see Krishnamurthy and Abad [15] for
details.
This implies that both the constraint 1TA1 = 1 and the constraint A ≥ 0 have been
dealt with. We can thus instead try to solve the unconstrained minimization problem
in α:
minα∈RX2−1
‖S2,1 −OA(α)OT ‖2F . (3.48)
It should be noted that since we are considering sums of squares of the trigonometric
functions, it is sufficient to only consider α ∈ [0, π/2]X2−1.
It is not entirely clear if this parametrization preserves the convexity of the original
problem. We note that sin2 x is convex and cos2 x is concave on the interval [0, π/2],
and leave it as future work to provide a (dis)proof of convexity.
We will apply the regular gradient descent method from Section 3.2.2 to update the pa-
rameter vector α since we are now dealing with an unconstrained optimization problem.
With f as in Equation (3.23), the descent update is
αk+1 = αk − 2ηk∇αf(A(αk)). (3.49)
The gradient is somewhat involved to evaluate. A partial result is given in Wang et al.
[32]. We will however perform a slightly different derivation here. Using the chain rule,
Chapter 3. Methods for Online Identification 42
we get
[∇αf
]l
=∂f
∂[α]l
=∂f
∂[A]1,1
∂[A]1,1∂[α]l
+ · · ·+ ∂f
∂[A]X,X
∂[A]X,X∂[α]l
= 1T
∂f
∂[A]1,1· · · ∂f
∂[A]1,X...
. . ....
∂f∂[A]X,1
· · · ∂f∂[A]X,X
∂
∂[α]l
[A]1,1 · · · [A]1,X
.... . .
...
[A]X,1 · · · [A]X,X
1= 1T
∂f
∂A ∂A
∂[α]l1. (3.50)
In this expression, ∂f∂A is known from Section 3.3.1.2, i.e. Equation (3.27), and ∂A
∂[α]lcan
be calculated as [ ∂A∂[α]l
]ij
=∂[A]ij∂[α]l
=∂
∂[α]l
Ω2(i−1)X+j
= 2× Ω(i−1)X+j ×
∂
∂[α]l
Ω(i−1)X+j
, (3.51)
with the last derivative evaluating to
∂Ωn
∂[α]l=
− sin[α]1 if n = 1,
− sin[α]n ×∏n−1k=1 sin[α]k if 2 ≤ n ≤ X2 − 1 and l = n,
cos[α]ntan[α]l
×∏n−1k=1 sin[α]k if 2 ≤ n ≤ X2 − 1 and l 6= n,
1tan[α]l
×∏X2−1k=1 sin[α]k if n = X2.
(3.52)
Thus, taking Equation (3.50), Equation (3.51) and Equation (3.52) together, everything
in the expression for∇αf is known and the gradient descent according to Equation (3.49)
can be performed.
Note that the expressions above are mostly of theoretical value and for studying the
convergence of the algorithm in an implementation. In a real-time setting, one would
preferably derive expressions that can be calculated with the possibility of reusing results.
The amount of expressions with trigonometric functions to evaluate in the above formulas
grows fast as the number of states of the Markov chain grows.
We present the full algorithm in Figure 3.4.
Chapter 3. Methods for Online Identification 43
Algorithm: Online SNNMF using Spherical Coordinates Method (SCM)
Input: X - number of hidden states,O - observation matrix,yprev - previous observation,ynow - current observation,k - time since start of estimation procedure,
S2,1 - current estimate of S2,1,
α - current parametrization of A,
Output: Recursive estimates of HMM parameters; T and π.
0. Initialize the recursion with k = 1, S2,1 ∈ RY×Y as a matrix of zeros and α ∈ RX2−1
as a random vector. Chose a step-size ηk.
1. Update the estimate of S2,1,
[S2,1]ij ←[S2,1]ij × (k − 1) + δi,ynowδj,yprev
k, (3.53)
where δi,j is the Kronecker delta.
2. Calculate the gradient of the cost function with respect to α according to Equa-tion (3.50), Equation (3.51) and Equation (3.52).
3. Perform the gradient descent
α← α− 2ηk∇αf(A(α)). (3.54)
4. Reconstruct A as [A]ij
= Ω2(i−1)X+j, (3.55)
using Equation (3.45).
5. Calculate and return:π ← AT1 (2.32)
andT ← Adiag(π)−1. (2.33)
6. Take
yprev ← ynow (3.56)
k ← k + 1 (3.57)
and save a new measurement in ynow, then go to Step 1.
Figure 3.4: The online structured non-negative matrix factorization algorithm em-ploying a spherical coordinates parametrization.
Chapter 4
Notes on Implementation
4.1 Measuring the Accuracy of Estimates
To be able to compare the different methods introduced in the previous chapters, a
measure of correctness has to be defined. That is, we want a function d(x, x) that, for
some numerical quantity x and an estimate x, gives a indication of how close x is to
x. There are several commonly used ones in the literature, of which a subset will be
introduced and briefly discussed below. Some are not suitable for our purpose since, for
example, they break down for estimates with negative elements.
4.1.1 Kullback-Leibler Divergence
A commonly used measure that is employed by, for example Vanluyten et al. [28] and
Lee and Seung [17], is a modified Kullback-Leibler divergence. For two non-negative
matrices A and B of equal size, it is defined as
dKL(A,B) =∑ij
([A]ij log
[A]ij[B]ij
− [A]ij + [B]ij), (4.1)
where the sum is over all the elements of A and B.
Burnham and Anderson [5] give an interpretation of it as the information lost when an
approximate model, in this case B, is used instead of the true model, A. The expression
above reduces to the Kullback-Leibler divergence if∑
ij [A]ij =∑
ij [B]ij = 1.
The problem with this measure is the logarithm term: it will only give a meaningful
answer if all elements in the matrices are non-negative. This complicates matters since
two of the methods treated in this text (the spectral learning algorithm from Section 2.1
45
Chapter 4. Notes on Implementation 46
and the primal-dual method from Section 3.2.3) give no guarantee that the elements of
the estimates will be non-negative. This is of course a criticism of the algorithms, but
it is still interesting to be able to compare their performances on different systems and
against other methods.
A possible solution could be to employ heuristics, such as clipping negative elements to
zero. This would yield sensible estimates, but the measure would still collapse due to
the division (by zero). This problem is touched upon by Hsu et al. [12, p. 10 and p. 12],
who add another condition on the considered system matrices, namely that all elements
should be larger or equal to some α > 0. One could follow this path and clip elements to
α instead of zero, but it is still not straight-forward how to compare different methods
to each other, since there is no obvious choice of α and also because the clipping of
elements is not well motivated.
4.1.2 Eigenvalues and Eigenvectors
It is common to classify the behaviour of linear systems using the eigenvalues of the
system matrices. It is true that the eigenvalues and eigenvectors play an important
role in describing the properties of a Markov chain. For example, the second largest
eigenvalue of the transition matrix is closely related to how fast the Markov chain reaches
its stationary distribution, see Levin et al. [18] for details. A possibility for measuring
the correctness of the estimates could be to measure the proximity of the eigenvalues
of the estimated matrices to those of the true matrices. However, some problems are
apparent with this approach:
• Since we are performing identification of HMMs, we have to not only consider
the transition matrix, which is square, but also the observation matrix, which, is
not necessarily square. Whenever the number of possible observations is larger
than the number of hidden states, i.e., when Y > X, the eigenvalues of O are
not well-defined. A possible alternative could be to consider the singular values
instead.
• How should the deviation from the true eigenvalues and eigenvectors be measured
and summarized?
Another concern is that renaming the states of the underlying Markov chain in an HMM
will result in an equivalent system (since they are hidden), but in a different description
in terms of the matrices used to describe it. This could shift the eigenvalues, even if the
estimates are correct.
Chapter 4. Notes on Implementation 47
1 2
1
2
1
2
1-p
p
1-q
q
r
1-r
s
1-s
(a) System ΣA
1 2
1
2
1
2
1-q
q
1-p
p
s
1-s
r
1-r
(b) System ΣB
Figure 4.1: Graph depiction of the HMMs Σa and Σb which are related by renamingthe hidden states. They cannot be distinguished from just sequences of observations.
To make this concrete, consider the following system,
ΣA =
TA =
[p 1− q
1− p q
], OA =
[r s
1− r 1− s
], (4.2)
with p, q, r, s ∈ [0, 1].
Swapping the labels of the two hidden states translates to permuting the corresponding
rows and columns in the transition matrix, and permuting the corresponding columns
in the observation matrix. The resulting system description is
ΣB =
TB =
[q 1− p
1− q p
], OB =
[s r
1− s 1− r
]. (4.3)
One can easily convince oneself that the systems ΣA and ΣB are indistinguishable when
sequences of outputs are the only thing that can be observed, by for example drawing a
graph representation of the system, as seen in Figure 4.1.
However, for ΣA, the eigenvalues can be calculated to be 1, p+ q−1 for the transition
matrix and 1, r − s for the observation matrix. For ΣB, the eigenvalues remain in
1, p+q−1 for the transition matrix, but move to 1, s−r for the observation matrix.
Thus, the eigenvalues are different even though the systems would yield indistinguishable
outputs.
A comparison of the eigenvalues (and possibly the eigenvectors) would first require a
renaming of the states in the estimated matrices so to correspond to the naming in the
true matrices. After that, the two points raised above would have to be considered.
Chapter 4. Notes on Implementation 48
4.1.3 Probability of Output Sequences
The above discussion about the problems that permuting the hidden states of an HMM
can cause when measuring correctness is valid for the Kullback-Leibler divergence as
well. Since matrix elements are compared to each other in two different realizations (the
estimated and the true), the states have to line up for the comparison to make sense.
One invariant, i.e., a quantity that is independent of how the hidden states are permuted,
is the probability of seeing a certain sequence of outputs. This measure, as defined by
Zhao and Poupart [33], is
dseq(T,O, T , O) =∑
y1,...,yk∈T|Pr[y1, . . . , yk]− Pr[y1, . . . , yk]|
1k , (4.4)
where T is a set of test sequences. Hsu et al. [12] derive theoretical bounds on the
convergence of the spectral learning algorithm for a similar measure, namely
∑y1,...,yk∈T
|Pr[y1, . . . , yk]− Pr[y1, . . . , yk]|, (4.5)
where T = (y1, . . . , yk) : yi ∈ Y for i = 1, . . . , k is the set of all possible output
sequences of length k.
The calculation of the above joint probabilities involves the multiplication of the transi-
tion and emission matrix. This can lead to the measure indicating a well-fitted model,
even though the individual matrices can have negative elements or columns not summing
to one. It does not give a clear indication of how well the estimated matrices lie to the
true ones.
This could be sufficient depending on the application. For example, if we are only inter-
ested in the conditional probability of the next observation, given all the observations
up to this time,
Pr[yk|y1, . . . , yk−1], (4.6)
then it would be sufficient to know only the product of the transition and observation
matrix, as shown in Hsu et al. [12]. However, if the system identification is performed
with the purpose of gaining some insight on how the system at hand is constructed, then
knowledge of both the transition matrix and the observation matrix is wanted, and we
would prefer to have a measure of how well these two matrices are estimated.
4.1.4 Matrix Norms
Mattfeld [20] gave an overview of how the methods in Hsu et al. [12] can be implemented
Chapter 4. Notes on Implementation 49
and suggested using the Frobenius norm of the difference between the estimated matrices
and the true matrices as a measure of correctness.
The Frobenius norm of a matrix is defined as
‖A‖F =
√∑ij
|[A]ij |2, (4.7)
where the sum is over all elements of A. This is equivalent to
‖A‖F =√
TrA∗A, (4.8)
where A∗ denotes the conjugate transpose of the matrix A.
From Equation (4.7), it is clear that defining a measure
dF (A,B) = ‖A−B‖2F (4.9)
would be the square sum of the difference of each element of A and B. This is an
intuitive measure of how close two matrices are and is widely used in numerical linear
algebra. In our setting, one has to be careful about two things though:
• As briefly discussed in Section 4.3.3, the ordering of the hidden states of an HMM is
not unique as long as the observation matrix is permuted correspondingly. There
is no guarantee that any of the algorithms from Chapter 2 will give estimates
of the transition matrix and observation matrix with the same ordering of the
states as the true system. This is not a problem for a real system identification
problem, since the true model is unknown, but it is crucial when benchmarking
the algorithm.
Plotting dF (T, T ) or dF (O, O) does not make sense unless T and O are permuted
so that the hidden states correspond to those of the true system T and O. Mattfeld
[20] failed to realize this.
• Another, less critical note is that since the Frobenius norm is a sum over all
the elements of a matrix, it naturally grows with the dimension of the matrix.
Hence, a fair comparison between a three hidden states model (transition matrix
of dimension three, with nine elements) and a four hidden states model (transition
matrix of dimension four, with sixteen elements) can not be made unless some
normalization is performed first.
We suggest dividing the squared Frobenius norm by the number of elements in the
matrix, effectively giving the mean squared error of all the elements, as measure.
Chapter 4. Notes on Implementation 50
For matrices A, A ∈ Rm×n this would be
dF (A, A) =‖A− A‖2Fm× n
=
∑mi=1
∑nj=1
∣∣[A]ij − [A]ij∣∣2
m× n. (4.10)
And explicitly for the matrices (and vector) that we are concerned with in this
text:
dF (T, T ) =‖T − T‖2F
X2=
∑Xi=1
∑Xj=1
∣∣[T ]ij − [T ]ij∣∣2
X2, (4.11)
dF (O, O) =‖O − O‖2FX × Y
=
∑Yi=1
∑Xj=1
∣∣[O]ij − [O]ij∣∣2
X × Y, (4.12)
dF (π, π) =‖π − π‖22
X=
∑Xi=1 |[π]i − [π]i|2
X. (4.13)
The last row is justified since the Frobenius norm is a straightforward generaliza-
tion of the Euclidean norm to matrices.
Equation (4.10) is the measure that we will use in the simulations in the chapters to
come.
4.2 Re-Ordering the States
To make sure that the estimated matrices can be compared to the true ones using the
dF -measure defined above, we will need to re-order the columns and rows so that they
correspond to the order of the states in the true matrices.
Given a set M with n elements, assign a unique number from 1 to n! to each of the
possible permutations of the items in the set. Let permi : M →M be the permutation
operator giving the ith permutation of the items in M .
Define
Pi =[permie1, . . . , eX
], (4.14)
where[·]
concatenates the vectors in · horizontally, as the ith permutation matrix of
dimension X ×X.
Now, we will let the matrix T ∗ denote the estimated transition matrix with permuted
states that seems to reassemble the ordering in the true transition matrix T the best.
We will calculate it as
T ∗ = P−1i∗ T Pi∗ , (4.15)
Chapter 4. Notes on Implementation 51
where
i∗ = arg mini∈1,...,X!
‖P−1i T Pi − T‖2F . (4.16)
Pi∗ specifies how the states of the estimated quantities should be permuted as to line up
with the order used in the true quantities. Once we have found Pi∗ , and have sorted T ,
the estimated observation matrix and the estimated stationary distribution vector have
to be sorted correspondingly. This is done by taking
O∗ = OPi∗ (4.17)
and
π∗ = πPi∗ . (4.18)
The observant reader will note that this might actually not at all give the true ordering
of the states. It might happen that some permutation, other than the true, of the states
gives a better fit to the true model. This might give a slightly smaller error for small
amount of samples, but as the elements of the estimated matrices start to converge, so
should the permutation that lie closest to the true.
In all figures to come where dF is plotted this re-ordering is performed so that the
measure makes sense.
4.3 Some Notes on Implementing the Spectral Learning
Algorithm
There are some ambiguities when it comes to implementing the spectral learning algo-
rithm of Hsu et al. [12]. Some of these will be briefly touched upon in the following
sections.
4.3.1 Faulty Matrices
The issue that the spectral learning algorithm (and also the primal-dual method, which
only enforces the constraints softly) is not guaranteed to return valid stochastic matrices
has to be handled somehow. This means that the estimated transition and observation
matrix might have elements lying outside of the interval [0, 1] and some columns’ ele-
ments might not sum to one. For the spectral learning algorithm, some elements might
even have a non-zero imaginary part due to the eigenvalue calculation.
Chapter 4. Notes on Implementation 52
We have chosen not to use the heuristic approach of clipping negative or complex ele-
ments and enforcing stochasticity by normalizing each column. Rather, we include a plot
of the fraction of non-stochastic estimates that the spectral learning algorithm yields as
the number of samples grows.
It has been found beneficial empirically to rerun Step 3 to Step 6 of the spectral learning
algorithm, as outlined in Figure 2.1, until feasible estimates for the transition and obser-
vation matrix are acquired. It might be the case that no such estimates are ever found,
so we have in our implementation taken 75 to be the maximum number of iterations for
rerunning these steps.
4.3.2 If the Matrix is not Diagonal?
Step 5 of Figure 2.1 should result in a diagonal matrix, whose diagonal is then taken as
one row of the estimated observation matrix.
If the resulting matrix is not diagonal (due to this being a numerical implementation
with estimated quantities), one has to make a decision:
• The first option is to set all off-diagonal elements to zero.
• Another option would be to do an eigendecomposition on the resulting (non-
diagonal) matrix, and use the diagonal matrix of eigenvalues from this decom-
position. The rational is that this should include more data (the off-diagonal
elements) in the estimate, thus giving a better estimate. One has to be careful
with the ordering of these new eigenvalues though, as mentioned at the end of
Section 2.3.
In the simulations in Chapter 5, the first option is chosen.
4.3.3 Separating the Eigenvalues
Hsu et al. [12] mention that the main source of instability with the spectral learning
algorithm is the risk of not getting separated eigenvalues in Step 5 of Figure 2.1. Even
if the eigenvalues are separated, they might not be very well spread apart. Analyti-
cally, this is not a problem since we will recover a non-singular similarity transform.
Numerically, however, the small separation might cause problems.
Chapter 4. Notes on Implementation 53
Influence of Eigenvalue Separation
103 104 105 10610−4
10−3
10−2
10−1
100
101
samples
erro
rofO
-est
imat
e,dF
(O,O
)
Ng = 1Ng = 2Ng = 10Ng = 25Ng = 50
103 104 105 10610−4
10−3
10−2
10−1
100
101
102
samples
erro
rofT
-est
imat
e,dF
(T,T
)
103 104 105 10610−4
10−3
10−2
10−1
100
samples
erro
rofπ
-est
imat
e,dF
(π,π
)
103 104 105 1060
0.2
0.4
0.6
0.8
1
samples
frac
tion
of
non-s
toch
ast
ices
tim
ates
(of
eith
erO
orT
)
Figure 4.2: Performance of the spectral learning algorithm when Step 4 and 5 areperformed Ng times to find the set of random variables giving the largest spread of the
eigenvalues. Every data point is averaged over 75 simulations.
One idea could be to rerun Step 4 and Step 5 a number of times, say Ng, and chose the
vector gi∗ of random variables that resulted in the largest separation of the eigenvalues;
i∗ = arg maxi∈1,...,Ng
minj 6=k|λj − λk|
, (4.19)
where the eigenvalues are evaluated as functions of gi.
However, as seen in Figure 4.2, where this is done for Ng ∈ 1, 2, 10, 25, 50 on Example 2
of Appendix A, a higher value of Ng does if anything seem to deteriorate the estimates.
It appears as if choosing the eigenvalues with the largest separation generates more
Chapter 4. Notes on Implementation 54
non-stochastic estimates of the transition and observation matrix. In the simulations in
Chapter 5, we take Ng = 1, i.e. we do not try to find the largest spread. We do however
rerun these steps (4 and 5) if any eigenvalues happen to be equal.
Chapter 5
Numerical Results for Spectral
Learning and Structured
Non-Negative Matrix
Factorization
In the plots in this chapter, samples refers to the total amount of observation samples
used for estimating the parameters of the HMM and not the number of observation pairs
or triplets.
The measure of correctness utilized is dF (·, ·), which is the mean squared error of all the
elements of the matrix in question. See Section 4.1.4 for a discussion.
We report the fraction of matrices that do not have all elements in the interval [0, 1]
(fraction of non-stochastic estimates), since the spectral learning algorithm does not
guarantee that the estimated quantities fulfill this constraint.
The examples used in this chapter can be found in Appendix A and will in the plots be
referred to as Ei, where i is the number of the example.
5.1 Comparison of Spectral Learning and Structured Non-
Negative Matrix Factorization
In this section, we compare the Spectral Learning (SL) algorithm to the Structured
Non-Negative Matrix Factorization (SNNMF) algorithm on two systems. These two
algorithms are described in Chapter 2.
55
Chapter 5. Numerical Results for Spectral Learning and SNNMF 56
Comparison of SL and SNNMF
103 104 105 106 10710−7
10−6
10−5
10−4
10−3
10−2
10−1
100
samples
erro
rofO
-est
imat
e,dF
(O,O
)
103 104 105 106 107
10−6
10−4
10−2
100
samples
erro
rofT
-est
imat
e,dF
(T,T
)
103 104 105 106 1070
0.5
1
1.5
2
2.5
3
3.5
samples
tim
e[s
]
103 104 105 106 1070
0.2
0.4
0.6
0.8
1
samples
frac
tion
ofnon
-sto
chas
tic
esti
mate
s(o
fei
therO
orT
)
SL E1
SL E2
SNNMF E1
SNNMF E2
Figure 5.1: Comparison of SL and SNNMF. All data points are averaged over 20simulations.
The first system is one with two hidden states and two possible discrete observations
(Example 1). The second system is one with three hidden states and three possible
discrete observations (Example 2).
Both algorithms were tested on 20 different realizations of the two HMMs for every
amount of samples. The average dF -error is plotted in Figure 5.1 along with the time
consumption and the fraction of non-stochastic estimates. For the SNNMF-algorithm,
the multistart-command1 in MATLAB was used to try to find the global minimum,
with 25 different starting positions.
1http://se.mathworks.com/help/gads/multistart-class.html
Chapter 5. Numerical Results for Spectral Learning and SNNMF 57
The spectral learning algorithm does not only run much faster than SNNMF, it also
appears to converge much closer to the true solution as the number of samples grow. As
a matter of fact, it seems as if the SNNMF algorithm fails to recover the true system
matrices, even for a large amount of samples. This could be a problem of a local
minimum. The vast amount of initial conditions tested by the multistart-command
should however alleviate this.
For Example 2, SNNMF appears to outperform the spectral learning algorithm when
less than about 105 samples are available. This is the point where the spectral learning
algorithm starts to provide valid stochastic estimates.
This suggests that one could try to use the spectral learning algorithm as a first al-
ternative for identifying a system (due to the very fast run-time): if it generates a
high percentage of non-stochastic estimates, then one could fall back on the SNNMF
algorithm.
5.2 Performance of the Spectral Learning Algorithm
In this section, we evaluate the spectral learning algorithm on a multitude of systems
with three hidden states to see if we can draw any conclusions on the consistency of the
algorithm.
Simulation results can be found in Figure 5.2, where every data point is an average over
20 different realizations of the HMM. Mean slopes in the log-log diagram, which give an
indication of the convergence rate, have been calculated and are provided in Table 5.1
(the data points at 103 samples were disregarded in this calculation) along with the
second largest eigenvalue of the transition matrix. The second largest eigenvalue is
related to the mixing of the Markov chain and could be a possible indicator of how well
the algorithm will perform.
Example Mean Slope Mean Slope Mean Slope λ2 of T
dF (T, T ) dF (O, O) dF (π, π)
2 -1.166 -1.583 -1.029 0.303 -1.020 -1.213 -1.110 0.754 -0.309 -0.161 -0.179 0.405 -0.944 -0.361 -0.716 0.25 + 0.43i6 -0.956 -1.344 -0.957 0.567 -0.870 -0.973 -0.727 0.238 -0.355 -0.027 -0.247 0.13
Table 5.1: Mean slopes of the dF -error for the spectral learning algorithm in the log-log plot Figure 5.2 together with the second largest eigenvalue of the transition matrix
for various examples. The system matrices can be found in Appendix A.
Chapter 5. Numerical Results for Spectral Learning and SNNMF 58
Performance of SL on Various Examples
103 104 105 106 10710−7
10−6
10−5
10−4
10−3
10−2
10−1
100
101
samples
erro
rofO
-est
imat
e,dF
(O,O
)
E2
E3
E4
E5
E6
E7
E8
103 104 105 106 10710−6
10−5
10−4
10−3
10−2
10−1
100
101
102
samples
erro
rofT
-est
imat
e,dF
(T,T
)
103 104 105 106 10710−7
10−6
10−5
10−4
10−3
10−2
10−1
100
samples
erro
rofπ
-est
imat
e,dF
(π,π
)
103 104 105 106 1070
0.2
0.4
0.6
0.8
1
samples
frac
tion
ofnon
-sto
chas
tic
esti
mate
s(o
fei
therO
orT
)
103 104 105 106 1070
0.2
0.4
0.6
samples
tim
e[s
]
Figure 5.2: Performance of the spectral learning algorithm on numerous examples(found in Appendix A). All data points are averaged over 20 simulations.
Chapter 5. Numerical Results for Spectral Learning and SNNMF 59
Even though there appears to be a continuous scale for the performance of the spectral
learning algorithm on the examples considered, three different classes are somewhat
apparent from Figure 5.2: no convergence, slow convergence and fast convergence. The
examples that are the easiest to classify are Example 3 and 6 that belong to the fast
convergence class and Example 4, 5 and 8 that belong to the no convergence class. The
other examples appear to be somewhere in the middle (slow convergence class).
Worth noting is that the convergence rate for all examples appears to be constant once
the spectral learning algorithm starts providing stochastic estimates (which happens for
different amounts of samples for these examples).
As discussed in Section 4.3.1, some steps of the algorithm are re-evaluated until a valid
estimate is generated, or a maximum number of iterations (75) has been reached. This
accounts for the slightly higher time consumption of Example 4, 5 and 8: since the
amount of non-stochastic estimates is very high, the algorithm has to redo those steps
up to 75 times.
A larger second largest eigenvalue of the transition matrix appears give slightly better
performance. It is not a perfect indicator however: compare Example 4 with λ2 = 0.4
and Example 2 with λ2 = 0.3. The spectral learning algorithm fails to generate good
estimates for Example 4, even though its λ2 is larger than that of Example 2, on which
it recovers good estimates.
We next provide a comparison of the spectral learning algorithm to the standard Expectation-
Maximization (EM) algorithm. We use the implementation provided in MATLAB,
hmmtrain2, of the EM-algorithm. The limit of maximum iterations was taken as the
default value of 500. Due to the very slow convergence rate, we only calculate averages
over three realizations of the HMMs and only provide data for three of the examples
(one from each performance class) of the ones in Figure 5.2. The EM-algorithm was
initiated with random matrices. The results are seen in Figure 5.3.
This figure illustrates the critique of the EM-method fairly well. First of all, it is
very slow. The time-scale is orders of magnitude larger than for the spectral learning
algorithm. At 106 − 107 samples, the spectral learning algorithm generates an estimate
in less than a second, where as the time consumption increases rapidly for EM and it
requires ten minutes for about 105 samples. Secondly, it appears to be very sensitive
to what initial conditions are chosen (as seen in the actual increase in error when the
number of samples increased for Example 3), which highlights the possibility of getting
stuck in local minima.
2http://se.mathworks.com/help/stats/hmmtrain.html
Chapter 5. Numerical Results for Spectral Learning and SNNMF 60
Performance of the EM-method
103 104 105 106 10710−7
10−6
10−5
10−4
10−3
10−2
10−1
100
101
samples
erro
rofO
-est
imat
e,dF
(O,O
)
EM E2
EM E3
EM E4
103 104 105 106 10710−6
10−5
10−4
10−3
10−2
10−1
100
101
102
samples
erro
rofT
-est
imate
,dF
(T,T
)
103 104 105 106 1070
100
200
300
400
500
600
samples
tim
e[s
]
Figure 5.3: Performance of the EM-algorithm on three examples. Every data pointis an average over three realizations of the HMM. Random initial guesses where used
for starting the EM-algorithm.
Since the benchmark of the EM-algorithm was not very extensive, any hard conclusions
can not be drawn from Figure 5.2 and Figure 5.3. However, it seems to indicate that the
spectral learning algorithm does not perform worse than EM, at least not when using
random initial guesses. A good approach here would probably be to use the spectral
algorithm to generate an initial guess for the EM-algorithm which can then be used to
refine the estimate. This is especially true when a larger amount of samples are available.
The spectral learning algorithm is very fast even when the data batch size is large. The
resulting estimates are however very good on the examples where it actually converges,
so a further refinement using EM might not be needed. Note that this assumes that
the estimates recovered from the spectral learning algorithm are stochastically valid. If
they are not, and it is possible to manipulate them slightly, then that is probably a good
approach. If they are very far from being valid, then one is probably better of using a
random initial guess in EM.
Chapter 5. Numerical Results for Spectral Learning and SNNMF 61
5.3 Higher-Dimensional Systems with Spectral Learning
In this section, we evaluate how well the spectral learning algorithm performs on larger
systems. We benchmark the algorithm on a number of examples, each with an increasing
amount of hidden and observable states.
The systems used in this benchmark are generated randomly. We let the transition
matrix be a perturbed shifted identity matrix, generated as:
1. Chose X and let T ∈ RX×X be
T =
0 1 0...
. . .
0 · · · 0 1
0 · · · 0 0
. (5.1)
2. Perturb T as
[T ]ij ←∣∣[T ]ij +N
(0,
1
X
)∣∣. (5.2)
3. Normalize each column of T .
The number of observable states are assumed to be equal to the number of hidden
states, i.e. Y = X. The observation matrix is then generated in the same manner as
the transition matrix, except that it is chosen as the identity matrix in the first step,
and not a shifted one.
Simulations were only performed with up to nine hidden states due to the combinatorial
explosion when re-ordering the states (described in Section 4.2). To get results for larger
systems, a measure of correctness that avoids the combinatorial problem of aligning the
states with the original ordering would have to be used.
Three different matrices were generated and each used in five realizations of the HMM
for each sample size. The averages of these 15 simulations for different dimensions are
plotted in Figure 5.4.
We note the following things. First of all, the results are very inconclusive. The spectral
learning algorithm performed well on the 9-state example, extremely well on the 5-
state example, but very poorly on the 3- and 6-state examples. It appears to be some
underlying factor, apart from the dimension, of the matrices that determines how well
the spectral learning algorithm performs. This was seen in the previous section where
the examples in Appendix A could be roughly classified into three classes depending on
the performance.
Chapter 5. Numerical Results for Spectral Learning and SNNMF 62
Performance of SL as System Dimension Increases
103 104 105 106 10710−6
10−5
10−4
10−3
10−2
10−1
100
samples
erro
rofO
-est
imat
e,dF
(O,O
)
103 104 105 106 10710−6
10−5
10−4
10−3
10−2
10−1
100
101
samples
erro
rofT
-est
imate
,dF
(T,T
)
103 104 105 106 1070
0.1
0.2
0.3
0.4
0.5
0.6
0.7
samples
tim
e[s
]
X = 3X = 5X = 6X = 7X = 8X = 9
103 104 105 106 1070
0.2
0.4
0.6
0.8
1
samples
frac
tion
ofnon-s
toch
asti
ces
tim
ates
(of
eith
erO
orT
)
Figure 5.4: Performance of the spectral learning algorithm as the dimension of theHMM increases. All data points are averaged over 15 simulations with 3 random ma-
trices, each evaluated over 5 simulations.
Chapter 5. Numerical Results for Spectral Learning and SNNMF 63
Secondly, and somewhat trivially, increasing the number of states increases the time
consumption of the algorithm.
Thirdly, (almost) every single estimate generated had a non-stochastic transition or
observation matrix. A possible explanation for this could be that the method for gener-
ating the random system matrices gives rise to some elements very close to zero. Thus,
even if the estimate is very good, but happens to have some element lying close to zero
but on the “wrong” (negative) side, then the estimate is classified as faulty. This is
especially apparent for the 5-state example; the spectral learning algorithm appears to
come very close to the true system matrices, but still fails to generate valid matrices.
In the implementation, some slack was left around zero to allow for numerical errors (a
few multiples of the machine epsilon).
This makes it hard to use the methodology suggested in Section 5.1, i.e. trust the
estimates once the fraction of non-stochastic estimates approaches zero. A mix of the
heuristic approach of clipping elements that are close to zero could be used: clip all
elements in [−ε, 0) to zero and then check if the matrix is valid, where ε > 0 is chosen
small.
5.4 Perturbation Analysis
From the results in the previous sections, it is clear that the performance of the spectral
learning algorithm depends heavily on the system. Of the examples in Appendix A,
Example 3 showed excellence performance, where as Example 4 and Example 5 seemed
almost impossible to identity, even with a huge amount of samples at hand.
In this section, we will perform an analytical analysis to see if we can explain the
discrepancy between these cases. The outline of the analysis can be found in standard
textbooks on numerical analysis, see for example Moler [21].
As explained in the derivation of the spectral learning algorithm (Section 2.3), the other
parameters of the HMM are calculated from the estimated observation matrix. This
means that the accuracy of the estimated observation matrix is crucial.
Recall Step 5 of Figure 2.1, i.e., that the rows of O are obtained by diagonalizing
(U S3,y,1)(UT S3,1)+ for y = 1, 2, . . . , Y . How sensitive are the eigenvalues (which we try
to find by diagonalization) to the accuracy of the estimates S3,y,1 and S3,1?
Assume for simplicity that we are dealing with a square system (X = Y ), so that we
can disregard the U -matrix. A very basic analysis can be performed as follows. We are
Chapter 5. Numerical Results for Spectral Learning and SNNMF 64
interested in the eigenvalues of S3,y,1S−13,1 . For notational ease, let
M = S3,y,1S−13,1 . (5.3)
Write the eigendecomposition of M as
M = RDR−1, (5.4)
or equivalently
D = R−1MR, (5.5)
where D is a diagonal matrix with the eigenvalues of M as its diagonal, and R is the
matrix of eigenvectors sorted correspondingly. Now perturb M by ∆M and assume that
the same matrix of eigenvectors is used in the diagonalization step. This will result in a
perturbation ∆D of the diagonal matrix of eigenvalues,
D + ∆D = R−1(M + ∆M)R =
= R−1MR+R−1∆MR, (5.6)
giving
∆D = R−1∆MR. (5.7)
Taking the norm of this expression yields,
‖∆D‖ = ‖R−1∆MR‖
≤ ‖R−1‖‖R‖‖∆M‖
= κ(R)× ‖∆M‖, (5.8)
where we used the definition of the condition number κ(R) = ‖R‖‖R−1‖. This very
simple analysis tells us that sensitivity of the eigenvalues of M , i.e., the elements in one
row of the estimated observation matrix, are related to the condition number of R, i.e.,
the matrix of eigenvectors used in the diagonalization procedure.
Notice here that ∆D is not necessarily diagonal. Translating the above relation back to
our original variables, we get
‖∆ diag(eTyO)‖ ≤ κ(OT )× ‖S3,y,1S−13,1 − S3,y,1S
−13,1‖. (5.9)
Thus, the accuracy of the estimate of the observation matrix is related to the condition
number of OT and the accuracy of the estimate of the moments.
Chapter 5. Numerical Results for Spectral Learning and SNNMF 65
Example Error dF (O, O) with 107 samples cond(OT )
4 2.75×10−1 115.45 4.42×10−2 27.18 1.72×10−2 208.37 6.06×10−4 21.62 4.55×10−5 10.86 4.66×10−6 5.43 3.03×10−7 2.6
Table 5.2: The result of Section 5.2 (Figure 5.2) sorted according to the error inthe estimate of the observation matrix when 107 samples were available (which is anindicator of how well the spectral learning algorithm has performed). The conditionnumber of OT appears to correlate well with this, except for the outlier Example 8.
In Table 5.2, we sort the results from Section 5.2 (i.e. Figure 5.2) according to the
error when 107 samples were used in the estimation procedure. The condition number
of OT appears to be a fairly good indicator of how well the spectral learning algorithm
performs. Except for the outlier Example 8, the condition numbers are perfectly sorted
in a descending fashion. Example 8 could perhaps be explained by the second factor in
Equation (5.9), ‖S3,y,1S−13,1 − S3,y,1S
−13,1‖, or that U has to be included in the analysis.
This could also possibly explain Example 5, which has a relatively large error, but a
condition number that is not very different form that of Example 7.
This suggests that the condition number of OT could be used as a gauge for how much
the estimates of the spectral learning algorithm can be trusted (this is not as useful as
it might seem, since this condition number is unknown when performing identification
of a system). A more detailed analysis could perhaps reveal a better relation.
Chapter 6
Numerical Results for Online
Structured Non-Negative Matrix
Factorization
This chapter aims to demonstrate the performance of the recursive algorithms described
in Chapter 3. The details of all examples can be found in Appendix A.
6.1 Comparison of the Algorithms
In this section, we compare the performance of the three recursive methods derived in
Chapter 3. We do this on two systems with three hidden states and a varying amount
of possible discrete observations. One system has three possible observations (Exam-
ple 2), and one has ten possible observations (Example 3). The exact transition and
observation matrices used in the examples are found in Appendix A. The methods are
Example Method Iteration Average [s] Mean Slope
2 PGDM 2.38 ×10−4 -0.3392 PDM 1.18×10−4 -0.3142 SCM 9.01×10−4 -0.2313 PGDM 2.70×10−4 -0.2323 PDM 1.37×10−4 -0.2283 SCM 1.02×10−3 -0.117
Table 6.1: Comparison of the convergence rate and time consumption of the threerecursive methods on two examples (E2 with X = 3, Y = 3 and E3 with X = 3,
Y = 10). Mean Slope refers to the slopes in Figure 6.1.
67
Chapter 6. Numerical Results for Online SNNMF 68
Performance of PGDM, PDM and SCM
100 101 102 103 104 10510−3
10−2
10−1
100
iteration
erro
rofT
-est
imat
e,dF
(T,T
)
PGDM E2
PDM E2
SCM E2
100 101 102 103 104 10510−3
10−2
10−1
100
iteration
erro
rofT
-est
imate
,dF
(T,T
)
PGDM E3
PDM E3
SCM E3
Figure 6.1: Performance of the three recursive methods on two examples: E2 withX = 3, Y = 3 and E3 with X = 3, Y = 10. Every data point is an average over ten
simulations.
labeled as Projected Gradient Descent Method (PGDM), Primal-Dual Method (PDM)
and Spherical Coordinates Method (SCM) in the plots and tables.
There is one parameter that we are free to chose in each algorithm: the step-size ηk
in the gradient descent. We have for simplicity chosen to use a constant. A range of
different values were tested and the one found to give the best result for each method
was then chosen. These were ηPGDM = 0.25, ηPDM = 0.05 and ηSCM = 2π103
.
For each example, ten realizations of the HMM were generated and the methods were
fed one output sample at a time to recursively improve the estimate of the transition
matrix. An average was then calculated for each method of the error in each iteration.
The plot of the averages of these errors for certain iterations can be seen in Figure 6.1.
The average time for one iteration along with the average slope (in a log-log plot) of
each algorithm are reported in Table 6.1. The PGDM appears to converge slightly faster
than the two other methods. The SCM seems to struggle for a while with Example 3.
This can perhaps be resolved by using another step-size.
Chapter 6. Numerical Results for Online SNNMF 69
Performance of PGDM on Higher-Dimensional Systems
100 101 102 103 104 10510−4
10−3
10−2
10−1
iteration
erro
rofT
-est
imat
e,dF
(T,T
)
X = 5X = 10X = 20
Figure 6.2: Performance of PGDM as the dimension of the system increases. Thesystems are generated randomly with Y = X. Each data point is an average over nine
simulations with three pairs of random matrices, each used for three simulations.
The average time for one iteration of the PGDM is about twice that of the PDM.
The time could probably be reduced by using a more efficient code for the projection
operation. The SCM is about one order of magnitude slower than the other two methods.
This is expected since the expression for the gradient (see Section 3.3.2) is quite involved
and has not been optimized for performance.
6.2 Performance as Dimension Increases
We are in this section interested in evaluating the performance of the recursive algo-
rithms as the dimension of the system increases. The three proposed methods appeared
to perform very similarly in the previous section. Since the PGDM had the fastest
convergence rate, we will perform the benchmarks only for this method in this section.
The systems we perform the benchmarks on are generated randomly, in the same manner
as in Section 5.3. We provide results for X ∈ 5, 10, 20, with Y = X. The results can
be seen in Figure 6.2. Every data point is an average over nine simulation runs. Three
random pairs of matrices were generated and used for three simulations each.
States Method Iteration Average [s] Mean Slope
5 PGDM 4.34×10−4 -0.35010 PGDM 1.74×10−3 -0.30120 PGDM 8.36×10−3 -0.281
Table 6.2: Performance of the PGDM as the dimension of the system increases. Statesrefer to the value of X and Y (equal). Mean Slope refers to the slopes in Figure 6.2.
Chapter 6. Numerical Results for Online SNNMF 70
Perhaps somewhat unintuitively, it appears as if the performance actually increases as
the number of states increases (as seen in the decrease in error). This can be explained
by the starting guess (taken to be the identity matrix). The way of generating the
random matrices will generate a lot of elements close to zero as the dimension increases.
This makes the initial guess better and better with our performance measure (the mean
squared error of each element).
Worth noting is that the convergence rate, i.e. the slope, is almost the same as the
number of states increases. The exact values for the average slopes in the log-log plot
are given in Table 6.2.
6.3 Tracking Time-Varying Dynamics
In this section, we demonstrate that the recursive algorithms can track a system with
time-varying dynamics. This means that either, or both, the transition matrix and the
observation matrix change over time.
Formally, we let the system be modelled as:
T (k) =
T1 k ∈ 0, 1, . . . , τ − 1,
T2 k ∈ τ, τ + 1, . . . (6.1)
and
O(k) =
O1 k ∈ 0, 1, . . . , τ − 1,
O2 k ∈ τ, τ + 1, . . . ,(6.2)
where T1, T2, O1 and O2 are constant transition and observation matrices, respectively,
and τ is the time for the switch.
We provide simulation results for two different scenarios. The first is that only the
dynamics of the Markov chain changes and the dynamics of the sensor remain constant,
i.e. O1 = O2. In the second example, both the dynamics of the Markov chain and the
dynamics of the sensor change. The exact matrices used can be found in Appendix A.
The switch of dynamics is made at time k = τ = 21667 in Figure 6.3. The PGDM
was used and every data point is an average over ten simulations. The left plot shows
the case where the sensor does not change its dynamics (i.e. O constant, T changes)
and the right plot shows the case where both system and sensor dynamics change. Also
illustrated is the performance for two different choices of the weight coefficient in the
update of the S2,1-matrix. This was discussed in Section 3.3.1.1.
Chapter 6. Numerical Results for Online SNNMF 71
PGDM Tracking Time-Varying Dynamics
0 1 2 3 4 5 6 7 8 9
·104
10−3
10−2
10−1
iteration
erro
rofT
-est
imat
e,dF
(T,T
)
E6 → E2, ρ(k) = 1
E6 → E2, ρ(k) = k
0 1 2 3 4 5 6 7 8 9
·104
10−3
10−2
10−1
iterationer
ror
ofT
-est
imat
e,dF
(T,T
)
E2 → E9, ρ(k) = 1
E2 → E9, ρ(k) = k
Figure 6.3: Error as PGDM tracks a time-varying system. The left plot shows the casewhere only the Markov chain changes (at the dashed red line), and the sensor dynamicsstays constant. The right plot shows the case where both the sensor dynamics andthe Markov chain change. Every data point is an average over ten simulations. Twodifferent choices for the weight in the update of the estimate of the S2,1 matrix are
shown.
Giving more weight to newer measurements, in these examples by taking ρ(k) = k,
speeds up the convergence after the change of dynamics. The convergence is slower
after the change of dynamics than for the initial system. This could be remedied by
putting even more weight on newer measurements, by for example taking ρ(k) = k2 or
a higher power.
Nonetheless, Figure 6.3 demonstrates that PGDM can successfully be used to track a
time-varying system.
Chapter 7
Conclusions
7.1 Summary and Conclusions
The theme of this thesis has been identification of HMMs. The “standard” method
today, the EM-algorithm, is prone to problems of local minima and slow convergence.
We have implemented the spectral learning algorithm outlined in Hsu et al. [12], which
uses spectral methods to deliver one-shot estimates of the parameters of the HMM. It is
claimed that this method avoids the convergence problems of EM. We benchmarked it
on various examples and have seen that the algorithm overall performs well, but fails to
identify some systems. It is orders of magnitude faster than EM and generated better
estimates than EM for some systems when EM was started with random initial guesses.
We proposed that the difficulty of identifying some systems was related to the condition
number of the product of the observation and transition matrices, OT , and found that
this gives some indications on how the spectral learning algorithm will perform from our
numerical examples.
A down-side of the spectral learning algorithm is that there is no guarantee that the
estimates are valid in the sense that the matrices are stochastic. We noticed a general
pattern that once the algorithm started to provide valid estimates, then the estimates
were very good. This happened when a certain amount of samples were available for
the estimation procedure (a different amount for different systems).
This suggests that estimates generated from the spectral learning algorithm can be
trusted if they are valid. Since the true parameters are unknown in a real-world scenario,
we propose the following work-flow. Given a batch of data (observations of the HMM), do
a partition and apply the spectral learning algorithm to each partition. If the percentage
73
Chapter 7. Conclusions 74
of non-valid estimates is high, employ EM or some other method. If not, use the spectral
learning algorithm on the full data set. Since this gives a one-shot result, EM can then
be used with this estimate as initial guess to refine the estimate.
In the derivation of the spectral learning algorithm, we tried to clarify some of the quirks
that had been overseen in at least one work building on Hsu et al. [12]. We also provided
a way of measuring the accuracy of the estimates. Some previous work had failed to
provide a sensible measure for how accurate these matrices were when using a matrix
norm as measure. This was complicated by the fact that the hidden states of the HMM
can be permuted without changing the outside appearance of the system. We had to
re-order the states in the estimates so that they matched the order of the true system.
Once this was done, we could use the mean squared error of the elements between the
true matrices and the estimated matrices.
We also discussed some ideas that could improve the estimates given by the spectral
learning algorithm; clipping negative/complex elements and trying to maximize the
spread of the eigenvalues in the diagonalization. We saw that maximizing the spread
gave worse results and was thus not used in the numerical examples.
We then introduced a method employing a variant of non-negative matrix factorization.
This method formulated the identification procedure as a optimization problem. The
benefit was that we could guarantee that the estimated parameters were stochastically
valid by enforcing some constraints when solving the optimization problem. The dis-
advantage was that the optimization problem was non-convex, and thus, that we could
not guarantee that we would not converge only to a local minimum, similar to the
EM-algorithm.
Inspired by this method, we made the assumption that the dynamics of the sensor in
the HMM were known. We proved that the problem then reduces to a convex optimiza-
tion problem. Since it is of interest in a real-time setting to continuously improve the
estimates, we set out to formulate a recursive algorithm for estimating the unknown
dynamics of the Markov chain underlying the HMM.
Two relatively standard methods from constrained convex optimization were introduced
and employed to do this reformulation: the projected gradient descent method and
the primal-dual method. We also exploited a multi-dimensional generalization of the
Pythagorean trigonometric identity to transform the constrained optimization problem
to an unconstrained optimization problem which allowed us to use the regular gradient
descent method. This resulted in three novel methods for online estimation of the
transition dynamics of an HMM.
Chapter 7. Conclusions 75
These three algorithms were then evaluated on numerical examples. We found that the
formulation using the projected gradient descent method had slightly better performance
than the other two; both judging convergence speed and time consumption. This could
however be changed if a more sophisticated choice of the step-sizes in the algorithms
were made.
We also demonstrated that the algorithms can successfully track a system with time-
varying dynamics: both of the sensor and the Markov chain.
7.2 Contributions
In short, we have provided:
• A discussion of some of the delicacies concerning the implementation of the spectral
learning algorithm.
• A way of measuring accuracy along with benchmark data for the spectral learning
algorithm on various examples.
• A proposal for how to work with the spectral learning algorithm in a real-world
setting.
• A small benchmark of a structured non-negative matrix factorization algorithm.
• Three novel methods for estimating the transition dynamics of an HMM when the
sensor dynamics are known employing a structured non-negative matrix factoriza-
tion and convex optimization.
– Derivations with explicit expressions for all terms.
– A reformulation of the constrained optimization problem to an unconstrained
optimization problem by exploiting a multi-dimensional Pythagorean trigono-
metric identity.
– Benchmarks on various examples.
– Proof-of-concept that they can successfully track a system with time-varying
dynamics.
7.3 Future Work
There are many paths that are left unexplored in this work. We here provide brief
indications for what could be investigated in future work.
Chapter 7. Conclusions 76
7.3.1 Combining with Other Methods
Anandkumar et al. [1] generalize the method in Hsu et al. [12] and use a slightly different
method for recovering the transition and observation matrices. One could study cases
where the spectral learning algorithm from Hsu et al. [12] fails to provide valid estimates
and see if this alternative method gives better performance. In that case, some hybrid
method could be devised, which falls back on the generalized method for estimating
difficult systems.
7.3.2 Other Explanations for the Performance of the Spectral Learning
Algorithm
We proposed that the condition number of OT could be used to explain the performance
of the spectral learning algorithm on different examples. However, we saw that this was
not a perfect indicator. How close to being singular the observation matrix is should be
a good indicator for how well any algorithm can identify the system. (If O has rank one
for example, then any T can be used to generate indistinguishable outputs.) A possible
measure could be the angle between the column vectors of O.
7.3.3 How to choose the weight ρ(k)?
The weighting factor ρ(k) from Section 3.3.1.1 influences how fast the change of dynamics
will be “noticed” and incorporated in the estimate of S2,1. In this thesis, we provided
simulation results for two choices of ρ(k), but one could explore how to best chose this
factor to guarantee good convergence properties.
7.3.4 Formal Convergence Properties of Online Structured Non-Negative
Matrix Factorization
Also, the formal convergence properties of the three novel algorithms should be studied.
This is non-trivial since there are two quantities for which the convergence have to
be guaranteed simultaneously: the observation matrix O → O and the second order
moment matrix S2,1 → S2,1?
7.3.5 Adaptive Step-Size
There has been much work done on adaptive step-size algorithms for solving optimiza-
tion problems. For simplicity, we used constant step-sizes that seemed to perform well
Chapter 7. Conclusions 77
empirically. A more detailed study of how the step-sizes influence the convergence prop-
erties of the algorithms is a possible path for future work.
7.3.6 The Primal-Dual Method with the Inequality Constraint
We neglected the inequality constrain of the optimization problem in Section 3.3.1.3.
This could be incorporated by adding X × X extra Lagrange multipliers, for example
on the form µi = eθi which would guarantee that µi does not change sign.
7.3.7 Other Parametrizations than Spherical
The parametrization employed in Section 3.3.2 is interesting since it converts the con-
strained optimization problem to an unconstrained optimization problem. However, the
generalized Pythagorean trigonometric identity is not the only possible parametrization
that can be employed.
For example, taking
[A]ij =eθij∑X
k=1
∑Xl=1 e
θkl(7.1)
would also guarantee that all elements of A are non-negative and that their sum is
one. Furthermore, the preservation of the convexity could be easier to analyze since the
exponential function is convex.
7.3.8 Computational Complexity of Online Structured Non-Negative
Matrix Factorization
Numerical simulations provide insight, but a theoretical expression for the computational
complexity would make comparison to other methods easier.
Appendix A
Examples Used in Benchmarks
This appendix contains the parameters used in each of the examples from the text. They
have either been taken from various texts on HMMs or been conceived.
Example 1
T =
[0.90 0.30
0.10 0.70
](A.1)
O =
[0.80 0.20
0.20 0.80
](A.2)
Example 2
T =
0.60 0.30 0.30
0.20 0.50 0.30
0.20 0.20 0.40
(A.3)
O =
0.80 0.20 0.30
0.10 0.70 0.10
0.10 0.10 0.60
(A.4)
79
Appendix A. Examples Used in Benchmarks 80
Example 3
This system was used by Mattfeld [20] to benchmark the spectral learning algorithm.
The numbers as presented here are rounded to two decimal places. The exact fractions
from Mattfeld [20, p. 26] were used in the numerical simulations.
T =
0.80 0.07 0.17
0.10 0.87 0.17
0.10 0.07 0.67
(A.5)
O =
0.40 0.05 0.02
0.07 0.55 0.02
0.07 0.05 0.42
0.07 0.05 0.42
0.07 0.05 0.02
0.07 0.05 0.02
0.07 0.05 0.02
0.07 0.05 0.02
0.07 0.05 0.02
0.07 0.05 0.02
(A.6)
Example 4
T =
0.50 0.10 0.20
0.20 0.60 0.40
0.30 0.30 0.40
(A.7)
O =
0.20 0.40 0.70
0.70 0.40 0.10
0.10 0.20 0.20
(A.8)
Example 5
T =
0.50 0.00 0.50
0.50 0.50 0.00
0.00 0.50 0.50
(A.9)
Appendix A. Examples Used in Benchmarks 81
O =
0.70 0.30 0.80
0.20 0.40 0.05
0.10 0.30 0.15
(A.10)
Example 6
T =
0.70 0.20 0.10
0.20 0.60 0.30
0.10 0.20 0.60
(A.11)
O =
0.80 0.20 0.30
0.10 0.70 0.10
0.10 0.10 0.60
(A.12)
Example 7
T =
0.50 0.30 0.40
0.10 0.40 0.40
0.40 0.30 0.20
(A.13)
O =
0.80 0.20 0.30
0.10 0.70 0.10
0.10 0.10 0.60
(A.14)
Example 8
These are two randomly generated matrices. The elements as seen here are rounded off.
The elements of each column added to one in the numerical simulations.
T =
0.61 0.40 0.51
0.33 0.37 0.37
0.06 0.23 0.13
(A.15)
Appendix A. Examples Used in Benchmarks 82
O =
0.13 0.11 0.03
0.18 0.14 0.08
0.07 0.08 0.11
0.15 0.11 0.10
0.02 0.15 0.13
0.12 0.17 0.10
0.03 0.01 0.08
0.00 0.16 0.17
0.12 0.03 0.09
0.18 0.03 0.12
(A.16)
Example 9
T =
0.60 0.30 0.50
0.00 0.50 0.40
0.40 0.20 0.10
(A.17)
O =
0.80 0.30 0.40
0.00 0.60 0.00
0.20 0.10 0.60
(A.18)
Appendix B
Computing Environment
All simulations in the thesis were performed on a MacBook Air with the following
specifications:
• Mac OS X Version 10.9.5
• CPU 1.3 GHz Intel Core i5
• RAM 4 GB 1600 MHz DDR3
• MATLAB R2013a (8.1.0.604)
83
Bibliography
[1] A. Anandkumar, D. Hsu, and S. M. Kakade. A Method of Moments for Mixture
Models and Hidden Markov Models. ArXiv e-prints, Mar. 2012.
[2] B. D. O. Anderson. New developments in the theory of positive systems. In
C. I. Byrnes, B. N. Datta, C. F. Martin, and D. S. Gilliam, editors, Systems
and Control in the Twenty-First Century, number 22 in Systems & Control:
Foundations & Applications, pages 17–36. Birkhauser Boston. ISBN 978-1-4612-
8662-2, 978-1-4612-4120-1. URL http://link.springer.com/chapter/10.1007/
978-1-4612-4120-1_2.
[3] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring
in the statistical analysis of probabilistic functions of markov chains. pages 164–171.
URL http://www.jstor.org/stable/2239727.
[4] S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge University
Press. ISBN 0521833787 9780521833783.
[5] K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference:
A Practical Information-Theoretic Approach. Springer Science & Business Media.
ISBN 9780387953649.
[6] Y. Chen and X. Ye. Projection onto a simplex. URL http://arxiv.org/abs/
1101.6081.
[7] G. Cybenko and V. Crespi. Learning hidden markov models using nonnegative
matrix factorization. 57(6):3963–3970. URL http://ieeexplore.ieee.org/xpls/
abs_all.jsp?arnumber=5773017.
[8] S. Fan and Y. Yao. Strong convergence of a projected gradient method. 2012:
1–10. ISSN 1110-757X, 1687-0042. doi: 10.1155/2012/410137. URL http://www.
hindawi.com/journals/jam/2012/410137/.
[9] L. Finesso, A. Grassi, and P. Spreij. Two-step nonnegative matrix factorization
algorithm for the approximate realization of hidden markov models. URL http:
//arxiv.org/abs/1007.3435.
85
Bibliography 86
[10] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press.
[11] H. a. Hjalmarsson and B. Ninness. Fast, non-iterative estimation of hidden markov
models. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998
IEEE International Conference on, volume 4, pages 2253–2256. IEEE. URL http:
//ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=681597.
[12] D. Hsu, S. M. Kakade, and T. Zhang. A Spectral Algorithm for Learning Hidden
Markov Models. ArXiv e-prints, Nov. 2008.
[13] H. Jaeger. Observable operator models for discrete stochastic time series. 12(6):
1371–1398. ISSN 0899-7667, 1530-888X. doi: 10.1162/089976600300015411. URL
http://www.mitpressjournals.org/doi/abs/10.1162/089976600300015411.
[14] M. J. Johnson. A simple explanation of a spectral algorithm for learning hidden
markov models. URL http://arxiv.org/abs/1204.2477.
[15] V. Krishnamurthy and F. V. Abad. Gradient based policy optimization of con-
strained markov decision processes. URL http://arxiv.org/abs/1110.4946.
[16] B. Lakshminarayanan and R. Raich. Non-negative matrix factorization for param-
eter estimation in hidden markov models. In Machine Learning for Signal Pro-
cessing (MLSP), 2010 IEEE International Workshop on, pages 89–94. IEEE. URL
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5589231.
[17] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In
In NIPS, pages 556–562. MIT Press.
[18] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times.
American Mathematical Soc. ISBN 9780821886274.
[19] D. V. Lindberg and H. Omre. Inference of the transition matrix in convolved hidden
markov models by a generalized baum-welch algorithm. URL https://wiki.math.
ntnu.no/_media/ure/subm-2014-4.pdf.
[20] C. Mattfeld. Implementing spectral methods for hidden markov models with real-
valued emissions. CoRR, abs/1404.7472, 2014. URL http://arxiv.org/abs/
1404.7472.
[21] C. Moler. Numerical Computing with MATLAB. Society for Industrial and Applied
Mathematics, 2004. ISBN 9780898715606. URL http://books.google.ca/books?
id=-vPtcriflH0C.
[22] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov
models. eprint arXiv:cs/0502076, Feb. 2005.
Bibliography 87
[23] L. R. Rabiner. First-hand:the hidden markov model. URL http://www.ieeeghn.
org/wiki/index.php/First-Hand:The_Hidden_Markov_Model.
[24] J. Rodu, D. Foster, W. Wu, and L. Ungar. Using regression for spectral esti-
mation of hmms. In A.-H. Dediu, C. Martın-Vide, R. Mitkov, and B. Truthe,
editors, Statistical Language and Speech Processing, volume 7978 of Lecture Notes
in Computer Science, pages 212–223. Springer Berlin Heidelberg, 2013. ISBN 978-
3-642-39592-5. doi: 10.1007/978-3-642-39593-2 19. URL http://dx.doi.org/10.
1007/978-3-642-39593-2_19.
[25] StackOverflow. Question: Does matlab eig always returns[sic] sorted
values. URL http://stackoverflow.com/questions/13704384/
does-matlab-eig-always-returns-sorted-values.
[26] B. Vanluyten. Realization, Identification and Filtering for Hidden Markov Models
using Matrix Factorization Techniques. PhD thesis, Katholieke Universiteit Leuven,
2008.
[27] B. Vanluyten, J. C. Willems, and B. De Moor. A new approach for the identification
of hidden markov models. In Decision and Control, 2007 46th IEEE Conference
on, pages 4901–4905. IEEE, . URL http://ieeexplore.ieee.org/xpls/abs_all.
jsp?arnumber=4434912.
[28] B. Vanluyten, J. C. Willems, and B. De Moor. Structured nonnegative matrix
factorization with applications to hidden markov realization and clustering. 429
(7):1409–1424, . ISSN 00243795. doi: 10.1016/j.laa.2008.03.010. URL http://
linkinghub.elsevier.com/retrieve/pii/S0024379508001262.
[29] T. Vercauteren, A. L. Toledo, and X. Wang. Online bayesian estimation of
hidden markov models with unknown transition matrix and applications to
IEEE 802.11 networks. In Acoustics, Speech, and Signal Processing, 2005. Pro-
ceedings.(ICASSP’05). IEEE International Conference on, volume 4, pages iv–
13. IEEE. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=
1415933.
[30] C. Wang and N. Xiu. Convergence of the gradient projection method for generalized
convex minimization. 16(2):111–120. URL http://link.springer.com/article/
10.1023/A:1008714607737.
[31] W. Wang and M. A. Carreira-Perpinan. Projection onto the probability simplex:
An efficient algorithm with a simple proof, and an application. URL http://arxiv.
org/abs/1309.1541.
Bibliography 88
[32] X. Wang, V. Krishnamurthy, and J. Wang. Stochastic gradient algorithms for design
of minimum error-rate linear dispersion codes in MIMO wireless systems. 54(4):
1242–1255. ISSN 1053-587X. doi: 10.1109/TSP.2005.863122.
[33] H. Zhao and P. Poupart. A sober look at spectral learning. URL http://arxiv.
org/abs/1406.4631.