on identification of hidden markov models using spectral...

DEGREE PROJECT, IN ENGINEERING PHYSICS; SYSTEMS, CONTROL AND , SECOND LEVELROBOTICS

STOCKHOLM, SWEDEN 2015

On Identification of Hidden MarkovModels Using Spectral andNon-Negative Matrix FactorizationMethods

ROBERT MATTILA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING

KTH Royal Institute of Technology

Abstract

Department of Automatic Control

School of Electrical Engineering

Master of Science

On Identification of Hidden Markov Models Using Spectral and

Non-Negative Matrix Factorization Methods

by Robert Mattila

Hidden Markov Models (HMMs) are popular tools for modeling discrete time series.

Since the parameters of these models can be hard to derive analytically or directly mea-

sure, various algorithms are available for estimating these from observed data. The

most common method, the Expectation-Maximization algorithm, suffers from problems

with local minima and slow convergence. A spectral algorithm that has received con-

siderable attention in the field of machine learning claims to avoid these issues. This

thesis implements and benchmarks said algorithm on various systems to see how well it

performs.

One of the concerns with the proposed spectral algorithm is that it cannot guarantee that

the estimates are stochastically valid: it may recover negative or complex probabilities,

due to an eigenvalue decomposition.

Another approach to the HMM identification problem is to leverage results from Non-

Negative Matrix Factorization (NNMF) theory. Inspired by an algorithm employing a

Structured NNMF (SNNMF), assumptions are presented to guarantee that the factor-

ization problem can be cast into a convex optimization problem.

Three novel recursive algorithms are then derived for estimating the dynamics of an

HMM when the sensor dynamics are known. These can be used in an online setting

where time and/or computational resources are limited, since they only require the

current estimate of the HMM parameters and the new observation. Numerical results

for the algorithms are provided.

http://www.kth.se

Faculty Web Site URL Here (include http://)

Department or School Web Site URL Here (include http://)

KTH Kungliga Tekniska Hogskolan

ReferatAvdelningen for reglerteknik

Skolan for elektro- och systemteknik

Examensarbete

Identifiering av dolda Markovmodeller via spektrala och icke-negativa

matrisfaktoriseringsmetoder

av Robert Mattila

Ett populart verktyg for modellering av diskreta tidsserier ar dolda Markovmodeller

(eng. Hidden Markov Models, HMMs). I praktiken kan de ingaende parametrarna

i dessa modeller vara svara att harleda eller direkt mata, vilket har lett till att di-

verse algoritmer for att skatta dessa utifran uppmatt data har konstruerats. Den van-

ligaste algoritmen for sadan skattning, Forvantans-Maximerings (eng. Expectation-

Maximization) metoden, har visats ha problem med lokala minima och langsam konver-

gens. En metod som anvander spektral faktorisering sags undga dessa problem och har

fatt stor uppmarksamhet inom maskininlarning. Detta examensarbete implementerar

denna algoritm och utvarderar dess prestanda pa diverse system.

Ett av problemen med denna algoritm for spektralt lararande (eng. spectral learning)

ar att den inte lamnar nagra garantier for att de skattade parametrarna ar stokastiskt

korrekta: resultatet kan bli sannolikheter som ar negativa eller komplexa. Detta pa

grund utav en egenvardesfaktorisering i ett steg utav algoritmen.

En annan mojlig vag att ga for att skatta parametrarna i en HMM ar att anvanda

resultat fran teori rorande icke-negativ matrisfaktorisering (eng. Non-Negative Matrix

Factorization, NNMF). Inspirerade av en algoritm som anvander en strukturerad NNMF

sa presenteras antaganden som gor att faktoriseringsproblemet kan konverteras till ett

konvext optimeringsproblem.

Tre nya rekursiva algoritmer harleds sedan som kan skatta dynamiken hos en HMM

nar sensorn som anvands for att gora matningar antags ha kand dynamik. Dessa

kan anvandas i realtidsapplikationer dar tid- och/eller berakningsresurser ar begransade

eftersom de enbart behover en gammal skattning och en ny observation for att forbattra

skattningen. Numeriska simulationer utfors och resultaten presenteras sedan for ett

antal system.

Acknowledgements

I would like to thank my supervisor Prof. Bo Wahlberg and Assoc. Prof. Cristian

Rojas for inviting me to work with them and for their input and support during the

development of this thesis.

I had the opportunity to spend some time at University of British Columbia (UBC)

with Prof. Vikram Krishnamurthy, who gave me valuable feedback and almost too

many things to try out during my stay there, for which I am very grateful.

I would also like to thank his students; Sujay Bhatt, Anup Aprem and Yan Duan, for

welcoming me so kindly to the lab.

Robert Mattila

January 2015

v

Contents

Abstract iii

Referat iv

Acknowledgements v

Contents vi

List of Figures xi

List of Tables xiii

Abbreviations xv

Symbols xvii

1 Introduction 1

1.1 Background and Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1.1 Example of a Markov Chain . . . . . . . . . . . . . . . . 5

1.3.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.2.1 Example of an HMM . . . . . . . . . . . . . . . . . . . . 7

1.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Expectation-Maximization Method . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Overview of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.7 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Methods for Identification Using Batch Data 13

2.1 Introduction to Spectral Learning . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Derivation of the Spectral Learning Algorithm . . . . . . . . . . . . . . . 14

2.4 Introduction to Non-Negative Matrix Factorization . . . . . . . . . . . . . 22

vii

Contents viii

2.5 Identification using Structured Non-Negative Matrix Factorization . . . . 23

3 Methods for Online Identification 27

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Convexity and Convex Optimization . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 The Projected Gradient Descent Method . . . . . . . . . . . . . . 29

3.2.3 The Primal-Dual Method . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Online Structured Non-Negative Matrix Factorization for Known SensorDynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Derivation of the Algorithms . . . . . . . . . . . . . . . . . . . . . 32

3.3.1.1 Estimating S2,1 Recursively . . . . . . . . . . . . . . . . . 34

3.3.1.2 Projected Gradient Descent . . . . . . . . . . . . . . . . . 35

3.3.1.3 The Primal-Dual Method without Inequality Constraints 38

3.3.2 Spherical Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Notes on Implementation 45

4.1 Measuring the Accuracy of Estimates . . . . . . . . . . . . . . . . . . . . 45

4.1.1 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . 45

4.1.2 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . 46

4.1.3 Probability of Output Sequences . . . . . . . . . . . . . . . . . . . 48

4.1.4 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Re-Ordering the States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Some Notes on Implementing the Spectral Learning Algorithm . . . . . . 51

4.3.1 Faulty Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.2 If the Matrix is not Diagonal? . . . . . . . . . . . . . . . . . . . . . 52

4.3.3 Separating the Eigenvalues . . . . . . . . . . . . . . . . . . . . . . 52

5 Numerical Results for Spectral Learning and Structured Non-NegativeMatrix Factorization 55

5.1 Comparison of Spectral Learning and Structured Non-Negative MatrixFactorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2 Performance of the Spectral Learning Algorithm . . . . . . . . . . . . . . 57

5.3 Higher-Dimensional Systems with Spectral Learning . . . . . . . . . . . . 61

5.4 Perturbation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 Numerical Results for Online Structured Non-Negative Matrix Fac-torization 67

6.1 Comparison of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2 Performance as Dimension Increases . . . . . . . . . . . . . . . . . . . . . 69

6.3 Tracking Time-Varying Dynamics . . . . . . . . . . . . . . . . . . . . . . . 70

7 Conclusions 73

7.1 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.3.1 Combining with Other Methods . . . . . . . . . . . . . . . . . . . 76

Contents ix

7.3.2 Other Explanations for the Performance of the Spectral LearningAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.3.3 How to choose the weight ρ(k)? . . . . . . . . . . . . . . . . . . . . 76

7.3.4 Formal Convergence Properties of Online Structured Non-NegativeMatrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.3.5 Adaptive Step-Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.3.6 The Primal-Dual Method with the Inequality Constraint . . . . . . 77

7.3.7 Other Parametrizations than Spherical . . . . . . . . . . . . . . . . 77

7.3.8 Computational Complexity of Online Structured Non-NegativeMatrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 77

A Examples Used in Benchmarks 79

B Computing Environment 83

Bibliography 85

List of Figures

1.1 Example of a Markov chain, illustrating Equation (1.10). The blue circlesrepresent the discrete states of the system and the numbers on the edgesthe probabilities of transitioning between two states. . . . . . . . . . . . . 5

1.2 Example of an HMM, illustrating Equation (1.10) and Equation (1.14).The blue circles represent the hidden state of the system and the greensquares the possible observations. Two green squares with the same num-ber are equivalent, they correspond to the same observation, and are onlydrawn separated in the figure for clarity. . . . . . . . . . . . . . . . . . . . 7

2.1 The spectral learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 The structured non-negative matrix factorization algorithm. . . . . . . . . 25

3.1 Example of a convex function. The (gray) line between any two pointson the graph lies above the graph. A negative gradient is shown as ared arrow. Following the direction of the gradient at each point (i.e.performing a gradient descent) will reach the global minimum. . . . . . . 28

3.2 The online structured non-negative matrix factorization algorithm em-ploying the projected gradient descent method. . . . . . . . . . . . . . . . 37

3.3 The online structured non-negative matrix factorization algorithm em-ploying the primal-dual method. . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 The online structured non-negative matrix factorization algorithm em-ploying a spherical coordinates parametrization. . . . . . . . . . . . . . . . 43

4.1 Graph depiction of the HMMs Σa and Σb which are related by renamingthe hidden states. They cannot be distinguished from just sequences ofobservations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Performance of the spectral learning algorithm when Step 4 and 5 areperformed Ng times to find the set of random variables giving the largestspread of the eigenvalues. Every data point is averaged over 75 simulations. 53

5.1 Comparison of SL and SNNMF. All data points are averaged over 20simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Performance of the spectral learning algorithm on numerous examples(found in Appendix A). All data points are averaged over 20 simulations. 58

5.3 Performance of the EM-algorithm on three examples. Every data pointis an average over three realizations of the HMM. Random initial guesseswhere used for starting the EM-algorithm. . . . . . . . . . . . . . . . . . . 60

5.4 Performance of the spectral learning algorithm as the dimension of theHMM increases. All data points are averaged over 15 simulations with 3random matrices, each evaluated over 5 simulations. . . . . . . . . . . . . 62

xi

List of Figures xii

6.1 Performance of the three recursive methods on two examples: E2 withX = 3, Y = 3 and E3 with X = 3, Y = 10. Every data point is anaverage over ten simulations. . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.2 Performance of PGDM as the dimension of the system increases. Thesystems are generated randomly with Y = X. Each data point is anaverage over nine simulations with three pairs of random matrices, eachused for three simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.3 Error as PGDM tracks a time-varying system. The left plot shows thecase where only the Markov chain changes (at the dashed red line), andthe sensor dynamics stays constant. The right plot shows the case whereboth the sensor dynamics and the Markov chain change. Every data pointis an average over ten simulations. Two different choices for the weightin the update of the estimate of the S2,1 matrix are shown. . . . . . . . . 71

List of Tables

5.1 Mean slopes of the dF -error for the spectral learning algorithm in thelog-log plot Figure 5.2 together with the second largest eigenvalue of thetransition matrix for various examples. The system matrices can be foundin Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 The result of Section 5.2 (Figure 5.2) sorted according to the error inthe estimate of the observation matrix when 107 samples were available(which is an indicator of how well the spectral learning algorithm hasperformed). The condition number of OT appears to correlate well withthis, except for the outlier Example 8. . . . . . . . . . . . . . . . . . . . . 65

6.1 Comparison of the convergence rate and time consumption of the threerecursive methods on two examples (E2 with X = 3, Y = 3 and E3 withX = 3, Y = 10). Mean Slope refers to the slopes in Figure 6.1. . . . . . . 67

6.2 Performance of the PGDM as the dimension of the system increases.States refer to the value of X and Y (equal). Mean Slope refers to theslopes in Figure 6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xiii

Abbreviations

EM Expectation-Maximization

HMM Hidden Markov Model

NNMF Non-Negative Matrix Factorization

PDM Primal Dual Method

PGDM Projected Gradient Descent Method

SCM Spherical Coordinates Method

SL Spectral Learning

SNNMF Structured Non-Negative Matrix Factorization

SVD Singular Value Decomposition

xv

Symbols

·, · estimate of numerical quantity ·

1 (column) vector of ones

[·]• element • of tensor ·

I identity matrix

δi,j the Kronecker delta

∆ the probability simplex

∆· change of ·

·+ Moore-Penrose psuedo-inverse

Tr · matrix trace

diag · matrix with elements of vector · as diagonal elements

ei ith Cartesian (column) unit vector

∇· gradient

N (·, ·) normally distributed number

Ω· square root of term in generalized Pythagorean trigonometric identity

O observation matrix (elements in columns sum to one)

T transition matrix (elements in columns sum to one)

dF (·, ·) mean squared error of matrix elements

S· moment

P permutation matrix

π stationary distribution

A dummy variable

Jij matrix of zeros, except for a one at element (i, j)

ρ(·) weight of new observations compared to old observations

L Lagrangian

xvii

Dedicated to Juan Parera Lopez

xix

Chapter 1

Introduction

1.1 Background and Setting

Mathematical modelling is one of the cornerstones of much of the knowledge of mankind.

However, the complexity of some systems limits how well certain parameters in the

mathematical model can be determined numerically. And not only complexity, there

might be other limiting factors; sometimes the time needed to derive an analytical

model is not available, or it could be necessary to break the system in order to be

able to perform some measurements, which can be very costly. This has lead to the

field of system identification. Instead of analytically deriving or measuring parameters,

algorithms are developed that will generate estimates of the parameters that fit observed

data well. This thesis treats such identification procedures for a certain kind of model,

namely the Hidden Markov Model (HMM).

The foundation of the theory of Markov models was laid by Andrew Markov in the

beginning of the 20th century. Since then, this mathematical model has found applica-

tions in a wast amount of fields; such as automatic speech recognition, natural language

processing and genomic sequence modeling. A Markov chain is one such model, defined

as a discrete stochastic process, traversing through its finite set of states as time evolves

in discrete steps. The critical assumption is that the probability of the current state

transitioning to another is independent of the history of the process: it does only depend

on the current state of the system.

Usually, the future state of the system is impossible to know, since the state changes

are random. However, through statistical analysis, some things can be deduced. One of

the most important is the expected value of the system state in the future. In financial

markets, this could tell how a stock is expected to perform and whether or not we should

sell or buy more shares.

1

Chapter 1. Introduction 2

There are many generalizations of the basic Markov chain. It is common that the state

itself of the system can not be directly observed and one has to rely on observations

of some other quantities. For example, the state of a medical patient; sick or healthy,

can not be directly observed by the doctor. The doctor use readings of a thermometer,

answers to questions and other instruments to try to find the true (hidden) state of the

patient. If the true state of the system behaves like a Markov chain, then such a system

is called an HMM.

Another possible generalization is to imagine that there is a control input that can

be designed to steer the system towards a desired state. Mathematically, this reflects

itself as changing the probabilities of certain transitions in the underlying (hidden)

Markov chain. This is what is known as a Partially Observed Markov Decision Process

(POMDP).

So-called HMM-filters have been devised to find the hidden state of a system from

observations of the HMM, and have in the recent years been proven very successful

in various domains. Such filters, together with dynamic programming, constitute the

fundamental tools for solving POMDP-problems.

HMM-filters are also used for prediction and filtering purposes. Regardless of how they

are applied, they require the knowledge of all probabilities of the HMM, i.e. the param-

eters of the HMM. Since these are not always known, either by analytical derivations or

direct measurements on the system, they might have to be estimated.

The current predominating methods for estimating these parameters, such as the Baum-

Welch/Expectation-Maximization (EM) method from Baum et al. [3], rely on local op-

timization based schemes to iteratively calculate better and better estimates. This

approach is prone to problems with local minima and slow convergence. Some recent

results by Hsu et al. [12] in the field of machine learning seem promising for solving the

estimation problem without the issue of sensitivity to initial guesses and slow conver-

gence. They obtain an estimate through a one-shot method employing techniques from

linear algebra.

The aim of this project is two-fold. We first intend to explore and implement the so-

called spectral learning algorithm for HMMs by Hsu et al. [12]. We will draw some

parallels to similar work and see how the algorithm relates. The second intention is to

examine the problem of estimating the parameters in an online setting, meaning that a

new estimate is calculated using only a new measurement and the old estimate. This

is important in a real-time application where memory and/or computational resources

might be limited.


1.2 Notation

This section will briefly introduce some mathematical concepts and notation that will

be used throughout the text. We will let × be reserved for scalar multiplications, since

there will be no need for using the cross product between vectors.

Elements of vectors and matrices are given as subscripts of brackets: if A is a matrix

then the element on the ith row and jth column is [A]ij . The ith Cartesian unit vector

will be denoted ei and the vector of ones as 1, both column vectors.

An empirical estimate of a numerical quantity x, not necessarily scalar, will be denoted

as x. The probability measure of an event will be written Pr[ · ].

The (n− 1)-dimensional probability simplex ∆ is defined as

∆ = x ∈ Rn : x ≥ 0, 1Tx = 1. (1.1)

In Equation (1.1), and in the rest of the text, ≥ should be evaluated elementwise on

vector and matrix quantities.

The gradient of a (scalar, vector or matrix) function f with respect to x will be denoted

∇xf or ∂f∂x depending on which one is more visually pleasing in the present context.

The Hadamard, or elementwise, product between two matrices A and B will be denoted

with the -operator and is defined as

[A B]ij = [A]ij × [B]ij . (1.2)

The Frobenius norm of a matrix A is written and defined as

‖A‖F =

√∑ij

[A]2ij , (1.3)

where the sum is over all elements of A.

A normally distributed number with mean µ and standard deviation σ will be written

N (µ, σ).


1.3 Hidden Markov Models

This section will formally introduce the model that is central of this thesis, namely the

Hidden Markov Model (HMM). We will begin by discussing the standard Markov model,

or Markov chain, since the HMM is a generalization.

A Markov chain is a discrete stochastic model that can be represented as a sequence

of stochastic variables x1, x2, x3, . . . . The state of the Markov chain at time k is xk.

The crucial assumption is that these variables obey the Markov property. This is an

assumption that the conditional probability distribution of future states depends only

on the current state and not on past states. Formally, this can be stated as

Pr[xk+1 = i|x1 = i1, x2 = i2, . . . , xk = ik] = Pr[xk+1 = i|xk = ik]. (1.4)

The possible values that the variables xi can take is a set called the state-space of the

Markov chain. We will denote this set with X and let it be a finite consecutive subset

of N+ that includes 1, i.e. X = 1, 2, . . . , X. Here X denotes the number of discrete

states of the Markov chain, and will be referred to as the dimension of the Markov chain.

If the transition probabilities do not depend on the absolute time, the chain is said to

be time-invariant or stationary. This means that

Pr[xk+m+1 = i|xk+m = j] = Pr[xk+1 = i|xk = j], (1.5)

for all m. In this case, the concept of the chain’s stationary distribution is often intro-

duced:

Definition 1.1. The stationary distribution, π ∈ [0, 1]X , of a time-invariant Markov

chain is defined as

[π]i = Pr[xk = i] = Pr[x1 = i]. (1.6)

Remark 1.2. We will provide an example of a time-variant Markov chain in a later

chapter. In that case, we will treat it slightly heuristically and assume that the chain is

stationary on each of the intervals between the change of dynamics.

1.3.1 Transitions

Since there are X × X possible state transitions in a Markov chain of dimension X,

we can define a matrix that will keep track of the probabilities of each one of these

transitions:


1 : sunny

2 : cloudy 3 : rainy

0.1

0.8

0.10.1

0.2

0.7

0.40.4

0.2

Figure 1.1: Example of a Markov chain, illustrating Equation (1.10). The blue circlesrepresent the discrete states of the system and the numbers on the edges the probabil-

ities of transitioning between two states.

Definition 1.3. The (state) transition matrix 1, T ∈ [0, 1]X×X , is defined as

[T ]ij = Pr[xk+1 = i|xk = j]. (1.7)

The transition matrix is, by construction, a stochastic matrix, which implies that it has

the following two properties:

[T ]ij ∈ [0, 1] ∀ i, j = 0, 1, . . . , X (1.8)

andX∑i=1

[T ]ij = 1 ∀ j = 0, 1, . . . , X. (1.9)

This means that the elements of T are non-negative and that the elements of each column

of T sum to one.

1.3.1.1 Example of a Markov Chain

Consider the following example. The weather in a small town can only be in one out of

three states: sunny, cloudy and rainy. Our assumption that the state-space X ⊂ N+ is

without loss of generality, since we can simply assign a unique number to each one of

these states. A scientist in this town has calculated the probabilities for changes in the

weather. For example, he says that there is a seventy per cent chance of a cloudy day

turning into a rainy day after the night.

1 It is also common to define the transition matrix as P with [P ]ij = Pr[xk+1 = j|xk = i], i.e. as thetranspose of the T defined here, TT = P . We will use T instead of P to keep consistency with previouswork in the field.


The scientist have provided us only with the following transition matrix

T =

0.8 0.1 0.2

0.1 0.2 0.4

0.1 0.7 0.4

(1.10)

and says that the ordering is: one - sunny, two - cloudy and three - rainy. It is common

to visualize Markov chains as finite transition systems. This representation is equal to

providing a transition matrix, if the transition probabilities are specified on the edges

of the graph. We can easily construct the transition system in Figure 1.1 from the

transition matrix that the scientist has provided.

1.3.2 Observations

A common extension of the Markov chain is to assume that the current state is observed

through a noisy sensor. This makes the model an HMM, since the true state of the

Markov chain is never directly seen. If we assume that the noise is discrete valued and

only takes a finite number of values, then the possible observations of the Markov chain

will lie in a finite discrete set as well. Let the number of possible discrete observations

be Y and map each observation to a unique element in the set 1, 2, . . . , Y = Y. This

makes it possible to define a matrix of the probabilities of making each observation,

given each hidden state of the Markov chain. The observation (also called emission)

made at time k is yk ∈ Y. Formally, this means that:

Definition 1.4. Let the observation ( or, emission) matrix 2, O ∈ [0, 1]Y×X , be defined

as

[O]ij = Pr[yk = i|xk = j]. (1.11)

Assuming that we always make some observation, then we note that O is a stochastic

matrix:

[O]ij ∈ [0, 1] ∀ i = 1, 2, . . . , Y and j = 1, 2, . . . , X (1.12)

andY∑i=1

[O]ij = 1 ∀ j = 1, 2, . . . , X. (1.13)

This means that the elements in each of one of the columns of O sum to one.


1 : sunny

2 : cloudy 3 : rainy

1

2

3

1

2

3

1

2

3

0.1

0.8

0.10.1

0.2

0.7

0.4

0.4

0.2

0.9

0.1

0.0

0.2

0.5

0.3

0.0

0.2

0.8

Figure 1.2: Example of an HMM, illustrating Equation (1.10) and Equation (1.14).The blue circles represent the hidden state of the system and the green squares thepossible observations. Two green squares with the same number are equivalent, theycorrespond to the same observation, and are only drawn separated in the figure for

clarity.

1.3.2.1 Example of an HMM

Continuing on the previous example, assume that we are imprisoned in a room without

any windows in the strange three-weather town. Our only method of deducing the

outside weather is by observing the outerwear the prison guard wears each day when

he comes to feed us. In this setting, the state of the weather is hidden from us, but

we can make (noisy) observations that tell us clues about the outside conditions. If the

guard has a limited wardrobe, then the only possible observations might be: (1) shirt

and short pants, (2) long pants and a coat, and (3) raincoat and an umbrella.

Since the guard checks the weather forecast each morning before leaving for work, he

is fairly good at dressing appropriately for the weather. Assume that, on a sunny day,

there is a ninety per cent chance of him wearing shirt and short pants, and a ten per

cent chance of him wearing long pants and a coat. Similar probabilities can be assigned

to the two other hidden states (cloudy and rainy) and observation (raincoat and an

umbrella). We can summarize this in an observation matrix:

O =

0.9 0.2 0.0

0.1 0.5 0.2

0.0 0.3 0.8

, (1.14)

2As with the case of the transition matrix, there is another common definiton of the observationmatrix: B = OT . We will use O to keep consistency with previous work.


where we have assigned a number to each one of the observations, or in an illustration

such as Figure 1.2.

1.4 Problem Formulation

In practice, the transition and observation matrices are usually unknown and hard to

model. The data that is readily available is usually a sequence of observations from

the system: y1, y2, . . . , yk. The system identification (or learning) problem is to find

estimates O and T from the measured output sequence.

We will in a later chapter also be concerned with the problem of only estimating the tran-

sition matrix from a sequence of observations, when the observation matrix is assumed

to be known.

1.4.1 Example Problem

Again continuing on the previous example, we might not be lucky enough to have a

scientist friend that can provide us with a model of the weather, nor any a priori insight

in how well the guard is at dressing appropriately every morning. In this case, we can

simply make a list of the outfit each day over a period of time (say: day 1 - shirt

and short pants, day 2 - shirt and short pants, day 3 - long pants and a coat, . . . )

and then apply one of the algorithms outlined in this text to get good estimates of the

(weather) transition matrix and the (outfit) observation matrix. Once we have estimates

of these two matrices we can apply an HMM-filter to deduce the current weather and do

predictions about future weather conditions, which we can use to plan our escape from

the jail.

1.5 Expectation-Maximization Method

The Expectation-Maximization (EM) method is one of the most widely employed meth-

ods for solving the above outlined estimation problem. Since its weaknesses are some of

the motivations of the algorithms in this text, we will provide a very brief summary of

the algorithm here.

It solves the following optimization problem iteratively:

θ∗ = arg maxθ∈Θ

Pr[y1, y2, . . . , yk|θ], (1.15)


where θ are the parameters of the model, Θ is the feasible parameter set and θ∗ the

maximum-likelihood estimate of the true θ. However, in EM, instead of working directly

with the likelihood function, Pr[y1, y2, . . . , yk|θ], an auxiliary likelihood function is con-

sidered. After an initial guess θ0 of the true θ has been chosen, the algorithm iteratively

repeats the following two steps:

Expectation Step Calculate the auxiliary likelihood function

Q(θn−1, θ) = E[

log Pr[x1, . . . , xk, y1, . . . , yk | θ] | y1, . . . , yk, θn−1

]. (1.16)

Maximization Step Update the parameter estimation as

θn = arg maxθ∈Θ

Q(θn−1, θ). (1.17)

The method is also referred to as the Baum-Welch method when it is applied to an

HMM. See for example Rabiner [23] and references therein for more details.

The two main interventions against EM is that the algorithm sometimes is extremely

slow and that it is only guaranteed to converge to a local minima of the likelihood

function.

1.6 Overview of Related Work

There have been attempts to solve the HMM identification problem by casting the

HMM to a linear stochastic state-space model and then using identification techniques

for linear systems to recover the parameters. One of the problems with this approach is

that the parameters of the HMM (i.e. the transition and observation matrix) have some

constraints that a usual linear state-space model do not; namely that the elements in

each column should sum to one and be non-negative.

A subset of linear systems are positive linear systems, which are closer to HMMs since

they have postivity constraints. However, they do not have the sum-to-one constraints.

The identification of such systems has been studied by for example Anderson [2]. Van-

luyten [26] gives a thorough overview of such, and related, work. He notes that it is

often wrongly stated that obtaining a valid HMM from one with negative parameters

(which is what results from some identification techniques) is just a matter of finding

a similarity transform. He develops a method of positive identification using positive

matrix factorization-methods, meaning that the estimated matrices have non-negative

elements by construction.


However, he does only solve the identification problem for HMMs that are on the form

of Mealy machines. In, what he refers to as, a Mealy HMM the output at the present

time does not depend solely on the current state of the system, but also on what state

the system will progress to. This reassembles very much the observation operator form

employed by Jaeger [13] and Hsu et al. [12]. In their representation of an HMM, the

transition and emission matrices are grouped together.

In this thesis, the interest lies in recovering the transition and observation matrices

explicitly (which gives more insight and intuition on how the system functions). This

corresponds to the Moore HMM of Vanluyten [26], where the event of progressing to a

certain state and the event of producing a certain observation are independent. Van-

luyten [26] shows that a Moore HMM can easily be converted into a Mealy HMM by,

roughly, multiplying together the system matrices. The formula for this conversion in

Vanluyten [26, p. 68] is identical to the formula for finding the observation operators

given in Hsu et al. [12, Lemma 1].

Vanluyten [26] notes that the conversion in the other direction, i.e. from a Mealy HMM

to a Moore HMM, is also always possible, but results in a highly non-minimal represen-

tation of the HMM and that there is currently no known way of reducing a non-minimal

HMM to minimal one, if the positivity of the parameters have to be conserved. This

thesis will focus on identifying the parameters directly and not converting from a Mealy

to a Moore representation of the HMM.

Rodu et al. [24] make a comparison between different methods for HMM identification,

albeit only for models on the Mealy-form (i.e. not recovering explicit expressions for the

system matrices).

Hjalmarsson and Ninness [11] present a novel method for estimating the parameters of

an HMM using sub-space inspired techniques from the automatic control field. Their

algorithm is non-iterative, but cannot guarantee that the estimates are stochastically

valid.

Johnson [14] provides a linear algebraic explanation of the algorithm of Hsu et al. [12].

He does not outline the steps required to recover the transition and observation matrix

of the HMM explicitly though. This recovery step is only left as a note in an appendix

of Hsu et al. [12] and has been generalized to work on a more general class of systems

by Anandkumar et al. [1].


1.7 Outline of Thesis

Chapter 1 provides an overview of the problem at hand along with necessary concepts.

Chapter 2 treats the identification procedure when a batch of data is given using the

spectral learning algorithm of Hsu et al. [12] and the structured non-negative matrix

factorization algorithm of Lakshminarayanan and Raich [16] and Vanluyten et al. [28].

Chapter 3 is devoted to the online estimation problem. Assumptions are stated and the

structured non-negative matrix factorization algorithm is recast into a convex optimiza-

tion problem that can be solved recursively. Three methods of solving the estimation

problem are presented and discussed to some length. The result is three novel methods

of performing online estimation of the transition probabilities of an HMM.

Chapter 4 discusses some practical concerns when implementing the algorithms, along

with methods of measuring the accuracy of the estimates. This is necessary for the

benchmarks to come in the next two chapters.

Chapter 5 and Chapter 6 present numerical results from simulations of the algorithms

outlined in the previous chapters. The batch data identification methods are tested on

various systems with various amounts of available samples. The online methods are

tested on time-invariant systems, but also on a time-variant system where the dynamics

of the HMM change with time.

Chapter 7 concludes the thesis and summarizes the results. Implications for future work

are also provided.

Chapter 2

Methods for Identification Using

Batch Data

This chapter introduces two algorithms for identification of HMMs when batch data

is available. This means that the algorithms works in an offline setting: a batch of

observations are presented and the algorithms process the data to generate estimates of

the parameters of the HMM. If a new observation is made, then the process has to be

repeated from the beginning.

2.1 Introduction to Spectral Learning

Spectral learning refers to algorithms using spectral (i.e. eigen- or singular value) de-

compositions to find estimates of some unknown parameters of a system. For the case

of HMMs, this means finding estimates of the transition matrix and the observation

matrix. Hsu et al. [12] proposed a spectral method for identification of HMMs, which

has been generalized to other models by Anandkumar et al. [1].

Although Hsu et al. [12] are concerned with estimating probabilities of sequences of

outputs, i.e. estimating the cumulative joint distribution Pr[y1, y2, . . . , yk] and the con-

ditional distribution Pr[yk+1|y1, y2, . . . , yk], they leave a note (Hsu et al. [12, Appendix

C]) on how the method by Mossel and Roch [22] can be used in conjunction with their

method to recover explicit expressions for the (estimated) transition and observation

matrices. This combined method will be outlined in the subsequent sections.

The idea of Hsu et al. [12] is to relate observable quantities, correlations in triplets in

an output sequence, to the system parameters using some clever algebraic tricks. Since

a large part of the description was displaced to an appendix, some fairly crucial steps

13

Chapter 2. Methods for Identifications Using Batch Data 14

were explained in rather few words. In this chapter, we will try to explain these steps

in more detail and emphasize some of the quirks of the algorithm.

2.2 Moments

The method by Hsu et al. [12] (which will be referred to as the spectral learning al-

gorithm) relies on getting empirical estimates of the joint probabilities of n-tuples of

observations (nth order moments), with n being 1, 2 and 3. The first and second order

moments are vector and matrix quantities, respectively:

Definition 2.1. The first order moment, S1 ∈ [0, 1]Y , is defined as

[S1]i = Pr[y1 = i] ∀ i = 1, 2, . . . , Y, (2.1)

and the second order moment, S1,2 ∈ [0, 1]Y×Y , is defined as

[S2,1]ij = Pr[y2 = i, y1 = j] ∀ i, j = 1, 2, . . . , Y. (2.2)

Remark 2.2. Note that it is possible to define other second order moments, such as S1,3,

in an analog fashion.

The third order moment is a third order tensor, but can be represented as a matrix if

one index is fixed:

Definition 2.3. The third order moments, S3,y,1 ∈ [0, 1]Y×Y , are defined as

[S3,y,1]ij = Pr[y3 = i, y2 = y, y1 = j], (2.3)

for i, j = 1, 2, . . . , Y .

Remark 2.4. S1, S2,1 and S3,y,1 are denoted P1, P2,1 and P3,y,1 in Hsu et al. [12] and

related work. This notation is a bit unfortunate since it clashes with the customary

notation of the transition matrix as P and can therefore cause unnecessary confusion.

2.3 Derivation of the Spectral Learning Algorithm

We will in this section derive relations between the moments (of which we can form

empirical estimates using measured batch data) and the HMM parameters T and O, i.e.

the transition and observation matrices. As we derive these expressions, the algorithm

will become clear. We provide a summary of the algorithm in Figure 2.1.


To begin with, consider the ith component of the first order moment vector,

[S1]i = Pr[y1 = i]

=∑j∈X

Pr[y1 = i|x1 = j] Pr[x1 = j]

=∑j∈X

[O]ij [π]j

= [Oπ]i, (2.4)

where the law of total probability was used in second equality. This is equivalent to

S1 = Oπ, (2.5)

if written as a vector equation. An intuitive interpretation of this equation is that π

provides the probability of of the Markov chain being in a certain state, and that O

then multiplies π to provide the probability of each possible observation. We will now

derive similar expressions for the higher order moments as well. A few more steps are

required to express the second order moment matrix using O and T . Consider S2,1 on

component form,

[S2,1]ij = Pr[y2 = i, y1 = j]

=∑m∈X

Pr[y2 = i, y1 = j|x2 = m] Pr[x2 = m]

=∑m∈X

Pr[y2 = i|x2 = m]︸︷︷︸=[O]im

Pr[y1 = j|x2 = m] Pr[x2 = m]

=∑m∈X

[O]im∑n∈X

Pr[y1 = j|x2 = m,x1 = n]︸︷︷︸=Pr[y1=j|x1=m]=[O]jn

Pr[x1 = n|x2 = m] Pr[x2 = m]

=∑m∈X

[O]im∑n∈X

[O]jnPr[x1 = n, x2 = m]

Pr[x2 = m]Pr[x2 = m]

=∑m∈X

[O]im∑n∈X

[O]jn Pr[x2 = m|x1 = n] Pr[x1 = n]

=∑m∈X

[O]im∑n∈X

[O]jn[T ]mn[π]n

=∑m∈X

∑n∈X

[O]im[T ]mn[π]n[OT ]nj

= [OTdiag(π)OT ]ij , (2.6)

where the third and forth equalities use the conditional independence of the Markov

chain and the fifth and sixth is an application of Bayes’ formula. This can be written


as a matrix expression as

S2,1 = OTdiag(π)OT . (2.7)

It might be easier to gain some intuition of this equation by a simple reformulation:

S2,1 = OT diag(π)OT

= OT[O diag(π)]T . (2.8)

Taken from the right to the left; π provides the probability of the Markov chain being

in a certain state, O then gives the probability of the first item in the observation pair,

T multiplies to transition the Markov chain to the next state, which then the second O

multiplies to give a probability for the second observation.

As mentioned earlier, the third order moment is a (third order) tensor, but can be

written as a second order tensor (i.e. a matrix) if one index is fixed. Let us fix the

second index to be y. In this case, we can derive the following expression for the (i, j)th

component of S3,y,1:

[S3,y,1]ij = Pr[y3 = i, y2 = y, y1 = j]

=∑n∈X

Pr[y3 = i, y2 = y, y1 = j|x3 = n] Pr[x3 = m]

=∑n∈X

Pr[y3 = i|x3 = n]︸︷︷︸=[O]in

Pr[y2 = y, y1 = j|x3 = n] Pr[x3 = m]

=∑n∈X

[O]in∑m∈X

Pr[y2 = y, y1 = j|x3 = n, x2 = m]

× Pr[x2 = m|x3 = n] Pr[x3 = m]

=∑n∈X

[O]in∑m∈X

Pr[y2 = y|x2 = m] Pr[y1 = j|x2 = m]

× Pr[x3 = n|x2 = m] Pr[x2 = m]

=∑n∈X

[O]in∑m∈X

[O]ym Pr[y1 = j|x2 = m][T ]nm Pr[x2 = m]

=∑n∈X

∑m∈X

[O]in[O]ym[T ]nm∑l∈X

Pr[y1 = j|x2 = m,x1 = l]

× Pr[x1 = l|x2 = m] Pr[x2 = m]

=∑n∈X

∑m∈X

[O]in[O]ym[T ]nm∑l∈X

[O]jl Pr[x2 = m|x1 = l] Pr[x1 = l]

=∑n∈X

∑m∈X

∑l∈X

[O]in[O]ym[T ]nm[O]jl[T ]mlπl

=∑n∈X

∑m∈X

∑l∈X

[O]in[T ]nm[O]ym[T ]mlπl[OT ]lj

=[OT diag(eTyO)T diag(π)OT ]ij . (2.9)


Or, expressed as a matrix equation

S3,y,1 = OTdiag(eTyO)Tdiag(π)OT . (2.10)

We will also need to consider the second order moment with a “jump”, S3,1. Instead

of redoing similar calculations as those resulting in Equation (2.7), we note that by the

law of total probability

[S3,1]ij = Pr[y3 = i, y1 = j]

=∑y∈Y

Pr[y3 = i, y2 = y, y1 = j]

=∑y∈Y

[S3,y,1]ij (2.11)

and thus that

S3,1 =∑y∈Y

S3,y,1

=∑y∈Y

OTdiag(eTyO)Tdiag(π)OT

= OT∑y∈Y

diag(eTyO)Tdiag(π)OT

= OT∑

y∈Y [O]y,1 0. . .

0 ∑y∈Y [O]y,X

Tdiag(π)OT

= OTITdiag(π)OT

= OTTdiag(π)OT , (2.12)

where the fact that O is a stochastic matrix and therefore that the elements of each

column sum to one was used in fifth equality.

The spectral learning method relies on two assumptions. Firstly,

Assumption 2.5 (Hsu et al. [12]). π > 0 elementwise, and O and T are rank X.

Secondly, when Hsu et al. [12] generalized the method in Mossel and Roch [22] to handle

cases where there are more possible observations than hidden states, they introduced a

matrix U ∈ RY×X with the following property:

Assumption 2.6 (Hsu et al. [12]). UTO is invertible.


This matrix can be freely chosen as long as Assumption 2.6 is satisfied. However, to

make the algorithm more concrete, Hsu et al. [12] provide a suggestion for how this

matrix can be chosen. We state, and slightly rephrase, Lemma 2 of Hsu et al. [12]:

Lemma 2.7. Assume Assumption 2.5 holds, then rank(S2,1) = X and the matrix V of

left singular vectors of S2,1 corresponding to non-zero singular values, has range(V ) =

range(O). So taking U = V fulfills Assumption 2.6.

The interested reader can find the proof in Hsu et al. [12, p. 6]. We will from hereon

assume that the choice suggested in Lemma 2.7 is made for U .

Now that we have derived expressions for all the necessary moment tensors, we will

show how they can be combined cleverly to yield an expression for recovering O. The

vital step of the algorithm is as follows: We will rewrite Equation (2.10) to introduce

Equation (2.12), which is easy since they share a common factor. This will result in a

diagonalization from which the diagonal with eigenvalues can be identified as one row

of O. As noted above, we introduce the matrix U to be able to handle the case that we

have more possible discrete observations than hidden states. Consider Equation (2.10)

pre-multiplied by UT ,

UTS3,y,1 = UTOT diag(eTyO)T diag(π)OT

= UTOT diag(eTyO) (UTOT )−1(UTOT )︸︷︷︸=I

Tdiag(π)OT

= UTOT diag(eTyO)(UTOT )−1UT OTT diag(π)OT︸︷︷︸= (2.12)

= UTOT diag(eTyO)(UTOT )−1UTS3,1. (2.13)

UTS3,1 has full row rank by Assumption 2.5 and Assumption 2.6, so multiplying by its

Moore-Penrose psuedo-inverse (see, for example, Golub and Van Loan [10] for details)

from the right gives

(UTS3,y,1)(UTS3,1)+ = (UTOT ) diag(eTyO) (UTOT )−1. (2.14)

This is an eigendecomposition, so the eigenvalues of (UTS3,y,1)(UTS3,1)+ are precisely

the elements of the yth row of the emission matrix O (i.e. eTyO). Everything on the left

hand side of Equation (2.14) can be estimated from data, which in turn allows us to

calculate an estimate of O, row by row.

There is a delicacy here though that, for example, was not recognized by Mattfeld

[20]. That is the order of the eigenvalues. Equation (2.14) allows us to calculate an


estimate of one row of O at a time. But using, say, the eig-command1 in MATLAB,

the eigenvalues will be returned in a descending (not guaranteed though) order (see

StackOverflow [25]). If this command is used to calculate the eigenvalues of the left

hand side of Equation (2.14) for each row of O, then the assembled observation matrix

will be faulty, since the elements of each row will be sorted according to the elements’

sizes. The elements of the rows should be sorted according to the order of the hidden

states.

The trick is to exploit the fact that the same matrix, UTOT , diagonalizes the left hand

side of Equation (2.14) for all y ∈ Y. To get a consistent ordering amongst the elements

in the rows of O, we do an eigendecomposition for some y and then use the ordered set of

eigenvectors from this decomposition to diagonalize the left hand side of Equation (2.14)

for every other y. Mattfeld [20, see especially Algorithm 1, p. 13] made the above

mentioned error, and thus get an inconsistent ordering of the elements amongst the

rows of the estimated observation matrix.

Hsu et al. [12] introduce some further improvements to the estimation procedure of the

observation matrix to increase the robustness. It could be that some row, let us say

row y, of O has (at least) two elements that are equal. This means that the eigenvalues

of (UTS3,y,1)(UTS3,1)+ are not unique. In this case, if the geometric multiplicity is

lower than the algebraic multiplicity, we would not get the required amount of (unique)

eigenvectors to do an eigendecomposition. In continuation, we would not be able to

invert the matrix of eigenvectors to do the diagonalization transformation and recover

the other rows of O.

We could do the initial eigendecomposition for every y ∈ Y until we find one that is valid

in the sense outlined in the previous paragraph. However, this would fail if no row of O

is valid. It is shown in Mossel and Roch [22] that if we, instead of a single row, consider

a random combination of all rows of O, then the eigenvalues will be separated with high

probability. To do this, we define the Y random variables gy ∼ N (0, 1) : y = 1, . . . , Y and consider the weighted summation of Equation (2.14),

∑y∈Y

gy(UTS3,y,1)(UTS3,1)+ =

∑y∈Y

gy(UTOT ) diag(eTyO)(UTOT )−1

=∑y∈Y

(UTOT )gy diag(eTyO)(UTOT )−1

= (UTOT )∑y∈Y

gy diag(eTyO)

(UTOT )−1. (2.15)

1http://se.mathworks.com/help/matlab/ref/eig.html

http://se.mathworks.com/help/matlab/ref/eig.html


Performing an eigendecomposition of∑

y∈Y gy (UTS3,y,1)(UTS3,1)+ will recover the di-

agonalization transformation UTOT , up to permutations and scalings of the columns,

which can be used to diagonalize the left hand side of Equation (2.14) for every y ∈ Y.

The permutations of the columns correspond to different labelings of the hidden states.

This is further elaborated on in Section 4.1.2 and Section 4.2. Roughly, since the hid-

den states are hidden, their naming is irrelevant in a realization since they cannot be

observed. The above discussion was concerned with guaranteeing a consistent ordering

of the elements amongst the rows of O, i.e. that we use the same labeling of the hidden

states for every possible observation, but not that we recover the order corresponding

to the true system matrices.

The diagonalization matrix we recover from the eigendecomposition of the sum of the

left hand side of Equation (2.15) would be UTOTKP, where K is a diagonal matrix

of scaling factors and P is the identity matrix with some columns swapped. Which

columns of I that are swapped to form P is given by what algorithm we use to do the

eigendecomposition (for example the eig-command in MATLAB).

We multiply both sides of Equation (2.14) by this matrix from the right and by its

inverse from the left, which yields

(UTOTKP)−1(UTS3,y,1)(UTS3,1)+(UTOTKP)

= (UTOTKP)−1(UTOT ) diag(eTyO)(UTOT )−1(UTOTKP)

= (KP)−1(UTOT )−1(UTOT ) diag(eTyO)(UTOT )−1(UTOT )(KP)

= P−1K−1 diag(eTyO)KP

= P−1K−1K diag(eTyO)P

= P−1 diag(eTyO)P, (2.16)

since K is diagonal.

We can thus calculate the left hand side of Equation (2.16) for every y ∈ Y to recover

each row of O as the diagonal of the resulting expression. The observation matrix we

construct using this procedure can however have some columns swapped compared to

the true observation matrix. Hsu et al. [12, p. 30] state that the recovered observation

matrix is in exact correspondence with the true observation matrix, but that is not

always true.

Once O is recovered, we can find π and T using the following relations:

O+S1 = O+Oπ = π (2.17)


Algorithm: Spectral Learning (SL)

Input: X - number of hidden states,Y - number of possible discrete observations,M - number of observation triplets to sample.

Output: HMM parameters O, T and π.

1. Sample M triplets of observations (y1, y2, y3) from the HMM and form empiricalestimates S1, S2,1, S3,1 and S3,y,1 for y = 1, . . . , Y .

2. Calculate the SVD of S2,1 and form U by taking the left singular vectors corre-sponding to the X largest singular values as columns.

3. Calculate Y random variables gy ∼ N(0, 1) : y = 1, . . . , Y .

4. Perform an eigendecomposition of the left hand side of Equation (2.15) when theabove estimates are used, i.e. of∑

y∈Ygy(U

T S3,y,1)(UT S3,1)+. (2.19)

5. Use the (same) matrix of eigenvectors from Step 4 to diagonalize(U S3,y,1)(UT S3,1)+ for y = 1, . . . , Y and take the diagonal as row y of O. Thediagonalizations are performed by multiplicating from the left and by the inversefrom the right in accordance with Equation (2.16).

6. Calculateπ ← O+S1 (2.20)

andT ← O+S2,1(O+)T diag(π)−1. (2.21)

7. Return O, T and π.

Figure 2.1: The spectral learning algorithm.

and

O+S2,1(O+)Tdiag(π)−1 = O+(OTdiag(π)OT )(O+)Tdiag(π)−1 = T. (2.18)

Note that the transition matrix and stationary distribution that we recover using the

above relations have a corresponding re-ordering of the states as that of the observation

matrix used in the expressions.

The spectral learning algorithm uses the method outlined above to recover estimates

of the observation matrix, transition matrix and stationary distribution using estimates

S1, S2,1, S3,1 and S3,y,1 instead of the true tensors. The algorithm is summarized in

Figure 2.1.


2.4 Introduction to Non-Negative Matrix Factorization

One quite serious flaw of the spectral learning algorithm is that it leaves no guarantees

that the estimated matrices are valid stochastic matrices. It is a one-shot method

that can result in a transition or observation matrix with negative (or perhaps worse,

imaginary) elements. It appears that this problem has not yet been tackled successfully

in the literature. Another approach to the identification procedure of HMMs is to

leverage the results from Non-Negative Matrix Factorization (NNMF) theory. Much of

this work is inspired by the identification methods developed in the control field for

positive linear systems.

Using an NNMF ensures the non-negativity of the estimated parameters. This is not

enough for the HMM identification problem however, since it has some further complica-

tions compared to regular linear systems: the system parameters have to be stochastic.

In particular, this means that the elements of each column of O and T have to sum to

one. These constraints have to be considered when performing the factorization.

It is possible to formulate the HMM identification problem using an NNMF as an op-

timization problem. This is done by Vanluyten et al. [28] and Lakshminarayanan and

Raich [16]. Their methods are very similar and only differ in the way that the opti-

mization problem is solved. Their methods are also conceptually close to the spectral

learning algorithm outlined in the previous section: a matrix of second order moments

is empirically estimated from data and then decomposed into factors, which can be

identified as the parameters of the HMM.

Vanluyten et al. [27] generalize the method to use higher order moments as well. This

complicates the procedure of recovering explicit expressions for the transition and ob-

servation matrix. Recovering explicit expressions is not the aim in Vanluyten et al. [27]

though, and only formulas to recover their product are derived. This is also done by

Finesso et al. [9] and Cybenko and Crespi [7] using slightly different procedures.

In this text, only the method of Vanluyten et al. [28] and Lakshminarayanan and Raich

[16] will be discussed, since they reassemble the spectral learning algorithm the most.


2.5 Identification using Structured Non-Negative Matrix

Factorization

A Non-Negative Matrix Factorization (NNMF) is a decomposition of a non-negative

matrix Q into two non-negative matrices V and A such that

Q = V A. (2.22)

Non-negativity in this text refers to elementwise non-negativity of the matrices. Methods

for solving this decomposition usually relaxe the above constraint to

Q ≈ V A. (2.23)

See for example Lee and Seung [17] for some background on NNMF and a discussion on

how the decomposition can be performed numerically.

Equation (2.22) is the standard NNMF. The methods mentioned in the previous sec-

tion that employ a pure NNMF for the identification procedure do not recover explicit

expressions for the transition and observation matrix. Since our interest lies in actually

recovering these matrices, we will only describe the method that uses a slight variation

of the standard NNMF. Vanluyten et al. [28] introduce the Structured Non-Negative Ma-

trix Factorization (SNNMF), which is also employed by Lakshminarayanan and Raich

[16] without citing Vanluyten et al. [28] and without introducing the name SNNMF.

In an SNNMF, the decomposition of Q is still performed into two non-negative matrices

V and A (albeit different from those in Equation (2.22)), but using a slightly different

structure of the decomposition:

Q = V AV T . (2.24)

Again, the decomposition is usually done approximatively and not exactly.

The insight of Vanluyten et al. [28] and Lakshminarayanan and Raich [16] is to compare

Equation (2.7) from the previous section,

S2,1 = OT diag(π)OT , (2.7)

to Equation (2.24). They realize that it is possible to identify

Q = S2,1, (2.25)

V = O, (2.26)

A = T diag(π), (2.27)


and then formulate and solve Equation (2.24) approximately as a constrained minimiza-

tion problem,

minV,A‖S2,1 − V AV T ‖

s.t.

A ≥ 0, 1TA1 = 1,

V ≥ 0, 1TV = 1T ,(2.28)

where V ∈ RY×X and A ∈ RX×X . The resulting V immediately recovers the observation

matrix, O. T and π can be recovered using the following expressions,

π = AT1 (2.29)

and

T = Adiag(π)−1. (2.30)

The constraints and the relations for T and π will be discussed further in the next

chapter, but it is intuitively plausible that they enforce the stochasticity of the involved

quantities.

Note that the exact norm is unspecified in Equation (2.28). Vanluyten et al. [28] con-

sider the Kullback-Leibler divergence and propose an iterative algorithm for solving the

minimization problem. Lakshminarayanan and Raich [16] consider both the Kullback-

Leibler divergence and an unspecified norm, presumably the Euclidean, but provide

only an iterative method employing an alternating least squares approach for solving

the minimization problem for the unspecified norm.

The idea is then to, just as in the spectral learning algorithm, estimate S2,1 from data

and use S2,1 in the minimization problem to recover estimates O, T and π. The benefit

of this approach is that the estimates are valid stochastically by construction.

However, Equation (2.28) is a non-convex optimization problem, which brings along the

question of local minima. Both the algorithm by Vanluyten et al. [28] and the algorithm

by Lakshminarayanan and Raich [16] are only guaranteed to converge to a local minima.

This is the current trade-off between the spectral learning algorithm, that avoids the

problem with local minima but fails to guarantee the generation of valid estimates, and

the SNNMF algorithm.

The complete SNNMF algorithm is outlined in Figure 2.2, where it should be noted that

minimizing ‖ · ‖ is the same as minimizing ‖ · ‖2.


Algorithm: Structured Non-Negative Matrix Factorization (SNNMF)

Input: X - number of hidden states,Y - number of possible discrete observations,M - number of observation pairs to sample.

Output: HMM parameters O, T and π.

1. Sample M pairs of observations (y1, y2) from the HMM and form an empiricalestimate S2,1.

2. Solve the optimization problem

minV ∈RY ×X

A∈RX×X

‖S2,1 − V AV T ‖2F

s.t.

A ≥ 0, 1T A1 = 1,

V ≥ 0, 1T V = 1T .(2.31)

3. Calculateπ ← AT1 (2.32)

andT ← Adiag(π)−1. (2.33)

4. Return O ← V , T and π.

Figure 2.2: The structured non-negative matrix factorization algorithm.

Chapter 3

Methods for Online Identification

This chapter presents three novel methods for estimating the dynamics of the Markov

chain underlying an HMM when the dynamics of the sensor used to make observations

of the HMM are assumed to be known. We first give a brief introduction to convexity

and convex optimization, since we will be employing such concepts when we in the later

sections derive the methods.

3.1 Introduction

In the previous chapter, we assumed that we were given a sequence of outputs from an

HMM, or could wait long enough to sample said sequence. This sequence was then used

to estimate the transition and observation matrix of the generating HMM. One possible

critique of this approach is that once a new measurement becomes available, everything

has to be recalculated from scratch. This is not a problem for applications where time,

memory and computational resources are near unlimited. However, for some real-time

applications, one or more of these factors can be severely limited.

By online identification, we refer to a recursive formulation of an identification algorithm.

This means that once a new measurement becomes available, a new estimate is generated

using only the new measurement and the old estimate. In this way, we avoid storing all

observations in memory, and we also avoid redoing expensive calculations from scratch

(such as the singular-value decomposition in the spectral learning algorithm).

The aim of this chapter is to formulate the SNNMF algorithm from the previous chapter

(see Section 2.5) on a recursive form (under some assumptions). To the best of our

knowledge, this has not been done before in the literature, and is the main result of this

thesis.

27

Chapter 3. Methods for Online Identification 28

Figure 3.1: Example of a convex function. The (gray) line between any two points onthe graph lies above the graph. A negative gradient is shown as a red arrow. Followingthe direction of the gradient at each point (i.e. performing a gradient descent) will

reach the global minimum.

3.2 Convexity and Convex Optimization

3.2.1 Introduction

We will see that the optimization problem we try to solve as part of the SNNMF, under

some assumptions, is convex. This section presents some results and methods from

convex optimization that will be used in the recursive formulation.

Most of the concepts and methods that will be discussed below can be found in Boyd

and Vandenberghe [4], which is one of the standard textbooks on convex optimization.

For completeness, we cite their definition of a convex set and a convex function:

Definition 3.1 (Boyd and Vandenberghe [4]). A set C is convex if for any x1, x2 ∈ C

and any θ ∈ [0, 1], we have

θx1 + (1− θ)x2 ∈ C. (3.1)

Definition 3.2 (Boyd and Vandenberghe [4]). A function f : Rn → R is convex if the

domain of f is a convex set and if for all x1 and x2 in the domain of f , and θ ∈ [0, 1],

we have

f(θx1 + (1− θ)x2) ≤ θf(x1) + (1− θ)f(x2). (3.2)

The geometrical interpretation of a convex function is that the line between any two

points on the function’s surface always lies above (or touches) the surface. What is so

valuable about this property is that if we follow the negative gradient of the function, i.e.

the steepest slope downhill, to end up in a local minimum, then we are guaranteed that

it is also a global minimum. This is because any local minimum of a convex function is

also a global minimum (Boyd and Vandenberghe [4, c.f. Section 4.2.2]). Figure 3.1 tries

to make this intuitively plausible. See Boyd and Vandenberghe [4] for a more rigorous

discussion.


Before presenting a selection of methods used in convex optimization, we will first prove

a result that will justify our use of them in the sections to come.

Theorem 3.3. Let A ∈ RY×Y , B ∈ RY×X and C ∈ RX×X , then ||A − BCBT || is a

convex function in C.

Proof. Let f(C) = ‖A − BCBT ‖, θ ∈ [0, 1] and C1, C2 ∈ RX×X . Then following

Definition 3.2,

f(θC1 + (1− θ)C2) = ‖A−B(θC1 + (1− θ)C2)BT ‖

= ‖ θA+ (1− θ)A︸︷︷︸=A

−θBC1BT − (1− θ)BC2B

T ‖

= ‖θ[A−BC1BT ] + (1− θ)[A−BC2B

T ]‖

≤ ‖θ[A−BC1BT ‖+ ‖(1− θ)[A−BC2B

T ]‖

= θ‖A−BC1BT ‖+ (1− θ)‖A−BC2B

T ‖

= θf(C1) + (1− θ)f(C2). (3.3)

Remark 3.4. Note that the norm is left unspecified since only the triangle inequality was

used in the proof. Also note that ‖ · ‖2 is convex if ‖ · ‖ is convex.

3.2.2 The Projected Gradient Descent Method

The regular gradient descent method, mentioned briefly above, is a standard method in

optimization. The idea is to follow the steepest downhill slope of a function to reach

the bottom of a “valley”. For non-convex functions, this means that a local minimum

will be approached. If the function is convex however, then any local minimum is also

a global minimum.

Formally, the regular gradient descent method is as follows. Assume that the aim is to

minimize some function f(x), i.e. solve the problem

minx

f(x). (3.4)

Take x0 as an initial guess of a global minimum and then iteratively refine the guess as

xk+1 = xk − ηk∇xf(xk), (3.5)


where ηk is the so-called step-size of the gradient descent. (Sometimes, the normalized

gradient ∇xf(xk)‖∇xf(xk)‖ is used and explicitly written out, but it is equal to taking ηk =

ηk × 1‖∇xf(xk)‖ .)

This will make x move according to the steepest downward slope of the surface of f(x).

Note that there are no restrictions on what values x can take. Thus, Equation (3.5) is

an unconstrained minimization algorithm.

If the optimization problem is on the form

minx

f(x)

s.t. x ∈ C, (3.6)

for some convex set C, then the update rule Equation (3.5) will not guarantee that the

constraint is fulfilled.

The projected gradient descent method is a relatively straight-forward generalization of

the gradient descent method to handle constraints. In the projected gradient descent

method constraints are enforced by performing a projection of xk+1 onto the constraint

set C. The method can formally be written as

xk+1 = ProjC

xk − ηk∇xf(xk)

, (3.7)

where ProjCa

is the projection operator solving

ProjCa

= arg minb∈C

‖a− b‖. (3.8)

A more detailed description of the projected gradient descent method, along with dis-

cussions on the convergence for different step-sizes, can be found in Fan and Yao [8] and

Wang and Xiu [30].

3.2.3 The Primal-Dual Method

Another popular constrained optimization method is the primal-dual method. Unlike the

projected gradient descent method, the constraints are only enforced softly, meaning that

they will be fulfilled asymptotically.


For a problem with equality constraints,

minx

f(x)

s.t. g(x) = 0, (3.9)

where g(x) ∈ Rm, the so-called Lagrangian L is introduced as

L(x, λ) = f(x) + λT g(x), (3.10)

where λ ∈ Rm. The algorithm then iterates between the primal update step,

xk+1 = xk − ηk∇xL(xk, λk) (3.11)

and the dual update step,

λk+1 = λk + ηk∇λL(xk, λk). (3.12)

A more thorough discussion can be found in Krishnamurthy and Abad [15] and Boyd

and Vandenberghe [4].

3.3 Online Structured Non-Negative Matrix Factorization

for Known Sensor Dynamics

In some applications, either the transition matrix or the observation matrix is known.

The transition matrix describes the system dynamics and can at times be modelled

using domain knowledge. The observation matrix models the noise and the sensor used

to measure the system. Since the sensor is a design parameter for some systems, we will

assume that the observation matrix is known in this chapter. Thus, we will only try to

recover the dynamics of the underlying Markov chain of the HMM.

With this assumption, we will see that the optimization problem of the SNNMF formu-

lation becomes convex. We can then apply the methods described above and be sure

that the global minimum (i.e. the true observation matrix) will be reached.

It is worth reasoning about how this assumptions would influence the spectral learning

algorithm, since it was the topic of the first part of this thesis. The eigendecomposition

would be avoided under this assumption since it is used for recovering O. This would

get rid of the possibility to end up with negative or complex elements in the estimates.

However, there is still no guarantee that the sum-to-one condition will be satisfied.


Lindberg and Omre [19] provide an example of an application where this assumption

is made, i.e., where the observation matrix is known, but the dynamics of the Markov

chain is not. They model earthquakes using a generalized HMM with multiple layers

of hidden states. Vercauteren et al. [29] provide another application, when they try to

solve the problem of estimating the number of competing terminals in a wireless network

to be able to tune parameters that will increase its performance.

3.3.1 Derivation of the Algorithms

We will continue the discussion from where we left it in Section 2.5. To recap, the

identification problem had been formulated as the following minimization problem,

minV ∈RY ×X

A∈RX×X

‖S2,1 − V AV T ‖2F

s.t.

A ≥ 0, 1T A1 = 1,

V ≥ 0, 1T V = 1T ,(2.31)

from which estimates of the HMM parameters could be obtained using

O = V , (3.13)

π = AT1, (2.32)

T = Adiag(π)−1. (2.33)

This exploited the fact that we can write the second order moment matrix S2,1 as

S2,1 = OT diag(π)OT (2.7)

= OAOT . (3.14)

The constraints enforced on V are fairly self-explaining. Since we identify V = O, V ≥ 0

ensures that the we do not deal with negative probabilities. 1T V = 1T makes sure that

we get a stochastic matrix, where the elements in each column sum to one. Note that

these two constraints together ensure that every element lies in the interval [0, 1].


The constraints that we enforce on A is non-negativity and that the sum of all its

elements should be one, which follows from the fact that

1TA1 = 1TTdiag(π)1

= 1TTπ

= 1Tπ

= 1, (3.15)

since the sum of the elements of each column of T is one and π is a stochastic vector.

The estimate of the stationary distribution, Equation (2.32), follows from

AT1 = diag(π)T T1

= diag(π)1

= π. (3.16)

Note that this ensures that π is stochastic since from the constraint 1T A1 = 1, it follows

that

1 = 1T AT1

= 1T π. (3.17)

To derive the formula for the estimate of the transition matrix, we simply solve for

T in the definition of A, namely Equation (3.14), to get T = Adiag(π)−1. Note that

Equation (2.33) also guarantees that T is a valid transition matrix, since

1T T = 1T Adiag(π)−1

= 1T Adiag(AT1)−1

= 1T

[A]1,1 · · · [A]1,X

. . .

[A]X,1 · · · [A]X,X

1[A]1,1+···+[A]X,1

· · · 0

. . .

0 · · · 1[A]1,X+···+[A]X,X

=[

[A]1,1+[A]2,1+···+[A]X,1

[A]1,1+[A]2,1+···+[A]X,1· · · [A]1,X+[A]2,X+···+[A]X,X

[A]1,X+[A]2,X+···+[A]X,X

]= 1T , (3.18)

which shows that the elements of each column in T sum to one.

Remark 3.5. The inverse matrix in the expression for T , Equation (2.33), is easy to calcu-

late since it is the inverse of a diagonal matrix: diag([γ1 . . . γn])−1 = diag([ 1γ1

. . . 1γn

]).


As mentioned above, the necessary assumption that we will make in this chapter is that

we know the sensor dynamics (i.e. O). Equation (2.31) then reduces to

minA∈RX×X

‖S2,1 −OAOT ‖2F

s.t. A ≥ 0, 1T A1 = 1. (3.19)

Referring to Theorem 3.3, we can conclude that the cost function is convex. Further

on, it is easily seen by vectorizing A (i.e. stacking the column vectors as one big vector)

that the constraint set is simply the (X2−1)-dimensional probability simplex. Boyd and

Vandenberghe [4, Section 2.2.4] show that this is a convex set. Thus, the optimization

problem in Equation (3.19) is a convex optimization problem.

We will now leverage the methods from the previous section to solve the problem. All of

these methods require gradients of some functions. To avoid having to perform numerical

calculations of these gradients, we will derive analytical expressions.

3.3.1.1 Estimating S2,1 Recursively

However, before solving the optimization problem, we will briefly explain how S2,1 can

be estimated online.

The basic idea is to keep track of the number of times that the output has been observed

to jump between two possible outputs, i.e. keep track of the relative frequencies of pairs

of observations. There are at least two ways of numerically representing the current

estimate. We can either keep track of the probabilities that can be calculated from the

counting, i.e. keep an explicit estimate S2,1, or keep track of the exact number of times

that all possible pairs of observations have been seen and then when needed, calculate

S2,1.

Each approach has its advantages and disadvantages. It seems convenient to keep an ex-

plicit representation of S2,1 available all the time. The problem is updating the estimate

once a new observation is available. To make sure that every observation is accounted

for correctly, one has to keep track of the time k since starting the estimation procedure.

Given the previous observation yk−1 and the new observation yk, the update rule can

be written

[S2,1]ij ←[S2,1]ij × (k − 1) + δi,ykδj,yk−1

k, (3.20)

where δi,j is the Kronecker delta. S2,1 is converted to the exact number of previously

observed jumps between the two outputs j and i, the element corresponding to the new

observation is then increased and everything is normalized to yield a new S2,1.


This is the method we will use in the algorithms to be presented in this chapter. Another

approach is to keep track of the number of jumps, and then normalize this matrix to get

an estimate of S2,1 once needed. This implicit representation has some benefits, since

we have some flexibility of how we generate our estimate of S2,1. For example, we can

chose to weight new observations more if we are tracking a system that is time-varying.

The update procedure in this case would be

[C]ij ← [C]ij + δi,ykδj,yk−1× ρ(k) (3.21)

and

S2,1 ←C∑ij [C]ij

, (3.22)

where we use C as a counting matrix and ρ(k) as a time-varying weight of the obser-

vations. Taking ρ(k) = 1 yields an unweighted estimate and is equivalent to Equa-

tion (3.20). Taking ρ(k) as some increasing function puts more weight to newer obser-

vations.

3.3.1.2 Projected Gradient Descent

The first method we consider for solving Equation (3.19) is the projected gradient descent

method. As explained in Section 3.2.2, we first perform a regular gradient descent and

then project the resulting estimate onto the constraint set. In our case, the constraint set

is the probability simplex. Projections onto simplices have, among others, been studied

by Chen and Ye [6] and Wang and Carreira-Perpinan [31].

We will not dwell on this matter here, but mention that in our implementation, the

method from Wang and Carreira-Perpinan [31] is used due to its ease of implementation.

The cost function that we try to minimize is ‖S2,1 −OAOT ‖2F . We seek to calculate its

gradient with respect to A. Let

f(A) =1

2‖S2,1 −OAOT ‖2F . (3.23)

It is well-known that the following relations hold for the squared Frobenius norm of a

matrix: ‖Ψ‖2F = Tr(ΨTΨ) = Tr(ΨΨT ). Using this, we can expand Equation (3.23) as

f(A) =1

2‖S2,1 −OAOT ‖2F

=1

2Tr([S2,1 −OAOT ][ST2,1 −OATOT ]). (3.24)


Let Jij be the matrix of all zeros except for the element (i, j) which is a one, i.e.

[Jij ]ab = δi,aδj,b. Also let ∆A = εJij , with 1 ε > 0, and to ease the notation in the

following equations, denote S2,1 as simply S. Then

f(A+ ∆A) =1

2Tr

[(S −OAOT )−O∆AOT ][(ST −OATOT )−O∆ATOT ]

=1

2Tr

[S −OAOT ][ST −OATOT ]

︸︷︷︸=f(A)

+1

2Tr− [S −OAOT ]O∆ATOT −O∆AOT [ST −OATOT ]

+O∆AOTO∆ATOT

≈ f(A) +1

2Tr− [S −OAOT ]O∆ATOT −O∆AOT [ST −OATOT ]

= f(A)− Tr

[S −OAOT ]O∆ATOT

= f(A)− Tr

[S −OAOT ]OJTijO

T× ε (3.25)

where terms of higher order than linear were disregarded. Comparing this expression to

the first order Taylor expansion of f(A+ ∆A),

f(A+ εJij) ≈ f(A) + ε∂f(A)

∂Aij, (3.26)

we conclude that

∂f(A)

∂Aij= −Tr

[S −OAOT ]OJTijO

T

= −Tr

[S −OAOT ]OJjiOT. (3.27)

and thus that[∇A‖S2,1 −OAOT ‖2F

]ij

= −2Tr

[S2,1 −OAOT ]OJjiOT. (3.28)

To generate recursive estimates of the transition matrix; we first update our estimate of

S2,1 as explained in the previous section and then perform a projected gradient descent

on the cost function ‖S2,1 − OAOT ‖2F with respect to A. Since we have an explicit

expression for the gradient, we present the full algorithm in Figure 3.2.

Remark 3.6. The initial guess of A is not very important since the algorithm is globally

convergent.


Algorithm: Online SNNMF using Projected Gradient Descent Method (PGDM)

Input: X - number of hidden states,O - observation matrix,yprev - previous observation,ynow - current observation,k - time since start of estimation procedure,

S2,1 - current estimate of S2,1,

A - current estimate of the variable A.

Output: Recursive estimates of HMM parameters; T and π.

0. Initialize the recursion with k = 1, S2,1 ∈ RY×Y as a matrix of zeros and A ∈ RX×Xas I × 1

X . Chose a step-size ηk.

1. Update the estimate of S2,1,

[S2,1]ij ←[S2,1]ij × (k − 1) + δi,ynowδj,yprev

k, (3.29)

where δi,j is the Kronecker delta.

2. Let [Jij ]ab = δi,aδj,b and calculate the gradient of the cost function[G]ij←[∇A‖S2,1 −OAOT ‖2F

]ij

= −2Tr

[S2,1 −OAOT ]OJjiOT. (3.28)

3. Perform the projected gradient descent

A← Proj∆

A− ηk ×G

, (3.30)

where ∆ is the probability simplex.

4. Calculate and return:π ← AT1 (2.32)


5. Take

yprev ← ynow (3.31)

k ← k + 1 (3.32)

and save a new measurement in ynow, then go to Step 1.

Figure 3.2: The online structured non-negative matrix factorization algorithm em-ploying the projected gradient descent method.


3.3.1.3 The Primal-Dual Method without Inequality Constraints

We will now solve the optimization using the primal-dual method instead. The La-

grangian L from Equation (3.10) translates to the following expression for the problem

at hand:

L(A, λ) = ‖S2,1 −OAOT ‖2F + λ(1TA1− 1), (3.33)

where λ ∈ R. This follows from Equation (3.19) by a simple reformulation of the equality

constraint.

Note that we do not in this formulation restrict the elements of A to lie in [0, 1], only

that they sum to one. The inequality constraint can be incorporated in the primal-dual

method as well by restraining a new set of Lagrangian multipliers (i.e. extending λ)

to never change sign. This can be achieved by for example using a parametrization of

the extra Lagrangian multipliers that employ exponential functions. We will not pursue

that path in this thesis and will leave it as future work. However, using only the equality

constraint appears to give good empirical results.

The update rules for the primal variable (A) and the dual (λ) are

Ak+1 = Ak − ηk∇AL(Ak, λk), (3.34)

λk+1 = λk + ηk∇λL(Ak, λk). (3.35)

Thus we will need the gradients of the Lagrangian with respect to A and λ. As in the

previous section, let [Jij ]ab = δi,aδj,b, i.e. the matrix of all zeros except for the element

at row i and column j which is a one. Reusing the result of Equation (3.28) and with f

defined as in Equation (3.23), the first gradient can be calculated as

[∇AL(A, λ)

]ij

= 2∂f(A)

∂Aij+ λ

[∇A1TA1

]ij

= −2Tr

[S2,1 −OAOT ]OJjiOT

+ λ, (3.36)

since

∇A(1TA1) = ∇A∑i,j∈X

([A]ij)

=

1 · · · 1...

. . ....

1 · · · 1

. (3.37)


Further on, the gradient with respect to λ is

∇λL(A, λ) =∂

∂λ

‖S2,1 −OAOT ‖2F + λ(1TA1− 1)

= 1TA1− 1. (3.38)

With expressions for the gradients, we present the full algorithm in Figure 3.3.

3.3.2 Spherical Coordinates

The final method that we will derive is fundamentally different from the two previous

ones. They have been constrained optimization methods, due to Equation (3.19) be-

ing a constrained optimization problem. We will now reformulate the problem as an

unconstrained optimization problem, but still enforce the constraints.

We will adapt the method from Krishnamurthy and Abad [15] which cleverly exploits

the Pythagorean trigonometric identity, i.e., for any x ∈ R:

sin2 x+ cos2 x ≡ 1. (3.44)

The idea is to generalize this one-dimensional identity to a multi-dimensional identity,

and then parametrize the elements of A, which are constrained to lie on a simplex,

using the terms in the identity. Since products involving the squares of sin and cos are

guaranteed to lie in [0, 1], we end up with elements guaranteed to lie in the same interval.

Furthermore, since the identity holds, the sum of all elements will be one.

We will make use of the following set of functions (for X > 1):

Ωn(α) =

cos[α]1 if n = 1,

cos[α]n ×∏n−1k=1 sin[α]k if 2 ≤ n ≤ X2 − 1,

sin[α]X2−1 ×∏X2−2k=1 sin[α]k if n = X2,

(3.45)

where α ∈ RX2−1. Each Ωn can be interpreted as the square root of one term in a

generalized Pythagorean trigonometric identity.


Algorithm: Online SNNMF using the Primal-Dual Method (PDM)



A - current estimate of the variable A,λ - current value of the dual variable.


0. Initialize the recursion with k = 1, S2,1 ∈ RY×Y as a matrix of zeros, A ∈ RX×Xas I × 1

X and λ = 0. Chose a step-size ηk.



k, (3.39)


2. Let [Jij ]ab = δi,aδj,b and calculate the gradient of the Lagrangian with respect tothe primal and dual variable:[

GA]ij←[∇AL(A, λ)

]ij

= −2Tr

[S2,1 −OAOT ]OJjiOT

+ λ. (3.36)

and [Gλ]ij←[∇λL(A, λ)

]ij

= 1T A1− 1. (3.38)

3. Update the primal variable

A← A− ηk ×GA. (3.40)

4. Update the dual variableλ← λ+ ηk ×Gλ. (3.41)



6. Take


k ← k + 1 (3.43)


Figure 3.3: The online structured non-negative matrix factorization algorithm em-ploying the primal-dual method.


Remark 3.7. To get some intuition on why this is true, consider how the Ωn-functions

can be derived:

1 ≡

cos2 x+ sin2 x =

cos2 x+ sin2 x× (cos2 y + sin2 y)︸︷︷︸≡1

=

cos2 x+ sin2 x× (cos2 y + sin2 y × [cos2 z + sin2 z]). (3.46)

...

We parametrize the A matrix using the parameter vector α and the Ωn-functions as

[A]ij

= Ω2(i−1)X+j. (3.47)

It is straight forward to check that the sum of all elements of A using this parametrization

is equal to one and also lie in the interval [0, 1], see Krishnamurthy and Abad [15] for

details.

This implies that both the constraint 1TA1 = 1 and the constraint A ≥ 0 have been

dealt with. We can thus instead try to solve the unconstrained minimization problem

in α:

minα∈RX2−1

‖S2,1 −OA(α)OT ‖2F . (3.48)

It should be noted that since we are considering sums of squares of the trigonometric

functions, it is sufficient to only consider α ∈ [0, π/2]X2−1.

It is not entirely clear if this parametrization preserves the convexity of the original

problem. We note that sin2 x is convex and cos2 x is concave on the interval [0, π/2],

and leave it as future work to provide a (dis)proof of convexity.

We will apply the regular gradient descent method from Section 3.2.2 to update the pa-

rameter vector α since we are now dealing with an unconstrained optimization problem.

With f as in Equation (3.23), the descent update is

αk+1 = αk − 2ηk∇αf(A(αk)). (3.49)

The gradient is somewhat involved to evaluate. A partial result is given in Wang et al.

[32]. We will however perform a slightly different derivation here. Using the chain rule,


we get

[∇αf

]l

=∂f

∂[α]l

=∂f

∂[A]1,1

∂[A]1,1∂[α]l

+ · · ·+ ∂f

∂[A]X,X

∂[A]X,X∂[α]l

= 1T

∂f

∂[A]1,1· · · ∂f

∂[A]1,X...

. . ....

∂f∂[A]X,1

· · · ∂f∂[A]X,X

∂

∂[α]l

[A]1,1 · · · [A]1,X

.... . .

...

[A]X,1 · · · [A]X,X

1= 1T

∂f

∂A ∂A

∂[α]l1. (3.50)

In this expression, ∂f∂A is known from Section 3.3.1.2, i.e. Equation (3.27), and ∂A

∂[α]lcan

be calculated as [ ∂A∂[α]l

]ij

=∂[A]ij∂[α]l

=∂

∂[α]l

Ω2(i−1)X+j

= 2× Ω(i−1)X+j ×

∂

∂[α]l

Ω(i−1)X+j

, (3.51)

with the last derivative evaluating to

∂Ωn

∂[α]l=

− sin[α]1 if n = 1,

− sin[α]n ×∏n−1k=1 sin[α]k if 2 ≤ n ≤ X2 − 1 and l = n,

cos[α]ntan[α]l

×∏n−1k=1 sin[α]k if 2 ≤ n ≤ X2 − 1 and l 6= n,

1tan[α]l

×∏X2−1k=1 sin[α]k if n = X2.

(3.52)

Thus, taking Equation (3.50), Equation (3.51) and Equation (3.52) together, everything

in the expression for∇αf is known and the gradient descent according to Equation (3.49)

can be performed.

Note that the expressions above are mostly of theoretical value and for studying the

convergence of the algorithm in an implementation. In a real-time setting, one would

preferably derive expressions that can be calculated with the possibility of reusing results.

The amount of expressions with trigonometric functions to evaluate in the above formulas

grows fast as the number of states of the Markov chain grows.

We present the full algorithm in Figure 3.4.


Algorithm: Online SNNMF using Spherical Coordinates Method (SCM)



α - current parametrization of A,


0. Initialize the recursion with k = 1, S2,1 ∈ RY×Y as a matrix of zeros and α ∈ RX2−1

as a random vector. Chose a step-size ηk.



k, (3.53)


2. Calculate the gradient of the cost function with respect to α according to Equa-tion (3.50), Equation (3.51) and Equation (3.52).

3. Perform the gradient descent

α← α− 2ηk∇αf(A(α)). (3.54)

4. Reconstruct A as [A]ij

= Ω2(i−1)X+j, (3.55)

using Equation (3.45).



6. Take


k ← k + 1 (3.57)


Figure 3.4: The online structured non-negative matrix factorization algorithm em-ploying a spherical coordinates parametrization.

Chapter 4

Notes on Implementation

4.1 Measuring the Accuracy of Estimates

To be able to compare the different methods introduced in the previous chapters, a

measure of correctness has to be defined. That is, we want a function d(x, x) that, for

some numerical quantity x and an estimate x, gives a indication of how close x is to

x. There are several commonly used ones in the literature, of which a subset will be

introduced and briefly discussed below. Some are not suitable for our purpose since, for

example, they break down for estimates with negative elements.

4.1.1 Kullback-Leibler Divergence

A commonly used measure that is employed by, for example Vanluyten et al. [28] and

Lee and Seung [17], is a modified Kullback-Leibler divergence. For two non-negative

matrices A and B of equal size, it is defined as

dKL(A,B) =∑ij

([A]ij log

[A]ij[B]ij

− [A]ij + [B]ij), (4.1)

where the sum is over all the elements of A and B.

Burnham and Anderson [5] give an interpretation of it as the information lost when an

approximate model, in this case B, is used instead of the true model, A. The expression

above reduces to the Kullback-Leibler divergence if∑

ij [A]ij =∑

ij [B]ij = 1.

The problem with this measure is the logarithm term: it will only give a meaningful

answer if all elements in the matrices are non-negative. This complicates matters since

two of the methods treated in this text (the spectral learning algorithm from Section 2.1

45

Chapter 4. Notes on Implementation 46

and the primal-dual method from Section 3.2.3) give no guarantee that the elements of

the estimates will be non-negative. This is of course a criticism of the algorithms, but

it is still interesting to be able to compare their performances on different systems and

against other methods.

A possible solution could be to employ heuristics, such as clipping negative elements to

zero. This would yield sensible estimates, but the measure would still collapse due to

the division (by zero). This problem is touched upon by Hsu et al. [12, p. 10 and p. 12],

who add another condition on the considered system matrices, namely that all elements

should be larger or equal to some α > 0. One could follow this path and clip elements to

α instead of zero, but it is still not straight-forward how to compare different methods

to each other, since there is no obvious choice of α and also because the clipping of

elements is not well motivated.

4.1.2 Eigenvalues and Eigenvectors

It is common to classify the behaviour of linear systems using the eigenvalues of the

system matrices. It is true that the eigenvalues and eigenvectors play an important

role in describing the properties of a Markov chain. For example, the second largest

eigenvalue of the transition matrix is closely related to how fast the Markov chain reaches

its stationary distribution, see Levin et al. [18] for details. A possibility for measuring

the correctness of the estimates could be to measure the proximity of the eigenvalues

of the estimated matrices to those of the true matrices. However, some problems are

apparent with this approach:

• Since we are performing identification of HMMs, we have to not only consider

the transition matrix, which is square, but also the observation matrix, which, is

not necessarily square. Whenever the number of possible observations is larger

than the number of hidden states, i.e., when Y > X, the eigenvalues of O are

not well-defined. A possible alternative could be to consider the singular values

instead.

• How should the deviation from the true eigenvalues and eigenvectors be measured

and summarized?

Another concern is that renaming the states of the underlying Markov chain in an HMM

will result in an equivalent system (since they are hidden), but in a different description

in terms of the matrices used to describe it. This could shift the eigenvalues, even if the

estimates are correct.


1 2

1

2

1

2

1-p

p

1-q

q

r

1-r

s

1-s

(a) System ΣA

1 2

1

2

1

2

1-q

q

1-p

p

s

1-s

r

1-r

(b) System ΣB

Figure 4.1: Graph depiction of the HMMs Σa and Σb which are related by renamingthe hidden states. They cannot be distinguished from just sequences of observations.

To make this concrete, consider the following system,

ΣA =

TA =

[p 1− q

1− p q

], OA =

[r s

1− r 1− s

], (4.2)

with p, q, r, s ∈ [0, 1].

Swapping the labels of the two hidden states translates to permuting the corresponding

rows and columns in the transition matrix, and permuting the corresponding columns

in the observation matrix. The resulting system description is

ΣB =

TB =

[q 1− p

1− q p

], OB =

[s r

1− s 1− r

]. (4.3)

One can easily convince oneself that the systems ΣA and ΣB are indistinguishable when

sequences of outputs are the only thing that can be observed, by for example drawing a

graph representation of the system, as seen in Figure 4.1.

However, for ΣA, the eigenvalues can be calculated to be 1, p+ q−1 for the transition

matrix and 1, r − s for the observation matrix. For ΣB, the eigenvalues remain in

1, p+q−1 for the transition matrix, but move to 1, s−r for the observation matrix.

Thus, the eigenvalues are different even though the systems would yield indistinguishable

outputs.

A comparison of the eigenvalues (and possibly the eigenvectors) would first require a

renaming of the states in the estimated matrices so to correspond to the naming in the

true matrices. After that, the two points raised above would have to be considered.


4.1.3 Probability of Output Sequences

The above discussion about the problems that permuting the hidden states of an HMM

can cause when measuring correctness is valid for the Kullback-Leibler divergence as

well. Since matrix elements are compared to each other in two different realizations (the

estimated and the true), the states have to line up for the comparison to make sense.

One invariant, i.e., a quantity that is independent of how the hidden states are permuted,

is the probability of seeing a certain sequence of outputs. This measure, as defined by

Zhao and Poupart [33], is

dseq(T,O, T , O) =∑

y1,...,yk∈T|Pr[y1, . . . , yk]− Pr[y1, . . . , yk]|

1k , (4.4)

where T is a set of test sequences. Hsu et al. [12] derive theoretical bounds on the

convergence of the spectral learning algorithm for a similar measure, namely

∑y1,...,yk∈T

|Pr[y1, . . . , yk]− Pr[y1, . . . , yk]|, (4.5)

where T = (y1, . . . , yk) : yi ∈ Y for i = 1, . . . , k is the set of all possible output

sequences of length k.

The calculation of the above joint probabilities involves the multiplication of the transi-

tion and emission matrix. This can lead to the measure indicating a well-fitted model,

even though the individual matrices can have negative elements or columns not summing

to one. It does not give a clear indication of how well the estimated matrices lie to the

true ones.

This could be sufficient depending on the application. For example, if we are only inter-

ested in the conditional probability of the next observation, given all the observations

up to this time,

Pr[yk|y1, . . . , yk−1], (4.6)

then it would be sufficient to know only the product of the transition and observation

matrix, as shown in Hsu et al. [12]. However, if the system identification is performed

with the purpose of gaining some insight on how the system at hand is constructed, then

knowledge of both the transition matrix and the observation matrix is wanted, and we

would prefer to have a measure of how well these two matrices are estimated.

4.1.4 Matrix Norms

Mattfeld [20] gave an overview of how the methods in Hsu et al. [12] can be implemented


and suggested using the Frobenius norm of the difference between the estimated matrices

and the true matrices as a measure of correctness.

The Frobenius norm of a matrix is defined as

‖A‖F =

√∑ij

|[A]ij |2, (4.7)

where the sum is over all elements of A. This is equivalent to

‖A‖F =√

TrA∗A, (4.8)

where A∗ denotes the conjugate transpose of the matrix A.

From Equation (4.7), it is clear that defining a measure

dF (A,B) = ‖A−B‖2F (4.9)

would be the square sum of the difference of each element of A and B. This is an

intuitive measure of how close two matrices are and is widely used in numerical linear

algebra. In our setting, one has to be careful about two things though:

• As briefly discussed in Section 4.3.3, the ordering of the hidden states of an HMM is

not unique as long as the observation matrix is permuted correspondingly. There

is no guarantee that any of the algorithms from Chapter 2 will give estimates

of the transition matrix and observation matrix with the same ordering of the

states as the true system. This is not a problem for a real system identification

problem, since the true model is unknown, but it is crucial when benchmarking

the algorithm.

Plotting dF (T, T ) or dF (O, O) does not make sense unless T and O are permuted

so that the hidden states correspond to those of the true system T and O. Mattfeld

[20] failed to realize this.

• Another, less critical note is that since the Frobenius norm is a sum over all

the elements of a matrix, it naturally grows with the dimension of the matrix.

Hence, a fair comparison between a three hidden states model (transition matrix

of dimension three, with nine elements) and a four hidden states model (transition

matrix of dimension four, with sixteen elements) can not be made unless some

normalization is performed first.

We suggest dividing the squared Frobenius norm by the number of elements in the

matrix, effectively giving the mean squared error of all the elements, as measure.


For matrices A, A ∈ Rm×n this would be

dF (A, A) =‖A− A‖2Fm× n

=

∑mi=1

∑nj=1

∣∣[A]ij − [A]ij∣∣2

m× n. (4.10)

And explicitly for the matrices (and vector) that we are concerned with in this

text:

dF (T, T ) =‖T − T‖2F

X2=

∑Xi=1

∑Xj=1

∣∣[T ]ij − [T ]ij∣∣2

X2, (4.11)

dF (O, O) =‖O − O‖2FX × Y

=

∑Yi=1

∑Xj=1

∣∣[O]ij − [O]ij∣∣2

X × Y, (4.12)

dF (π, π) =‖π − π‖22

X=

∑Xi=1 |[π]i − [π]i|2

X. (4.13)

The last row is justified since the Frobenius norm is a straightforward generaliza-

tion of the Euclidean norm to matrices.

Equation (4.10) is the measure that we will use in the simulations in the chapters to

come.

4.2 Re-Ordering the States

To make sure that the estimated matrices can be compared to the true ones using the

dF -measure defined above, we will need to re-order the columns and rows so that they

correspond to the order of the states in the true matrices.

Given a set M with n elements, assign a unique number from 1 to n! to each of the

possible permutations of the items in the set. Let permi : M →M be the permutation

operator giving the ith permutation of the items in M .

Define

Pi =[permie1, . . . , eX

], (4.14)

where[·]

concatenates the vectors in · horizontally, as the ith permutation matrix of

dimension X ×X.

Now, we will let the matrix T ∗ denote the estimated transition matrix with permuted

states that seems to reassemble the ordering in the true transition matrix T the best.

We will calculate it as

T ∗ = P−1i∗ T Pi∗ , (4.15)


where

i∗ = arg mini∈1,...,X!

‖P−1i T Pi − T‖2F . (4.16)

Pi∗ specifies how the states of the estimated quantities should be permuted as to line up

with the order used in the true quantities. Once we have found Pi∗ , and have sorted T ,

the estimated observation matrix and the estimated stationary distribution vector have

to be sorted correspondingly. This is done by taking

O∗ = OPi∗ (4.17)

and

π∗ = πPi∗ . (4.18)

The observant reader will note that this might actually not at all give the true ordering

of the states. It might happen that some permutation, other than the true, of the states

gives a better fit to the true model. This might give a slightly smaller error for small

amount of samples, but as the elements of the estimated matrices start to converge, so

should the permutation that lie closest to the true.

In all figures to come where dF is plotted this re-ordering is performed so that the

measure makes sense.

4.3 Some Notes on Implementing the Spectral Learning

Algorithm

There are some ambiguities when it comes to implementing the spectral learning algo-

rithm of Hsu et al. [12]. Some of these will be briefly touched upon in the following

sections.

4.3.1 Faulty Matrices

The issue that the spectral learning algorithm (and also the primal-dual method, which

only enforces the constraints softly) is not guaranteed to return valid stochastic matrices

has to be handled somehow. This means that the estimated transition and observation

matrix might have elements lying outside of the interval [0, 1] and some columns’ ele-

ments might not sum to one. For the spectral learning algorithm, some elements might

even have a non-zero imaginary part due to the eigenvalue calculation.


We have chosen not to use the heuristic approach of clipping negative or complex ele-

ments and enforcing stochasticity by normalizing each column. Rather, we include a plot

of the fraction of non-stochastic estimates that the spectral learning algorithm yields as

the number of samples grows.

It has been found beneficial empirically to rerun Step 3 to Step 6 of the spectral learning

algorithm, as outlined in Figure 2.1, until feasible estimates for the transition and obser-

vation matrix are acquired. It might be the case that no such estimates are ever found,

so we have in our implementation taken 75 to be the maximum number of iterations for

rerunning these steps.

4.3.2 If the Matrix is not Diagonal?

Step 5 of Figure 2.1 should result in a diagonal matrix, whose diagonal is then taken as

one row of the estimated observation matrix.

If the resulting matrix is not diagonal (due to this being a numerical implementation

with estimated quantities), one has to make a decision:

• The first option is to set all off-diagonal elements to zero.

• Another option would be to do an eigendecomposition on the resulting (non-

diagonal) matrix, and use the diagonal matrix of eigenvalues from this decom-

position. The rational is that this should include more data (the off-diagonal

elements) in the estimate, thus giving a better estimate. One has to be careful

with the ordering of these new eigenvalues though, as mentioned at the end of

Section 2.3.

In the simulations in Chapter 5, the first option is chosen.

4.3.3 Separating the Eigenvalues

Hsu et al. [12] mention that the main source of instability with the spectral learning

algorithm is the risk of not getting separated eigenvalues in Step 5 of Figure 2.1. Even

if the eigenvalues are separated, they might not be very well spread apart. Analyti-

cally, this is not a problem since we will recover a non-singular similarity transform.

Numerically, however, the small separation might cause problems.


Influence of Eigenvalue Separation

103 104 105 10610−4

10−3

10−2

10−1

100

101

samples

erro

rofO

-est

imat

e,dF

(O,O

)

Ng = 1Ng = 2Ng = 10Ng = 25Ng = 50

103 104 105 10610−4

10−3

10−2

10−1

100

101

102

samples

erro

rofT

-est

imat

e,dF

(T,T

)

103 104 105 10610−4

10−3

10−2

10−1

100

samples

erro

rofπ

-est

imat

e,dF

(π,π

)

103 104 105 1060

0.2

0.4

0.6

0.8

1

samples

frac

tion

of

non-s

toch

ast

ices

tim

ates

(of

eith

erO

orT

)

Figure 4.2: Performance of the spectral learning algorithm when Step 4 and 5 areperformed Ng times to find the set of random variables giving the largest spread of the

eigenvalues. Every data point is averaged over 75 simulations.

One idea could be to rerun Step 4 and Step 5 a number of times, say Ng, and chose the

vector gi∗ of random variables that resulted in the largest separation of the eigenvalues;

i∗ = arg maxi∈1,...,Ng

minj 6=k|λj − λk|

, (4.19)

where the eigenvalues are evaluated as functions of gi.

However, as seen in Figure 4.2, where this is done for Ng ∈ 1, 2, 10, 25, 50 on Example 2

of Appendix A, a higher value of Ng does if anything seem to deteriorate the estimates.

It appears as if choosing the eigenvalues with the largest separation generates more


non-stochastic estimates of the transition and observation matrix. In the simulations in

Chapter 5, we take Ng = 1, i.e. we do not try to find the largest spread. We do however

rerun these steps (4 and 5) if any eigenvalues happen to be equal.

Chapter 5

Numerical Results for Spectral

Learning and Structured

Non-Negative Matrix

Factorization

In the plots in this chapter, samples refers to the total amount of observation samples

used for estimating the parameters of the HMM and not the number of observation pairs

or triplets.

The measure of correctness utilized is dF (·, ·), which is the mean squared error of all the

elements of the matrix in question. See Section 4.1.4 for a discussion.

We report the fraction of matrices that do not have all elements in the interval [0, 1]

(fraction of non-stochastic estimates), since the spectral learning algorithm does not

guarantee that the estimated quantities fulfill this constraint.

The examples used in this chapter can be found in Appendix A and will in the plots be

referred to as Ei, where i is the number of the example.

5.1 Comparison of Spectral Learning and Structured Non-

Negative Matrix Factorization

In this section, we compare the Spectral Learning (SL) algorithm to the Structured

Non-Negative Matrix Factorization (SNNMF) algorithm on two systems. These two

algorithms are described in Chapter 2.

55

Chapter 5. Numerical Results for Spectral Learning and SNNMF 56

Comparison of SL and SNNMF

103 104 105 106 10710−7

10−6

10−5

10−4

10−3

10−2

10−1

100

samples

erro

rofO

-est

imat

e,dF

(O,O

)

103 104 105 106 107

10−6

10−4

10−2

100

samples

erro

rofT

-est

imat

e,dF

(T,T

)

103 104 105 106 1070

0.5

1

1.5

2

2.5

3

3.5

samples

tim

e[s

]

103 104 105 106 1070

0.2

0.4

0.6

0.8

1

samples

frac

tion

ofnon

-sto

chas

tic

esti

mate

s(o

fei

therO

orT

)

SL E1

SL E2

SNNMF E1

SNNMF E2

Figure 5.1: Comparison of SL and SNNMF. All data points are averaged over 20simulations.

The first system is one with two hidden states and two possible discrete observations

(Example 1). The second system is one with three hidden states and three possible

discrete observations (Example 2).

Both algorithms were tested on 20 different realizations of the two HMMs for every

amount of samples. The average dF -error is plotted in Figure 5.1 along with the time

consumption and the fraction of non-stochastic estimates. For the SNNMF-algorithm,

the multistart-command1 in MATLAB was used to try to find the global minimum,

with 25 different starting positions.

1http://se.mathworks.com/help/gads/multistart-class.html

http://se.mathworks.com/help/gads/multistart-class.html


The spectral learning algorithm does not only run much faster than SNNMF, it also

appears to converge much closer to the true solution as the number of samples grow. As

a matter of fact, it seems as if the SNNMF algorithm fails to recover the true system

matrices, even for a large amount of samples. This could be a problem of a local

minimum. The vast amount of initial conditions tested by the multistart-command

should however alleviate this.

For Example 2, SNNMF appears to outperform the spectral learning algorithm when

less than about 105 samples are available. This is the point where the spectral learning

algorithm starts to provide valid stochastic estimates.

This suggests that one could try to use the spectral learning algorithm as a first al-

ternative for identifying a system (due to the very fast run-time): if it generates a

high percentage of non-stochastic estimates, then one could fall back on the SNNMF

algorithm.

5.2 Performance of the Spectral Learning Algorithm

In this section, we evaluate the spectral learning algorithm on a multitude of systems

with three hidden states to see if we can draw any conclusions on the consistency of the

algorithm.

Simulation results can be found in Figure 5.2, where every data point is an average over

20 different realizations of the HMM. Mean slopes in the log-log diagram, which give an

indication of the convergence rate, have been calculated and are provided in Table 5.1

(the data points at 103 samples were disregarded in this calculation) along with the

second largest eigenvalue of the transition matrix. The second largest eigenvalue is

related to the mixing of the Markov chain and could be a possible indicator of how well

the algorithm will perform.

Example Mean Slope Mean Slope Mean Slope λ2 of T

dF (T, T ) dF (O, O) dF (π, π)

2 -1.166 -1.583 -1.029 0.303 -1.020 -1.213 -1.110 0.754 -0.309 -0.161 -0.179 0.405 -0.944 -0.361 -0.716 0.25 + 0.43i6 -0.956 -1.344 -0.957 0.567 -0.870 -0.973 -0.727 0.238 -0.355 -0.027 -0.247 0.13

Table 5.1: Mean slopes of the dF -error for the spectral learning algorithm in the log-log plot Figure 5.2 together with the second largest eigenvalue of the transition matrix

for various examples. The system matrices can be found in Appendix A.


Performance of SL on Various Examples

103 104 105 106 10710−7

10−6

10−5

10−4

10−3

10−2

10−1

100

101

samples

erro

rofO

-est

imat

e,dF

(O,O

)

E2

E3

E4

E5

E6

E7

E8

103 104 105 106 10710−6

10−5

10−4

10−3

10−2

10−1

100

101

102

samples

erro

rofT

-est

imat

e,dF

(T,T

)

103 104 105 106 10710−7

10−6

10−5

10−4

10−3

10−2

10−1

100

samples

erro

rofπ

-est

imat

e,dF

(π,π

)

103 104 105 106 1070

0.2

0.4

0.6

0.8

1

samples

frac

tion

ofnon

-sto

chas

tic

esti

mate

s(o

fei

therO

orT

)

103 104 105 106 1070

0.2

0.4

0.6

samples

tim

e[s

]

Figure 5.2: Performance of the spectral learning algorithm on numerous examples(found in Appendix A). All data points are averaged over 20 simulations.


Even though there appears to be a continuous scale for the performance of the spectral

learning algorithm on the examples considered, three different classes are somewhat

apparent from Figure 5.2: no convergence, slow convergence and fast convergence. The

examples that are the easiest to classify are Example 3 and 6 that belong to the fast

convergence class and Example 4, 5 and 8 that belong to the no convergence class. The

other examples appear to be somewhere in the middle (slow convergence class).

Worth noting is that the convergence rate for all examples appears to be constant once

the spectral learning algorithm starts providing stochastic estimates (which happens for

different amounts of samples for these examples).

As discussed in Section 4.3.1, some steps of the algorithm are re-evaluated until a valid

estimate is generated, or a maximum number of iterations (75) has been reached. This

accounts for the slightly higher time consumption of Example 4, 5 and 8: since the

amount of non-stochastic estimates is very high, the algorithm has to redo those steps

up to 75 times.

A larger second largest eigenvalue of the transition matrix appears give slightly better

performance. It is not a perfect indicator however: compare Example 4 with λ2 = 0.4

and Example 2 with λ2 = 0.3. The spectral learning algorithm fails to generate good

estimates for Example 4, even though its λ2 is larger than that of Example 2, on which

it recovers good estimates.

We next provide a comparison of the spectral learning algorithm to the standard Expectation-

Maximization (EM) algorithm. We use the implementation provided in MATLAB,

hmmtrain2, of the EM-algorithm. The limit of maximum iterations was taken as the

default value of 500. Due to the very slow convergence rate, we only calculate averages

over three realizations of the HMMs and only provide data for three of the examples

(one from each performance class) of the ones in Figure 5.2. The EM-algorithm was

initiated with random matrices. The results are seen in Figure 5.3.

This figure illustrates the critique of the EM-method fairly well. First of all, it is

very slow. The time-scale is orders of magnitude larger than for the spectral learning

algorithm. At 106 − 107 samples, the spectral learning algorithm generates an estimate

in less than a second, where as the time consumption increases rapidly for EM and it

requires ten minutes for about 105 samples. Secondly, it appears to be very sensitive

to what initial conditions are chosen (as seen in the actual increase in error when the

number of samples increased for Example 3), which highlights the possibility of getting

stuck in local minima.

2http://se.mathworks.com/help/stats/hmmtrain.html

http://se.mathworks.com/help/stats/hmmtrain.html


Performance of the EM-method

103 104 105 106 10710−7

10−6

10−5

10−4

10−3

10−2

10−1

100

101

samples

erro

rofO

-est

imat

e,dF

(O,O

)

EM E2

EM E3

EM E4

103 104 105 106 10710−6

10−5

10−4

10−3

10−2

10−1

100

101

102

samples

erro

rofT

-est

imate

,dF

(T,T

)

103 104 105 106 1070

100

200

300

400

500

600

samples

tim

e[s

]

Figure 5.3: Performance of the EM-algorithm on three examples. Every data pointis an average over three realizations of the HMM. Random initial guesses where used

for starting the EM-algorithm.

Since the benchmark of the EM-algorithm was not very extensive, any hard conclusions

can not be drawn from Figure 5.2 and Figure 5.3. However, it seems to indicate that the

spectral learning algorithm does not perform worse than EM, at least not when using

random initial guesses. A good approach here would probably be to use the spectral

algorithm to generate an initial guess for the EM-algorithm which can then be used to

refine the estimate. This is especially true when a larger amount of samples are available.

The spectral learning algorithm is very fast even when the data batch size is large. The

resulting estimates are however very good on the examples where it actually converges,

so a further refinement using EM might not be needed. Note that this assumes that

the estimates recovered from the spectral learning algorithm are stochastically valid. If

they are not, and it is possible to manipulate them slightly, then that is probably a good

approach. If they are very far from being valid, then one is probably better of using a

random initial guess in EM.


5.3 Higher-Dimensional Systems with Spectral Learning

In this section, we evaluate how well the spectral learning algorithm performs on larger

systems. We benchmark the algorithm on a number of examples, each with an increasing

amount of hidden and observable states.

The systems used in this benchmark are generated randomly. We let the transition

matrix be a perturbed shifted identity matrix, generated as:

1. Chose X and let T ∈ RX×X be

T =

0 1 0...

. . .

0 · · · 0 1

0 · · · 0 0

. (5.1)

2. Perturb T as

[T ]ij ←∣∣[T ]ij +N

(0,

1

X

)∣∣. (5.2)

3. Normalize each column of T .

The number of observable states are assumed to be equal to the number of hidden

states, i.e. Y = X. The observation matrix is then generated in the same manner as

the transition matrix, except that it is chosen as the identity matrix in the first step,

and not a shifted one.

Simulations were only performed with up to nine hidden states due to the combinatorial

explosion when re-ordering the states (described in Section 4.2). To get results for larger

systems, a measure of correctness that avoids the combinatorial problem of aligning the

states with the original ordering would have to be used.

Three different matrices were generated and each used in five realizations of the HMM

for each sample size. The averages of these 15 simulations for different dimensions are

plotted in Figure 5.4.

We note the following things. First of all, the results are very inconclusive. The spectral

learning algorithm performed well on the 9-state example, extremely well on the 5-

state example, but very poorly on the 3- and 6-state examples. It appears to be some

underlying factor, apart from the dimension, of the matrices that determines how well

the spectral learning algorithm performs. This was seen in the previous section where

the examples in Appendix A could be roughly classified into three classes depending on

the performance.


Performance of SL as System Dimension Increases

103 104 105 106 10710−6

10−5

10−4

10−3

10−2

10−1

100

samples

erro

rofO

-est

imat

e,dF

(O,O

)

103 104 105 106 10710−6

10−5

10−4

10−3

10−2

10−1

100

101

samples

erro

rofT

-est

imate

,dF

(T,T

)

103 104 105 106 1070

0.1

0.2

0.3

0.4

0.5

0.6

0.7

samples

tim

e[s

]

X = 3X = 5X = 6X = 7X = 8X = 9

103 104 105 106 1070

0.2

0.4

0.6

0.8

1

samples

frac

tion

ofnon-s

toch

asti

ces

tim

ates

(of

eith

erO

orT

)

Figure 5.4: Performance of the spectral learning algorithm as the dimension of theHMM increases. All data points are averaged over 15 simulations with 3 random ma-

trices, each evaluated over 5 simulations.


Secondly, and somewhat trivially, increasing the number of states increases the time

consumption of the algorithm.

Thirdly, (almost) every single estimate generated had a non-stochastic transition or

observation matrix. A possible explanation for this could be that the method for gener-

ating the random system matrices gives rise to some elements very close to zero. Thus,

even if the estimate is very good, but happens to have some element lying close to zero

but on the “wrong” (negative) side, then the estimate is classified as faulty. This is

especially apparent for the 5-state example; the spectral learning algorithm appears to

come very close to the true system matrices, but still fails to generate valid matrices.

In the implementation, some slack was left around zero to allow for numerical errors (a

few multiples of the machine epsilon).

This makes it hard to use the methodology suggested in Section 5.1, i.e. trust the

estimates once the fraction of non-stochastic estimates approaches zero. A mix of the

heuristic approach of clipping elements that are close to zero could be used: clip all

elements in [−ε, 0) to zero and then check if the matrix is valid, where ε > 0 is chosen

small.

5.4 Perturbation Analysis

From the results in the previous sections, it is clear that the performance of the spectral

learning algorithm depends heavily on the system. Of the examples in Appendix A,

Example 3 showed excellence performance, where as Example 4 and Example 5 seemed

almost impossible to identity, even with a huge amount of samples at hand.

In this section, we will perform an analytical analysis to see if we can explain the

discrepancy between these cases. The outline of the analysis can be found in standard

textbooks on numerical analysis, see for example Moler [21].

As explained in the derivation of the spectral learning algorithm (Section 2.3), the other

parameters of the HMM are calculated from the estimated observation matrix. This

means that the accuracy of the estimated observation matrix is crucial.

Recall Step 5 of Figure 2.1, i.e., that the rows of O are obtained by diagonalizing

(U S3,y,1)(UT S3,1)+ for y = 1, 2, . . . , Y . How sensitive are the eigenvalues (which we try

to find by diagonalization) to the accuracy of the estimates S3,y,1 and S3,1?

Assume for simplicity that we are dealing with a square system (X = Y ), so that we

can disregard the U -matrix. A very basic analysis can be performed as follows. We are


interested in the eigenvalues of S3,y,1S−13,1 . For notational ease, let

M = S3,y,1S−13,1 . (5.3)

Write the eigendecomposition of M as

M = RDR−1, (5.4)

or equivalently

D = R−1MR, (5.5)

where D is a diagonal matrix with the eigenvalues of M as its diagonal, and R is the

matrix of eigenvectors sorted correspondingly. Now perturb M by ∆M and assume that

the same matrix of eigenvectors is used in the diagonalization step. This will result in a

perturbation ∆D of the diagonal matrix of eigenvalues,

D + ∆D = R−1(M + ∆M)R =

= R−1MR+R−1∆MR, (5.6)

giving

∆D = R−1∆MR. (5.7)

Taking the norm of this expression yields,

‖∆D‖ = ‖R−1∆MR‖

≤ ‖R−1‖‖R‖‖∆M‖

= κ(R)× ‖∆M‖, (5.8)

where we used the definition of the condition number κ(R) = ‖R‖‖R−1‖. This very

simple analysis tells us that sensitivity of the eigenvalues of M , i.e., the elements in one

row of the estimated observation matrix, are related to the condition number of R, i.e.,

the matrix of eigenvectors used in the diagonalization procedure.

Notice here that ∆D is not necessarily diagonal. Translating the above relation back to

our original variables, we get

‖∆ diag(eTyO)‖ ≤ κ(OT )× ‖S3,y,1S−13,1 − S3,y,1S

−13,1‖. (5.9)

Thus, the accuracy of the estimate of the observation matrix is related to the condition

number of OT and the accuracy of the estimate of the moments.


Example Error dF (O, O) with 107 samples cond(OT )

4 2.75×10−1 115.45 4.42×10−2 27.18 1.72×10−2 208.37 6.06×10−4 21.62 4.55×10−5 10.86 4.66×10−6 5.43 3.03×10−7 2.6

Table 5.2: The result of Section 5.2 (Figure 5.2) sorted according to the error inthe estimate of the observation matrix when 107 samples were available (which is anindicator of how well the spectral learning algorithm has performed). The conditionnumber of OT appears to correlate well with this, except for the outlier Example 8.

In Table 5.2, we sort the results from Section 5.2 (i.e. Figure 5.2) according to the

error when 107 samples were used in the estimation procedure. The condition number

of OT appears to be a fairly good indicator of how well the spectral learning algorithm

performs. Except for the outlier Example 8, the condition numbers are perfectly sorted

in a descending fashion. Example 8 could perhaps be explained by the second factor in

Equation (5.9), ‖S3,y,1S−13,1 − S3,y,1S

−13,1‖, or that U has to be included in the analysis.

This could also possibly explain Example 5, which has a relatively large error, but a

condition number that is not very different form that of Example 7.

This suggests that the condition number of OT could be used as a gauge for how much

the estimates of the spectral learning algorithm can be trusted (this is not as useful as

it might seem, since this condition number is unknown when performing identification

of a system). A more detailed analysis could perhaps reveal a better relation.

Chapter 6

Numerical Results for Online

Structured Non-Negative Matrix

Factorization

This chapter aims to demonstrate the performance of the recursive algorithms described

in Chapter 3. The details of all examples can be found in Appendix A.

6.1 Comparison of the Algorithms

In this section, we compare the performance of the three recursive methods derived in

Chapter 3. We do this on two systems with three hidden states and a varying amount

of possible discrete observations. One system has three possible observations (Exam-

ple 2), and one has ten possible observations (Example 3). The exact transition and

observation matrices used in the examples are found in Appendix A. The methods are

Example Method Iteration Average [s] Mean Slope

2 PGDM 2.38 ×10−4 -0.3392 PDM 1.18×10−4 -0.3142 SCM 9.01×10−4 -0.2313 PGDM 2.70×10−4 -0.2323 PDM 1.37×10−4 -0.2283 SCM 1.02×10−3 -0.117

Table 6.1: Comparison of the convergence rate and time consumption of the threerecursive methods on two examples (E2 with X = 3, Y = 3 and E3 with X = 3,

Y = 10). Mean Slope refers to the slopes in Figure 6.1.

67

Chapter 6. Numerical Results for Online SNNMF 68

Performance of PGDM, PDM and SCM

100 101 102 103 104 10510−3

10−2

10−1

100

iteration

erro

rofT

-est

imat

e,dF

(T,T

)

PGDM E2

PDM E2

SCM E2

100 101 102 103 104 10510−3

10−2

10−1

100

iteration

erro

rofT

-est

imate

,dF

(T,T

)

PGDM E3

PDM E3

SCM E3

Figure 6.1: Performance of the three recursive methods on two examples: E2 withX = 3, Y = 3 and E3 with X = 3, Y = 10. Every data point is an average over ten

simulations.

labeled as Projected Gradient Descent Method (PGDM), Primal-Dual Method (PDM)

and Spherical Coordinates Method (SCM) in the plots and tables.

There is one parameter that we are free to chose in each algorithm: the step-size ηk

in the gradient descent. We have for simplicity chosen to use a constant. A range of

different values were tested and the one found to give the best result for each method

was then chosen. These were ηPGDM = 0.25, ηPDM = 0.05 and ηSCM = 2π103

.

For each example, ten realizations of the HMM were generated and the methods were

fed one output sample at a time to recursively improve the estimate of the transition

matrix. An average was then calculated for each method of the error in each iteration.

The plot of the averages of these errors for certain iterations can be seen in Figure 6.1.

The average time for one iteration along with the average slope (in a log-log plot) of

each algorithm are reported in Table 6.1. The PGDM appears to converge slightly faster

than the two other methods. The SCM seems to struggle for a while with Example 3.

This can perhaps be resolved by using another step-size.


Performance of PGDM on Higher-Dimensional Systems

100 101 102 103 104 10510−4

10−3

10−2

10−1

iteration

erro

rofT

-est

imat

e,dF

(T,T

)

X = 5X = 10X = 20

Figure 6.2: Performance of PGDM as the dimension of the system increases. Thesystems are generated randomly with Y = X. Each data point is an average over nine

simulations with three pairs of random matrices, each used for three simulations.

The average time for one iteration of the PGDM is about twice that of the PDM.

The time could probably be reduced by using a more efficient code for the projection

operation. The SCM is about one order of magnitude slower than the other two methods.

This is expected since the expression for the gradient (see Section 3.3.2) is quite involved

and has not been optimized for performance.

6.2 Performance as Dimension Increases

We are in this section interested in evaluating the performance of the recursive algo-

rithms as the dimension of the system increases. The three proposed methods appeared

to perform very similarly in the previous section. Since the PGDM had the fastest

convergence rate, we will perform the benchmarks only for this method in this section.

The systems we perform the benchmarks on are generated randomly, in the same manner

as in Section 5.3. We provide results for X ∈ 5, 10, 20, with Y = X. The results can

be seen in Figure 6.2. Every data point is an average over nine simulation runs. Three

random pairs of matrices were generated and used for three simulations each.

States Method Iteration Average [s] Mean Slope

5 PGDM 4.34×10−4 -0.35010 PGDM 1.74×10−3 -0.30120 PGDM 8.36×10−3 -0.281

Table 6.2: Performance of the PGDM as the dimension of the system increases. Statesrefer to the value of X and Y (equal). Mean Slope refers to the slopes in Figure 6.2.


Perhaps somewhat unintuitively, it appears as if the performance actually increases as

the number of states increases (as seen in the decrease in error). This can be explained

by the starting guess (taken to be the identity matrix). The way of generating the

random matrices will generate a lot of elements close to zero as the dimension increases.

This makes the initial guess better and better with our performance measure (the mean

squared error of each element).

Worth noting is that the convergence rate, i.e. the slope, is almost the same as the

number of states increases. The exact values for the average slopes in the log-log plot

are given in Table 6.2.

6.3 Tracking Time-Varying Dynamics

In this section, we demonstrate that the recursive algorithms can track a system with

time-varying dynamics. This means that either, or both, the transition matrix and the

observation matrix change over time.

Formally, we let the system be modelled as:

T (k) =

T1 k ∈ 0, 1, . . . , τ − 1,

T2 k ∈ τ, τ + 1, . . . (6.1)

and

O(k) =

O1 k ∈ 0, 1, . . . , τ − 1,

O2 k ∈ τ, τ + 1, . . . ,(6.2)

where T1, T2, O1 and O2 are constant transition and observation matrices, respectively,

and τ is the time for the switch.

We provide simulation results for two different scenarios. The first is that only the

dynamics of the Markov chain changes and the dynamics of the sensor remain constant,

i.e. O1 = O2. In the second example, both the dynamics of the Markov chain and the

dynamics of the sensor change. The exact matrices used can be found in Appendix A.

The switch of dynamics is made at time k = τ = 21667 in Figure 6.3. The PGDM

was used and every data point is an average over ten simulations. The left plot shows

the case where the sensor does not change its dynamics (i.e. O constant, T changes)

and the right plot shows the case where both system and sensor dynamics change. Also

illustrated is the performance for two different choices of the weight coefficient in the

update of the S2,1-matrix. This was discussed in Section 3.3.1.1.


PGDM Tracking Time-Varying Dynamics

0 1 2 3 4 5 6 7 8 9

·104

10−3

10−2

10−1

iteration

erro

rofT

-est

imat

e,dF

(T,T

)

E6 → E2, ρ(k) = 1

E6 → E2, ρ(k) = k

0 1 2 3 4 5 6 7 8 9

·104

10−3

10−2

10−1

iterationer

ror

ofT

-est

imat

e,dF

(T,T

)

E2 → E9, ρ(k) = 1

E2 → E9, ρ(k) = k

Figure 6.3: Error as PGDM tracks a time-varying system. The left plot shows the casewhere only the Markov chain changes (at the dashed red line), and the sensor dynamicsstays constant. The right plot shows the case where both the sensor dynamics andthe Markov chain change. Every data point is an average over ten simulations. Twodifferent choices for the weight in the update of the estimate of the S2,1 matrix are

shown.

Giving more weight to newer measurements, in these examples by taking ρ(k) = k,

speeds up the convergence after the change of dynamics. The convergence is slower

after the change of dynamics than for the initial system. This could be remedied by

putting even more weight on newer measurements, by for example taking ρ(k) = k2 or

a higher power.

Nonetheless, Figure 6.3 demonstrates that PGDM can successfully be used to track a

time-varying system.

Chapter 7

Conclusions

7.1 Summary and Conclusions

The theme of this thesis has been identification of HMMs. The “standard” method

today, the EM-algorithm, is prone to problems of local minima and slow convergence.

We have implemented the spectral learning algorithm outlined in Hsu et al. [12], which

uses spectral methods to deliver one-shot estimates of the parameters of the HMM. It is

claimed that this method avoids the convergence problems of EM. We benchmarked it

on various examples and have seen that the algorithm overall performs well, but fails to

identify some systems. It is orders of magnitude faster than EM and generated better

estimates than EM for some systems when EM was started with random initial guesses.

We proposed that the difficulty of identifying some systems was related to the condition

number of the product of the observation and transition matrices, OT , and found that

this gives some indications on how the spectral learning algorithm will perform from our

numerical examples.

A down-side of the spectral learning algorithm is that there is no guarantee that the

estimates are valid in the sense that the matrices are stochastic. We noticed a general

pattern that once the algorithm started to provide valid estimates, then the estimates

were very good. This happened when a certain amount of samples were available for

the estimation procedure (a different amount for different systems).

This suggests that estimates generated from the spectral learning algorithm can be

trusted if they are valid. Since the true parameters are unknown in a real-world scenario,

we propose the following work-flow. Given a batch of data (observations of the HMM), do

a partition and apply the spectral learning algorithm to each partition. If the percentage

73

Chapter 7. Conclusions 74

of non-valid estimates is high, employ EM or some other method. If not, use the spectral

learning algorithm on the full data set. Since this gives a one-shot result, EM can then

be used with this estimate as initial guess to refine the estimate.

In the derivation of the spectral learning algorithm, we tried to clarify some of the quirks

that had been overseen in at least one work building on Hsu et al. [12]. We also provided

a way of measuring the accuracy of the estimates. Some previous work had failed to

provide a sensible measure for how accurate these matrices were when using a matrix

norm as measure. This was complicated by the fact that the hidden states of the HMM

can be permuted without changing the outside appearance of the system. We had to

re-order the states in the estimates so that they matched the order of the true system.

Once this was done, we could use the mean squared error of the elements between the

true matrices and the estimated matrices.

We also discussed some ideas that could improve the estimates given by the spectral

learning algorithm; clipping negative/complex elements and trying to maximize the

spread of the eigenvalues in the diagonalization. We saw that maximizing the spread

gave worse results and was thus not used in the numerical examples.

We then introduced a method employing a variant of non-negative matrix factorization.

This method formulated the identification procedure as a optimization problem. The

benefit was that we could guarantee that the estimated parameters were stochastically

valid by enforcing some constraints when solving the optimization problem. The dis-

advantage was that the optimization problem was non-convex, and thus, that we could

not guarantee that we would not converge only to a local minimum, similar to the

EM-algorithm.

Inspired by this method, we made the assumption that the dynamics of the sensor in

the HMM were known. We proved that the problem then reduces to a convex optimiza-

tion problem. Since it is of interest in a real-time setting to continuously improve the

estimates, we set out to formulate a recursive algorithm for estimating the unknown

dynamics of the Markov chain underlying the HMM.

Two relatively standard methods from constrained convex optimization were introduced

and employed to do this reformulation: the projected gradient descent method and

the primal-dual method. We also exploited a multi-dimensional generalization of the

Pythagorean trigonometric identity to transform the constrained optimization problem

to an unconstrained optimization problem which allowed us to use the regular gradient

descent method. This resulted in three novel methods for online estimation of the

transition dynamics of an HMM.


These three algorithms were then evaluated on numerical examples. We found that the

formulation using the projected gradient descent method had slightly better performance

than the other two; both judging convergence speed and time consumption. This could

however be changed if a more sophisticated choice of the step-sizes in the algorithms

were made.

We also demonstrated that the algorithms can successfully track a system with time-

varying dynamics: both of the sensor and the Markov chain.

7.2 Contributions

In short, we have provided:

• A discussion of some of the delicacies concerning the implementation of the spectral

learning algorithm.

• A way of measuring accuracy along with benchmark data for the spectral learning

algorithm on various examples.

• A proposal for how to work with the spectral learning algorithm in a real-world

setting.

• A small benchmark of a structured non-negative matrix factorization algorithm.

• Three novel methods for estimating the transition dynamics of an HMM when the

sensor dynamics are known employing a structured non-negative matrix factoriza-

tion and convex optimization.

– Derivations with explicit expressions for all terms.

– A reformulation of the constrained optimization problem to an unconstrained

optimization problem by exploiting a multi-dimensional Pythagorean trigono-

metric identity.

– Benchmarks on various examples.

– Proof-of-concept that they can successfully track a system with time-varying

dynamics.

7.3 Future Work

There are many paths that are left unexplored in this work. We here provide brief

indications for what could be investigated in future work.


7.3.1 Combining with Other Methods

Anandkumar et al. [1] generalize the method in Hsu et al. [12] and use a slightly different

method for recovering the transition and observation matrices. One could study cases

where the spectral learning algorithm from Hsu et al. [12] fails to provide valid estimates

and see if this alternative method gives better performance. In that case, some hybrid

method could be devised, which falls back on the generalized method for estimating

difficult systems.

7.3.2 Other Explanations for the Performance of the Spectral Learning

Algorithm

We proposed that the condition number of OT could be used to explain the performance

of the spectral learning algorithm on different examples. However, we saw that this was

not a perfect indicator. How close to being singular the observation matrix is should be

a good indicator for how well any algorithm can identify the system. (If O has rank one

for example, then any T can be used to generate indistinguishable outputs.) A possible

measure could be the angle between the column vectors of O.

7.3.3 How to choose the weight ρ(k)?

The weighting factor ρ(k) from Section 3.3.1.1 influences how fast the change of dynamics

will be “noticed” and incorporated in the estimate of S2,1. In this thesis, we provided

simulation results for two choices of ρ(k), but one could explore how to best chose this

factor to guarantee good convergence properties.

7.3.4 Formal Convergence Properties of Online Structured Non-Negative

Matrix Factorization

Also, the formal convergence properties of the three novel algorithms should be studied.

This is non-trivial since there are two quantities for which the convergence have to

be guaranteed simultaneously: the observation matrix O → O and the second order

moment matrix S2,1 → S2,1?

7.3.5 Adaptive Step-Size

There has been much work done on adaptive step-size algorithms for solving optimiza-

tion problems. For simplicity, we used constant step-sizes that seemed to perform well


empirically. A more detailed study of how the step-sizes influence the convergence prop-

erties of the algorithms is a possible path for future work.

7.3.6 The Primal-Dual Method with the Inequality Constraint

We neglected the inequality constrain of the optimization problem in Section 3.3.1.3.

This could be incorporated by adding X × X extra Lagrange multipliers, for example

on the form µi = eθi which would guarantee that µi does not change sign.

7.3.7 Other Parametrizations than Spherical

The parametrization employed in Section 3.3.2 is interesting since it converts the con-

strained optimization problem to an unconstrained optimization problem. However, the

generalized Pythagorean trigonometric identity is not the only possible parametrization

that can be employed.

For example, taking

[A]ij =eθij∑X

k=1

∑Xl=1 e

θkl(7.1)

would also guarantee that all elements of A are non-negative and that their sum is

one. Furthermore, the preservation of the convexity could be easier to analyze since the

exponential function is convex.

7.3.8 Computational Complexity of Online Structured Non-Negative

Matrix Factorization

Numerical simulations provide insight, but a theoretical expression for the computational

complexity would make comparison to other methods easier.

Appendix A

Examples Used in Benchmarks

This appendix contains the parameters used in each of the examples from the text. They

have either been taken from various texts on HMMs or been conceived.

Example 1

T =

[0.90 0.30

0.10 0.70

](A.1)

O =

[0.80 0.20

0.20 0.80

](A.2)

Example 2

T =

0.60 0.30 0.30

0.20 0.50 0.30

0.20 0.20 0.40

(A.3)

O =

0.80 0.20 0.30

0.10 0.70 0.10

0.10 0.10 0.60

(A.4)

79

Appendix A. Examples Used in Benchmarks 80

Example 3

This system was used by Mattfeld [20] to benchmark the spectral learning algorithm.

The numbers as presented here are rounded to two decimal places. The exact fractions

from Mattfeld [20, p. 26] were used in the numerical simulations.

T =

0.80 0.07 0.17

0.10 0.87 0.17

0.10 0.07 0.67

(A.5)

O =

0.40 0.05 0.02

0.07 0.55 0.02

0.07 0.05 0.42

0.07 0.05 0.42

0.07 0.05 0.02

0.07 0.05 0.02

0.07 0.05 0.02

0.07 0.05 0.02

0.07 0.05 0.02

0.07 0.05 0.02

(A.6)

Example 4

T =

0.50 0.10 0.20

0.20 0.60 0.40

0.30 0.30 0.40

(A.7)

O =

0.20 0.40 0.70

0.70 0.40 0.10

0.10 0.20 0.20

(A.8)

Example 5

T =

0.50 0.00 0.50

0.50 0.50 0.00

0.00 0.50 0.50

(A.9)


O =

0.70 0.30 0.80

0.20 0.40 0.05

0.10 0.30 0.15

(A.10)

Example 6

T =

0.70 0.20 0.10

0.20 0.60 0.30

0.10 0.20 0.60

(A.11)

O =

0.80 0.20 0.30

0.10 0.70 0.10

0.10 0.10 0.60

(A.12)

Example 7

T =

0.50 0.30 0.40

0.10 0.40 0.40

0.40 0.30 0.20

(A.13)

O =

0.80 0.20 0.30

0.10 0.70 0.10

0.10 0.10 0.60

(A.14)

Example 8

These are two randomly generated matrices. The elements as seen here are rounded off.

The elements of each column added to one in the numerical simulations.

T =

0.61 0.40 0.51

0.33 0.37 0.37

0.06 0.23 0.13

(A.15)


O =

0.13 0.11 0.03

0.18 0.14 0.08

0.07 0.08 0.11

0.15 0.11 0.10

0.02 0.15 0.13

0.12 0.17 0.10

0.03 0.01 0.08

0.00 0.16 0.17

0.12 0.03 0.09

0.18 0.03 0.12

(A.16)

Example 9

T =

0.60 0.30 0.50

0.00 0.50 0.40

0.40 0.20 0.10

(A.17)

O =

0.80 0.30 0.40

0.00 0.60 0.00

0.20 0.10 0.60

(A.18)

Appendix B

Computing Environment

All simulations in the thesis were performed on a MacBook Air with the following

specifications:

• Mac OS X Version 10.9.5

• CPU 1.3 GHz Intel Core i5

• RAM 4 GB 1600 MHz DDR3

• MATLAB R2013a (8.1.0.604)

83

Bibliography

[1] A. Anandkumar, D. Hsu, and S. M. Kakade. A Method of Moments for Mixture

Models and Hidden Markov Models. ArXiv e-prints, Mar. 2012.

[2] B. D. O. Anderson. New developments in the theory of positive systems. In

C. I. Byrnes, B. N. Datta, C. F. Martin, and D. S. Gilliam, editors, Systems

and Control in the Twenty-First Century, number 22 in Systems & Control:

Foundations & Applications, pages 17–36. Birkhauser Boston. ISBN 978-1-4612-

8662-2, 978-1-4612-4120-1. URL http://link.springer.com/chapter/10.1007/

978-1-4612-4120-1_2.

[3] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring

in the statistical analysis of probabilistic functions of markov chains. pages 164–171.

URL http://www.jstor.org/stable/2239727.

[4] S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge University

Press. ISBN 0521833787 9780521833783.

[5] K. P. Burnham and D. R. Anderson. Model Selection and Multimodel Inference:

A Practical Information-Theoretic Approach. Springer Science & Business Media.

ISBN 9780387953649.

[6] Y. Chen and X. Ye. Projection onto a simplex. URL http://arxiv.org/abs/

1101.6081.

[7] G. Cybenko and V. Crespi. Learning hidden markov models using nonnegative

matrix factorization. 57(6):3963–3970. URL http://ieeexplore.ieee.org/xpls/

abs_all.jsp?arnumber=5773017.

[8] S. Fan and Y. Yao. Strong convergence of a projected gradient method. 2012:

1–10. ISSN 1110-757X, 1687-0042. doi: 10.1155/2012/410137. URL http://www.

hindawi.com/journals/jam/2012/410137/.

[9] L. Finesso, A. Grassi, and P. Spreij. Two-step nonnegative matrix factorization

algorithm for the approximate realization of hidden markov models. URL http:

//arxiv.org/abs/1007.3435.

85

http://link.springer.com/chapter/10.1007/978-1-4612-4120-1_2

http://link.springer.com/chapter/10.1007/978-1-4612-4120-1_2

http://www.jstor.org/stable/2239727

http://arxiv.org/abs/1101.6081


http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5773017


http://www.hindawi.com/journals/jam/2012/410137/

http://www.hindawi.com/journals/jam/2012/410137/



Bibliography 86

[10] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press.

[11] H. a. Hjalmarsson and B. Ninness. Fast, non-iterative estimation of hidden markov

models. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998

IEEE International Conference on, volume 4, pages 2253–2256. IEEE. URL http:

//ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=681597.

[12] D. Hsu, S. M. Kakade, and T. Zhang. A Spectral Algorithm for Learning Hidden

Markov Models. ArXiv e-prints, Nov. 2008.

[13] H. Jaeger. Observable operator models for discrete stochastic time series. 12(6):

1371–1398. ISSN 0899-7667, 1530-888X. doi: 10.1162/089976600300015411. URL

http://www.mitpressjournals.org/doi/abs/10.1162/089976600300015411.

[14] M. J. Johnson. A simple explanation of a spectral algorithm for learning hidden

markov models. URL http://arxiv.org/abs/1204.2477.

[15] V. Krishnamurthy and F. V. Abad. Gradient based policy optimization of con-

strained markov decision processes. URL http://arxiv.org/abs/1110.4946.

[16] B. Lakshminarayanan and R. Raich. Non-negative matrix factorization for param-

eter estimation in hidden markov models. In Machine Learning for Signal Pro-

cessing (MLSP), 2010 IEEE International Workshop on, pages 89–94. IEEE. URL

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5589231.

[17] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In

In NIPS, pages 556–562. MIT Press.

[18] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times.

American Mathematical Soc. ISBN 9780821886274.

[19] D. V. Lindberg and H. Omre. Inference of the transition matrix in convolved hidden

markov models by a generalized baum-welch algorithm. URL https://wiki.math.

ntnu.no/_media/ure/subm-2014-4.pdf.

[20] C. Mattfeld. Implementing spectral methods for hidden markov models with real-

valued emissions. CoRR, abs/1404.7472, 2014. URL http://arxiv.org/abs/

1404.7472.

[21] C. Moler. Numerical Computing with MATLAB. Society for Industrial and Applied

Mathematics, 2004. ISBN 9780898715606. URL http://books.google.ca/books?

id=-vPtcriflH0C.

[22] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden Markov

models. eprint arXiv:cs/0502076, Feb. 2005.



http://www.mitpressjournals.org/doi/abs/10.1162/089976600300015411




https://wiki.math.ntnu.no/_media/ure/subm-2014-4.pdf

https://wiki.math.ntnu.no/_media/ure/subm-2014-4.pdf



http://books.google.ca/books?id=-vPtcriflH0C

http://books.google.ca/books?id=-vPtcriflH0C

Bibliography 87

[23] L. R. Rabiner. First-hand:the hidden markov model. URL http://www.ieeeghn.

org/wiki/index.php/First-Hand:The_Hidden_Markov_Model.

[24] J. Rodu, D. Foster, W. Wu, and L. Ungar. Using regression for spectral esti-

mation of hmms. In A.-H. Dediu, C. Martın-Vide, R. Mitkov, and B. Truthe,

editors, Statistical Language and Speech Processing, volume 7978 of Lecture Notes

in Computer Science, pages 212–223. Springer Berlin Heidelberg, 2013. ISBN 978-

3-642-39592-5. doi: 10.1007/978-3-642-39593-2 19. URL http://dx.doi.org/10.

1007/978-3-642-39593-2_19.

[25] StackOverflow. Question: Does matlab eig always returns[sic] sorted

values. URL http://stackoverflow.com/questions/13704384/

does-matlab-eig-always-returns-sorted-values.

[26] B. Vanluyten. Realization, Identification and Filtering for Hidden Markov Models

using Matrix Factorization Techniques. PhD thesis, Katholieke Universiteit Leuven,

2008.

[27] B. Vanluyten, J. C. Willems, and B. De Moor. A new approach for the identification

of hidden markov models. In Decision and Control, 2007 46th IEEE Conference

on, pages 4901–4905. IEEE, . URL http://ieeexplore.ieee.org/xpls/abs_all.

jsp?arnumber=4434912.

[28] B. Vanluyten, J. C. Willems, and B. De Moor. Structured nonnegative matrix

factorization with applications to hidden markov realization and clustering. 429

(7):1409–1424, . ISSN 00243795. doi: 10.1016/j.laa.2008.03.010. URL http://

linkinghub.elsevier.com/retrieve/pii/S0024379508001262.

[29] T. Vercauteren, A. L. Toledo, and X. Wang. Online bayesian estimation of

hidden markov models with unknown transition matrix and applications to

IEEE 802.11 networks. In Acoustics, Speech, and Signal Processing, 2005. Pro-

ceedings.(ICASSP’05). IEEE International Conference on, volume 4, pages iv–

13. IEEE. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=

1415933.

[30] C. Wang and N. Xiu. Convergence of the gradient projection method for generalized

convex minimization. 16(2):111–120. URL http://link.springer.com/article/

10.1023/A:1008714607737.

[31] W. Wang and M. A. Carreira-Perpinan. Projection onto the probability simplex:

An efficient algorithm with a simple proof, and an application. URL http://arxiv.

org/abs/1309.1541.

http://www.ieeeghn.org/wiki/index.php/First-Hand:The_Hidden_Markov_Model

http://www.ieeeghn.org/wiki/index.php/First-Hand:The_Hidden_Markov_Model

http://dx.doi.org/10.1007/978-3-642-39593-2_19

http://dx.doi.org/10.1007/978-3-642-39593-2_19

http://stackoverflow.com/questions/13704384/does-matlab-eig-always-returns-sorted-values

http://stackoverflow.com/questions/13704384/does-matlab-eig-always-returns-sorted-values



http://linkinghub.elsevier.com/retrieve/pii/S0024379508001262

http://linkinghub.elsevier.com/retrieve/pii/S0024379508001262



http://link.springer.com/article/10.1023/A:1008714607737

http://link.springer.com/article/10.1023/A:1008714607737



Bibliography 88

[32] X. Wang, V. Krishnamurthy, and J. Wang. Stochastic gradient algorithms for design

of minimum error-rate linear dispersion codes in MIMO wireless systems. 54(4):

1242–1255. ISSN 1053-587X. doi: 10.1109/TSP.2005.863122.

[33] H. Zhao and P. Poupart. A sober look at spectral learning. URL http://arxiv.

org/abs/1406.4631.



TRITA XR-EE-RT 2015:001

www.kth.se

on identification of hidden markov models using spectral...

Documents