probabilistic sequence models with speech and language

Thesis for the degree of Doctor of Philosophy

Probabilistic Sequence Models

with Speech and Language Applications

Gustav Eje Henter

Communication TheorySchool of Electrical Engineering

KTH – Royal Institute of Technology

Stockholm 2013

Henter, Gustav EjeProbabilistic Sequence Models with Speech and Language Applications

Copyright c�2013 Gustav Eje Henter except whereotherwise stated. All rights reserved.

ISBN 978-91-7501-932-1TRITA-EE 2013:042ISSN 1653-5146

Communication TheorySchool of Electrical EngineeringKTH – Royal Institute of TechnologySE-100 44 Stockholm, Sweden

AbstractSeries data, sequences of measured values, are ubiquitous. Whenever ob-servations are made along a path in space or time, a data sequence results.To comprehend nature and shape it to our will, or to make informed de-cisions based on what we know, we need methods to make sense of suchdata. Of particular interest are probabilistic descriptions, which enable usto represent uncertainty and random variation inherent to the world aroundus.

This thesis presents and expands upon some tools for creating prob-abilistic models of sequences, with an eye towards applications involvingspeech and language. Modelling speech and language is not only of use forcreating listening, reading, talking, and writing machines—for instance al-lowing human-friendly interfaces to future computational intelligences andsmart devices of today—but probabilistic models may also ultimately tellus something about ourselves and the world we occupy.

The central theme of the thesis is the creation of new or improved modelsmore appropriate for our intended applications, by weakening limiting andquestionable assumptions made by standard modelling techniques. Onecontribution of this thesis examines causal-state splitting reconstruction(CSSR), an algorithm for learning discrete-valued sequence models whosestates are minimal sufficient statistics for prediction. Unlike many tradi-tional techniques, CSSR does not require the number of process states tobe specified a priori, but builds a pattern vocabulary from data alone, mak-ing it applicable for language acquisition and the identification of stochasticgrammars. A paper in the thesis shows that CSSR handles noise and errorsexpected in natural data poorly, but that the learner can be extended in asimple manner to yield more robust and stable results also in the presenceof corruptions.

Even when the complexities of language are put aside, challenges re-main. The seemingly simple task of accurately describing human speechsignals, so that natural synthetic speech can be generated, has proved dif-ficult, as humans are highly attuned to what speech should sound like. Apair of papers in the thesis therefore study nonparametric techniques suit-able for improved acoustic modelling of speech for synthesis applications.Each of the two papers targets a known-incorrect assumption of established

i

methods, based on the hypothesis that nonparametric techniques can betterrepresent and recreate essential characteristics of natural speech.

In the first paper of the pair, Gaussian process dynamical models(GPDMs), nonlinear, continuous state-space dynamical models based onGaussian processes, are shown to better replicate voiced speech, with-out traditional dynamical features or assumptions that cepstral param-eters follow linear autoregressive processes. Additional dimensions ofthe state-space are able to represent other salient signal aspects such asprosodic variation. The second paper, meanwhile, introduces KDE-HMMs,asymptotically-consistent Markov models for continuous-valued data basedon kernel density estimation, that additionally have been extended with afixed-cardinality discrete hidden state. This construction is shown to pro-vide improved probabilistic descriptions of nonlinear time series, comparedto reference models from different paradigms. The hidden state can be usedto control process output, making KDE-HMMs compelling as a probabilisticalternative to hybrid speech-synthesis approaches.

A final paper of the thesis discusses how models can be improved evenwhen one is restricted to a fundamentally imperfect model class. Mini-mum entropy rate simplification (MERS), an information-theoretic schemefor postprocessing models for generative applications involving both speechand text, is introduced. MERS reduces the entropy rate of a model whileremaining as close as possible to the starting model. This is shown toproduce simplified models that concentrate on the most common and char-acteristic behaviours, and provides a continuum of simplifications betweenthe original model and zero-entropy, completely predictable output. As thetails of fitted distributions may be inflated by noise or empirical variabilitythat a model has failed to capture, MERS’s ability to concentrate on high-probability output is also demonstrated to be useful for denoising modelstrained on disturbed data.

Keywords: Time series, acoustic modelling, speech synthesis, stochas-tic processes, causal-state splitting reconstruction, robust causal states, pat-tern discovery, Markov models, HMMs, nonparametric models, Gaussianprocesses, Gaussian process dynamical models, nonlinear Kalman filters,information theory, minimum entropy rate simplification, kernel density es-timation, time-series bootstrap.

ii

List of Papers

The thesis is based on the following papers:

[A] G. E. Henter, M. R. Frean, and W. B. Kleijn, “Gaussian Pro-cess Dynamical Models for Nonparametric Speech Representa-tion and Synthesis,” in Proc. ICASSP, 2012, pp. 4505–4508.

[B] G. E. Henter and W. B. Kleijn, “Picking Up the Pieces: CausalStates in Noisy Data, and How to Recover Them,” PatternRecogn. Lett., vol. 34, no. 5, pp. 587–594, 2013.

[C] G. E. Henter and W. B. Kleijn, “Minimum Entropy Rate Simpli-fication of Stochastic Processes,” IEEE T. Pattern Anal., sub-mitted.

[D] G. E. Henter, A. Leijon, and W. B. Kleijn, “Kernel Den-sity Estimation-Based Markov Models with Hidden State,”manuscript in preparation.

iii

In addition to papers A–D, the following papers have also beenproduced in part by the author of the thesis:

[E] G. E. Henter and W. B. Kleijn, “Simplified Probability Modelsfor Generative Tasks: A Rate-Distortion Approach,” in Proc.EUSIPCO, vol. 18, 2010, pp. 1159–1163.

[F] G. E. Henter and W. B. Kleijn, “Intermediate-State HMMs toCapture Continuously-Changing Signal Features,” in Proc. In-terspeech, vol. 12, 2011, pp. 1817–1820.

[G] P. N. Petkov, W. B. Kleijn, and G. E. Henter, “EnhancingSubjective Speech Intelligibility Using a Statistical Model ofSpeech,” in Proc. Interspeech, vol. 13, 2012, pp. 166–169.

[H] P. N. Petkov, G. E. Henter, and W. B. Kleijn, “MaximizingPhoneme Recognition Accuracy for Enhanced Speech Intelligi-bility in Noise,” IEEE T. Audio Speech, vol. 21, no. 5, pp. 1035–1045, 2013.

iv

AcknowledgementsWhile this thesis may have just one name printed on the front page, it owesits existence to the input and support of many people, and I am gratefulfor each and every one. First and foremost, I must give thanks to my twoscientific guiding stars at SIP who offered me the opportunity to pursue aPhD, and first among them my principal supervisor, supermind ProfessorBastiaan Kleijn: Your scientific creativity is immense and inspiring, yourconsiderable experience is dispensed in lucid and compact pearls of wisdom,your formidable skill in debate has helped hone mine, and your editingcapabilities and sense for greatness in scientific presentation are second tonone, with our papers being much stronger as a result. Thank you foryour immeasurable teachings, and for developing me from a student into aresearcher. I hope you can see hints of yourself reflected back in this thesis.

Next, I am deeply grateful to my second supervisor, Professor Arne Lei-jon: Your abilities as a scientist and engineer, your didactic skills, your ap-proachability, and your dependability are as great as your humility. Thankyou for turning me from a student into a teacher. It has been a delight towork with you on the Pattern Recognition course, and I value your opinionimmensely, whether in science or outside of it.

Beside my two supervisors, I am also indebted to many other seniors inthe department, including Professor Markus Flierl, Doctor Saikat Chatterje,and Professor Mikael Skoglund—not only directly, but also for their work inkeeping electrical engineering at KTH a thriving research environment. Onthat note, I also must thank Dora Söderberg for her happy assistance withadministrative matters. Additionally, I am grateful to Professor RagnarThobaben for helpfully performing quality review of the thesis.

Next I must thank my coworkers at SIP, and lately Communication The-ory, who have made my years as a PhD student rewarding both scientificallyand socially. Aside from Ermin, Konrad, Pravin, Zhanyu, and Sebastian,who I’ve had the pleasure of sharing an office with, special thanks go to allwho have worked on their sound and imaging PhDs alongside me, includ-ing, alphabetically, Anders, Chris, David Z, Du, Guoqiang, Haopeng, Jalil,Jan, Janusz, Minyue, Nasser, Obada, Petko (also a scientific collaborator),and Svante. Among the post-docs, Cees, Hannes, and Timo also come tomind. Our many interesting discussions at lunch, around the fika table, at

v

seminars, and on many other occasions will remain a fond memory of mine.Speaking of interesting discussions, I would be remiss if I did not mention

my MSc student Andreas and the many inquisitive, ambitious, and creativestudents that I’ve interacted with in the Pattern Recognition course. WhatI learned through you in my teaching efforts has sharpened my scientificunderstanding and contributed tangibly to my research.

Outside the confines of KTH, I would like to express my gratitude tothose who helped make my visit to Wellington enjoyable, including Bas-tiaan, officemate Yusuke, scientific collaborator Marcus, as well as Kristiand Daniel D. Thanks also go to the scientists in other countries whom I’vehave the pleasure of working with on the ACORNS and LISTA projects,with a particular mention going to Simon and Cassie at the University ofEdinburgh, who generously have decided to provide me with employmentfor the coming year. I also thank Matt in Cambridge for thoughtful e-mailconversations.

Leaving the scientific domain, I wish to extend many warm thanks to myfriends and peers, both in Sweden and across the globe, who have helpedkeep my life fun and interesting during the PhD. This includes Leigh, Leon,and Michael in the Pacific Northwest; Daniel P and Richard in California;Natalie in Quebec; Brian in Oklahoma; the eminent Douglas and the crewon the East Coast; Danny in France; Philip in Germany; Donna in Denmark;and Eivind in Norway.

In my home base of Sweden I am grateful to a multitude of friends hailingfrom different circles and eras. This includes peers from my education, witha particular nod to the class at Höglandsskolan, including Gabriel, Mattias,Erik, Rikard, Gustav N, Ulf, Tobias, Björn, Johanna, Jonas, and Clara,studymates Fritjof, Eric, and Torbjörn from KTH, as well as more recentgood friends such as Joel, Patrik, Agata, Antonio, and David B, to namejust a few. I hope I can deepen these friendships and continue to add newones in years to come.

I would not be who I am today if not for my educators in school andat university. Suffice it to say that, among many, I am particularly gratefulto Agneta, Katarina, and Leena for their positive and enduring influenceon my life. I also have had the fortune of growing up in the context of alarge and stable extended family, with grandparents Ave, Maud, Berit, andEdgar who always have had a keen interest in what I’ve been doing.

Having peeled away many layers of close friends and acquaintances, weget to the core of being. While for the scientific content of this dissertationI am most indebted to my supervisors, the existence of the thesis is at leastas attributable to four people who I can always count on both in times ofjoy and in times need. These are Gabriel, my friend for more than half mylife, my father Jan-Inge and my mother Maria—givers of life and constantsources of wisdom, love, and support from childhood until the present day—and my brother and best friend Viking. Thank you for believing in me even

vi

when I did not. Without your encouragement, this thesis would not havecome to pass.

Gustav Eje HenterEdinburgh, October 2013

vii

Contents

Abstract i

List of Papers iii

Acknowledgements v

Contents ix

Acronyms and Abbreviations xv

I Introduction 11 The Importance of Sequence Modelling . . . . . . . . . . . . 1

1.1 Probability Theory . . . . . . . . . . . . . . . . . . . 11.2 Overview of Sequence Data Applications . . . . . . . 21.3 Applications in Speech and Language . . . . . . . . 21.4 Thesis Disposition . . . . . . . . . . . . . . . . . . . 3

2 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Preliminary Notation . . . . . . . . . . . . . . . . . 42.2 Stochastic Processes . . . . . . . . . . . . . . . . . . 52.3 Finite and Infinite Duration . . . . . . . . . . . . . . 62.4 Sample Interdependence . . . . . . . . . . . . . . . . 72.5 The Time Dimension . . . . . . . . . . . . . . . . . . 8

3 Solving Problems Using Data . . . . . . . . . . . . . . . . . 93.1 Task Archetypes . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Classification . . . . . . . . . . . . . . . . . 93.1.2 Synthesis . . . . . . . . . . . . . . . . . . . 103.1.3 Regression and Prediction . . . . . . . . . . 113.1.4 Estimation and Model Selection . . . . . . 12

3.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . 133.3 Optimal Decisions under Uncertainty . . . . . . . . . 15

3.3.1 Supervised Problems . . . . . . . . . . . . 15

ix

3.3.2 Unsupervised Problems . . . . . . . . . . . 173.4 Tasks Constraints . . . . . . . . . . . . . . . . . . . 183.5 The Necessity of Assumptions . . . . . . . . . . . . . 193.6 Data-Based Decision Making . . . . . . . . . . . . . 21

3.6.1 Unsupervised Problems . . . . . . . . . . . 213.6.2 Supervised Problems . . . . . . . . . . . . 223.6.3 Discriminative Procedures . . . . . . . . . 233.6.4 Computational Considerations . . . . . . . 25

4 Common Model Paradigms . . . . . . . . . . . . . . . . . . 254.1 Parametric vs. Nonparametric . . . . . . . . . . . . . 264.2 Generative vs. Discriminative . . . . . . . . . . . . . 274.3 Probabilistic vs. Geometric . . . . . . . . . . . . . . 284.4 Fully Bayesian Approaches . . . . . . . . . . . . . . 304.5 Mixture Models . . . . . . . . . . . . . . . . . . . . . 324.6 Ensemble Methods . . . . . . . . . . . . . . . . . . . 33

5 Modelling Sequence Data . . . . . . . . . . . . . . . . . . . 345.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . 355.2 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . 365.3 Bayesian Networks . . . . . . . . . . . . . . . . . . . 385.4 Markov Processes . . . . . . . . . . . . . . . . . . . . 39

5.4.1 Markov Chains . . . . . . . . . . . . . . . . 405.4.2 Continuous-Valued Markov Models . . . . 415.4.3 Variable-Order Models . . . . . . . . . . . 43

5.5 Long-Memory Models . . . . . . . . . . . . . . . . . 435.5.1 Hidden-Markov Models and Kalman Filters 445.5.2 Models Incorporating Neural Networks . . 465.5.3 Hidden Semi-Markov Models . . . . . . . . 47

5.6 Combining Models and Paradigms . . . . . . . . . . 485.7 Finite Sequences . . . . . . . . . . . . . . . . . . . . 505.8 Other Approaches . . . . . . . . . . . . . . . . . . . 50

6 Solving Practical Problems . . . . . . . . . . . . . . . . . . 526.1 Solution Procedure . . . . . . . . . . . . . . . . . . . 526.2 Feature Engineering . . . . . . . . . . . . . . . . . . 53

6.2.1 Information Removal . . . . . . . . . . . . 536.2.2 Exposing Problem Structure . . . . . . . . 546.2.3 Finding Fitting Features . . . . . . . . . . 55

6.3 Bias, Variance, and Overfitting . . . . . . . . . . . . 556.4 Preprocessing and Postprocessing . . . . . . . . . . . 57

6.4.1 Domain Adaptation . . . . . . . . . . . . . 586.4.2 Smoothing . . . . . . . . . . . . . . . . . . 59

7 The Thesis Research . . . . . . . . . . . . . . . . . . . . . . 607.1 Thesis Context . . . . . . . . . . . . . . . . . . . . . 607.2 Overview of Work . . . . . . . . . . . . . . . . . . . 617.3 Summary of Contributions and Outlook . . . . . . . 62

x

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

II Included papers 81

A Gaussian Process Dynamical Models for NonparametricSpeech Representation and Synthesis A11 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . A12 Introducing GPDMs for Speech . . . . . . . . . . . . . . . . A2

2.1 Continuous, Multidimensional State-Spaces . . . . . A22.2 Gaussian Process Dynamical Models . . . . . . . . . A3

3 Implementing GPDMs for Speech . . . . . . . . . . . . . . . A53.1 Feature Representation . . . . . . . . . . . . . . . . A53.2 Covariance Functions . . . . . . . . . . . . . . . . . A63.3 Advanced Initialization . . . . . . . . . . . . . . . . A6

4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . A64.1 Speech Generation . . . . . . . . . . . . . . . . . . . A74.2 Speech Representation . . . . . . . . . . . . . . . . . A8

5 Conclusions and Future Work . . . . . . . . . . . . . . . . . A9References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A9

B Picking Up the Pieces: Causal States in Noisy Data, andHow to Recover Them B11 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . B12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . B3

2.1 Causal-State Systems . . . . . . . . . . . . . . . . . B32.2 CSSR . . . . . . . . . . . . . . . . . . . . . . . . . . B5

3 Limitations of CSSR . . . . . . . . . . . . . . . . . . . . . . B73.1 Practical Consequences . . . . . . . . . . . . . . . . B83.2 An HMM Non-Learnability Criterion . . . . . . . . . B83.3 Checking Non-Learnability . . . . . . . . . . . . . . B10

4 Robust Causal States . . . . . . . . . . . . . . . . . . . . . . B124.1 Robust Homogenization . . . . . . . . . . . . . . . . B124.2 Recovering Causal Structure . . . . . . . . . . . . . B13

5 A Practical Example . . . . . . . . . . . . . . . . . . . . . . B165.1 The Flip Process . . . . . . . . . . . . . . . . . . . . B165.2 Recovering the Flip Process . . . . . . . . . . . . . . B18

6 Related Approaches . . . . . . . . . . . . . . . . . . . . . . B197 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . B20References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B20A Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . B23B Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . B24

B.1 Distances in Robust Homogenization . . . . . . . . . B25B.2 Limiting Behavior . . . . . . . . . . . . . . . . . . . B26

xi

C Causal States of the Flip Process . . . . . . . . . . . . . . . B27C.1 First Parts of the Theorem . . . . . . . . . . . . . . B28C.2 The Final Point of the Theorem . . . . . . . . . . . B29

C Minimum Entropy Rate Simplification of Stochastic Pro-cesses C11 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . C12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . C3

2.1 Task-Appropriate Post-Processing . . . . . . . . . . C32.2 Relations to Sparsity and Denoising . . . . . . . . . C4

3 Minimum Entropy Rate Simplification . . . . . . . . . . . . C53.1 Preliminary Definitions . . . . . . . . . . . . . . . . C63.2 Quantifying Simplicity . . . . . . . . . . . . . . . . . C63.3 Preventing Oversimplification . . . . . . . . . . . . . C83.4 The General MERS Formulation . . . . . . . . . . . C9

4 MERS for Gaussian Processes . . . . . . . . . . . . . . . . . C94.1 Purely Nondeterministic Processes . . . . . . . . . . C94.2 Weighted Itakura-Saito Divergence . . . . . . . . . . C104.3 Conserving the Variance . . . . . . . . . . . . . . . . C114.4 General Solution for Gaussian Processes . . . . . . . C124.5 Relation to the Wiener Filter . . . . . . . . . . . . . C12

5 MERS for Markov Chains . . . . . . . . . . . . . . . . . . . C135.1 General Solution for Markov Chains . . . . . . . . . C135.2 Solution Properties . . . . . . . . . . . . . . . . . . . C15

6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . C166.1 Thresholding-Based Simplification . . . . . . . . . . C166.2 Simplifying a Meteorological Model . . . . . . . . . . C176.3 Simplifying a Text Model . . . . . . . . . . . . . . . C196.4 Denoising a Speech Grammar . . . . . . . . . . . . . C20

7 Conclusions and Future Work . . . . . . . . . . . . . . . . . C26References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C26A Supplementary Derivations for Gaussian MERS . . . . . . . C30

A.1 Gaussian MERS Solution . . . . . . . . . . . . . . . C30A.2 Weighted Itakura-Saito Divergence MERS . . . . . . C31A.3 Variance-Constrained Gaussian MERS . . . . . . . . C32A.4 Translation Invariance . . . . . . . . . . . . . . . . . C34

B Supplementary Material for Markov Chain MERS . . . . . C34B.1 Bigram Matrix MERS Formulation . . . . . . . . . . C34B.2 Markov Chain MERS Solution . . . . . . . . . . . . C35

xii

D Kernel Density Estimation-Based Markov Models withHidden State D11 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . D12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . D3

2.1 Markov Models . . . . . . . . . . . . . . . . . . . . . D32.2 Hidden-State Models . . . . . . . . . . . . . . . . . . D42.3 Combined Models . . . . . . . . . . . . . . . . . . . D5

3 KDE Models for Time Series . . . . . . . . . . . . . . . . . D63.1 Kernel Density Estimation . . . . . . . . . . . . . . D63.2 Kernel Conditional Density Estimation . . . . . . . D73.3 KDE Markov Models . . . . . . . . . . . . . . . . . . D83.4 KDE-MM Data Generation . . . . . . . . . . . . . . D103.5 KDE Hidden Markov Models . . . . . . . . . . . . . D11

4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . D134.1 KDE Likelihood Function . . . . . . . . . . . . . . . D134.2 Expectation Maximization for KDE-HMMs . . . . . D154.3 Generalized EM Update Formulas for KDE-HMM . D174.4 Accelerated Updates . . . . . . . . . . . . . . . . . . D194.5 KDE-MM Bandwidth Selection . . . . . . . . . . . . D204.6 Initialization and Summary . . . . . . . . . . . . . . D21

5 Experiments on Synthetic Data . . . . . . . . . . . . . . . . D225.1 Data Series . . . . . . . . . . . . . . . . . . . . . . . D225.2 Experiment Setup . . . . . . . . . . . . . . . . . . . D255.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . D26

6 Experiments on Real-Life Data . . . . . . . . . . . . . . . . D266.1 Data Series . . . . . . . . . . . . . . . . . . . . . . . D266.2 Markov-Model Experiment . . . . . . . . . . . . . . D276.3 Hidden-Markov Experiment . . . . . . . . . . . . . . D296.4 Data Generation Experiment . . . . . . . . . . . . . D32

7 Conclusions and Future Work . . . . . . . . . . . . . . . . . D32References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D32

xiii

Acronyms andAbbreviations

ACORNS Acquisition of Communication and Recognition Skills

AD-converter Analog-to-Digital converter

AIC Akaike Information Criterion

AMISE Asymptotic Mean Integrated Squared Error

AR Autoregressive

AR-HMM Autoregressive Hidden Markov Model

ARMA Autoregressive Moving Average

ASR Automatic Speech Recognition

BIC Bayesian Information Criterion

CCSA Clustered Causal-State Algorithm

CELP Code-Excited Linear Prediction

CMN Cepstral Mean Normalization

CSSR Causal-State Splitting Reconstruction

CTW Context Tree Weighting

DAG Directed Acyclic Graph

DNA Deoxyribonucleic Acid

DNN Deep Neural Network

EBW Extended Baum-Welch

ECG Electrocardiogram

xv

EM Expectation Maximization

FET Future and Emerging Technologies

GARCH Generalized Autoregressive Conditional Heteroskedasticity

GMM Gaussian Mixture Model

GP Gaussian Process

GPDM Gaussian Process Dynamical Model

GP-LVM Gaussian Process Latent variable Model

HMM Hidden Markov Model

HSMM Hidden Semi-Markov Model

HTS HMM-based Speech Synthesis system

IID, i.i.d. Independent and Identically Distributed

KCDE Kernel Conditional Density Estimation

KDE Kernel Density Estimation

KDE-HMM KDE Hidden Markov Model

KDE/HMM KDE Hidden Markov Model (special case of KDE-HMM)

KDE-MM KDE Markov Model

KL-divergence Kullback-Leibler divergence

LISTA The Listening Talker

LSTM Long Short-Term Memory

MA Moving Average

MAP Maximum A-Posteriori

MCE Minimum Classification Error

MCMC Markov Chain Monte Carlo

MERS Minimum Entropy Rate Simplification

MFCC Mel-Frequency Cepstral Coefficient

MFD Markov Forecast Density

MGE Minimum Generation Error

ML Maximum Likelihood

xvi

MLP Multilayer Perceptron

MLPG Maximum Likelihood Parameter Generation

MM Minorize-Maximization, Markov Model

MMI Maximum Mutual Information

MRF Markov Random Field

MSE Mean Squared Error

MUSHRA Multiple Stimuli with Hidden Reference and Anchor

NLP Natural Language Processing

NP-hard Non-deterministic Polynomial-time hard

PAC learning Probably Approximately Correct learning

PCA Principal Component Analysis

pdf Probability Density Function

PDFA Probabilistic Deterministic Finite Automaton

pmf Probability Mass Function

POMDP Partially Observable Markov Decision Process

POS Part-of-Speech

PPCA Probabilistic Principal Component Analysis

PPM Prediction by Partial Matching

RBF Radial Basis Function

RBM Restricted Boltzmann Machine

RCS Robust Causal States

RMS Root Mean Square

RNN Recurrent Neural Network

RVM Relevance Vector Machine

RV Random Variable

SETAR Self-Exciting Threshold Autoregressive

SNR Signal-to-Noise Ration

STRAIGHT Speech Transformation and Representation using AdaptiveInterpolation of weighted spectrum

xvii

SVM Support Vector Machine

TDNN Time-Delay Neural Network

VLMM Variable-Length Markov Model

VOM Variable-Order Markov model

VTLN Vocal Tract Length Normalization

WSS Wide-Sense Stationary

xviii

Part I

Introduction

Introduction

1 The Importance of Sequence ModellingIn the last few centuries, human society has evolved at a pace unprecedentedin the historical record. This development has propelled humanity from apre-industrial civilization to a technologically advanced information society,leaving nary any aspect of life untouched.

A driving force in the recent societal evolution has been broad and sys-tematic application of the scientific method of iteratively forming hypothesesand validating these against natural data, rather than relying on intuitionand subjective beliefs alone. Identifying more accurate and appropriatemodels of observed patterns and phenomena has allowed scientists and en-gineers to better predict and shape the world. Such breakthroughs, in turn,have enabled data collection of higher accuracy and greater scope than be-fore, creating a feedback loop. Along the way, scientists and engineers haveamassed an ever-growing toolchest of methods for analyzing and describingnatural data. The purpose of this thesis is to present and expand on someof these tools relevant for probabilistic sequence modelling.

1.1 Probability Theory

Since the beginning of the 20th century, probability theory has seen rapiddevelopment and risen to prominence as an invaluable modelling tool inapplied sciences. Important early milestones of probability theory includethe presentation of Hilbert’s so-called sixth problem—a call [1], in 1900, toput physics (including probabilities) on a firm mathematical basis—followedby Kolmogorov’s subsequent axiomatization of probability theory [2] in the1930s. Today, probability theory and randomness sit at the heart of ourgold-standard physical theories of the universe, as a key component of quan-tum mechanical models. The famous Schrödinger equation, for instance, isof fundamentally stochastic nature.

Probability theory has also seen widespread application far beyondthe microcosmic realm. An important early promoter of applications of

2 Introduction

probability theory outside physics was the biologist and statistician SirR. A. Fisher. At any scale, the real world is fraught with uncertainty, as isthe data we gather from it. Probability theory, which provides a principledmethod to represent, quantify, and perform computation under uncertainty,is therefore of interest.1 Some, such as Jaynes [4], have taken this reasoningfurther, and interpret probability theory as an extension of logic to the casewhere we do not have sufficient information. Cox’s theorem [5, 6] is anattempt to derive this correspondence from first principles.

At present, probability theory permeates much of the scientific theoryand methodology in many fields of science—natural and social—includingspeech and language as considered in this thesis.

1.2 Overview of Sequence Data ApplicationsTime-dependent data series are everywhere. They are in the stars, and inthe Earth. They appear in our society, and within ourselves. In astron-omy, we encounter series showing periodic patterns of sunspots [7] or thetransient intensity profiles of gamma ray bursts and many other sources [8].In geology and meteorology, time dependence is ubiquitous in, for instance,temperature series and rainfall data [9, 10]. In biology and ecology, time se-ries map the rise and fall of predator and prey populations [11], and appearas sampled acoustic recordings of marine and terrestrial life [12]. In militaryas well as civilian life, sonar and radar data take the shape of time series.In economy and finance, time chronicles the wealth of nations as well as thefluctuating prices of stocks and other financial instruments [13, 14, 15]. Insocial sciences, political approval ratings and crime statistics, for instance,change over time [16]. In physiology and medicine, electrocardiograms andelectroencephalograms are some well-known time-dependent signals, as areepidemiological data. In culture and entertainment, series data manifes-tations include music (scores and performances), motion-capture data forcomputer graphics, as well as sports results.

Not all series data are necessarily time-dependent, either. The base-pairsequences in DNA and RNA are important examples of dataseries indexedby space rather than time. Other non-temporal examples include quantitiesmeasured along spatial paths, for instance elevation profiles of roads.

1.3 Applications in Speech and LanguageA particularly rich trove of sequence data concerns human communication indifferent forms, most prominently speech and text. Our interest in collectingand analyzing such data is not surprising: we use speech and language toexpress and communicate abstract thought. In essence, these processes

1Other techniques for quantifying uncertainty also exist, e.g., fuzzy logic, which allowsnon-binary (partial) set membership [3].

1 The Importance of Sequence Modelling 3

map out the spaces where our thoughts reside. There is even evidence thatlanguage shapes our cognition [17], and language development has (throughthe “social brain hypothesis”) been proposed as an intimate correlate of brainsize in human evolution [18].

Much research effort in signal processing, natural language processing,linguistics, machine learning, and artificial intelligence has been devoted todevising models of speech and text. Such models are also of interest in fieldssuch as neurobiology and psychoacoustics, for instance for speech perceptionresearch [19].

In some cases, breakthroughs in speech and text modelling have ush-ered in widespread social and societal changes. The speech modelling andsignal processing (source coding) necessary for compressing and transmit-ting speech across cellular phone networks (e.g., CELP [20]), in particular,has had substantial impact in shaping modern society. In other areas, thepromise of automatic speech and language processing technology has notbeen realized in full, and is subject to ongoing research. This includesmany speech and language understanding tasks such as automatic speechrecognition, or text translation and summarization, where machines stilllag behind human performance considerably, except, perhaps, in the mostnarrowly-defined tasks [21].2 Another interesting example is speech synthe-sis, where artificial speech can be at least as intelligible as human speech,while the naturalness of synthetic speech remains noticeably inferior to hu-man speech [23, 24].

Interestingly, models and tools developed for speech and language datahave frequently been useful in other applications as well. In fact, speechand language applications have been a driving force in the development ofseveral widely used sequence-modelling paradigms, such as Hidden Markovmodels and the recent efforts to use deep neural networks for time seriesmodelling [25] (both of which will be discussed in later sections).

1.4 Thesis DispositionAs seen in section 1.2, series data comes from many sources, and our studyof such data can have many goals. Given the myriad applications, there ex-ists a vast body of different techniques for modelling such data. Not everymethod is suited to every application, however, and it is therefore importantto select an approach that is appropriate for the task at hand. Discrete-valued data, for example, is typically handled using different methods thancontinuous-valued series. More subtle differences between applications ex-ist as well: for instance, it turns out that optimal parameter estimates inresolution-constrained source coding are different from entropy-constrained

2The curious fact that computers excel at many tasks that cause humans great diffi-culty, but have problems with tasks we humans find easy, is known as Moravec’s paradox[22].

4 Introduction

coding [26]. One aim of this thesis is to also consider in which situationsdifferent techniques are preferable, for instance in classification versus syn-thesis applications.

The remainder of this thesis introduction provides a more in-depthoverview of sequence models with applications to speech and language.Specifically, section 2 defines stochastic processes, and describes some prop-erties that set sequence data apart from other datasets. Section 3 thenoutlines characteristics that define various problem classes of interest, andhow we may use models to solve the problems in the face of uncertainty.A taxonomy of different types of models is presented in section 4, followedby a closer study of common sequence modelling paradigms and some asso-ciated assumptions in section 5. Section 6 then discusses the surroundingprocedure and considerations involved in applying the models in practice.Section 7, finally, sets the stage for the manuscripts that constitute the bodyof the thesis and concludes the introduction by discussing the results andrelations between the papers.

2 The DataThe goal of this section is to introduce stochastic processes as a generalframework for representing sources of random sequence data. This entailsdefining stochastic processes (sections 2.1 and 2.2), discussing finite and in-finite duration data sources (section 2.3), and outlining the one-dimensionalbetween-sample dependence properties that distinguish discrete-time pro-cesses from many other data sources (sections 2.4 and 2.5).

2.1 Preliminary NotationWe begin by introducing some standard notation. In the following, capi-tal letters, e.g., X, denote random variables (RVs), while lower-case lettersidentify specific, nonrandom realizations x of the random variables, for in-stance an observed sample value. Boldface indicates (possibly) vector ormatrix-valued variables, scalars are non-bold, while curly fonts generallyidentify sets. In particular, X will represent the state space of X, the setof values the random variable can take, i.e., x 2 X .

The function f will represent both probability density functions (pdfs)of continuous-valued random variables and probability density functions(pmfs) of discrete-valued RVs, depending on context. To distinguish differ-ent distributions, we will write, e.g., f

X

(x) to represent the pdf or pmf ofX evaluated at x. f

X

(x), fX

(·), or the shorthand fX

may sometimes beused to represent the entire distribution, that is, the function evaluated atall x 2 X . Conditional distributions, for example of X given ⇥ = ✓, arewritten as f

X|⇥ (x | ✓), while fX

(x; ✓) with a semicolon represents a de-

2 The Data 5

pendence on a non-stochastic, but possibly unknown, quantity or parameter✓; these two examples also illustrate how Bayesian and frequentist interpre-tations are differentiated. The set of x for which f

X

(x) is nonzero is thesupport of X: it is a (not necessarily strict) subset of X . Occasionally, thesomewhat loose notation {X} may be used in pdfs or pmfs to indicate a setof random variables, each taking values on the same space X .

2.2 Stochastic ProcessesSimply put, a stochastic process is a collection of possibly dependent randomvariables X

t

that share the same observation space or state space, here X .The variables in the set are indexed by elements t of some index set T , sowe can think of the stochastic process as the set {X

t

: t 2 T }. 3

A rigorous definition of a stochastic process requires a measurable space(X , ⌃X ) and a probability space (⌦, ⌃⌦, f), where ⌦ is a sample space,⌃S ✓ 2

S signifies a sigma algebra over a set S, and f : ⌃⌦ ! [0, 1] is a mea-surable function assigning consistent probabilities to the events (elements)in ⌃⌦. A stochastic process is then a set of random variables {X

t

: t 2 T }taking values on X . For more on this formal treatment, see, e.g., [28].

Throughout this thesis, we will apply a probabilistic perspective, andconsider the data under examination to be generated by a stochastic processof some sort. (This does not necessarily imply that one needs to use prob-abilistic methods to solve sequence-related tasks later on.) Discrete-timesequence data, specifically, is characterized by stochastic processes wherethe index set is scalar, fully ordered, and discrete. We will assume that thisset is integer valued: either the positive integers for processes with a spe-cific starting point, or all of Z for bi-infinite sequences. (Other index sets forsequences can generally be mapped onto these in an order-preserving fash-ion.) We will term the resulting stochastic processes time series models, incontrast to time series, which are observed, nonrandom data sequences.4

Unlike the index set T , few constraints will be imposed on the set ofpossible values X . Nevertheless, as with regular random variables, the na-ture of the state space has a substantial impact on the nature of the pro-cess, as well as what modelling techniques that are preferable. Generally

3An alternative interpretation is that a stochastic process is a function-valued randomvariable, that is, a distribution over the space of functions x (t), where the index setT defines the domain of the function while the state space X defines its range. This“function-space view” can be particularly helpful for processes on continuous index sets,such as Gaussian processes as discussed in [27].

4Note that continuous-time processes (T = R, say) can be sampled by picking outa discrete subset of the random variables contained in the process. If limitations areimposed on the set of possible realizations of the continuous process, the samples maysometimes be sufficient for reconstructing the full, continuous time function—an exampleis the case of equidistant sampling from a band-limited function, which can be recon-structed perfectly from samples with a sufficiently high sampling frequency, as per thesampling theorem [29].

6 Introduction

speaking, discrete-valued processes over finite alphabets are the simplest tomodel, but countably infinite alphabets, e.g., X = Z, are also possible. Forcontinuous-valued processes, where X ✓ RD in general, it is common tomake assumptions on the marginal pdfs f

Xt (xt

) such as continuity or aspecific parametric shape.

As a point of notation, we will from now on write X to denote a discrete-time sequence of random variables—typically distributed according to sometime series model—and x to signify a realized string of observations, e.g., asampled series. In particular, we may write X to represent the stochasticprocess itself. In situations where we want to refer to particular randomvariables or observations from the process, we shall write X

⌧

, where ⌧ ✓ Tis a set of time indices. The special notation

Xt2t1

= (Xt1 , Xt1+1, . . . , Xt2) (1)

identifies a contiguous sequence or sample from t1 up to and including t2,i.e., Xt2

t1= X

⌧

with

⌧ = {t1, t1 + 1, . . . , t2} . (2)

The capital letter T is frequently used to denote the end time of a sequence;thus XT

1 is a set of T random variables, not a vector transpose, which wouldbe written X|. In rare cases, we write x(n) to refer to a specific sequence(the nth sequence in some set), in contrast to x

n

, which identifies a singlesampled value. The notation X , finally, identifies the set of all possiblesequences and sub-sequences with elements taking values on X ; essentiallyX = X 2T (or a subset if only contiguous sub-sequences are considered), 2Tbeing the powerset of T , representing all possible index sequences.

2.3 Finite and Infinite DurationSome practical data sources only generate observations for a short time,and, as a consequence, the length of observed sequences x may be limitedby the source itself. Other processes keep going forever, at least in theory.We distinguish these as finite-duration and infinite-duration stochastic pro-cesses, respectively. We can define finite-duration processes based on infiniteduration processes with the positive integers as index set, where we also in-troduce a special output (observation) symbol 0· , representing “no output.”Finite-duration processes are then, in principle, the processes supported on

X fin =

n

x 2 {X [ 0·}T : t, t0 2 T ^ xt

= 0· ^ t0 � t ) xt

0= 0·o

, (3)

the set of infinite sequences where all values beyond the first 0· also are 0· .Since 0· represents no output and the sequence ending, one generally works

2 The Data 7

with finite sequences xT

1 in practice; whether the considered sequence actu-ally ended at T , or potentially continues beyond that point, is sometimesambiguous and has to be inferred from the context.

Finite-duration sequences can be created by chopping up infinite-duration sequences at specific points, or otherwise extracting finite segmentsfrom an infinite series, ignoring any dependences between the segments.Sentences, for instance, are often seen as realizations of a finite-durationprocess, even if they are part of a continuous stream of text or speech.

2.4 Sample InterdependenceWhile many elementary probability-theory problems assume that data sam-ples are mutually independent, stochastic processes can be used as a frame-work for representing dependence between a set of comparable observations.Since it is common in practice for data samples to exhibit interdependence,stochastic processes are very useful modelling tools.5

Dependence between variables, as for instance described by stochasticprocesses, is a double-edged sword. If X and Y are two independent randomvariables, then the joint probability density function factors as

fX,Y

(x, y) ⌘ fX

(x) fY

(y) . (4)

For dependent variables, this is no longer true. This complicates learning,as a dependent pdf essentially has to be learned for all pairs (x, y), which isgenerally more difficult—compare the pmf f

X

(x) fY

(y) of two independentdiscrete random variables on an N -symbol alphabet, which has 2 (N � 1)

degrees of freedom, against the general joint pmf fX,Y

(x, y) which hasN2 � 1.

Another effect when samples covary is that each new sample containsless new information because of the redundancy. Consider the extreme casewhen samples are known to be totally dependent, such that X

t

= X1 8t 2 T :Here, all samples will have the same value, and no sample beyond the firstobservation contributes any additional information about f

Xt . Hence wecannot accurately infer, e.g., the mean µ = E (X

t

) even from an infinitelylong sample sequence. More generally, it can be shown that when nearbysamples are somewhat dependent, but not too strongly so, it is possible toconverge on the true mean at the same asymptotic rate as when samplesare IID, but with an error that is greater by a constant factor for any largebut fixed sample size. In other words, a larger amount of samples (by acertain factor) is required to obtain a desired precision, compared to the

5From a philosophical point of view, it can even be argued that no events in the worldare truly independent, since the quantum-mechanical wave function of a particle generallyhas infinite support except when confined to infinitely deep potential wells. This makes ittheoretically possible, though prohibitively unlikely, for particles to influence each otherat arbitrarily large distances.

8 Introduction

independent case. A more formal discussion of this property, complete withan example, is provided in section 5.2.

On the other hand, non-independence between random variables is anessential property in many applications, e.g., in machine learning, since de-pendence allows analysts to infer the properties (distribution) of unobservedquantities from observed ones. For instance, such dependence between ob-served feature data and class labels is a necessity for informed classification.In the case of time series specifically, dependence between nearby samplescan allow more accurate short-term time-series forecasting (prediction offuture values from past values), compared to the case where no nearby sam-ples are available. For example, it is much easier to predict tomorrow’sweather if one knows what the weather was like today.

2.5 The Time DimensionAs mentioned earlier, we consider stochastic processes where the index setis the integers or a (typically infinite) contiguous subset thereof. This in-troduces a natural ordering of the random variables X

t

, where variablesare assigned to regularly-spaced points on the real line. For other kinds ofstochastic processes, the index set can be grids or lattices, and it is alsopossible to assign random variables to nodes in a general graph (cf. theframework in section 5.3).

A key property of natural processes is that spatial or temporal orderingreflects a natural tendency of nearby variables to covary more, while vari-ables at great separation are virtually independent. In other words, depen-dences are localized. For discrete-time sequence data, where variables arearranged along a line, this means that dependences may be expected to havea one-dimensional structure of some kind. (Note that “one-dimensional” inthis case only refers to the manner in which variables influence each other,so each individual variable X

t

may be a high-dimensional vector, e.g., pixelsin a frame of video.)

Some specific localized dependence structures are introduced in sections5.4 through 5.6. For now, we point out that these one-dimensional de-pendences have a common property in that they simplify inference: Givensufficient information about the current (and recent) values of the process,the process variables are partitioned into a past set and a future set thatare conditionally independent. As a consequence, fast inference algorithmssuch as the forward recursion in section 5.5 can be developed. The exis-tence of such algorithms is highly significant for the practice of time-seriesmodelling, in that many models and methods that are computationally in-feasible in the general case can, and are, used effectively to solve problemsinvolving sequence data.

In theory, grids and many other graphical network structures may alsobe partitioned into conditionally independent sets if the values of a sufficient

3 Solving Problems Using Data 9

number of observables are given, but this number usually grows with graphsize, and generally does not lend itself to developing efficient algorithms.

3 Solving Problems Using Data

Our interest in sequence data stems from the fact that it carries informationabout the process that generated it. Real-world data therefore can be usedto solve practical problems. While the details may vary between applica-tions, this section provides an overview of different tasks, what sets themapart, and how, in principle, we can use data to solve them.

We begin by characterizing different types of tasks (section 3.1), andthen consider methods for quantifying task performance (section 3.2) andmaking optimal decisions under uncertainty (section 3.3). A discussion ofdifferent constraints which may impose restrictions on how problems aresolved is given in section 3.4. Section 3.5 discusses the necessity of mak-ing assumptions, such that decisions can be made based on empirical data(section 3.6).

3.1 Task Archetypes

The central task of applied problem-solving can generally be phrased asmaking a decision of some kind, based on the available data—after all, ifwe already have decided what to do, and cannot be influenced by data onthe problem, there is rarely any interest in further analysis.

A function that accepts data as input and returns a decision is known as adecision function or decision rule. In this context, a decision means choosingan element from a set of different possibilities. If the decision function is notcompletely manually specified, but also has a direct dependence on otherdata from related decision situations, it may also be called a learner. Theability to not decide solely on the prior beliefs of the analyst alone, but toalso provide an explicit entry point for information gleaned from the realworld to guide the decisions, has proved to be an extraordinarily powerfuland versatile approach to successful decision making. A more extensivediscussion of statistical decision theory can be found in [30].

Some common task archetypes, which differ in the nature of the inputsand the outputs of the decision function, are classification, synthesis, re-gression, prediction, and estimation. These are outlined below in turn.

3.1.1 Classification

In a classification problem, the training data D takes the form of labelled ex-amples, that is, pairs of observed features x and an associated label variable

10 Introduction

c:

D = {(xn

, cn

)}Nn=1 , (5)

where N forthwith represents the sample size. Since we are working withsequence data, we are particularly interested in cases where the each obser-vation actually is a sequence,

x(n)=

⇣

x(n)1 , x

(n)2 , . . . , x

(n)Tn

⌘

. (6)

These sequences may have different lengths (Tn

for sequence n). The clas-sification task is to predict the unknown labels of new examples where onlythe feature sequences can be observed. In other words, one should devise aclassifier, a function bc (x) that, given a sequence, returns an estimate of theclass label. The set of labels C (the possible decision choices) is generallydiscrete and finite; at the very least, we expect C to have lower cardinalitythan the set of possible feature sequences X .

Two subtypes of classification problems can be identified: in the simplestkind, each feature sequence is associated with exactly one label. Withoutloss of generality, c 2 C ✓ Z. Some examples of this setting are isolatedword recognition for speech [31, 32], or sentiment classification [33] or spamfiltering [34] for text data. In a more general setting, the “label” may actuallybe a sequence c of discrete labels itself: c 2 C. One example of this isautomatic speech recognition (ASR), where the task is to convert speechsequence input to appropriate sequences of words or phones [35]. Anothercase is part-of-speech (POS) tagging in NLP, e.g., [36], where each word inthe input text should be labelled with an appropriate part-of-speech tag.State estimation in automatic control is a third example.

The umbrella of classification is quite broad. Detection problems [37]can be seen as classification tasks, where one is classifying sequences orsequence frames into categories “signal present” and “signal not present.”Certain collaborative filtering tasks can also be formulated as predictingthe user-rating label based on other labelled examples [38].

3.1.2 Synthesis

Synthesis problems are essentially classification problems in reverse. Thetraining set D takes the same feature-label pair form as in equation (5) frombefore, but the task is different. Specifically, one is given a label c or labelsequence c, and asked to reconstruct an appropriate observation sequencebx (·) to go with it. The set of possible choices in the decision problem isthen defined by the state space, i.e., bx 2 X , the conceivable realizationsof the stochastic process. The set X generally has greater cardinality thanthe label set, which is the opposite situation from classification. Also, thesynthesis output is generally considered observable, whereas the class labels


returned by a classifier may not be possible to observe directly. In case thesamples X

t

are continuous-valued, this is essentially a kind of regressionproblem.

A prominent synthesis application is speech synthesis, where the taskis to create new, synthetic speech signals corresponding to given words orphone sequences (labels) that are to be spoken [39]. The synthetic speechfeatures bx

t

, e.g., [40], are generally continuous-valued. Tasks such as ma-chine translation [41] where the output is a text also have an element ofsynthesis, as the output lives in the same space as the observations. Incontrast to speech synthesis, however, text output is discrete, and typicallyconsists of strings of characters or words, bx

t

2 X , X being a finite alphabetor dictionary.

3.1.3 Regression and Prediction

Regression is another type of supervised learning problem, generally charac-terized by finding a functional relationship (which may be stochastic) thatrelates one continuous-valued random variable to another. The trainingdata takes the form

D = {(xn

, yn

)}Nn=1 , (7)

with the task being is to reconstruct the dependent variable yn

from theindependent variables x

n

, i.e., create a predictor by (x). (The dependent/in-dependent nomenclature here, although standard, can be confusing, as itdoes not imply statistical independence among the elements of X.) Un-like previous scenarios, X and Y typically take values on spaces of similarcardinality.

We are interested in cases where the datapoints are sequences. A promi-nent application area involving sequence pair data

�

x(n), y(n)�

is controltheory, where one wants to use a series of inputs x to control the currentand future outputs Y of some process in order to maintain specific values orsatisfy certain constraints [42]. The specific task of identifying the relation-ship between the input and output series is known as system identification[43].

A notable special case of the above arises when a sequence of observationsfrom some process is available, with the variables of interest for predictionbeing future values of the same process. In other words, the task is tocreate a function bx

⌧

0 (x⌧

), where ⌧ ⇢ T and ⌧ 0 ⇢ T here denote disjoint,nonempty sets of time indices. (There is thus no distinction between X

t

and Yt

-variables anymore, just between different times.) When the timesin ⌧ 0 exceed those in ⌧ , this prediction situation is frequently known asforecasting. Since one essentially has a regression problem mapping values ofa process to other values of the process itself, another term is autoregression

12 Introduction

(see, in particular, section 5.4). Other variants involve extrapolating dataseries backwards in time, or filling in missing values (imputation).

Like in synthesis problems, the decision variable in these prediction prob-lems lives in the same space as the observations, and is considered explicitlyobservable. However, the training data only consists of single dataserieswithout explicit labels, similar to the unsupervised data for estimation tasksbelow.

Forecasting is very common in certain application areas such as meteo-rology [44], climatology [45], and economics [46], but is only rarely a goalin itself when working with speech or language data. Nevertheless, pre-dictive models have important applications in these areas as well: goodshort-term speech-signal predictors are of importance for predictive speechcoding [20], while predictive models of text can be used as a component ofautomatic speech recognition systems, to correct misheard or unintelligiblewords based on context [47].

3.1.4 Estimation and Model Selection

Classification, synthesis, and regression are examples of so-called supervisedlearning, as the training data both contains observations x and other values,e.g., labels c, that interpret them or provide a reference or answer of somekind. In unsupervised learning, only observed data samples x are available,with no interpretation. The most prevalent example of unsupervised learn-ing is model selection of some sort, where, for a given training dataset ofindependent samples

D(us) = {xn

}Nn=1 , (8)

the task is to select the most appropriate description m of the data from aset of possible stochastic models M, each a distribution f

m

(x) over X . (Theletters “us” here stand for unsupervised.) Especially relevant to this thesis isthe case where the training dataset contains sequences x(n). Each sequenceis assumed to be independent, but samples within the same sequence mayshow dependence.

Typically, the set of models to choose from forms a manifold or a setof manifolds in the space of all possible distributions. If we introduce acoordinate system over the manifold, the problem of model selection thenreduces to what is known as parameter estimation, where the parameter✓ 2 P corresponds to the coordinates of a unique model on the manifold.In other words, there is a bijection between M and P, and we have

fm

(x) ⌘ f (x; ✓) (9)

for some m (✓). An estimator is thus a function b✓ ({X}) of random obser-vations from a distribution, which we apply to the available (training) data


D(us).6Parameter estimation is most frequently encountered as a sub-task in

standard approaches to classification or synthesis: first, one uses the trainingdata subsets

Dc

= {xn

2 D : cn

= c} 8c 2 C (10)

to create a separate model mc

for each label c; the resulting set of modelsis then used to generate appropriate classifier outputs for given inputs. Theprocedure will be discussed further in sections 3.6 and 6.1, while more com-prehensive coverage specifically of estimation is provided in [48]. Most often,the set of parameters P (possible decision choices) is continuous, as they arecoordinates on a manifold. Discrete parameters and estimation problemsalso exist, for instance order selection [49] (estimating the order parameterof a model, e.g., the degree of a polynomial in a regression model). Thesetypically correspond to selecting one of several possible manifolds in distri-bution space; such discrete problems are often referred to as model selectionrather than parameter estimation.

There are also situations where the estimated value is seen as a goal initself, such as determining the signal-to-noise ratio of a wireless system [50],estimating the mean value of a population, tracking an object on radar,or identifying the value of fundamental constants in physics. These exam-ples illustrate that estimation problems also can arise in non-probabilisticsettings, such as classical physics, and that there need not be a one-to-onemapping between the estimated parameter and the model that generatedthe data (e.g., many different distributions can have the same mean). Inthis case, the one-to-one relationship from equation (9) might not apply,and we may instead see the quantity ✓ to be estimated as a many-to-onefunction ✓ (f

X

) of the actual distribution fX

of the observations. However,it is still suffices to estimate f

X

to obtain an estimate of ✓.In addition to the situations discussed here, there are many other kinds

of decision problems, such as ranking [51] (choosing an ordering of a setof elements) or the related problem of selecting a subset of items. Mostof these, however, are less common in sequence data applications, and willtherefore not be considered further in this thesis.

3.2 Loss FunctionsFor a given decision task, we frequently want to select a course of action thatis as good as possible under the circumstances. This suggests expressing the

6This presentation tends towards the frequentist perspective, where the parameter ✓is treated as an unknown number without any probabilistic interpretation. A deeper lookat the alternative Bayesian view, where the parameters are treated as random variables⇥ which covary with other observables, is provided in section 4.4, among others.

14 Introduction

task as a mathematical optimization problem. To make this more concrete,we must quantify what constitutes good and bad performance.

A highly general method for evaluating different options in a decisionproblem is to define a loss function or cost function L (

by, y) which associatesa cost with choosing the element by when the optimal choice would havebeen y. (The symbol here y is a stand-in that could denote a class label,observation sequence, or a distribution parameter—we are by no meansreferring to regression exclusively.) Typically the loss L (y, y) of makingthe optimal choice is zero, with other alternatives having equal or greaterloss. The loss-function concept can also be generalized to situations wherethe decision and the world state occupy different spaces, though we will notdo so explicitly here.

The loss function encodes our priorities between different choices, sothe details are highly application dependent. Nevertheless, some commonalternatives exist. For classification, it is common to take

L01 (by, y) = I (by 6= y) , (11)

where I (·) is an indicator function. This is called 0-1 loss [52]. 0-1 loss con-siders each deviation from optimality “equally bad,” and simply counts thenumber of misclassifications a decision rule makes when applied to a dataset.On continuous sets, for instance in regression problems, the quadratic lossfunction

L2 (by, y) = kby � yk22 (12)

=

X

i

(byi

� yi

)

2 (13)

is often used. As this is a continuous function, the derivative of whichis linear in by, it frequently leads to linear equations, making it easier tooptimize than 0-1 loss. However, the quadratic loss is not invariant undernon-affine transformations of the input variables. For example,

L2 (0, 1) = 1 = L2 (1, 2) (14)L2 (ln 0, ln 1) = 1 6= L2 (ln 1, ln 2) . (15)

It is therefore important to choose an appropriate data representation whenusing the quadratic loss.

The quadratic loss function is also applicable for comparing densities ofrandom variables. If f b

Y

(y) is a selected density, while the true, referencedistribution (optimal choice) is f

Y

(y), we can simply replace the sum in(12) by an integral, and obtain

L2

�

f bY

(y) , fY

(y)�

=

ˆ�

f bY

(y)� fY

(y)�2

dy. (16)


Another important loss function for comparing densities the Kullback-Leibler divergence [53], or KL-divergence for short, which is defined by

LKL

�

f bY

(y) , fY

(y)�

=

ˆfY

(y) lnfY

(y)

f bY

(y)dy (17)

= h (Y )� E�

ln f bY

(Y )

�

, (18)

where h denotes the differential entropy [54]. Unlike the other loss functionspresented, the KL-divergence is asymmetric in the arguments. Specifically,it tends to strongly penalize situations where f b

Y

has smaller support thanthe reference f

Y

[55]. The KL-divergence has connections to informationtheory, and may, for instance, be interpreted as the number of excess bitssent in source coding when using a coding scheme optimal for the distribu-tion f b

Y

when the true distribution is fY

.If it is possible to assign reasonable numerical costs to the various possi-

ble errors in an application, that is usually preferable to using a standard lossfunction like those above. Assessing the costs of different scenarios may berelatively straightforward in financial mathematics, but can be challengingin other areas, where values different wins and losses may not be monetary,and thus not necessarily directly comparable.

3.3 Optimal Decisions under Uncertainty

A practical problem with loss functions as presented thus far is that onemust know the true optimal decision in order to be able to compute the loss.Obviously, that information is not accessible in a real decision situation,where the optimal course of action is unknown. Instead, one has to actbased on other information, available in the form of random variables thathave been observed. Such information is seldom sufficient to determinethe true optimal course of action with certainty, but induces a conditionalprobability distribution over the different possible actions, indicating howlikely it is that each one of them is the reference option (the one with thelowest cost once the answer is revealed). Based on this distribution, one canform a decision in many different ways:

3.3.1 Supervised Problems

If, in a supervised learning problem, the joint distribution fX, C

(x, c) offeatures and labels is available, it is possible to minimize the expected loss,also known as the risk, given the input data. In fact, risk minimizationonly requires knowledge about the conditional distribution of the unknownquantity given the knowns, e.g., the conditional distribution f

C|X (c | x)of labels given the features in classification; the joint distribution is not

16 Introduction

necessary. The associated minimization problem can be written

bc (x) = argmin

c2CE (L (c, C) | X = x) (19)

= argmin

c2C

X

c

02CL (c, c0) f

C|X (c0 | x) . (20)

(Classification will be used as a running example in much of this section.The corresponding problem for other supervised situations is completelyanalogous.)

In supervised learning problems, minimizing the expected 0-1 loss L01

leads to a decision rule of the form

bcMAP (x) = argmax

c2CfC|X (c | x) (21)

= argmax

c2CfX|C (x | c) f

C

(c) . (22)

This is known as the maximum a-posteriori (MAP) decision rule, as it selectsthe most probable alternative given the input data x. The term f

C

(c), thedistribution over the class labels in the absence of any further informationabout the instance (i.e., before observing x), is termed the prior distribution.In the special case when all class labels are equally likely a priori, so thatfC

is a uniform distribution, this reduces to

bcML (x) = argmax

c2CfX|C (x | c) . (23)

This is known as the maximum likelihood (ML) rule as it maximizes thelikelihood function f

X|C (x | c), which is the probability of the observeddata as a function of an unobserved quantity or parameter, here c. In speechsynthesis, the MAP rule of maximizing the probability of the output giventhe input is widely used, though it is confusingly labelled as “maximumlikelihood parameter generation” (MLPG) [56],

bxMLPG (c) = argmax

x2XfX|C (x | c) . (24)

For regression-type tasks, the expected quadratic loss, also known asmean square error (MSE), is minimized by the conditional expected valueof the missing variable, e.g.,

bxMSE (c) = argmin

x2XE⇣

kx�Xk22 | C = c⌘

(25)

= E (X | C = c) (26)

=

ˆxf

X|C (x | c) dx. (27)


(This tacitly assumes that the conditional mean is an element of X , so thatit can be selected by the optimization. This is for instance satisfied if Xis convex. If E (X | C = c) /2 X , one may have to project the mean ontoX .) For Gaussian distributions, the mean is the same as the mode, andthis gives the same result as MAP. There are also other loss functions thatare minimized by the same mean value, in particular the so-called Bregmandivergences [57, 58], of which the quadratic loss function is a special case.

A more risk-averse alternative to the expected loss is to minimize theworst-case-scenario loss. This can be written

bcminimax (x) = argmin

c2Csup

c

02C(x)L (c, c0) , (28)

where C (x) denotes the set of possible c-values given that the input featuresare x. This is known as the minimax criterion, as it minimizes the maxi-mum loss. Notably, the decision does not depend on the exact probabilitydistribution of the output variable given the input, but only on the possiblelosses that can be incurred for a specific decision (the support of the pdf).The principle is therefore also applicable in cases where there is no explicitprobability distribution over the output variable.

3.3.2 Unsupervised Problems

In unsupervised problems, where we want to identify the distribution orother structure of an unlabelled dataset, we can use the quadratic orKullback-Leibler loss, respectively, to define the two optimization problems

bf2 (·) = argmin

g(·)2M

ˆ(g (x)� f

X

(x))2dx (29)

bfKL (·) = argmin

g(·)2M

ˆfX

(x) lnfX

(x)

g (x)dx, (30)

where M is a set of models (probability densities). Like the supervisedcase above, this requires that the data distribution f

X

is known. Note thatthe problems are nontrivial if f

X

(·) /2 M. The set of models M may, forinstance, be a parametric family of distributions f (x; ✓) indexed by theparameter ✓, so that

M = {f (·; ✓)}✓2P . (31)

The previous optimization problems are then equivalent to the parameterestimation problems

b✓2 = argmin

✓2P

ˆ(f (x; ✓)� f

X

(x))2dx (32)

b✓KL = argmin

✓2P

ˆfX

(x) lnfX

(x)

f (x; ✓)dx. (33)

18 Introduction

Among the two loss functions, the KL-divergence is significantly more com-mon in parameter estimation, whereas L2 minimization is the norm in non-parametric density estimation, possibly because it simplifies calculations[59]. Generally speaking, the KL-divergence is more sensitive to capturingthe tails of the distribution than is the L2 loss, so a distribution obtainedthrough KL-minimization is often more spread out, whereas L2-minimizersare comparatively more accurate near peaks of f

X

.Interestingly, KL-divergence minimization can, similarly to the super-

vised learning problems earlier, be formulated as minimizing an expectedvalue,

b✓KL = argmin

✓2PE (ln f

X

(X)� ln f (X; ✓)) (34)

= argmax

✓2PE (ln f (X; ✓)) . (35)

Notice that this amounts to maximizing the (expected) likelihood of thedata, and that the derivation and criterion do not require the parameter ✓to be a realization of a random variable, which is appropriate for frequentistmethods. The ML-KL connection is explored further in section 3.6.

3.4 Tasks ConstraintsPractical problems are not necessarily always as straightforward as outlinedabove. Other side constraints on the data and how it is acquired or presentedmay also affect the solution to the task. Some of these will be discussed inthis section.

Sometimes, there are restrictions on the information that is available. Ina supervised learning application, label data c may sometimes be expensiveto acquire, whereas feature data x comes cheap. Both speech and texts,for instance, may be readily obtained in bulk from the Internet, but accu-rate annotation (e.g., speech transcription or POS-tagging) requires costlyhuman intervention. In this case, a middle ground between supervised andunsupervised learning, known as semi-supervised learning [60], exists. Insemi-supervised learning, only some of the training examples are labelled.Often, it is possible to improve performance over pure supervised learn-ing on the labelled examples only, as the distribution of unlabelled pointsmay carry information about where (for instance) different clusters in thedata may be located, which may then be identified as belonging to differentclasses using the few labelled examples.

Another information restriction is censored data, where, for some dat-apoints, only a subset of the elements in x are available. And even if allelements are available, they may not be accurate. There could be saturationeffects in the features, or noise bursts that make elements unreliable. Partic-ularly challenging are the situations where entire datapoints are misleading,


due to labels being incorrect, as situation known as label noise [61].Another class of constraints surrounds the manner in which data are

presented. The classic learning scenario—where a large set of training datais provided, a classifier, synthesizer or other similar-problem solver is createdusing the data, and the resulting decision function is applied to the taskat hand—is known as batch learning. Sometimes, however, data arrivesinstance by instance in a stream. This is called online learning. In thisscenario, only part of the information in a new instance is provided at first,and the learner is required to make a decision based on the incomplete datum(e.g., predict the class from given features); after this, the true answer isrevealed, and there is an opportunity to refine the decision function basedon the new information.

The online learning situation is more demanding that batch learning,because new information is constantly coming in, and procedures that can-not be updated or generate output efficiently may be unable to keep up.Generally speaking, an online algorithm can always be applied in a batchsetting, by presenting the training examples one by one until they have beenexhausted, and then applying the resulting problem-solver to the actual taskat hand. However, some results on converting batch learning algorithms toan online setting do exist, e.g., [62].

In some online learning settings, it may be possible for the problem-solving algorithm to influence the examples that arrive, for instance bypresenting different feature sets and being told the appropriate label. Thisis known as active learning [63]. As this procedure has similarities withhow human infants and children develop, active learning may, among otherthings, be of interest in artificial intelligence applications.

Though the above restrictions and paradigms are important, the remain-der of this thesis will focus on offline, purely supervised or unsupervisedlearning—specifically how to create and identify useful models for sequencedata of different kinds.

3.5 The Necessity of AssumptionsSo far, we have seen how a variety of decision problems involving sequencedata can be formulated, and how they can be solved if the behaviour of thedata (meaning the distribution) is provided. In practice, the distribution isunknown, and one must make some sort of assumptions to solve the problem.This is a symptom of a general property in science and philosophy that somethings have to be taken for granted—even the fundamental building blocksof mathematics are expressed as axioms [64], whose internal consistencycannot be established beyond all doubt [65].

Without making assumptions that connect future decisions with pastdecisions in the problems we are considering, every new situation is a blankslate, and it is not possible to use past data to aid future decisions. There

20 Introduction

are theoretic arguments (see, e.g., chapter 2 in [66]) that assumption-freelearning cannot generalize to new instances. In machine learning, assump-tions are often referred to as inductive bias, since they bias a learner towardscertain conclusions, particularly when data is scarce.

The question then becomes what assumptions to make. As usual, the an-swer depends on the situation. In general, one typically makes assumptionssuch that the problem is in some way solvable. Preferably, the assump-tions should also match the characteristics of the practical situation underconsideration.

Beyond this, there is a myriad of different possible situations. A usefulview is to place different assumptions on a scale ranging from strong toweak. Strong assumptions are typically necessary when only little data isavailable. At the extreme end, if we propose a single, specific distributionfX

or fX, C

, there is no need for training data at all—optimal decisionsunder the assumed distribution can be computed directly from previouslypresented principles. The hazards of making such strong assumptions arethat they could be wildly wrong, and there is no way for data to steer thedecision making in the right direction.

If the training dataset is large, one may consider making weaker as-sumptions, leaving a bigger hole for the data to fill. The most conservativeassumption that can be made is arguably that the data distribution followsthe so called empirical distribution (in equation (36) below), which is sup-ported exclusively on the observations in the given dataset. This does notassume the existence of any possible observations beyond the values seenin the training data. Sadly, this is too conservative and cannot general-ize at all to new situations; see equation (45). Nevertheless, the empiricaldistribution, as a probability distribution which (together with the samplesize N) uniquely represents the dataset, can be used as a tool for develop-ing techniques for making decisions based on a mixture of assumptions andempirical data, as outlined below.

As a middle ground between no assumptions, and assumptions only,the classic approach to problem solving with data follows the example ofestimation and model selection: one assumes that the data was generatedby a distribution in a specific set M, and then uses the data to pick a fittingmodel (or several—cf. 4.4 and 4.6) from the set. If the dimensionality of theset is fixed, such that that it can be parameterized by a coordinate ✓ 2 Rp,M is known as a parametric family, and the models that it contains arecalled parametric models. If, on the other hand, the dimensionality of theset of possible models grows without bound as the size N of the trainingdataset increases, the setup is typically considered nonparametric.


3.6 Data-Based Decision MakingWe shall now look at techniques for making informed decisions and solvingthe tasks from section 3.1, using a mixture of data and assumptions. Thehope is that, if the assumptions are reasonably close to the true situation,a practically useful solution will result.

3.6.1 Unsupervised Problems

The decision criteria presented in section 3.3 are unrealistic, as they assumethat the relevant distribution f

X, C

(x, c) or fX

(x) is known, whereas in apractical application only training data D or Dun is available. To get aroundthis, one can define the empirical distribution

˙f (x) = ˙f�

x | D(us)

�

=

1

N

N

X

n=1

� (x� xn

) (36)

consisting of Dirac spikes at each datapoint. This distribution essentiallyturns expected values into sums over the available datapoints. Inserting thisapproximation as the reference distribution into equation (35), one obtains

b✓ML

�D(us)

�

= argmax

✓2P

1

N

N

X

n=1

ln f (xn

; ✓) (37)

= argmax

✓2Pln f

�D(us); ✓�

(38)

= argmax

✓2Pf�D(us); ✓

�

(39)

In other words, minimizing the KL-divergence between the parametric ap-proximation and the empirical distribution is the same as choosing the pa-rameter according to the maximum likelihood rule in equation (23) (cf. [67]);b✓ML is the parameter choice which maximizes the probability of the givendata.7

Maximum likelihood is the classic frequentist parameter estimation pro-cedure, as it has numerous desirable properties. Most importantly, it is[70, 71]:

7For the record, minimizing the quadratic loss (L2

-norm) between the empirical dis-tribution and a parametric family yields the estimator

b✓2

�D

(us)

�= argmin

✓2P

1

2

ˆ(f (x; ✓))2 dx�

1

N

NX

n=1

f (xn; ✓)

!. (40)

A cross-validation version of this estimator is commonly used with nonparametric meth-ods, specifically in bandwidth selection for kernel density estimation (KDE) [68, 69], butthe formula has received relatively little attention in the traditional parameter estimationliterature, compared to the maximum likelihood approach.

22 Introduction

1. Consistent: if the true fX

(x) ⌘ f (x; ✓) for some ✓, the estimator(37) will converge in probability on this value as N ! 1.

2. Asymptotically efficient: as N grows large, b✓ML will be as close aspossible to the true value ✓ for the given dataset size (again assumingthe true model is in M).

3. Parameterization invariant: if ✓0= g (✓) is an alternative parameteri-

zation of the parametric family M, and b✓0ML

�D(us)

�

is the maximumlikelihood estimate of this alternative parameter for a given dataset,then b✓

0ML

�D(us)

� ⌘ g⇣

b✓ML

�D(us)

�

⌘

. As parameters are not con-sidered directly observable, there may be more than one compellingparameterization of a family M; this result assures us that how werepresent the family of distributions will not affect the result.

This is a frequentist treatment, in that the parameter ✓ is considered un-known but not stochastic, and we do not use a probability distribution toquantify the uncertainty in ✓. The Bayesian alternative is to instead intro-duce a probability distribution over the models in M. This is equivalent toletting the true parameter be a random variable ⇥ with a prior distributionf⇥ (✓). The joint distribution can be written

fX,⇥ (x, ✓) ⌘ f

X|⇥ (x | ✓) f⇥ (✓) , (41)

where fX|⇥ (x | ✓) ⌘ f

X

(x; ✓) are functionally identical, but carry differ-ent interpretations. Minimizing the expected 0-1 loss for ✓ leads to themaximum a-posteriori parameter estimate

b✓MAP

�D(us)

�

= argmax

✓2Pf⇥ (✓)

N

Y

n=1

fX|⇥ (x

n

| ✓) (42)

= argmax

✓2Pf⇥|{X}

�

✓ | D(us)

�

, (43)

completely analogous to the MAP rule in equation (21). A more in-depthtreatment of the Bayesian perspective will be reserved for section 4.4.

3.6.2 Supervised Problems

For supervised problems, the situation is usually a little more complex.The empirical distribution in (36) has very narrow support, so the inducedconditional distribution

˙fC|X (c | x) =

˙fX, C

(x, c)˙fX

(x)(44)

=

˙fX, C

(x, c)P

c

02C˙fX

(x, c0)(45)


is undefined unless x 2 {xn

}Nn=1. The risk minimization rule (20) then

cannot be applied directly. In classification, the classic way to get aroundthis is to:

1. propose a parametric family

M = {fX, C

(·, ·; ✓)}✓2P (46)

for the joint distribution fX, C

,

2. estimate the parameters ✓ of the joint distribution using unsupervisedlearning (typically maximum likelihood) on the data D, and

3. insert the selected pdf fX, C

⇣

x, c; b✓⌘

into a decision rule, commonlyMAP (21) or maximum likelihood (23), to obtain a classifier.

Assuming data samples are independent, the maximum likelihood objectivefunction for the parameter estimation factors as

f{X, C} (D; ✓) =

N

Y

n=1

fX, C

(xn

, cn

; ✓) (47)

=

Y

c2C

N

Y

n=1

�

fX|C (x

n

| c; ✓) fC

(c; ✓)�

I(cn=c) (48)

=

Y

c2CfC

(c; ✓)|Dc| f{X}|C (D

c

| c; ✓) , (49)

where Dc

from equation (10) is the subset of datapoints corresponding tolabel c, and |D

c

| denotes its cardinality. Typically, the class-conditional dis-tributions f{X}|C (D

c

| c; ✓) depend on disjoint subsets ✓c

of the parametersfor each class. As the objective function factors, these parameters can thenbe estimated independently, class by class, simplifying the computations.

3.6.3 Discriminative Procedures

The proposed maximum likelihood classification approach based on unsu-pervised parameter estimation typically yields reasonable results, but mayunderperform in cases where the assumptions are incorrect and the true dis-tribution of the data is not part of M. There are several techniques, knownas discriminative learning, which attempt to optimize objectives that arecloser to the supervised task at hand. The general idea is to worry lessabout describing the distributions accurately, and instead concentrate onthe decision boundaries (the inputs x at which bc changes value), and try toposition these as appropriately as possible.

24 Introduction

One discriminative strategy, known as maximum mutual information(MMI) training [72], is to select the parameters to maximize the conditionalprobability

b✓MMI (D) = argmax

✓2P

N

Y

n=1

fC|X (c

n

| xn

; ✓) (50)

of the correct class. This maximizes the information-theoretic mutual in-formation between labels and observations, evaluated on the training data.Compared to MAP parameter estimation for unsupervised data, this proce-dure optimizes the conditional probabilities used in a MAP decision (21) forthe supervised task. This is much closer to the supervised task performancewe ultimately are interested in, and hence (50) tends to improve results inpractice, at the expense of being more difficult to optimize than traditionalapproaches such as (37).

Another possibility is to consider the decision function, e.g., (21) or(23), as an abstract function of both the input data and the parameters ✓.Minimizing the expected loss as a function of the parameters then yieldsthe formulation

b✓ = argmin

✓2PE (L (bc (X; ✓) , C)) , (51)

in the case of classification. Note that this is a valid problem also in caseswhere the classifier bc is just an arbitrary function with no probabilistic in-terpretation at all. We can treat the performance of the decision functionon the training data (i.e., the empirical distribution) as a proxy for perfor-mance on future examples. This suggests selecting parameters accordingto

b✓ (D) = argmin

✓2P

1

N

N

X

n=1

L (bc (xn

; ✓) , cn

) . (52)

For the special case of 0-1 loss, this yields an estimate

b✓MCE (D) = argmin

✓2P

N

X

n=1

I (bc (xn

; ✓) 6= cn

) . (53)

This is known as minimum classification error (MCE) [73], as it chooses aparameter set that minimizes the number of misclassifications that bc makeson the training data. While MCE often achieves good practical performanceon the task under consideration, the objective function is not straightfor-ward to optimize, as it only takes on N different, discrete values.

Equation (52) and the reasoning behind it also applies to synthesis tasks.For speech, assuming a quadratic loss function (more appropriate than 0-1

4 Common Model Paradigms 25

loss now that X is continuous and equipped with a metric) and a given data-generation procedure bx (·; ✓) such as MLPG, one obtains the parameterestimation problem

b✓MGE (D) = argmin

✓2P

N

X

n=1

(

bx (cn

; ✓)� xn

)

2 . (54)

This is known as minimum generation error training (MGE), and tends toimprove subjective results over synthesis from ML-estimated models [74].Again, this is achieved by considering an optimization problem more closelyrelated to the practical problem of interest.

3.6.4 Computational Considerations

As a final remark, it is not uncommon that the eventual optimization prob-lem one obtains cannot be solved analytically, especially for discriminativemethods. Often, however, optimization techniques such as gradient descentcan be applied to solve the problem iteratively. Typically, these methodstake a proposed solution and improve it, such that the value of the objectivefunction increases. Applying the procedure multiple times eventually con-verges on a local stationary point of the original objective [75]. Aside fromgradient descent another example of this procedure is the EM-algorithm[76], which can be formulated as interleaving two optimizations of two dif-ferent sets of variables [77]. In other cases, the problem can simplified orapproximated to a form which can be solved more readily (e.g., by relaxingthe optimization problem [78]), but the resulting solution may then not beidentical to that of the original formulation.

4 Common Model ParadigmsThe assumptions made when solving practical problems with the methodsfrom 3.6 are to a large part embodied by the model set M. Consequently,the different modelling techniques that have been proposed are almost asvaried as the decision problems we may face.

This section establishes some of the major paradigms in general statisti-cal modelling and problem solving, complete with example techniques fromthe different classes; the companion section 5 discusses assumptions andmodels specific to sequence data. Topics covered here include the distinc-tions between parametric and nonparametric models (section 4.1), betweendiscriminative and generative models (section 4.2), and between probabilis-tic and geometric approaches (section 4.3). Methods that combine multiplemodels, either from a Bayesian perspective (section 4.4), using mixtures(section 4.5), or through other means (section 4.6), are also considered.

26 Introduction

4.1 Parametric vs. NonparametricThe distinction between parametric and nonparametric methods, intro-duced in section 3.5, is quite significant in practice. The ever-growing set ofdistributions M that nonparametric methods can describe as the numberof samples N increases enables asymptotic convergence as N ! 1 evenunder very weak assumptions on the generating distribution (see, for in-stance, [79]). On the other hand, the weak assumptions mean that largerdatasets generally are necessary to attain a certain performance; [80] givessome upper bounds for performance in density estimation. In addition,computational complexity often grows superlinearly with dataset size N ,whereas common parametric approaches typically are linear in N , thoughthey may be superlinear in p, the dimensionality of the parameter space P.Approximations can frequently be employed in nonparametric approachesto make the computational load linear in N , but at the expense of theasymptotic convergence properties; in a sense, such approximations makenonparametric methods parametric.

Parametric distributions are exceedingly common in applications. Mostwell-known is probably the normal or Gaussian distribution, which is amember of the important exponential family, consisting of distributions thatcan be written as

fX

(x; ✓) = h (x) exp (⌘|(✓)T (x)�K (✓)) . (55)

Here, T is a vector of natural sufficient statistics,8 ⌘ is a correspondingset of natural parameters (the distribution is written in natural form if⌘ (✓) ⌘ ✓), while lnK (✓) acts as a normalization. This family also includesseveral other important continuous and discrete distributions such as thebeta and gamma distributions, the von Mises-Fisher distribution, as well asthe geometric and Poisson distributions. Mixture distributions, see section4.5, are generally not in the exponential family. Student’s t-distribution, inparticular, can be written as an infinite mixture and is not in the exponentialfamily.

To be in the exponential family it is additionally required that the sup-port of the distribution is independent of ✓. The set of uniform distributionson, e.g., [0, ✓], is therefore not an exponential family. Being in the exponen-tial family has certain advantages for parameter estimation, for instance theexistence of a sufficient statistic (the Pitman-Koopman-Darmois theorem)whose dimension remains bounded as the sample size N ! 1, but we shallnot elaborate further on these here.

8Here, a statistic is any function of the available data, and a sufficient statistic is afunction computable over the data which contains all information relevant for optimalparameter estimation. This means that the mutual information between the parameterand the sufficient statistic is the same as the mutual information between the parameterand the data, regardless of how the parameter is distributed [54].


The most prominent nonparametric density estimation method is ar-guably kernel density estimation (KDE) [81, 82], shadowed by its closelyrelated “dual” k-nearest neighbour density estimation, although the latterdoes not lead to normalizable densities [83]. Other techniques include se-ries expansions and maximum penalized likelihood methods (closely relatedto splines). Some discussion of different nonparametric density estimationmethods is available in [59, 80].

4.2 Generative vs. Discriminative

Equation (20) shows that minimum-risk decisions only require knowledgeabout the conditional distribution of the decision quantity given the input.Approximate minimum-risk decisions can thus be made based on approxi-mations of this distribution, e.g., f

C|X

⇣

c | x; b✓⌘

. A model that describesthis conditional probability directly is said to be discriminative, since itis designed for class discrimination. The alternative is to instead approxi-mate the full joint feature-label distribution f

X, C

(x, c)—such a model cansubsequently be used for discrimination by rearranging the relations

fX, C

(x, c) ⌘ fX

(x) fC|X (c | x) (56)

⌘ fC

(c) fX|C (x | c) , (57)

but is in addition capable of sampling new feature-label pairs, or new fea-tures given labels, which is useful in synthesis tasks. Such models are knownas generative models. We note that (56) suggests that f

C|X may have fewerdegrees of freedom than f

X, C

, and thus may be easier to estimate fromfinite datasets, giving discriminative methods a statistical advantage.

The density models discussed in the previous section can all easily beadapted as class-conditional feature distributions f

X|C , which, togetherwith a categorical distribution f

C

for the classes, form a full, generativemodel—cf. (57). The spline-based method for log-likelihood ratios in [84],in contrast, exclusively describes the conditional class distribution f

C|X ,without suggesting a joint distribution f

X, C

. A curious intermediate caseis k-nearest neighbour classification, which gives unnormalizable pdfs in un-supervised density estimation, but does lead to well-formed probabilities inclassification [83].

The difference between using a discriminative and a generative modelfor a discriminative task is not substantial. When parameters in generativemodels are chosen using MAP or ML, it becomes possible to use the gener-ative properties of the approach to investigate what the model has learnedabout the typical example, for instance by sampling. This interpretabilitycan often be advantageous in understanding how well a certain statisticalmodel works for a problem. If, however, the model is misspecified (the true

28 Introduction

pdf is not in M) a parameter estimation approach grounded in the dis-criminative perspective, such as MMI or MCE training, often yields betterultimate performance on the task. Aside from experimental evidence, thiscan be seen theoretically following [85], where it is shown that MMI is equiv-alent to ordinary maximum-likelihood estimation in an expanded model; thegreater flexibility (increased number of parameters) of this model makes itpossible to adjust to a larger variety of situations than before.

One downside of discriminative models is that they only are trained todescribe the conditional distribution f

C|X accurately, so any joint distribu-tion f

X, C

formed by slotting in discriminatively trained parameter valuessuch as b✓MMI or b✓MGE into a generative family may not be meaningful. Itcan also happen that a generative view may increase end performance indiscrimination: one example is unsupervised pre-training for deep neuralnetworks (DNN), which is an explicit data-generation perspective that hasshown recent success in initializing DNNs for discriminative tasks [86].

4.3 Probabilistic vs. GeometricIn equation (53), we saw one example that solutions to classification prob-lems need not be derived from probabilistic principles in order for empir-ical risk minimization to be possible. In general, as long as the task isnot explicitly to estimate a density or density parameter, there also existnonprobabilistic solution techniques. Instead of being based on maximiz-ing probabilities, these are generally founded on minimizing distances, andare therefore termed geometric methods. Because they do not describe anyprobabilities or distributions, geometric models do not allow sampling andcan only be used discriminatively.

A simple geometric approach is exemplar-based classification, where in-stances are classified based on which of two class examples they are closestto in feature space, according to some distance metric. Under the Euclideandistance metric, the decision regions are then separated by a hyperplane. Amore general example of this geometric notion is maximum-margin methodssuch as support vector machines (SVM) [87], which separate classes usinga hyperplane in a high-dimensional feature space [88]. SVMs have turnedout to be very strong performers on many binary classification tasks. An-other geometric method, used, e.g., in prediction and forecasting, is linearregression by least squares, the “squares” being square Euclidean distances[83].

It has turned out that many geometric methods are special cases of acorresponding probabilistic model. For instance, the two-exemplar basedapproach above yields identical decision surfaces to classification based on amodel where both classes are Gaussian distributed with identical covariancematrices. Furthermore, while the arithmetic mean is the point which mini-mizes the sum of square distances to all datapoints in a set, the mean is also


the maximum likelihood estimate of the location parameter and mode of aGaussian distribution for the same dataset. The correspondence extendsto finding sets of reference points that minimize the square distances tothe closest reference point (through the k-means algorithm [89]), and train-ing mixtures of Gaussian components with identical standard deviations.Probabilities of different points for a Gaussian distribution with arbitrarycovariance matrix C can be mapped one-to-one to Mahalanobis distances[90]

d (x, µ) =

q

(x� µ)|C�1

(x� µ) (58)

from the mean vector µ. In general, every Bregman divergence, inter-pretable as a kind of asymmetric distance measure, can be connected toa corresponding distribution in the exponential family [58].

Another example that highlights the connection between geometric andprobabilistic methods is nearest-neighbour classification, where the esti-mated class is chosen as the class of the example in the training data that isclosest to the input features according to some distance metric. This is obvi-ously a geometric decision—yet the same technique can also be interpretedprobabilistically, as a special case of k-nearest neighbour from before. Anotable example where, on the other hand, identifying an equivalent proba-bilistic model has been difficult is the case of SVMs; [91] represents a moresuccessful interpretation attempt. There exist, however, methods closelyrelated to SVMs that are probabilistic from the ground up, e.g., relevancevector machines (RVMs) [92].

Geometric intuition can often aid the solution of machine learning prob-lems, as well as add to the understanding of statistical methods, e.g., byconsidering ML-estimation as a minimal KL-divergence projection in in-formation geometry—this is how equation (37) was derived. Furthermore,many discrimination and prediction rules, also those derived from proba-bilistic principles, can be formulated in terms of distances.

Probabilistic approaches, however, bring more to the table than their ge-ometric counterparts, since they in addition explicitly model the uncertaintyin the decision situation. This makes them naturally capable of estimatingstochastic quantities such as the overall expected risk of a classifier, or howcertain a learner is of its decision for a given input. (One caveat is that theselected models are biased towards overestimating their own performanceon the training material, since they were selected among all other models forhaving the greatest apparent accuracy there [93, 94].) It may also be possi-ble for probabilistic methods to flag unusual input data that falls in featureregions where the model has not seen many examples and its decisions maybe inaccurate.

30 Introduction

4.4 Fully Bayesian ApproachesSection 3.6 introduced the Bayesian notion of assigning initial (prior) prob-abilities to the different models in the set M. These probabilities, togetherwith the observations, determine the final model selected, as seen in equa-tion (42). By taking the logarithm of the objective function,

b✓MAP (D) = argmax

c2C

�

ln f⇥ (✓) + ln f{X, C}|⇥ (D | ✓)� (59)

= argmax

c2C

ln f⇥ (✓) +

N

X

n=1

ln fX, C|⇥ (x

n

, cn

| ✓)!

, (60)

we see that the log prior ln f⇥ can be seen as an additive regularizer ofthe data log-likelihood ln f{X, C}|⇥, biasing the estimate towards certainregions of parameter space, with the additional property that f⇥ also has aprobabilistic interpretation. The Bayesian framework has more implicationsthan this, however. Since the parameter ⇥ in the Bayesian framework isa random variable, one can marginalize it out in many decision situations:both in supervised settings such as classification, where we can form

bc (x | D) = argmin

c2CE (L (c, C) | X = x, D) (61)

= argmin

c2C

X

c

02CL (c, c0) f

C|X, {X, C} (c0 | x, D) , (62)

using the law of total probability to compute

bfC|X, {X, C} (c

0 | x, D) =

ˆfC|X,⇥ (c0 | x, ✓) f⇥|{X, C} (✓ | D) d✓, (63)

as well as in unsupervised density estimation, where one obtains the predic-tive distribution

bfX|{X}

�

x | D(us)

�

=

ˆfX|⇥ (x | ✓) f⇥|{X}

�

✓ | D(us)

�

d✓. (64)

(We assume samples are conditionally independent given ⇥.) Methodslike the above, which take into account the entire posterior distributionf⇥|{X, C} or f⇥|{X} for the parameter, are known as fully Bayesian, in con-trast to MAP or maximum likelihood, which are based on point estimates(single values) b✓MAP or b✓ML.

The fully Bayesian strategy of incorporating all possibilities into deci-sions lends some protection against situations where there is substantialuncertainty about what parameter value that is appropriate. In such cases,picking a single value through point estimation can be highly misleading,as will be discussed in section 6.3. Weighting models together as in (64)


also enables the approach to represent certain distributions not containedin M, but within the convex hull spanned by M [95]. On the other hand,marginalizing over all parameter values complicates the mathematics of fullyBayesian approaches, and often necessitates approximations, e.g., so-calledvariational inference [96], or computationally intensive simulation methodssuch as Markov chain Monte Carlo (MCMC) [97, 98]. Sometimes the dis-tributions formed by summing over models in M can be physically unrea-sonable, for instance leading to speech models having more formants (vocaltract resonances) than natural human speech can contain.

Despite the conceptual appeal, Bayesian methods have a somewhat con-troversial history in statistics. This is partially because the choice of priordistribution often cannot be motivated on theoretical grounds, and partiallybecause of the subjectivist philosophical interpretations that are sometimesattached to Bayesian methods.

In certain applications a prior distribution can be inferred from previ-ously available information. When creating a speaker-specific speech recog-nizer, for instance, we can use the distribution of voice parameters seen in aprevious training material of many speakers as our prior. Sometimes, how-ever, there is a desire to select a “non-informative prior,” which representshaving no knowledge about the distribution. This condition of perfect ig-norance has proved difficult to describe mathematically. An early proposal,attributed to Bayes and Laplace, is the principle of indifference: assumea uniform prior over all parameter values. Unlike ML-estimation, whichonly is a point estimate, the distribution given by this rule depends on theparameterization for continuous variables. This is unappealing, since it ishard to argue which representation should be preferred for a fundamentallyunobservable quantity such as a distribution parameter.

Other proposals for objective prior selection exist, notably the so-calledJeffreys prior [99]

fJe↵ (✓) /q

detE��r

✓

fX|⇥ (X | ✓)� �r

✓

fX|⇥ (X | ✓)�| | ⇥ = ✓

�

, (65)

which is independent of representation and only depends on the model fX|⇥.

While attractive, the Jeffreys prior is not a panacea. Frequently, it is not anormalizable probability distribution (it has infinite integral over the param-eter space), in which case it is said to be improper. This may be acceptableas long as the posterior distribution used for making decisions based ondata is a proper distribution, which is often the case. However, there aresituations (notably mixture models [100]) where the posterior is always im-proper. There are also arguments that the Jeffreys prior is unsatisfactoryfor multidimensional parameters [101].9

9An alternative approach to non-informative priors, investigated, e.g., in [102], isto formulate weakly informative priors: prior distributions designed with the intention

32 Introduction

In recent years, the prior distribution has come to be seen as more ofan asset and less of a liability, and Bayesian methods have experiencedincreased use in applications. For one thing, it can be argued that non-Bayesian methods also make assumptions, but that these simply are notstated as explicitly. Having the assumptions up front, and formulated prob-abilistically as a Bayesian prior distribution, provides a clearer picture ofwhat is being taken for granted. The prior also provides a very explicitentry-point for other available information about the expected behaviour ofthe variable(s) in question. This can be used to (gently or strongly) biasresults in the right direction, and possibly increase performance comparedto methods that make weaker assumptions.

In contrast to the use of Bayesian methods as an applied tool in spe-cific situations, the philosophical interpretation of Bayesian probabilities assubjective probabilities (as in, e.g., [103]) remains controversial. For somediscussion see [104] and the accompanying commentary.

4.5 Mixture ModelsFully Bayesian methods are part of a broad category of approaches wheremore than one model is employed to solve a single problem. A general termfor such techniques is ensemble methods. From a high-level perspective, theidea of combining different approaches and opinions is compelling, as it isfrequently used in real-life decision making, e.g., panels of experts, juries,and scientific research groups—cf. [105].

Perhaps the simplest class of ensemble models is mixture models, wherethe pdf is a weighted sum of different models. Most well-known are Gaussianmixture models (GMMs), which have the form

fX

(x; ✓) =

K

X

k=1

pk

N (x; µk

, Ck

) , (66)

where N (x; µ, C) denotes a normal-distribution pdf with mean vector µand covariance matrix C evaluated at x, while µ

k

and Ck

are parametersfor component k, with nonnegative component probabilities p

k

satisfyingP

K

k=1 pk = 1. K, the number of components, is typically fixed (nonrandom).Many component shapes other than Gaussians are possible, for instancemixtures of beta distributions for data confined to an interval [106].

Mixture-model samples are most easily understood as generated througha two stage process: For every datapoint X

n

, a discrete random variableQ

n

is first drawn from the distribution fQ

(k) = pk

. The realization k = qn

then tells us which component distribution fX|Q (x

n

| qn

) to use when gen-erating the observed x-value. Q

n

here is thus an unobservable (hidden)

to exert little influence over the conclusions of an experiment while being sufficientlyinformative to avoid the pathologies of standard non-informative priors.


random variable that helps generate the nth observation. Such unobserv-able variables, which are sampled anew for each datapoint, are known as la-tent variables, in contrast to Bayesian parameters ⇥, which have the same,though unobservable, value for the entire dataset. This “hidden selector”structure also suggests that mixture models may be particularly appropri-ate in cases where the studied distribution is composed of many differentsubpopulations with potentially different properties.

Mixture models are superficially similar to fully Bayesian approachesin that the predictive data distributions (64) and (66) in both cases area weighted combination of different pdfs from simpler models, often fromthe exponential family. However, the distributions resulting from the twomethods often can show quite different asymptotic learning behaviour (thatis, when we let the training set size N grow). For a Bayesian model, thelikelihood term in (60) will eventually swamp the prior (wherever the prioris nonzero) as the amount of data becomes large. If, for example, M is aset of distinct categorical distributions, the posterior probability will even-tually come to strongly favour the one among these that is closest to thetrue data distribution. The resulting predictive distribution (64) will bevirtually indistinguishable to one of the distributions in M. A mixturemodel, in contrast, may retain all components and keep them distinct evenas N ! 1, and can therefore more easily converge on a variety of shapesthat are different and more complex than any of the components. In caseswhere the true generating distribution is not in M, the standard Bayesianapproach with conjugate priors often converges on a compromise modelwhich best explains the entire dataset (but see also [95]), while mixturemodels, roughly speaking, partition the state space X into K pieces, anduse a separate model for each part. However, the two ideas can be combinedto form Bayesian mixture models, e.g., [83, 106], in which the componentprobabilities and parameters are random variables.

Theoretical results show that mixture models can approximate any dis-tribution well, given a sufficient number of components [83]. Models witha large number of components may require substantial amounts of data tobe accurate, however. Somewhat surprising, then, is that Bayesian mixturemodels with an effectively infinite number of components can be formulatedand made practical [107]; these are of particular interest when it is difficultto propose an adequate number of components K a priori.

4.6 Ensemble MethodsWhile Bayesian approaches and mixture models have a long history in statis-tics, other kinds of ensemble models have recently surged in popularitydue to their consistently strong showings in prediction contests such asthe widely publicized Netflix Prize. Three important approaches—bagging,boosting, and stacking—are discussed below:

34 Introduction

Bagging [108] stands for bootstrap aggregating, and is the simplest of thethree techniques. In this method, the bootstrap [109] (resampling from Dwith replacement) is applied to create a large number of different datasetsbased on the original data. Individual problem solvers with the same struc-ture are then trained on the resampled datasets, leading to a potentiallydifferent parameter estimate for each. To form a decision, the outputs ofthe trained problem solvers are combined using, e.g., averaging (minimizingthe squared error on continuous spaces) or simple majority voting (discretespaces).

Compared to bagging, boosting, e.g., [110, 111], is a more directedmethod specifically for classification. Boosting trains a sequence of clas-sifiers on different weightings of the available data. After each model istrained, the data is reweighted so that incorrectly classified examples aregiven more weight when the next classifier is created. This creates a se-quence of models that in some sense complement each other. To obtain thefinal decision, the different learners are typically weighted together depend-ing on their classification accuracy. There are some impressive theoreticalresults showing that boosting asymptotically can approach optimal clas-sification accuracy, even if the component classifiers are only marginallydifferent from chance [112].

Stacking, finally, is the idea to create a pool of learners, and then useanother, higher-level learner to combine their outputs [113]. Notably, thetwo best-performing teams from the Netflix Prize competition both usedlarge, stacked ensembles for their predictors [114]. Artificial neural networkscan be seen as an extreme example of stacking, where each layer is anensemble of simple classifiers (perceptrons) with linear decision surfaces.

Generally, ensemble methods appear to perform better the more differ-ent the component models are, as this allows them to complement eachother more effectively. Bayesian methods are therefore sometimes consid-ered relatively weak ensembles [114], as all component models share thesame structure and the parameter distribution tends to concentrate on asmall subset of the possible models once sufficient amounts of data becomesavailable. Nevertheless, the ability to average over multiple models makes itpossible for Bayesian approaches to converge on distributions not present inM [95]. This is appealing, as many traditional convergence and optimalityresults, e.g., for maximum likelihood, only consider the situation where thetrue distribution is in M.

5 Modelling Sequence DataThe model paradigms from section 4 all apply to sequence models as well,but sequence models have the additional complication that samples alsomay depend on each other. Handling this requires assumptions on the na-

5 Modelling Sequence Data 35

ture of how samples interact, and models that express and quantify thedependences. This section introduces stationarity (section 5.1) and ergod-icity (section 5.2) as two central assumptions in sequence data models, andoutlines Bayesian networks as a framework for visualizing dependence struc-tures (section 5.3). Thereafter, Markov models (section 5.4), hidden Markovmodels (section 5.5), and various combinations (section 5.6) are described,along with example techniques from each class. Finally, we sketch how themodels can be applied to situations that do not satisfy stationarity or er-godicity (section 5.7), and touch upon some alternative paradigms which donot fit well in standard state-based sequence modelling frameworks (section5.8), for instance detector-based techniques.

5.1 StationarityFor sequence data, as considered in this thesis, there are two commonly-seenstandard assumptions: that data series are stationary, and that they areergodic. Stationarity can be seen as a way to make the dependences of theprocess uniform and structured, while ergodicity is a concept for localizingthe influence of the dependences. In this section we explore stationarity ingreater detail, with a discussion of ergodicity reserved for section 5.2.

Stationarity is the notion that the probabilistic properties of samplesfrom different times are comparable. Just as it is commonly assumed thatthe laws of nature are invariant over time, a stationary process is governedby the same distribution regardless of what the time-variable t is. Morespecifically, a sequence is strictly stationary if, for any sequence of obser-vations x, the probability (or pdf for continuous-valued processes) of thatsequence appearing in a pattern is identical at all times—that is, the processsatisfies

fX⌧

(x) ⌘ fX⌧+`

(x) (67)

for all `, as long as the time indices

⌧ = {t1, . . . , tT } (68)⌧ + ` = {t1 + `, . . . , t

T

+ `} (69)

for the pattern both are subsets of the index set T . In other words, anysubstring x has the same probability of appearing at any time.

Stationarity also comes in other forms than strict stationarity. A pro-cess is weakly stationary or wide-sense stationary (WSS) if its second-orderproperties do not depend on time,

E (X1) = E (Xt

) (70)E (X1X

|`

) = E�

Xt

X|t+`

�

(71)

36 Introduction

for all valid t and lags `; in other words, means and covariances are time-independent. Note that this does not require the distributions at differenttimes to be the same.

The advantage of stationarity is that is simplifies learning. Under strongstationarity, there essentially is a single, translated distribution to learn,rather than a new distribution for each time t. In the case of weak sta-tionarity, the distributions at individual times may differ, but there are stillinvariant means and variances that can be identified.

If the properties of the process (its distribution) change slowly with timeor otherwise are confined to a small set of similar distributions, a model thatassumes stationarity can still yield good results on a dataset. Speech signals,for instance, change relatively slowly on the millisecond scale, so stationarymodels are often sufficiently accurate over intervals of 20 ms or so.

In other cases, nonstationarity may have to be incorporated into themodel itself. Seasonal dependences, which are common in time series ofgeological, biological, or societal origin, can be expressed through cyclo-stationary processes. These can be thought of as a set of T

p

alternating(and possibly dependent) stationary processes, T

p

being the period, wheresample X

t

is generated from process number tmodTp

. Cyclostationarymodels can be created by letting process parameters (e.g., the mean) varyaccording to a given pattern [115]. It may then be possible to remove thenonstationary aspects of the data and obtain a stationary residual process.There are also nonparametric approaches to seasonality, for instance theclimate data example from [27]. Common cycle periods are days and years;tides also have a monthly component, while weekly patterns often appearin data affected by human activity. Similar techniques can also be appliedto describe ongoing trends in the data, for instance by adding terms thatmodel linear growth.

Many deviations from stationarity are, unlike cyclostationarity, difficultto describe in a structured manner. One example is one-shot, transientprocesses, i.e., finite-duration phenomena. To accurately estimate the be-haviour of such processes, it is typically necessary to have several indepen-dent realizations of the process. A deeper look at methods for working withnonstationary time series can be found in [116].

5.2 ErgodicityErgodicity is often presented as a short-memory property. Essentially, it isthe idea that long samples from a process will be representative of all thepossible behaviours of the process. More formally, if ⌧ and ⌧ 0 are two sets oftime indices and (X

⌧

= x) and (X⌧

0 = x0) are two positive-measure events,

meaning that the two output patterns can be generated by the process, thenthe process is ergodic if the event

�

X⌧

= x \X⌧

0+`

= x0� also has positivemeasure for some `. In other words, all possible patterns can also co-occur


in the same realization, at least if they are sufficiently far apart. (Thisdefinition is a slight simplification of that given in [117].)

Though, technically, ergodicity can be defined though a measure-preserving transform without requiring stationarity, stationarity is com-monly assumed before ergodicity is considered. If a process is both sta-tionary and ergodic, which is common to assume in applications, this meansthat the behaviour is time-invariant, and that time average of any integrablefunction g (x) converges to the ensemble average (the standard expectedvalue of the function),

lim

T!1

1

T

T

X

t=1

g (Xt

) =

ˆg (x) f

X1 (x) dx (72)

almost everywhere [118]. For such a process, a single sample sequence, iflong enough, suffices to learn the behaviour of the process to an arbitrary(given) accuracy. Ergodicity thus implies that the data satisfies a kind oflaw of large numbers for dependent samples.

Of course, some ergodic systems converge to the ensemble average fasterthan others, and between independence and simple ergodicity exist severaldifferent kinds of so-called mixing (weak and strong). This is reminiscent ofthe various strengthened versions of the law of large numbers, including thecentral limit theorem. Exactly how quickly an ergodic system forgets canbe quantified by mixing coefficients or by the autocovariance time TCov. Ifthe process X is weakly stationary with mean value µ and variance �2, theautocovariance time is defined through

1X

`=1

|Cov (Xt

, Xt+`

)| = �2TCov, (73)

assuming the sum is finite. This number affects how quickly the samplemean

bµN

(x) =1

N

N

X

t=1

xt

(74)

converges to the true mean µ = E (Xt

)—specifically one can show [119] that

E (LQ

(bµN

(X) , µ)) = E⇣

(bµN

(X)� µ)2⌘

(75)

�2

N(1 + 2TCov) . (76)

This result can be used to extend the law of large numbers, and also quanti-fies how learning from dependent data requires more samples than learning

38 Introduction

from independent samples due to the redundancy, as was mentioned in sec-tion 2.4. More precisely, an accuracy that is achievable with N datapointsfor independent data (which has TCov = 0) now requires N (1 + 2TCov)

samples instead, an increase by a constant factor. It is as if one is workingwith uncorrelated datapoints spaced 2TCov samples apart; this explains theinterpretation of TCov as an amount of time.

It is a theorem that non-ergodic stationary processes can be decomposedinto a set of component processes, each of which is ergodic [120]. Essentially,a process is either free to show all possible behaviours in a single realization,or restricted to pick a subset of them; within such a subset, all behavioursassociated with the subset can be observed in a single realization. As aconsequence, ergodicity is not always necessary for prediction: if we areasked to predict the likely future observations of a long sample, the samplewill be relevant to the current ergodic component, and that is the onlyergodic component that matters for the prediction task at hand.

5.3 Bayesian Networks

We have seen that, in order to be able to efficiently learn a process from itsoutput data, variable dependences must be localized (ergodicity). Learningalso becomes much easier if the dependences have some form of invariantstructure (stationarity). Fortunately, many observed processes satisfy thesecriteria to a large degree.

In the following subsections, we introduce the two dominant paradigmsfor modelling stationary and ergodic sequence data. The methods can beextended to also handle certain types of nonstationarity (particularly trendsand seasonality) and to model finite-duration sources.

When classifying sequence models, dependences among variables will beillustrated in the language of Bayesian networks [121, 122], which are atype of graphical model. More specifically, a Bayesian network is a directedacyclic graph (DAG) which expresses the causal dependences among a set ofrandom variables.10 To be concrete, let V = {1, . . . , N} be a topologicallyordered indexing of the nodes in a DAG, so that the set of vertices E satisfiesE ✓ {(i, i0) 2 V ⇥ V : i0 > i}. Let further J (i) ✓ E be the edges pointingto i. Also let XV = {X

i

: i 2 V} be a set of possibly dependent randomvariables. The graph (V, E) is then a Bayesian network with respect to XV

10Markov random fields (MRFs), one alternative to Bayesian networks, utilize undi-rected graphs, and can therefore be symmetric in the relations between variables. Thisis useful, e.g., in models of atomic interaction such as the Ising model and spin glasses[123]. Reduced Boltzmann machines, common building blocks in deep neural networks,are also expressed as MRFs [124, 125]. Hybrid models which combine both paradigmsexist as well [126, 25].


if the joint probability distribution of XV satisfies

fXV (xV) ⌘

Y

i2VfXi|XJ (i)

�

xi

| xJ (i)

�

. (77)

(Note that this is different from the factorization

fXV (xV) ⌘

Y

i2VfXi|X1, ...,Xi�1

(xi

| x1, . . . , xi�1) (78)

which applies to all sets of random variables.)We see that the absence of an edge between i and i0 > i in a Bayesian

network means that the corresponding variables are conditionally indepen-dent, given the variables in J (i0)—that is, the joint pdf factors as

fXi,Xi0 |XJ (i0)

�

xi

, xi

0 | xJ (i0)

�

⌘ fXi|XJ (i0)

�

xi

| xJ (i0)

�

fXi0 |XJ (i0)

�

xi

0 | xJ (i0)

�

(79)

when i /2 J (i0) , i < i0. Though they may not look like much, such inde-pendences can substantially simplify computations.11

We shall see that a useful notion of localized dependences is models thathave Bayesian networks on V ◆ T whose causal dependences, i.e., the setsJ (i), do not grow indefinitely with i 2 V. For probability distributions thatrepresent stationary processes, the sets J (i) generally have some recurringstructure to them, such that the causal dependences at any point simply aretranslated versions of the dependences at any other point. Such a Bayesiannetwork is known as a dynamical Bayesian network.

Although general inference in a Bayesian network is NP-hard to evenapproximate [128], the factorization in (77) suggests a sequential computa-tion such that sampling or probability computation for sets of contiguousvariables starting at i = 1 can be performed efficiently. This property isimportant for efficient computations with the models presented here.

5.4 Markov ProcessesPossibly the simplest assumption that can be made when creating asequence-data model is that the generating process has limited-range mem-ory, so that only the most recent data samples are informative for predictingfuture behaviour, assuming the underlying model is known. This notion isformalized by the well-known Markov property

fXt|Xt�1

�1

�

xt

| xt�1�1� ⌘ f

Xt|Xt�1t�p

�

xt

| xt�1t�p

�

; (80)

11An important caveat is that dependences in the network change depending onwhether or not the value of a variable is known, though simple algorithms exist to takethis into account, e.g., [127].

40 Introduction

Xt�1 X

t

Xt+1

Figure 1: Bayesian network representing first-order Markov model.

any process that satisfies this for all t is said to be a Markov process of orderp.

Relation (80) implies that future evolution is conditionally independentof the past, given the current context or state xt�1

t�p

.12 Given xt�1t�p

, we canrecursively apply (80) to find the joint probability distribution of any futuresamples. The state is thus what is known as a sufficient statistic for predic-tion. In Bayesian network terminology, we have J (t) = {t� p, . . . , t� 1}.The associated between-variable dependences (and conditional indepen-dences) are represented graphically in figure 1 for p = 1. As a specialcase, one can assume that a process has no memory at all and p = 0, inwhich case the data is IID.

5.4.1 Markov Chains

Discrete-valued Markov processes are often known as Markov chains. (Someauthors use this term to refer to continuous-valued processes, but this textwill not.) Stationary Markov chains on finite alphabets (meaning |X | < 1)are extremely common as language models in speech and text modelling. Ahighly appealing property is that the general next-step or transition distri-bution f

Xp|Xp�11

: X p+1 ! [0, 1] of these models can be specified using only|X |p+1 numbers (i.e., parameters), which is polynomial in the alphabet size.The values of f

Xp|Xp�11

are called transition probabilities.Any finite-order Markov chain on X can be converted to a first-order

Markov chain over the alphabet X p. For first-order Markov chains on fi-nite alphabets with time-independent next-step distribution, the transitionprobabilities can be arranged into a transition matrix A 2 [0, 1]

|X |⇥|X | withelements

(A)

ii

0 = aii

0= f

X2|X1(i0 | i) . (81)

Despite the capital font, A is typically not a random quantity; the capitalhere merely highlights that it is a matrix. Note that some texts definetransition matrices as A|, the transpose of our A.

12In this view, the state is essentially a random variable which, if known, decouplesthe future evolution of a process from its past. As will become apparent in section 5.5,such a state need not take values on X . The term “state space” will therefore sometimesbe used to refer to a space other than X , depending on what we mean by the “state” ofthe process under consideration.


The simple idea of collecting transition probabilities in a matrix is sur-prisingly powerful, in that many meaningful properties or relevant proba-bilities of Markov chains can be expressed in linear algebraic terms usingA. Since the elements on each row of A form a probability distribution,one for example has that

A1 = 1, (82)

where 1 is a column vector of all ones.13 1 is thus a right eigenvectorwith eigenvalue 1, which necessarily is the largest eigenvalue of A. Manyimportant properties of stationary and ergodic models can then be derivedfrom the Perron-Frobenius theorem [129, 130]. If f

Xt (i) is a stationarydistribution of the Markov chain and the vector ⇡ has elements ⇡

i

= fXt (i),

we have

A|⇡ = ⇡, (83)

so ⇡ is a left eigenvector of A. If this eigenvector is unique (the eigenvalue1 has multiplicity 1) and ⇡ > 0, the Markov chain is ergodic. The spectralgap of A (the difference between the moduli of the two largest eigenvalues)bounds the asymptotic decay rate towards the stationary distribution ⇡ fora Markov chain with an arbitrary starting state.

5.4.2 Continuous-Valued Markov Models

Continuous-valued data with stationary and finite Markovian dependencescan be modelled using numerous different techniques. Most prominent areperhaps standard linear autoregressive models (AR models), where new val-ues are generated as a linear combination of recent observations plus someweakly stationary, independent zero-mean driving noise E ,14 i.e.,

Xt

=

p

X

`=1

A`

Xt�`

+ Et

(84)

fXp+1|Xp

1(x

p+1 | xp

1) = fE

xp+1 �

p

X

`=1

A`

xp+1�`

!

, (85)

where A`

are (typically nonrandom) matrices of auto-regressive parameters.These are models for which linear prediction based on past values is optimal.Certain conditions apply on {A

`

}p`=1 and the variance of E

t

for the processto be stable and stationary. Extending the model to a nonzero mean isstraightforward.

13Nonnegative real matrices satisfying (82) are said to be right stochastic, which is anunfortunate term as the matrix itself need not be a stochastic quantity. Matrix-valuedrandom variables are instead known as random matrices.

14Think of this notation as “capital epsilon,” with " representing a specific realization.

42 Introduction

Autoregressive models are sometimes introduced from a geometric per-spective, as methods for minimizing the mean square prediction error. Pre-dicting the conditional mean of the next datapoint then becomes a linearregression problem. These geometric models have straightforward proba-bilistic analogues where samples are perturbed (driven) by, for instance,IID zero-mean Gaussian noise E around the geometric predictions. Such amodel leads to one additional parameter to estimate, namely the noise stan-dard deviation, but in turn provides a generative, probabilistic model of thedata. As with the Gaussian distribution examples in section 4.3, the meanof the probabilistic and the geometric predictions coincide, so estimates ofthe remaining parameters are the same in the two interpretations.

Autoregressive moving-average models (ARMA) [131] are also very com-mon in applications. These are a minor generalization of AR models wherethe noise is a moving-average process, as in

Xt

= µ+

p

X

`=1

A`

(Xt�`

� µ) +

o

X

`=0

B`

Et�`

, (86)

o being the order of the moving average, µ being the process mean param-eter, and {B

`

}o`=0 being parameters of the moving average. It is generally

assumed that B0 = I, the identity matrix.AR and ARMA models are often among the first techniques a practi-

tioner will apply to a new problem. The Wold decomposition [132] showsthat any weakly stationary and ergodic process can be decomposed into a(possibly infinite-order) linear ARMA process driven by a white noise pro-cess, meaning a process which is weakly stationary and has samples whichare mutually independent but not necessarily IID. This decomposition canbe seen as a motivation for the widespread use of truncated (i.e., finite-order)linear ARMA models in applications.

For more complex data, nonlinear approaches are generally of interest.A simple trick is to transform the data, in the hope that the process be-comes more linear, for instance by taking the logarithm of dataseries whichexperience multiplicative updates. If this is not sufficient, a plethora of non-linear models exists for the analyst to choose from. One may for exampleconsider piecewise-linear, regime-switching approaches such as self-excitingthreshold autoregressive (SETAR) models [133], when driven by a functionof the context variable, or geometric methods such as (non-recurrent) time-delay neural networks for recognition [134] or kernel-based AR models forprediction [135, 136], to name a few different alternatives.

Common to all these models is the fact that they are parametric, andthus only can represent a restricted family of continuous-valued Markovprocesses of a given order. Somewhat surprisingly, perhaps, there existprocedures which are capable of learning much wider classes of continuous-valued Markov models, e.g., [137]. The idea, also used as a component


of thesis paper [138], is to use KDE for nonparametric estimation of theMarkov transition density f

Xt|Xt�1t�p

.

5.4.3 Variable-Order Models

In the above treatment, the order p is treated as given and fixed. In practice,it is not known what p is, or even if it has a finite value. As models with lowp nearly always constitute special cases of models with higher p, one may betempted to choose large-order models, but these have many parameters andcan be difficult to learn, cf. section 6.3. A more refined approach is to treatp as a parameter and attempt to estimate it, i.e., perform order selection.A review of some order-selection criteria is provided in [49]; cross-validationtechniques are a popular choice in applications. See also section 6.3.

For Markov chains, there also exist methods that let the order of themodel depend adaptively on the latest observations (the context): some-times many symbols into the past have to be evaluated before the statecan be uniquely determined, whereas at other times only a few of the mostrecent observations may suffice. This is essentially the same as groupingcertain strings in X p whose latest symbols are similar, and assuming thattheir statistics are the same. These descriptions are known as variable-order Markov models (VOMs) or variable-length Markov models (VLMMs).The order required by each context can be adaptively learned from data invarious ways, for instance following [139].

Variable-order techniques are also referred to as context-tree methods,since contexts can be arranged in a tree, with observed symbols on thebranches and where each leaf corresponds to a specific predicted distributionfor the next sample. A more sophisticated embodiment of the same ideais the causal state learning algorithm in [140, 141], which compactly canrepresent certain strictly sofic processes [142], where some contexts may haveessentially infinite order. Specifically, an unbounded amount of lookbackmay be required for a strictly sofic process in order to uniquely identify thestate, though, once a sufficiently long history is provided, the number ofpossible distributions for future samples is finite.

Instead of simply selecting a single order, or a single order per con-text, it is also possible to combine models of different orders. This resultsin ensemble models of various sorts, with the potential to provide greaterpredictive accuracy; see section 4.6. The (binary) context-tree weighting(CTW) method in [143] is an example of a weighted ensemble of Markovchains of varying order.

5.5 Long-Memory ModelsFor processes with long-range dependences, a Markov model may requirehigh order, many parameters, and lots of data to provide a reasonable de-

44 Introduction

Xt�1 X

t

Xt+1

Qt�1 Q

t

Qt+1

Figure 2: Bayesian network representing a first-order hidden-statemodel. Shaded nodes represent observables while other nodes areunobserved latent variables.

scription. A more common approach to create models with long memory isto let the state of the process be a latent variable Q

t

. This latent variablegenerates the data we see at time t, but is itself not directly observable.We call this a hidden-state model. Like before, we let the state follow afinite-order Markovian process, i.e., satisfy (80), so that the entire modelmay be specified by the two distributions f

Xt|Qtand f

Qt|Qt�1t�p

. This leadsto a Bayesian network over both state and observation variables, with adependence structure as illustrated in figure 2 for a first-order hidden-stateprocess.

5.5.1 Hidden-Markov Models and Kalman Filters

Naturally, the nature of the space Q on which the qt

-variables take valuesa substantial effect on the processes that result. Discrete state-spaces, inparticular, lead to hidden Markov models (HMMs) [35], of which innumer-able variations exist. Continuous state-spaces, on the other hand, produceKalman filters [144] and nonlinear extensions thereof, e.g., [145, 146].

HMMs are a very natural fit for connecting speech and language (text)data, as they can represent continuous-valued observations, as in a soundsignal, which are generated based on an underlying discrete grammar orlanguage over a set of phones or words. In speech modelling for recognitionand synthesis, one thus takes x as a representation of the speech signal,and identifies the corresponding label sequence c with a sequence of HMMstates q. This enables the use of the forward-backward or Viterbi algorithmto perform decoding for automatic speech recognition [35] or provide specificstate sequences as input to control parametric speech synthesis [39].

Even though the underlying hidden state has limited-range memory inHMMs and Kalman filters, the X process observations need not satisfy theMarkov property. Among other things, this is evident in how the forwardalgorithm

fQt|Xt

t0

�

q | xt

t0

�


= fXt|Qt

(xt

| q)X

q

02QfQt|Qt�1

(q | q0) fQt�1|Xt�1

t0

�

q0 | xt�1t0

�

(87)

for state-inference in HMMs [35] recursively depends on all previous obser-vations. Hidden-state models can therefore integrate predictive informationover arbitrary lengths of time. This is crucial in many applications, as itmakes it simple to model long-range signal structure such as, for instance,the order of sounds in speech signals or various chart patterns in financialtechnical analysis, using only a limited state space (possible values for thehidden-state variable). Patterns with variable durations are especially easyto represent.

All Markov chains can be formulated as HMMs where fXt|Qt

(xt

| qt

) =

I (xt

= qt

), so HMMs are strictly more general than finite-order Markovchains. At the same time, it is sufficient that the underlying Markov chainis ergodic for the entire HMM to be ergodic as well [147]. Furthermore,unlike many other possible models with long-range memory, fast inferenceis possible for HMMs using the forward-backward algorithm [35]. (Thesame holds for linear Kalman filters where all variables are Gaussian [148,149].) This combination of expressive power and computational efficiencyhas been central to the success of HMMs in applications, since it enablesthe models to be efficiently trained and refined on very large databases.Efficient algorithms are not as easily derived for multidimensional stochasticprocesses, e.g., on grids or lattices; more general results on graphs whichadmit fast inference are provided by [150, 151].

Another advantage of the discrete state in a hidden-Markov model isthat it, like the components in a mixture model, effectively partitions thedata into different subsets, each of which may be explained by differentdistributions. This makes it possible to capture detailed features of theobservation distribution. Unlike regular mixture models, however, HMMsalso describe how different components correlate across time, creating aclass of mixture models suitable for describing time series.

On the other hand, the fact that the state is not directly visible com-plicates learning. While the EM-algorithm [76] can be used for iterativeparameter-estimation in HMMs, and is guaranteed to converge on a localoptimum in some special cases [152, 153], the method assumes that thenumber of states is known a priori. This is not necessarily true in prac-tice, and data-driven criteria for deciding on a suitable number of states aretherefore of interest. The causal-state learning algorithm in [140, 141] canbe seen as an example of such a method, capable of learning more generalrepresentations than finite-order Markov chains, and yielding a number ofstates that converges on the true number of so-called causal states (a certaininformation-theoretic representation of stochastic processes [154]). Thesispaper [155] extends these ideas to be more robust to noise.

46 Introduction

Xt�1 X

t

Xt+1

Qt�1 Q

t

Qt+1

Ct�1 C

t

Ct+1

Figure 3: Illustration of dependences and observables in a recur-rent neural network for classification. The model is discriminativerather than generative, so the graph cannot directly be interpretedas a Bayesian network.

5.5.2 Models Incorporating Neural Networks

Hidden state-variables can also be introduced to add long memory to non-probabilistic discriminatory models, for instance by creating artificial neuralnetworks where the activations of certain units, instead of being expresseddirectly in the output, are fed into the inputs of other units at the next timestep. Such constructions are known as recurrent neural networks (RNN).Many variants exist, with one interesting example being long short-termmemory (LSTM) networks [156], where internal network activations do notdecay with time and gating units allow information to flow in or out ofmemory.

Probabilistic extensions of LSTMs have been shown to do well on certaintime series tasks such as recognizing handwritten text [157]. Unlike naïveHMM implementations, these models do not assume a one-to-one mappingbetween states and recognition labels, and instead use their memory stateQ

t

as a middle layer between Xt

and Ct

, as in figure 3. This is similar tothe traditional input-state-output setup of Kalman filters in control theory,e.g., [148, 149].

Other approaches use neural networks for state-to-observation mappings,but employ HMMs for the state-dynamics back end. An example is the useof MLP features for speech recognition, e.g., [158, 159], which are featuresbased on phone activations for each frame, derived from multilayer per-ceptrons (a type of artificial neural network) trained to recognize phones.These features can be used in isolation or in tandem with more traditionalfeature representations.

In recent years, deep belief networks based on reduced Boltzmann ma-chines (RBMs) have produced strong results in ASR when paired with an


HMM back-end [25]. This work has garnered substantial interest from thespeech community. Unlike feedforward MLPs, these constructions haveprobabilistic interpretations and are often initialized from a probabilisticperspective. They are thus compelling to use in a generative setting, andapplications of deep neural networks to speech synthesis have just begun tospring up, e.g., [160, 161, 162].

5.5.3 Hidden Semi-Markov Models

A practical shortcoming of standard HMMs is that they have very poorability to model different duration distributions. Since the HMM only re-members its current state, and essentially flips a coin at each t to decidewhether it should stay in the state or transition elsewhere, the time spentin each state follows a discrete memoryless (i.e., geometric) distribution

fD

(d; a) = ad�1(1� a) , (88)

where the parameter a is the probability of remaining in the state and dis a positive integer duration value. This distribution has a relatively largestandard deviation that is coupled to the average duration, and attainsits mode at the shortest possible duration d = 1 for all valid a. Naturalprocesses such as phone durations in speech, in contrast, often show distri-butions peaked around a value larger than the minimal duration, and witha narrower dispersion around this value than a fitted geometric distributionwould provide; see also figure 2 in [163] for an example plot of ground-truthdurations from TV programming data.

One possibility to better model non-geometric state-duration distribu-tions is to let each state also contain a model of the duration distributionof the state. Whenever a new state is entered, a duration is generatedfrom this distribution, and the process remains in the state for exactly asmany steps as the generated duration shows. Following this, the processmoves to another state according to a Markov chain with time-independenttransition probabilities where self-transitions are forbidden, and the proce-dure repeats. This creates a so-called hidden semi-Markov model (HSMM),which can describe arbitrary state-duration distributions.

The name semi-Markov comes from the fact that the process is Marko-vian (in the sense that past and future are conditionally independent) givenknowledge about the current state Q

t

and a counter Dt

specifying how longthe process has remained in the current state. The state space is thereforeeffectively two-dimensional, which creates issues with applying standard al-gorithms efficiently. If, however, the duration distribution is log-concave(which implies unimodality), fast dynamic-programming analogues of theViterbi algorithm exist [163]; these can be used as part of approximate EMtraining.

48 Introduction

Alternatively, certain non-exponential duration distributions, e.g., thenegative binomial distribution, can be represented as ordinary HMMs bycreating a group of states tied to have the same output distribution, so thatthe H(S)MM has identical output distributions as long as the state remainsin the tied group. This enables one to apply standard EM training algo-rithms. The duration distributions that can be represented by combininggeometric distributions in this manner are known as phase-type distribu-tions. Continuous-valued phase-type distributions are dense on [0, 1), andcan therefore approximate any duration distribution well, when given a suf-ficient number of sub-states [164].

Hidden semi-Markov models have been used in automatic speech recog-nition [165], but the gains they produce there are perceived as relativelyminor. For parametric speech synthesis, in contrast, the improvements induration modelling delivered by HSMM systems are more noticeable [166],and HSMMs are used in state-of-the-art text-to speech systems, e.g., HTS[167]. An application to segmentation of TV recordings substantially low-ered frame classification error [163]. Other HSMM uses involve modellingdurations of human motion patterns, for example [168] and [169].

An alternative to using hidden-semi Markov models to obtain naturalduration distributions is to introduce a continuous state space (that is, usea possibly nonlinear Kalman filter), which can represent gradual, interme-diate progress between states and behaviours even while remaining strictlyMarkovian. This approach was explored in papers [170] and [171] by thethesis author and collaborators.

5.6 Combining Models and Paradigms

Often, real-world data series exhibit both short-range correlations and long-range structure and patterns. Theoretically, all this can be represented by adiscrete-state HMM, given a sufficient number of states, but in practice thatnumber may be prohibitively large. In a nutshell, HMMs trade the limitedmemory range of Markov models for limited memory resolution. Instead,it is often preferable to combine the Markovian observation dependencesand hidden-state approaches discussed above within a single model. Theresult is a network structure as illustrated in figure 4, where between-sampledependences follow a Markov process with properties determined by thehidden state.15 Importantly, fast inference remains possible.

Typically, the Markovian part in a model of the above type captures sim-ple and predictable local aspects of the data such as continuity, allowing thehidden state to concentrate on more complex, long-range interactions. In

15More general dynamic Bayesian networks for speech recognition have been consideredin [172]. There are also algorithms to learn the structure of dynamic networks from data[173].


Xt�1 X

t

Xt+1

Qt�1 Q

t

Qt+1

Figure 4: Bayesian network representing a model with Markovianand hidden-state dependences. Shaded nodes represent observ-ables while other nodes are unobserved latent variables.

speech signal modelling, for example, autoregressive models describe corre-lations between analysis frames due to physical constraints on the motion ofspeech articulators, while the hidden states typically are based on a languagemodel (e.g., n-grams), thus accounting for grammar and other large-scalestructures of speech and language.

Joint Markovian and hidden-state structure is seen in, e.g., GARCHmodels [14] from econometrics, or in SETAR models driven by an unob-servable Markov chain, so-called Markov switching models [174, 133]. Thepractice of adding dynamic features such as velocity and acceleration—usedin acoustic models for speech recognition and synthesis [175], and in sig-nature verification [176], for example—can be seen as a trick for introduc-ing implicit between-frame correlations. However, this is mathematicallyinconsistent, in that it assigns probability mass to impossible feature se-quences and tends to underestimate data variability [177]. More principledapproaches to modelling temporal correlations such as autoregressive HMMs(AR-HMMs) [178] or trajectory models [126] yield improved model accuracyin the case of speech [177]. Thesis paper [138] can be seen as a combinationof continuous-valued nonparametric Markov models similar to [137] with thediscrete hidden state of HMMs. Such a hidden state provides a method tocontrol the Markov process output, making it useful for language modellingin speech signals.

If may not be strictly necessary to explicitly introduce Markovian depen-dences in order to enforce output with a high degree of continuity. Papers[170] and [171] investigate ideas based on continuous-valued state spaces,where the dynamics and output mapping pdfs change continuously withstate-space position.

In the context of long-memory modelling of text and other discrete data,there are methods which, rather than add a hidden state, instead createMarkov-chain models without any explicit bound on the memory length,where the model order then grows with every observed symbol. Examplesinclude context trees extended to unbounded depth [179], unbounded-lengthprediction by partial matching (PPM*) [180], and the sequence memoizer

50 Introduction

[181, 182]. All of these can be viewed as ensemble methods, combining pre-dictions based on Markov models of different orders. In contrast to the otherproposals, the sequence memoizer is Bayesian and based on Pitman-Yor pro-cesses [183], which are capable of replicating the power-law like behaviourexhibited by many text sources. (We say power-law like, as it turns out thatmany sources claimed to be governed by power-laws in the literature showsignificant differences from true power-law distributions [184].) A shortcom-ing of creating and combining Markov-chain models of unbounded order isthat the computational complexity of training on a given sequence oftenbecomes quadratic in sequence length.

5.7 Finite SequencesFor processes that are not bi-infinite, a distribution over the starting state(X1 or Q1) must be specified. To ensure stationarity, this f

Q1must satisfy

fQ2

(q) =

ˆfQ2|Q1

(q | q0) f

Q1(q0

) dq0 (89)

= fQ1

(q) (90)

for all q. If the process is ergodic, fQ1

is uniquely is determined by theconditional next-step distribution f

Q2|Q1. For Markov chains, this is the

stationary distribution vector ⇡ from section 5.4.Finite-duration Markov chains and HMMs can be formed by adding an

attracting state (a state that only transitions to itself) associated with the“no output” symbol 0· . By starting in a different state than the attractingone, transient behavior results. Even if the resulting model is not stationaryin the ordinary sense of the word—for one thing, f

Q1 (i) 6= ⇡i

for severali—it can still be governed by an unchanging (time-independent) next-stepdistribution

fQt|Qt�1

(q | q0) ⌘ f

Q2|Q1(q | q0

) . (91)

This introduces similar advantages for learning as conventional stationaritydoes in infinite-duration models. Finite-duration models of this type arecommonly used to represent single speech utterances.

5.8 Other ApproachesNot all techniques for working with sequence data fit neatly into the abovestate-based framework. Early efforts in speech recognition were based ontemplates or exemplars; incoming speech was time stretched or compressedthrough dynamic time warping [185, 31] such that a maximal geometricsimilarity to each reference example could be computed. Hidden Markov


models, popularized in part by a timely 1989 tutorial [35] by Rabiner, pro-vided a more principled method for assigning alignment and similarity costs,and have largely superseded time-warping methods.16 Under the surface,however, both paradigms use very similar dynamic programming techniquesto compute how well duration-variable speech matches a given hypothesis.

In speech synthesis, HMM-based methods show substantial flexibility,and can adapt their voice to recreate different speakers and affective states.Higher naturalness, however, is generally provided by concatenative or unitselection synthesis approaches (cf. [187]), where fast search algorithms areused to find fitting signal segments from a speech database, segments whichthen are concatenated to express a desired target phrase. These techniquescan be seen as geometric methods, as they are based on non-probabilistictarget and join costs. (The target costs measure how well a segment well itexpresses the desired phone or phone sequence in context, and the join costhow the segment matches neighbouring segment candidates). Unit selectionapproaches have some connections with nonparametric methods, in that thedifferent possibilities that the method can express is explicitly linked to thetraining database and its size. (They are not ensembles, however, sinceonly a single model—a single realization even—is used at any given pointin time.)

There are also problem-solving techniques that do not model or treatthe datastream on a sample-by-sample or frame-by-frame basis. One suchapproach to classification, and specifically speech recognition, is to use de-tectors—small functions, each of which is designed to be sensitive to theoccurrence of a certain events in a datastream. In ASR, relevant eventsmay be the occurrence of speech building blocks such as phones or simplewords.

A set of detectors continuously scan the incoming data, and if the prob-ability is sufficiently large for a specific event to have occurred at some t, adetection is flagged for that time for the detector in question. Subsequentprocessing tries to make sense of the set of detections, which is not a sam-pled time series in any typical sense, nor is it modelled as such. An exampleof a detector framework of this kind is [188]. An alternative view groundedin (continuous-time) point processes can be found in [189].

Detectors can also be utilized in a more traditional time-series approach:If the detection probabilities for the set of detectors for each frame are ar-ranged in a vector, a sequence of detector-based feature vectors is obtained.This data can be described with traditional time-series models. The use ofMLP features in speech recognition [159], already mentioned in section 5.5,is an example of this approach.

16This is not to say that geometric methods are out of the picture; [186] provides aninnovative recent SVM-based discriminative model for ASR.

52 Introduction

6 Solving Practical ProblemsSo far, we have introduced the most common tasks involving sequence data,as well as the mathematical models and techniques used for solving them. Inthis section, we summarize how everything can be put together and discusssome important considerations for successful applied problem solving; [114]provides another view of several key issues in practical machine learning.

We begin by outlining a typical protocol followed when approaching anew problem (section 6.1). Thereafter (in section 6.2), feature engineering,the step that adapts the data to a form suitable for modelling, is considered.The phenomenon of overfitting, perhaps the most famous nemesis of dataanalysts and statisticians, is explored subsequently (section 6.3). Anothernotable practical problem is mismatch between training and application,where we discuss (in section 6.4) some methods, falling partially or whollyoutside traditional parameter estimation, to enhance models for specificapplications.

6.1 Solution Procedure

In solving a practical problem with the assistance of sequence data, thereare a few standard steps:

1. Define the problem by deciding on a specific quantitative perfor-mance measure to optimize.

2. Gather data. This can, e.g., entail recording speech data or obtain-ing a suitable text corpus, but also transcribing speech recordings orannotating corpora (i.e., adding labels to the data).

3. Create a feature extractor. This is a very important step, espe-cially for supervised learning, in which the data is transformed to aform that is more suitable for modelling. An overview of the consid-erations involved is given in the next section.

4. Specify a model for the feature data. Many of the distinguishingcharacteristics of different model paradigms have been covered previ-ously, and will not be repeated here.

5. Estimate the model parameters based on training data. Evennonparametric methods often have tuning parameters that need to beadjusted. Sometimes the procedure is not as simple as just applyingthe standard parameter estimation techniques from section 3.6, butmay also involve additional processing as discussed in section 6.4 toimprove performance.

6 Solving Practical Problems 53

6. Estimate performance. Since the apparent performance on thetraining data can be significantly misleading, this is usually done onheld-out data not used during training, a form of cross-validation, butother methods also exist.

7. Tweak the feature extractor, the model, and possibly other processingsteps until system performance is satisfactory.

8. Apply the resulting solution model to the original problem.

A few of these steps have been discussed in previous sections, particularlythe parts that relate to model building. We now turn to discuss some of thesurrounding aspects, such as feature creation and various types of prepro-cessing and postprocessing to adapt to different situations. While crucialto successful problem solving, these tend to be highly application depen-dent and are frequently difficult to put on an objective basis. Althougha substantial amount of work has been done to put several of these con-siderations on a firmer theoretical footing, the fact remains that successfulproblem solving requires a combination of art and science.

6.2 Feature EngineeringEspecially in supervised learning problems, it is quite uncommon to modelthe raw input data directly. Often, the feature data that the model describesis a processed version of the original input. The process that transformsthe raw input to suitable features is known as feature extraction, and theprocess of designing such extractors is known as feature engineering. Thisis generally the most application-dependent aspect of any machine-learningapplication, as once the data is in a suitable form, one can just apply oneof the many available mathematical models to describe the situation andobtain a solution to the problem.

In classification, feature extraction is often considered a one-way trans-formation, but to perform synthesis it is important that the feature data canbe inverted to recover a signal in the original input domain. It is not neces-sary for this inverse to be unique, just that some reasonable “pseudoinverse”in the original input space can be selected.

6.2.1 Information Removal

Feature extraction generally has two goals:

1. To remove information in the data that is redundant or of no or littlevalue to the task at hand. This yields a lower-dimensional descriptionthat requires fewer bits and is faster to process.

2. To transform the remaining information into a form that exposes thestructure of the problem and allows it to be modelled efficiently.

54 Introduction

To see why the first objective, information removal, can be beneficial, onecan think of CD-quality speech signals. These have a very high bitrate,but many frequencies may be fundamentally inaudible or masked by othersounds at different times—this is why lossy compression of audio can beso efficient [190]. Obviously, the human-audible signal aspects alone sufficefor high-accuracy speech recognition, as evidenced by human recognitionperformance on lossily compressed speech data. Furthermore, many per-ceptible vocal cues such as pitch may convey emphasis, speaker emotionalstate, and facilitate turn-taking, but are not of central importance for rec-ognizing the words themselves (though pitch can be of importance in tonallanguages such as Mandarin).

In situations where the experimenter does not know a-priori what infor-mation to keep, feature extractors can be made to incorporate unsuperviseddimensionality-reduction techniques such as principal component analysis(PCA) or factor analysis [191] to discard information while retaining mostof the empirical variability. The use of autoencoders in deep neural networks[192] can be seen as an approach for performing nonlinear dimensionalityreduction. There are also dimensionality-reduction techniques utilizing la-belled data, e.g., [193], which may be better at selecting information aspectsrelevant for specific supervised tasks. Though PCA and the concept of di-mensionality reduction often are motivated on geometric grounds, there ex-ist probabilistic extensions as well, for instance probabilistic PCA (PPCA)[194].

6.2.2 Exposing Problem Structure

The second objective of feature extraction is to expose the structure ofa problem in a way that can be described and learned efficiently. (Thisassumes that there is some kind of structure to the problem, otherwiselearning is impossible, as discussed in section 3.5). One way to think aboutthis is to consider where the features fall in the high-dimensional space ofthe raw input—typically, certain values go together, so that the data lieson a low-dimensional manifold in the high-dimensional space. The featureextraction is then essentially the task of extracting meaningful, or at leastuseful, coordinates on this manifold. It is easy to see that coordinatesbased on the structure of the problem can be expected to be helpful, e.g.,for geometric methods. In the case of discrete-valued data such as text,the situation is a little less straightforward to visualize, but elements of thisreasoning still remain.

A subtask of exposing the structure of the data is to separate indepen-dent or uncorrelated aspects of the material. This can make a significantdifference in the number of parameters that need to be estimated: com-pare the diagonal covariance matrices appropriate for Gaussian-distributedvectors with independent elements against the general covariance matrix,


which has D2 degrees of freedom. (D here denotes the feature-vector di-mensionality; naturally, dimensionality reduction will reduce the parameterset as well.)

6.2.3 Finding Fitting Features

Solving a learning task efficiently requires a synergy between features andmodels: the distribution and behaviour of the features should match theassumptions made by the model. If, for instance, the features as designedare naturally Gaussian distributed, then a Gaussian model may be expectedto be appropriate. If feature-vector components are independent, Gaussiandistributions with diagonal covariances or other models that assume inde-pendence may be used, otherwise correlated Gaussians or GMMs may berequired. Sometimes, it can be necessary to apply mathematical transfor-mations to adjust the range of the data and express the data in a naturaland well-behaved way.

Often, nature can be used as a source of inspiration in creating effec-tive feature extractors. Take speech signals as an example: Mel-frequencycepstrum coefficients, MFCCs, common in speech recognition [195] bothincorporate a short-term Fourier transform (similar to the effect of thehuman cochlea) as a step to decorrelate the feature data and remove in-audible aspects based on a psychoacoustic frequency-discrimination scale.The perceptual theme can be used to reduce feature vector dimensionalityeven further based on models of human hearing [196]. Features related toMFCCs are also used in speech synthesis, though somewhat different andaugmented by pitch tracking and other feature data, e.g., [40], in order notto sacrifice quality by discarding too much information. (See also the ASRversus synthesis comparison in [175].)

The auditory frequency scales used in MFCCs and similar representa-tions, originally introduced based on the human auditory system and psy-choacoustics, have recently been related to mathematical invariance andstability properties under frequency transposition and other perturbationsin so-called wavelet scatterings [197]. This interesting line of work also con-nects with modulation features, another biologically-linked speech feature[198], and with cascades of nonlinear processing units seen in convolutionaldeep belief networks [199].

6.3 Bias, Variance, and OverfittingApart from selecting features and models that go well together, an im-portant practical concern is choosing a model family of adequate complex-ity. Theoretical results show that methods such as maximum likelihood areasymptotically consistent if the data generating distribution is an elementin M. This seems to suggest that model families should be chosen as in-

56 Introduction

clusive as possible, to ensure that they contain the “true generating model,”or a close approximation thereof. This intuition is however very wrong infinite-sample situations. In fact, even if the data were generated by a modelin M (which is so unlikely in an application as to be practically impossible),a simplified model might—surprisingly—give better predictions both in su-pervised and unsupervised settings. The task of the analyst is thus not todetermine the theoretically correct parametric family, but to propose a setof models M which yields a good solution to the task, given the availabledata. This was succinctly articulated by George E. P. Box in his famousquotation that “all models are wrong, but some models are useful” [200].

The problem of choosing an appropriate model family is compoundedby the fact that large, complex model families M often are able to fitthe training data well, even when their actual explanatory value is verypoor. This phenomenon is known as overfitting, and relates to the fact thata model with many parameters may be capable of fitting all the randompeculiarities of the data, whereas a simpler model is forced to choose ashape which captures the broad tendencies of the material, in order to fitthe data as well as possible. Simpler models therefore often generalize betterto unseen instances, and thus provide more reliable solutions to the problemat hand.

In section 4.3 we mentioned how the apparent training data performanceis biased towards overly optimistic values, simply because one chooses themodel that maximizes this value. This bias becomes worse the more differentmodels we choose from (i.e., the larger M is) [93], which essentially is whyoverfitted models are so easily selected by the unwary analyst. There existanalytical methods to compensate for the inherent bias in the training-dataperformance of large models, e.g., Akaike’s information criterion (AIC) [201]or the Bayesian information criterion (BIC) [202],17 but cross-validationas touched upon in section 5.4 is possibly the most common approach inpractice.

Fully Bayesian methods as in section 4.4 lend some protection againstoverfitting, since the distribution is computed as the average over all plausi-ble models, rather than picking the single model that looks most appealing.If the Bayesian approach is uncertain about which model that is most ap-propriate, the set of models considered probable will be broad, and one mayexpect the predicted distribution to be smoothed out and be less sensitiveto stochastic variation than a distribution based on a point estimate. Non-parametric modelling can be seen as another method to combat overfitting,

17As an aside, it can be proved that AIC is an inconsistent estimator of the trueorder [203, 204], tending to overestimate the order of Markov chains, where BIC, incontrast, is consistent. This is often a bit of a red herring, however, since models of fixed,overestimated order generally learn to let higher-order terms go to zero, and thus alsoconverge on the correct predictor in the limit—hence, consistent order estimation is notnecessary for consistent distribution estimation.


by gradually growing the model set as more data becomes available, lettingmodel complexity adapt to dataset size.

For any technique we use, there are limits on how much data that is suf-ficient or necessary to have a high probability of learning a well-performingmodel. More specifically, the excess risk in classification (i.e., the differencebetween the average loss of a learner and the lowest achievable average loss)can be bounded using results from computational learning theory [66]. Oneimportant tool in such efforts is Hoeffding’s inequality, which puts boundson the difference between the empirical and the true probability of an event,and thus can be used to relate the expected generalization performance tothe observed error rate.

Mathematically, the task of selecting a proper model (family) complex-ity can be expressed as a so-called bias-variance trade-off. Simple modelfamilies have a large asymptotic bias (systematic error), in that their bestmodel may never be close to the generating model. The optimal parameters,being few, can, however, be estimated accurately from modest amounts ofdata. Complex M, on the other hand, typically have the potential to comecloser to the true distribution, but as there are many models to choose from,more data will be needed to get there. The selected models therefore havea strong dependence on random properties of the sample data, and show alot of variation (random error) around the most appropriate model.

To make the bias-variance trade-off more concrete, consider a case wherethe estimate b✓ of an unknown parameter ✓ satisfies b✓ ⇠ N �

µ, �2�

. Themean square error of the estimate is then

E✓

⇣

b✓ � ✓⌘2◆

= E✓

⇣

b✓ � µ+ µ� ✓⌘2◆

(92)

= (✓ � µ)2+ �2, (93)

which elegantly decomposes into a term relating to the systematic bias, anda term relating to the random uncertainty. Both bias and variance have to besmall for the total expected error to be small. An example where the abovetrade-off can be used explicitly in model selection is bandwidth selection forKDE, where the asymptotic mean integrated square error (AMISE) takesthe form of a sum of a bias and a variance term [59]. Bias-variance-typetrade-offs also exist for other loss functions, e.g., 0-1 loss [52].

6.4 Preprocessing and PostprocessingIn applications of machine learning, speech, or language technology to reallife, there is often a mismatch between how a model is created and trained,and how it later is applied. An example is the profusion of different noiseenvironments, speaker accents, and other signal degradation sources that ageneral speech recognizer must face without having seen them all in the lab.

58 Introduction

The mismatch between training and application is challenging to predictthrough theory, and often has to be compensated for by the analyst, usingmethods external to the standard estimation and decision procedures. Fre-quently, as we shall see in this section, this compensation occurs either as adata preprocessing (typically as part of the feature extraction stage) or asa kind of postprocessing after model selection or parameter estimation hasbeen performed.

6.4.1 Domain Adaptation

A canonical example of how training data and application can differ is thatmodels often are trained on data from a subset of individuals or cases,while the input data in subsequent applications typically comes from otherindividuals or cases, having a different probability distribution. As a matterof notation, the data is said to come from the source domain, with the targetdomain being the intended application; methods to mitigate problems dueto the mismatch are collectively referred to as domain adaptation or transferlearning [205].

Feature extraction, as covered in section 6.2, can be seen as a form ofpreprocessing. Certain types of feature normalization can, for instance,make a speech recognizer more robust to variations in the speaker voice andthe ambient acoustic environment. An example is cepstral mean normaliza-tion (CMN), e.g., [206], where each utterance feature sequence is translatedto have mean zero. This moves both training and test data to a commonregion in feature space, and constitutes an example of feature-based domainadaptation. A simple extension is to normalize both mean and variance.A preprocessing to correct spelling errors could be seen as a method forreducing uneven performance when working with texts written by differentindividuals. Another notable feature-based domain adaptation scheme fromNLP which takes the form of a preprocessing step is described in [207].

If the mismatch is simply due to uneven sampling, where cases that arecommonly encountered in practice are less common in the training database,this can be overcome by selecting or weighting datapoints prior to training[208], a type of instance-based domain adaptation. This requires retraining(picking a new model) every time the domain changes, however, and cannotcompensate for cases where application-relevant training examples are notmerely scarce, but absent altogether.

A third option is model-based domain adaptation, where the trainedmodel is adjusted to fit the application. This can, for instance, be doneby re-estimating certain parameters, e.g., vocal tract length normalization(VTLN) in speech recognition [209, 210], or by transforming means andvariance parameters in HMMs [211]. This is thus a kind of model postpro-cessing.


6.4.2 Smoothing

Even if we do not know any specifics about the target domain, models canbe estimated and processed in ways that typically increase performance inapplications. A basic, but important, example of this is model smoothing,which is of importance in language modelling for speech recognition, as wellas in other NLP tasks [212].

Smoothing addresses a problem with maximum likelihood estimationfor categorical distributions, for instance in language models, which assignszero probability to any event not observed in the training data. Such anevent could be a unique word or word combination (n-gram). Paradoxically,extremely uncommon events can occur quite often in practice: for large textcorpora, for instance, it is typical that about half the words in the corpusonly occur once, so-called hapax legomena. Even if a previously unseenword combination occurs only once in an entire sequence of test data, theprobability of the entire sequence will still be estimated to be zero. Themodel therefor rejects many perfectly valid texts, which is known as thezero-probability or zero-frequency problem [213].

The idea of smoothing is to broaden the support of the estimated dis-tribution, so that events previously considered impossible now have someprobability of occurring. This can be seen as a recognition of the fact thatpractical diversity is greater than the variability found in the training data.A simple scheme for categorical data is to add an initial count of one, orsome other number, to every bin (possible observations), e.g., [214]; thesepseudocounts, together with the empirical frequencies from the data, areused to determine the estimated distribution. The pseudocounts can be in-terpreted as a Bayesian prior, where the experimenter a priori creates “fakeobservations” to indicate that all elements in X have a real possibility ofoccurring. Jeffreys prior, in particular, amounts to adding a count of 1/2 toevery bin.

Smoothing to broaden the support of estimated distributions can also beperformed after model estimation, as a postprocessing step. For uncommonword combinations, another approach is to interpolate between a high-orderMarkov (n-gram) model and a lower order model [215], a kind of modelcombination.

The opposite situation of the zero-probability problem, when a selectedmodel assigns much probability mass to observations that are very unlikelyto be generated by the true process, can be an issue in generative tasks.Speech sampled from state-of-the-art parametric speech synthesis models,for instance, sounds warbly and bubbly in a manner human speech does not[177]. Evidently, most of the probability mass is assigned to models whichare radically unlike natural speech. Texts generated from Markov chainsare another example: while the words and local word combinations may bestandard, the text as a whole is generally incoherent in a manner atypical

60 Introduction

for texts of human origin.The above failings can be seen as fundamental shortcomings of the mod-

els, being unable to capture and represent important sources of variationand instead interpreting these as random noise. For want of better models,the issues can be lessened by reducing the support of the output distribu-tions after training to concentrate on the most probable and representativeobservations. Speech synthesis using MLPG [56], for instance, takes this tothe extreme, by only generating the single most probable output element.An information-theoretic scheme for continuously selectable levels of thistype of “sharpening” (as an opposite of smoothing) is the topic of thesispaper [55], some ideas of which were first published in [216].

7 The Thesis Research

We now move our focus towards the specific research contributions containedin this thesis. To start, some context is provided (section 7.1), explainingwhy research on speech and language models remains relevant. Thereafteran overview of the different research papers and their objectives is presented(section 7.2), followed by a list of main contributions of the individual papersand some brief words on future topics to explore (in section 7.3).

7.1 Thesis Context

A common theme of machine learning and artificial intelligence has beenthat machines can do things that no human can, but many things thathumans find easy are very difficult for a computer. Speech and languageis a good example of this divide. While computers can record, edit, store,transmit, and reproduce text and sound in ways that could hardly havebeen imagined just fifty years ago, we still cannot interface with machinesthrough speech and text nearly as naturally as we do with human beings.

That speech and natural language technology falls short in importantareas is not for lack of trying. Massive research efforts have been directed to-wards automatic speech recognition, text-to-speech systems, machine trans-lation, and similar applications, and have produced numerous insights andbreakthroughs big and small. Today, tools and technologies exploiting theseefforts are widely deployed, but remain held back from realizing their fullpotential by lingering limitations in the underlying science. The hunt isthus still on for better models which capture the true essence of speech andlanguage, or at least improve our modelling capacity to deliver solutionsthat improve on the state of the art.

7 The Thesis Research 61

Paper A B C DCore technique GPDMs CSSR/RCS MERS KDE-HMMs

Type Generative Generative Generative GenerativeObservations Continuous Discrete Any ContinuousApplication Synthesis Language Synthesis, Synthesis

acquisition denoisingHas Markov

3 3 3dependence?Long-memory Continuous Unbounded Unbounded Discretemechanism state order order/none state

Nonparametric 3 3 Yes and noWhat’s new? Application Criterion, Algorithm, Model,

algorithm application algorithm

Table 1: Overview of the ideas presented in the thesis papers.Tick marks mean “yes” while empty cells signify a “no.” Furtherexplanation and discussion is provided in the text.

7.2 Overview of Work

We need better speech and language models, and the research contained inthis thesis represents efforts to identify and examine promising candidates.In one way or another, each paper in the thesis demonstrates a methodfor overcoming limitations of entrenched modelling paradigms in particularapplication scenarios. Often, but not always, this is accomplished by in-vestigating model classes with greater descriptive power than the standardmethods, e.g., through nonparametric techniques, or Markov chains withpotentially unbounded memory.

The majority of the thesis papers focus on synthesis applications, butpaper B considers a method primarily suitable for NLP-related recogni-tion tasks (e.g., named-entity recognition) or pattern discovery for lan-guage acquisition. Papers A and D, meanwhile, both look to overcomeshortcomings in traditional HMM-based acoustic models used in parametricspeech synthesis, either through applying a more powerful and appropriatestate-space representation (replacing the oversimplification of a discrete,one-dimensional phonetic state), or by developing novel, controllable andconsistent nonparametric Markov models for continuous-valued processes(which may capture speech dynamics better than the linear Markov modelscurrently in use). Paper C, finally, considers how we best can make do withthe imperfect models we have, with results that are applicable to continuousas well as discrete-valued sources, i.e., both speech and text.

To illustrate how the research efforts included in thesis connect, and

62 Introduction

where they differ, table 1 provides a structured comparison of the four the-sis papers. While the table is largely self-explanatory, a few comments are inorder: The “Markov dependence” row refers to whether or not causal depen-dences between observations exist if all latent variables (if any) are given.GPDMs show that a hidden, continuous-valued state suffices to generatesmoothly changing output, without additional between-observation Marko-vian dependences, though paper A conjectures that introducing such depen-dences (which is straightforward using dynamic features) may further im-prove synthesis quality. “Unbounded order” long memory in the table meansthat the most straightforward way to think about the technique in questionis as a Markov process of possibly unbounded or infinite order, at least forcertain contexts. MERS can be solved explicitly for (continuous-valued)Gaussian processes with arbitrary power spectral densities—meaning in-finite order—but in the case of discrete-valued processes only finite-orderMarkov chains are considered in the paper. KDE-HMMs have a set of non-parametric Markov processes that generate the observations, but the modelsswitch between the different processes based on a parametric hidden Markovchain; these models therefore contain both parametric and nonparametricelements.

7.3 Summary of Contributions and OutlookTo provide some more detail on the individual research efforts, the mainachievements of the thesis papers can be summarized as follows:

• Paper A: The first (to our knowledge) application of Gaussian processdynamical models (GPDMs) [217, 218] to speech generation. GPDMsare a class of nonparametric dynamic models where autoregressivestate dynamics f

Qt|Qt�1and state-conditional output distributions

fXt|Qt

are given by Gaussian processes [27]. It is shown that GPDMsproduce more natural output for voiced speech than do comparablehidden Markov models, both under sampling and MLPG. This is at-tributed to the ability of the continuous state-space to represent grad-ual transitions and recreate natural phone durations. Additional state-space dimensions are shown to enable the representation of prosodicvariation.

• Paper B: An algebraic criterion for deciding when the causal-staterepresentation of a process can be learned by the causal-state splittingreconstruction (CSSR) algorithm from [140, 141]. The criterion is ap-plied to show that CSSR cannot learn a simple two-state process whendisturbed by non-zero probability substitution noise. A specificationof how to apply the criterion algorithmically is also provided.

• Paper B: A modified CSSR algorithm, dubbed robust causal states

7 The Thesis Research 63

(RCS). It is proved that, unlike CSSR, the algorithm is capable ofrecovering the underlying causal states of a process also in the presenceof noise (substitutions, deletions, and insertions). The conclusions aresupported by simulation data.

• Paper C: A general information-theoretic scheme, minimum entropyrate simplification (MERS), for simplifying stochastic processes byconcentrating on the most representative observations. Inspired byrate-distortion theory in source coding [54], MERS is formulated asan optimization problem over a set of stochastic processes, selectingthe process with the least entropy rate, subject to a constraint onthe permitted dissimilarity from the original reference process. Amultitude of different process classes and dissimilarity measures canbe used.

• Paper C: Explicit solutions to the MERS problem for Gaussian pro-cesses under different dissimilarity measures and constraints on theprocess under consideration. The solutions formulas also apply toentropy maximization under similar constraints.

• Paper C: An explicit solution formula to the MERS problem forfirst-order Markov chains under a reverse KL-divergence dissimilarity.The solution also applies to entropy maximization under similar con-straints. Experiments demonstrate the results of MERS on Markovchains, including text generation and an application to denoising amodel trained on disturbed data, where MERS outperforms an alter-native scheme based on thresholding.

• Paper D: KDE-HMMs, a new signal model extending asymptoticallyconsistent continuous-valued Markov processes based on kernel den-sity estimation (a construction likely first considered in [137]) with adiscrete hidden Markov state. This is promising for modelling the tra-jectories of acoustic parameters in speech (which have proved difficultto describe well-enough to generate natural-sounding speech [177]),with the added latent state enabling synthesis output to be controlled.

• Paper D: Guaranteed-ascent generalized EM-based parameter updateformulas for KDE-HMMs, as well as relaxed update formulas whichallow faster convergence in practice. Experiments show that KDE-HMMs can be trained using the relaxed formulas to achieve superiorprediction performance on nonlinear time series, compared to severalreference models.

In-depth information can be found in the papers themselves, which areincluded in part II below.

64 Introduction

Of course, many threads remain to be explored. To name a few items,the acoustic models of paper A need to be extended to handle discontinuitiesand to allow the synthesis of arbitrary speech utterances. The learning tech-niques in paper B are not yet practical in large-scale scenarios. Minimumentropy rate simplification from paper C may investigate new applicationsand additional dissimilarity measures. And the models of paper D, despitetheir many appealing properties, have yet to be adapted to perform non-parametric speech synthesis. These are but a few of many possible researchtasks still ahead of us in speech and language modelling.

References[1] D. Hilbert, “Mathematische probleme,” Nachrichten von der Königl.

Gesellschaft der Wiss. zu Göttingen, pp. 253–297, 1900.

[2] A. N. Kolmogorov, Grundbegriffe der Wahrscheinlichkeitsrechnung.Berlin: Julius Springer, 1933.

[3] T. J. Ross, Fuzzy Logic with Engineering Applications, 3rd ed. Chich-ester, UK: John Wiley & Sons, 2010.

[4] E. T. Jaynes, Probability Theory: The Logic of Science, G. L. Bret-thorst, Ed. Cambridge, UK: Cambridge University Press, 2003.

[5] R. T. Cox, “Probability, frequency, and reasonable expectation,” Am.Jour. Phys., vol. 14, pp. 1–13, 1946.

[6] ——, The Algebra of Probable Inference. Baltimore, MD: Johns Hop-kins Press, 1961.

[7] G. U. Yule, “On a method of investigating periodicities in disturbed se-ries, with special reference to Wolfer’s sunspot numbers,” Phil. Trans.R. Soc. A, vol. 226, pp. 267–298, 1927.

[8] S. Vaughan, “Random time series in astronomy,” Phil. Trans. R. Soc.A, vol. 371, no. 1984, 2013.

[9] M. E. Mann, R. S. Bradley, and M. K. Hughes, “Northern hemispheretemperatures during the past millennium: inferences, uncertainties,and limitations,” Geophys. Res. Lett., vol. 26, no. 6, pp. 759–762,1999.

[10] C. Duchon and R. Hale, Time Series Analysis in Meteorology andClimatology: An Introduction. Wiley-Blackwell, 2012.

[11] P. Turchin, Complex Population Dynamics: A Theoretical/EmpiricalSynthesis. Princeton, NJ: Princeton University Press, 2003.

References 65

[12] D. Stern, Ed., Remote Sensing of Animals, ser. Acou. Today, vol. 8,no. 3, 2012.

[13] F. Black and M. Scholes, “The pricing of options and corporate liabil-ities,” J. Polit. Econ., vol. 81, no. 3, pp. 637–654, 1973.

[14] T. P. Bollerslev, “Generalized autoregressive conditional heteroskedas-ticity,” J. Econom., vol. 31, no. 3, pp. 307–327, 1986.

[15] W. Enders, Applied Econometric Time Series, 2nd ed. Hoboken, NJ:John Wiley & Sons, 2004.

[16] R. McCleary and R. A. Hay, Jr., Applied Time Series Analysis for theSocial Sciences. Beverly Hills, CA: SAGE Publications, 1980.

[17] A. L. Gilbert, T. Regier, P. Kay, and R. B. Ivry, “Whorf hypothesisis supported in the right visual field but not the left,” P. Natl. Acad.Sci. USA, vol. 103, no. 2, pp. 489–494, 2006.

[18] R. I. M. Dunbar, “Coevolution of neocortical size, group size and lan-guage in humans,” Behav. Brain Sci., vol. 16, no. 4, pp. 681–735,1993.

[19] S. Stadler, A. Leijon, and B. Hagerman, “An information theoreticapproach to estimate speech intelligibility for normal and impairedhearing,” in Proc. Interspeech, Antwerpen, BE, Aug. 2007, pp. 398–401.

[20] M. R. Schroeder and B. S. Atal, “Code-excited linear prediction (celp):High-quality speech at very low bit rates,” in Proc. ICASSP, vol. 10,1985, pp. 937–940.

[21] J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson,“Super-human multi-talker speech recognition: A graphical modelingapproach,” Comput. Speech Lang., vol. 24, no. 1, pp. 45–66, 2010.

[22] H. Moravec, Mind Children. Harvard University Press, 1988.

[23] S. King and V. Karaiskos, “The blizzard challenge 2010,” in Proc.Blizzard Challenge 2010 workshop, 2010.

[24] C. Valentini-Botinhao, E. Godoy, Y. Stylianou, B. Sauert, S. King,and J. Yamagishi, “Improving intelligibility in noise of HMM-generated speech via noise-dependent and -independent methods,” inProc. ICASSP, May 2013.

66 Introduction

[25] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury,“Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups,” IEEE Signal. Proc. Mag.,vol. 29, no. 6, pp. 82–97, 2012.

[26] A. Ozerov and W. B. Kleijn, “Asymptotically optimal model esti-mation for quantization,” IEEE Trans. Comm., vol. 59, no. 4, pp.1031–1042, 2011.

[27] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Ma-chine Learning. MIT Press, Cambridge, MA, 2006.

[28] J. L. Doob, Stochastic Processes. New York, NY: John Wiley & Sons,1953.

[29] C. E. Shannon, “Communication in the presence of noise,” Proc. IRE,vol. 37, no. 1, pp. 10–21, 1949.

[30] J. O. Berger, Statistical Decision Theory and Bayesian Analysis,2nd ed. New York, NY: Springer-Verlag, 1985.

[31] C. Myers, L. R. Rabiner, and A. E. Rosenberg, “Performance tradeoffsin dynamic time warping algorithms for isolated word recognition,”IEEE T. Acoust. Speech, vol. 28, no. 6, pp. 623–635, 1980.

[32] K. J. Lang, A. H. Waibel, and G. E. Hinton, “A time-delay neuralnetwork architecture for isolated word recognition,” Neural Networks,vol. 3, no. 1, pp. 23–43, 1990.

[33] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classi-fication using machine learning techniques,” in Proc. EMNLP, vol. 10,2002, pp. 79–86.

[34] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras,and C. D. Spyropoulos, “An evaluation of naive Bayesian anti-spamfiltering,” in Proc. ECML Workshop Mach. Learn. New Inf. Age,G. Potamias, V. Moustakis, and M. van Someren, Eds., 2000, pp.9–17.

[35] L. R. Rabiner, “A tutorial on hidden Markov models and selectedapplications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp.257–286, 1989.

[36] A. Ratnaparkhi, “A maximum entropy model for part-of-speech tag-ging,” in Proc. EMNLP, vol. 1, 1996, pp. 133–142.

References 67

[37] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume 2:Detection Theory. Upper Saddle River, NJ: Prentice Hall, 1998.

[38] K. Miyahara and M. J. Pazzani, “Collaborative filtering with the sim-ple Bayesian classifier,” in Proc. PRICAI, 2000, pp. 679–689.

[39] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speechsynthesis,” Speech Commun., vol. 51, no. 11, pp. 1039–1064, 2009.

[40] H. Kawahara, “STRAIGHT, exploitation of the other aspectof VOCODER: Perceptually isomorphic decomposition of speechsounds,” Acoust Sci & Tech, vol. 27, no. 6, pp. 349–353, 2006.

[41] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, J. D. Lafferty, andR. L. Mercer, “Analysis, statistical transfer, and synthesis in machinetranslation,” in Proc. TMI, 1992, pp. 83–100.

[42] T. Glad and L. Ljung, Control Theory: Multivariable and NonlinearMethods. London, UK: Taylor & Francis, 2000.

[43] L. Ljung, System Identification – Theory for the User, 2nd ed. UpperSaddle River, NJ: Prentice-Hall, 1999.

[44] A. C. Lorenc, “Analysis methods for numerical weather prediction,”Q. J. Roy. Meteor. Soc., vol. 112, no. 474, pp. 1177–1194, 1986.

[45] M. Collins, “Ensembles and probabilities: a new era in the predictionof climate change,” Phil. Trans. R. Soc. A, vol. 365, no. 1857, pp.1957–1970, 2007.

[46] M. P. Clements and D. E. Hendry, Forecasting Economic Time Series.Cambridge, UK: Cambridge University Press, 1998.

[47] C. Chelba and F. Jelinek, “Structured language modeling,” Comput.Speech Lang., vol. 14, no. 4, pp. 283–332, 2000.

[48] S. M. Kay, Fundamentals of Statistical Signal Processing, Volume I:Estimation Theory. Upper Saddle River, NJ: Prentice Hall, 1993.

[49] P. Stoica and Y. Selén, “Model-order selection: a review of informationcriterion rules,” IEEE Signal. Proc. Mag., vol. 21, no. 4, pp. 36–47,2004.

[50] D. R. Pauluzzi and N. C. Beaulieu, “A comparison of SNR estima-tion techniques for the AWGN channel,” IEEE T. Commun., vol. 48,no. 10, pp. 1681–1691, 2000.

[51] O. Chapelle and Y. Zhang, “A dynamic Bayesian network click modelfor web search ranking,” in Proc. WWW, 2009, pp. 1–10.

68 Introduction

[52] R. Kohavi and D. H. Wolpert, “Bias plus variance decomposition forzero-one loss functions,” in Proc. ICML, 1996, pp. 275–283.

[53] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann.Math. Stat., vol. 22, no. 1, pp. 79–86, 1951.

[54] T. M. Cover and J. A. Thomas, Elements of Information Theory,2nd ed. New York, NY: John Wiley & Sons, 1991.

[55] G. E. Henter and W. B. Kleijn, “Minimum entropy rate simplificationof stochastic processes,” IEEE T. Pattern Anal., submitted.

[56] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Ki-tamura, “Speech parameter generation algorithms for HMM-basedspeech synthesis,” in Proc. ICASSP, vol. 3, 2000, pp. 1315–1318.

[57] L. M. Bregman, “The relaxation method of finding the common pointof convex sets and its application to the solution of problems in convexprogramming,” USSR Comp. Math. Math+, vol. 7, no. 3, pp. 200–217,1967.

[58] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering withBregman divergences,” J. Mach. Learn. Res., vol. 6, pp. 1705–1749,2005.

[59] B. W. Silverman, Density Estimation for Statistics and Data Analysis.Chapman & Hall, London, UK, 1986.

[60] O. Chapelle, B. Schölkopf, and A. Zien, Semi-Supervised Learning.Cambridge, MA: MIT Press, 2006.

[61] C. E. Brodley and M. A. Friedl, “Identifying and eliminating misla-beled training instances,” in Proc. Natl. Conf. Artif. Intell., vol. 13,1996, pp. 799–805.

[62] S. M. Kakade and A. T. Kalai, “From batch to transductive onlinelearning,” in Proc. NIPS, 2005.

[63] D. D. Lewis and W. A. Gale, “A sequential algorithm for training textclassifiers,” in Proc. ACM-SIGIR, vol. 17, 1994, pp. 3–12.

[64] K. Kunen, Set Theory. Amsterdam, NL: Elsevier, 1980.

[65] K. Gödel, “Über formal unentscheidbare Sätze der Principia Mathe-matica und verwandter Systeme I,” Monatsh. Math., vol. 38, no. 1,pp. 173–198, 1931.

[66] T. M. Mitchell, Machine Learning. New York, NY: McGraw-Hill,1997.

References 69

[67] H. Akaike, “Information theory and an extension of the maximumlikelihood principle,” in Proc. Second Int. Symp. Inf. Theory, 1973,pp. 267–281.

[68] M. Rudemo, “Empirical choice of histograms and kernel density esti-mators,” Scand. J. Statist., vol. 9, pp. 65–78, 1982.

[69] B. A. Turlach, “Bandwidth selection in kernel density estimation: Areview,” Institut de Statistique, Université catholique de Louvain,Voie du Roman Pays 34, B-1348 Louvain-la-Neuve, Belgium, Tech.Rep. Discussion Paper 9317, 1993.

[70] T. S. Ferguson, A Course in Large Sample Theory. London, UK:Chapman & Hall, 1996.

[71] U. Grenander and M. I. Miller, Pattern Theory: From Representationto Inference. Oxford, UK: Oxford University Press, 2007.

[72] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, “Maximummutual information estimation of hidden Markov model parametersfor speech recognition,” in Proc. ICASSP, vol. 11, 1986, pp. 49–52.

[73] B.-H. Juang and S. Katagiri, “Discriminative learning for minimumerror classification,” IEEE T. Signal Proces., vol. 40, no. 12, pp. 3043–3054, 1992.

[74] Y.-J. Wu and R.-H. Wang, “Minimum generation error training forHMM-based speech synthesis,” in Proc. ICASSP, vol. 1, 2006, pp.I–89–I–92.

[75] A. Gunawardana and W. Byrne, “Discriminative speaker adaptationwith conditional maximum likelihood linear regression,” in Proc. Eu-rospeech, 2001, pp. 1203–1206.

[76] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm,” J. Roy. Stat. Soc. B,vol. 39, no. 1, pp. 1–38, 1977.

[77] I. Csiszár and G. Tusnády, “Information geometry and alternatingminimization procedures,” in Stat. Decis., Suppl. Issue Number 1,1984, pp. 205–237.

[78] S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,UK: Cambridge University Press, 2004.

[79] L. Devroye and L. Györfi, Nonparametric Density Estimation: TheL1 View. John Wiley & Sons, 1985.

70 Introduction

[80] G. Wahba, “Optimal convergence properties of variable knot, kernel,and orthogonal series methods for density estimation,” Ann. Stat.,vol. 3, no. 1, pp. 15–29, 1975.

[81] M. Rosenblatt, “Remarks on some nonparametric estimates of a den-sity function,” Ann. Math. Stat., vol. 27, no. 3, pp. 832–837, 1956.

[82] E. Parzen, “On estimation of a probability density function and mode,”Ann. Math. Stat., vol. 33, no. 3, pp. 1065–1076, 1962.

[83] C. M. Bishop, Pattern Recognition and Machine Learning. Springer,New York, NY, 2006.

[84] B. W. Silverman, “Density ratios, empirical likelihood and cot death,”Appl. Stat. – J. Roy. St. C, vol. 27, no. 1, pp. 26–33, 1978.

[85] T. Minka, “Discriminative models, not discriminative training,”Microsoft Research Cambridge, Tech. Rep. MSR-TR-2005-144,2005. [Online]. Available: http://research.microsoft.com/pubs/70229/tr-2005-144.pdf

[86] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, andS. Bengio, “Why does unsupervised pre-training help deep learning?”J. Mach. Learn. Res., vol. 11, pp. 625–660, 2010.

[87] C. Cortes and V. N. Vapnik, “Support-vector networks,” Mach. Learn.,vol. 20, no. 3, pp. 273–297, 1995.

[88] B. Schölkopf and A. J. Smola, Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond, 1st ed. Cam-bridge, MA: MIT Press, 2001.

[89] S. Lloyd, “Least squares quantization in PCM,” IEEE T. Inform. The-ory, vol. 28, no. 2, pp. 129–137, 1982.

[90] P. C. Mahalanobis, “On the generalised distance in statistics,” Proc.Natl. Inst. Sci. India, vol. 2, no. 1, pp. 49–55, 1936.

[91] V. Franc, A. Zien, and B. Schölkopf, “Support vector machines asprobabilistic models,” in Proc. ICML, 2011, pp. 665–672.

[92] M. E. Tipping, “Sparse Bayesian learning and the relevance vectormachine,” J. Mach. Learn. Res., vol. 1, pp. 211–244, 2001.

[93] B. Efron, “How biased is the apparent error rate of a prediction rule?”J. Am. Stat. Assoc., vol. 81, no. 394, pp. 461–470, 1986.

[94] D. L. Verbyla, “Potential prediction bias in regression and discriminantanalysis,” Can. J. Forest Res., vol. 16, no. 6, pp. 1255–1257, 1986.

http://research.microsoft.com/pubs/70229/tr-2005-144.pdf

http://research.microsoft.com/pubs/70229/tr-2005-144.pdf

References 71

[95] C. R. Shalizi, “Dynamics of Bayesian updating with dependent dataand misspecified models,” Electron. J. Statist., vol. 3, pp. 1039–1074,2009.

[96] M. J. Beal, “Variational algorithms for approximate Bayesian infer-ence,” Ph.D. dissertation, University College London, 2003.

[97] R. M. Neal, “Probabilistic inference using markov chain monte carlomethods,” Dept. of Computer Science, University of Toronto, Tech.Rep. CRG-TR-93-1, 1993.

[98] D. J. C. MacKay, Information Theory, Inference, and Learning Algo-rithms. Cambridge, UK: Cambridge University Press, 2003.

[99] H. Jeffreys, “An invariant form for the prior probability in estimationproblems,” P. Roy. Soc. A. – Math. Phy., vol. 186, no. 1007, pp.453–461, 1946.

[100] L. Wasserman, “Asymptotic inference for mixture models by usingdata-dependent priors,” J. Roy. Stat. Soc. B, vol. 62, no. 1, pp. 159–180, 2000.

[101] A. R. Syversveen, “Noninformative bayesian priors: Interpretationand problems with construction and applications,” Norges Teknisk-Naturvitenskapelige Universitet, Tech. Rep. 3, 1998.

[102] A. Gelman, A. Jakulin, M. G. Pittau, and Y.-S. Su, “A weakly in-formative default prior distribution for logistic and other regressionmodels,” Ann. Appl. Stat., vol. 2, no. 4, pp. 1360–1383, 2008.

[103] L. J. Savage, The Foundations of Statistics, 2nd ed. New York, NY:Dover Publications, 1972.

[104] A. Gelman, “Objections to Bayesian statistics,” Bayesian Analysis,vol. 3, no. 3, pp. 445–450, 2008.

[105] J. Surowiecki, The Wisdom of Crowds. New York, NY: Doubleday,2005.

[106] Z. Ma and A. Leijon, “Bayesian estimation of beta mixture modelswith variational inference,” IEEE T. Pattern Anal., vol. 33, no. 11,pp. 2160–2173, 2011.

[107] C. E. Rasmussen, “The infinite Gaussian mixture model,” in Proc.NIPS, vol. 12, 2000, pp. 554–560.

[108] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp.123–140, 1996.

72 Introduction

[109] B. Efron, “Bootstrap methods: Another look at the jackknife,” Ann.Stat., vol. 7, pp. 1–26, 1979.

[110] Y. Freund and R. E. Schapire, “A decision-theoretic generalizationof on-line learning and an application to boosting,” J. Comput. Syst.Sci., vol. 55, no. 1, pp. 119–139, 1997.

[111] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regres-sion: a statistical view of boosting,” Ann. Stat., vol. 28, no. 2, pp.337–407, 2000.

[112] R. E. Schapire, “The strength of weak learnability,” Mach. Learn.,vol. 5, no. 2, pp. 197–227, 1990.

[113] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5,no. 2, pp. 241–259, 1992.

[114] P. Domingos, “A few useful things to know about machine learning,”Commun. ACM, vol. 55, no. 10, pp. 78–87, 2012.

[115] P. J. Brockwell and R. A. Davis, Introduction to Time Series andForecasting, 2nd ed. New York, NY: Springer, 2002.

[116] M. B. Priestley, Non-Linear and Non-Stationary Time Series Analy-sis. Waltham, MA: Academic Press, 1988.

[117] M. Brin and S. Garrett, Introduction to Dynamical Systems. Cam-bridge University Press, 2002.

[118] G. D. Birkhoff, “Proof of the ergodic theorem,” Proc. Natl. Acad. Sci.USA, vol. 17, no. 12, pp. 656–660, 1931.

[119] C. R. Shalizi. (2010) The world’s simplest ergodic theorem. [Online].Available: http://bactra.org/weblog/668.html

[120] C. R. Shalizi and A. Kontorovich. (2010) Almost None of the The-ory of Stochastic Processes. Version 0.1.1. http://www.stat.cmu.edu/~cshalizi/almost-none/.

[121] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference, ser. Representation and Reasoning. San MateoCA: Morgan Kaufmann, 1988.

[122] T. Koski and J. M. Noble, Bayesian Networks: An Introduction.Chichester, UK: John Wiley & Sons, 2009.

[123] H. Nishimori, Statistical Physics of Spin Glasses and Information Pro-cessing: An Introduction, ser. International Series of Monographs onPhysics. Oxford, UK: Oxford University Press, 2001, vol. 111.

http://bactra.org/weblog/668.html

http://www.stat.cmu.edu/~cshalizi/almost-none/

http://www.stat.cmu.edu/~cshalizi/almost-none/

References 73

[124] P. Smolensky, “Information processing in dynamical systems: Foun-dations of harmony theory,” in Parallel Distributed Computing: Ex-plorations in the Microstructure of Cognition. Vol. 1: Foundations.MIT Press, 1986, ch. 6.

[125] Y. Freund and D. Haussler, “Unsupervised learning of distributionson binary vectors using 2-layer networks,” in Proc. NIPS, 1992, pp.912–919.

[126] H. Zen, K. Tokuda, and T. Kitamura, “Reformulating the HMM asa trajectory model by imposing explicit relationships between staticand dynamic feature vector sequences,” Comput. Speech Lang., vol. 21,no. 1, pp. 153–173, 2007.

[127] R. D. Shachter, “Bayes-ball: Rational pastime (for determining ir-relevance and requisite information in belief networks and influencediagrams),” in Proc. UAI, 1998, pp. 480–487.

[128] P. Dagum and M. Luby, “Approximating probabilistic inference inBayesian belief networks is NP-hard,” Artif. Intell., vol. 60, no. 1, pp.141–153, 1993.

[129] O. Perron, “Zur Theorie der Matrices,” Math. Ann., vol. 64, no. 2, pp.248–263, 1907.

[130] F. G. Frobenius, “Über Matrizen aus nicht negativen Elementen,”Sitzungsber. Königl. Preuss. Akad. Wiss., pp. 456–477, 1912.

[131] P. Whittle, Hypothesis Testing in Time Series Analysis. Stockholm,Sweden: Almqvist & Wiksell, 1951.

[132] H. O. A. Wold, A Study in the Analysis of Stationary Time Series,2nd ed. Stockholm, Sweden: Almqvist & Wiksell, 1954.

[133] H. Tong and K. S. Lim, “Threshold autoregression, limit cycles andcyclical data,” J. Roy. Stat. Soc. B, vol. 42, no. 3, pp. 245–292, 1980.

[134] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang,“Phoneme recognition using time-delay neural networks,” IEEE T.Acoust. Speech, vol. 37, no. 3, pp. 328–339, 1989.

[135] M. Kallas, P. Honeine, C. Richard, C. Francis, and H. Amoud,“Kernel-based autoregressive modeling with a pre-image technique,”in Proc. SSP, 2011, pp. 281–284.

[136] ——, “Prediction of time series using Yule-Walker equations with ker-nels,” in Proc. ICASSP, 2012, pp. 2185–2188.

74 Introduction

[137] M. B. Rajarshi, “Bootstrap in Markov-sequences based on estimatesof transition density,” Ann. I. Stat. Math., vol. 42, no. 2, pp. 253–268,1990.

[138] G. E. Henter, A. Leijon, and W. B. Kleijn, “Kernel density estimation-based Markov models with hidden state,” manuscript in preparation.

[139] P. Bühlmann and A. J. Wyner, “Variable length Markov chains,” Ann.Stat., vol. 27, no. 2, pp. 480–513, 1999.

[140] C. Shalizi and K. Shalizi, “Blind construction of optimal nonlinearrecursive predictors for discrete sequences,” in Proc. UAI, vol. 20,2004, pp. 504–511.

[141] C. Shalizi, K. Shalizi, and J. Crutchfield, “An algorithm for patterndiscovery in time series,” Santa Fe Institute, Tech. Rep. 02-10-060,2002, http://arxiv.org/abs/cs.LG/0210025.

[142] B. Weiss, “Subshifts of finite type and sofic systems,” Monatsh. Math.,vol. 77, no. 5, pp. 462–474, Oct. 1973.

[143] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree weighting method: Basic properties,” IEEE T. Inform. Theory,vol. 41, no. 3, pp. 653–664, 1995.

[144] G. Welch and G. Bishop, “An introduction to the Kalman filter,”SIGGRAPH 2001 Course 8, 2001.

[145] E. A. Wan and A. T. Nelson, “Dual Kalman filtering methods fornonlinear prediction, smoothing, and estimation,” in Proc. NIPS 1996,vol. 9, 1997.

[146] E. A. Wan, R. van der Merwe, and A. T. Nelson, “Dual estimationand the unscented transformation,” in Proc. NIPS 1999, vol. 11, 2000,pp. 666–672.

[147] A. Schönhuth and H. Jaeger, “Characterization of ergodic hiddenMarkov sources,” IEEE T. Inform. Theory, vol. 55, no. 5, pp. 2107–2118, 2009.

[148] R. E. Kálmán, “A new approach to linear filtering and predictionproblems,” J. Basic Eng. – T. ASME, vol. 82, no. 1, pp. 35–45, 1960.

[149] R. E. Kálmán and R. S. Bucy, “New results in linear filtering andprediction theory,” J. Basic Eng. – T. ASME, vol. 83, no. 3, pp. 95–108, 1961.

[150] T. Jebara, “MAP estimation, message passing, and perfect graphs,”in Proc. UAI. AUAI Press, 2009, pp. 258–267.

http://arxiv.org/abs/cs.LG/0210025

References 75

[151] J. R. Foulds, N. Navaroli, P. Smyth, and A. T. Ihler, “Revisiting MAPestimation, message passing and perfect graphs,” in Proc. AISTATS,2011, pp. 278–286.

[152] P. S. Gopalakrishnan, D. Kanevsky, A. Nádas, and D. Nahamoo, “Aninequality for rational functions with applications to some statisticalestimation problems,” IEEE T. Inform. Theory, vol. 37, no. 1, pp.107–113, 1991.

[153] D. Kanevsky, “Extended Baum transformations for general functions,”in Proc. ICASSP, vol. 1, 2004, pp. I – 821–824.

[154] C. Shalizi and J. Crutchfield, “Computational mechanics: Pattern andprediction, structure and simplicity,” J. Stat. Phys., vol. 104, no. 3–4,pp. 817–879, 2001.

[155] G. E. Henter and W. B. Kleijn, “Picking up the pieces: Causal statesin noisy data, and how to recover them,” Pattern Recogn. Lett., vol. 34,no. 5, pp. 587–594, 2013.

[156] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” NeuralComput., vol. 9, no. 8, pp. 1735–1780, 1997.

[157] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, andJ. Schmidhuber, “A novel connectionist system for unconstrainedhandwriting recognition,” IEEE T. Pattern Anal., vol. 31, no. 5, pp.855–868, 2009.

[158] H. Hermansky, D. P. W. Ellis, and S. Sharma, “Tandem connectionistfeature extraction for conventional HMM systems,” in Proc. ICASSP,vol. 3, 2000, pp. 1635–1638.

[159] Q. Zhu, B. Y. Chen, N. Morgan, and A. Stolcke, “On using MLPfeatures in LVCSR,” in Proc. Interspeech, 2004.

[160] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speechsynthesis using deep neural networks,” in Proc. ICASSP, 2013, pp.7962–7966.

[161] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes usingrestricted Boltzmann machines and deep belief networks for statisticalparametric speech synthesis,” IEEE T. Speech Audi. P., vol. 21, no. 10,pp. 2129–2139, 2013.

[162] S. Kang, X. Qian, and H. Meng, “Multi-distribution deep belief net-work for speech synthesis,” in Proc. ICASSP, 2013, pp. 8012–8016.

76 Introduction

[163] D. Tweed, R. Fisher, J. Bins, and T. List, “Efficient hidden semi-Markov model inference for structured video sequences,” in Proc.VSPETS, 2005, pp. 247–254.

[164] S. Asmussen, O. Nerman, and M. Olsson, “Fitting phase-type distri-butions via the EM algorithm,” Scand. J. Statist., vol. 23, no. 4, pp.419–441, 1996.

[165] M. J. Russell and R. K. Moore, “Explicit modelling of state occupancyin hidden Markov models for automatic speech recognition,” in Proc.ICASSP, vol. 10, 1985, pp. 5–8.

[166] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,“Hidden semi-Markov model based speech synthesis,” in Proc. ICSLP,2004, pp. 1393–1396.

[167] J. Yamagishi, H. Zen, Y.-J. Wu, T. Toda, and K. Tokuda, “The HTS-2008 system: Yet another evaluation of the speaker-adaptive HMM-based speech synthesis system in the 2008 Blizzard Challenge,” inBlizzard Challenge 2008 Workshop, 2008.

[168] N. Niwase, J. Yamagishi, and T. Kobayashi, “Human walking motionsynthesis with desired pace and stride length based on HSMM,” IEICET. Inf. Syst., vol. 88, no. 11, pp. 2492–2499, 2005.

[169] E. Bozkurt, S. Asta, S. Özkul, Y. Yemez, and E. Erzin, “Multimodalanalysis of speech prosody and upper body gestures using hidden semi-Markov models,” in Proc. ICASSP, 2013, pp. 3562–3656.

[170] G. E. Henter and W. B. Kleijn, “Intermediate-state HMMs to capturecontinuously-changing signal features,” in Proc. Interspeech, vol. 12,2011, pp. 1817–1820.

[171] G. E. Henter, M. R. Frean, and W. B. Kleijn, “Gaussian process dy-namical models for nonparametric speech representation and synthe-sis,” in Proc. ICASSP, 2012, pp. 4505–4508.

[172] J. A. Bilmes and C. Bartels, “Graphical model architectures for speechrecognition,” IEEE Signal Proc. Mag., vol. 22, no. 5, pp. 89–100, 2005.

[173] N. Friedman, K. Murphy, and S. Russell, “Learning the structure ofdynamic probabilistic networks,” in Proc. UAI, vol. 14, 1998, pp. 139–147.

[174] J. D. Hamilton, “A new approach to the economic analysis of non-stationary time series and the business cycle,” Econometrica, vol. 57,no. 2, pp. 357–384, 1989.

References 77

[175] J. Dines, J. Yamagishi, and S. King, “Measuring the gap betweenHMM-based ASR and TTS,” in Proc. Interspeech, 2009, pp. 1391–1394.

[176] G. Rigoll and A. Kosmala, “A systematic comparison between on-lineand off-line methods for signature verification with hidden Markovmodels,” in Proc. ICPR, vol. 2, 1998, pp. 1755–1757.

[177] M. Shannon, H. Zen, and W. Byrne, “The effect of using normalizedmodels in statistical speech synthesis,” in Proc. Interspeech, vol. 12,2011.

[178] M. Shannon and W. Byrne, “Autoregressive HMMs for speech synthe-sis,” in Proc. Interspeech, vol. 10, 2009, pp. 400–403.

[179] F. M. J. Willems, “The context-tree weighting method: Extensions,”IEEE T. Inform. Theory, vol. 44, no. 2, pp. 792–798, 1998.

[180] J. G. Cleary and W. J. Teahan, “Unbounded length contexts forPPM,” Comput. J., vol. 40, no. 2 and 3, pp. 67–75, 1997.

[181] F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh,“The sequence memoizer,” Commun. ACM, vol. 54, no. 2, pp. 91–98,2011.

[182] J. Gasthaus, F. Wood, and Y. W. Teh, “Lossless compression basedon the sequence memoizer,” in Proc. DCC, 2010, pp. 337–345.

[183] J. Pitman and M. Yor, “The two-parameter Poisson-Dirichlet distri-bution derived from a stable subordinator,” Ann. Probab., vol. 25,no. 2, pp. 855–900, 1997.

[184] A. Clauset, C. R. Shalizi, and M. E. J. Newman, “Power-law distri-butions in empirical data,” SIAM Rev., vol. 51, no. 4, pp. 661–703,2009.

[185] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimiza-tion for spoken word recognition,” IEEE T. Acoust. Speech, vol. 26,no. 1, pp. 43–49, 1978.

[186] S.-X. Zhang and M. J. F. Gales, “Structured support vector machinesfor noise robust continuous speech recognition,” in Proc. Interspeech,2011, pp. 989–990.

[187] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speechsynthesis system using a large speech database,” in Proc. ICASSP,1996, pp. 373–376.

78 Introduction

[188] C.-H. Lee, M. A. Clements, S. Dusan, E. Fosler-Lussier, K. Johnson,B.-H. Juang, and L. R. Rabiner, “An overview on automatic speechattribute transcription (ASAT),” in Proc. Interspeech, 2007, pp. 1825–1828.

[189] A. Jansen and P. Niyogi, “Point process models for event-based speechrecognition,” Speech Commun., vol. 51, no. 12, pp. 1155–1168, 2009.

[190] J. Benesty, M. M. Sondhi, and Y. Huang, Springer Handbook of SpeechProcessing. New York, NY: Springer, 2008.

[191] C. J. C. Burges, “Geometric methods for feature extraction and dimen-sional reduction,” in Data Mining and Knowledge Discovery Hand-book. New York, NY: Springer, 2005, pp. 59–91.

[192] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionalityof data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.

[193] D. Zhang, Z.-H. Zhou, and S. Chen, “Semi-supervised dimensionalityreduction,” in Proc. SDM, vol. 7, 2007, pp. 629–634.

[194] M. E. Tipping and C. M. Bishop, “Probabilistic principal componentanalysis,” J. Roy. Stat. Soc. B, vol. 61, no. 3, pp. 611–622, 1999.

[195] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition.Upper Saddle River, NJ: Prentice Hall, 1993.

[196] C. Koniaris, S. Chatterjee, and W. B. Kleijn, “Selecting static anddynamic features using an advanced auditory model for speech recog-nition,” in Proc. ICASSP, 2010, pp. 4342–4345.

[197] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE T. SignalProces., submitted. [Online]. Available: http://arxiv.org/abs/1304.6763

[198] J. H. McDermott and E. P. Simoncelli, “Sound texture perception viastatistics of the auditory periphery: Evidence from sound synthesis,”Neuron, vol. 71, no. 5, pp. 926–940, 2011.

[199] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deepbelief networks for scalable unsupervised learning of hierarchical rep-resentations,” in Proc. ICML, 2009, pp. 609–616.

[200] G. E. P. Box and N. R. Draper, Empirical Model-Building and Re-sponse Surfaces. New York, NY: John Wiley & Sons, 1987.

[201] H. Akaike, “A new look at the statistical model identification,” IEEET. Automat. Contr., vol. 19, no. 6, pp. 716–723, 1974.

http://arxiv.org/abs/1304.6763

http://arxiv.org/abs/1304.6763

References 79

[202] G. Schwarz, “Estimating the dimension of a model,” Ann. Stat., vol. 6,no. 2, pp. 461–464, 1978.

[203] R. L. Kashyap, “Inconsistency of the AIC rule for estimating the orderof autoregressive models,” IEEE T. Automat. Contr., vol. 25, no. 5,pp. 996–998, 1980.

[204] R. W. Katz, “On some criteria for estimating the order of a Markovchain,” Technometrics, vol. 23, no. 3, pp. 243–249, 1981.

[205] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE T.Knowl. Data En., vol. 22, no. 10, pp. 1345–1359, 2010.

[206] S. Molau, F. Hilger, and H. Ney, “Feature space normalization inadverse acoustic conditions,” in Proc. ICASSP, 2003, pp. I–656–I–659.

[207] H. Daumé III, “Frustratingly easy domain adaptation,” in Proc. ACL,vol. 45, 2007, pp. 256–263.

[208] H. Shimodaira, “Improving predictive inference under covariate shiftby weighting the log-likelihood function,” J. Stat. Plan. Infer., vol. 90,no. 2, pp. 227–244, 2000.

[209] E. Eide and H. Gish, “A parametric approach to vocal tract lengthnormalization,” in Proc. ICASSP, 1996, pp. 346–348.

[210] T. Claes, I. Dologlou, L. ten Bosch, and D. Van Compernolle, “Anovel feature transformation for vocal tract length normalization inautomatic speech recognition,” IEEE T. Speech Audi. P., vol. 6, no. 6,pp. 549–557, 1998.

[211] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Comput. Speech Lang., vol. 12, no. 2, pp.75–98, 1998.

[212] S. F. Chen and J. Goodman, “An empirical study of smoothing tech-niques for language modeling,” in Proc. ACL, vol. 34, 1996, pp. 310–318.

[213] I. H. Witten and T. C. Bell, “The zero-frequency problem: Estimatingthe probabilities of novel events in adaptive text compression,” IEEET. Inform. Theory, vol. 37, no. 4, pp. 1085–1094, 1991.

[214] G. J. Lidstone, “Note on the general case of the Bayes-Laplace formulafor inductive or a posteriori probabilities,” T. Fac. Actuar., vol. 8, pp.182–192, 1920.

80 Introduction

[215] F. Jelinek and R. L. Mercer, “Interpolated estimation of Markov sourceparameters from sparse data,” in Proc. Workshop Pattern Recognit.Pract., 1980.

[216] G. E. Henter and W. B. Kleijn, “Simplified probability models for gen-erative tasks: a rate-distortion approach,” in Proc. EUSIPCO, vol. 18,2010, pp. 1159–1163.

[217] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process dy-namical models,” in Proc NIPS 2005, vol. 18, 2006, pp. 1441–1448.

[218] ——, “Gaussian process dynamical models for human motion,” IEEET. Pattern Anal., vol. 30, no. 2, pp. 283–298, Feb. 2008.

probabilistic sequence models with speech and language

Documents